Audio-Driven Co-Speech Gesture Video Generation

被引:0
|
作者
Liu, Xian [1 ]
Wu, Qianyi [2 ]
Zhou, Hang [1 ]
Du, Yuanqi [3 ]
Wu, Wayne [4 ]
Lin, Dahua [1 ,4 ]
Liu, Ziwei [5 ]
机构
[1] Chinese Univ Hong Kong, Multimedia Lab, Hong Kong, Peoples R China
[2] Monash Univ, Clayton, Vic, Australia
[3] Cornell Univ, Ithaca, NY USA
[4] Shanghai AI Lab, Shanghai, Peoples R China
[5] Nanyang Technol Univ, S Lab, Singapore, Singapore
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Co-speech gesture is crucial for human-machine interaction and digital entertainment. While previous works mostly map speech audio to human skeletons (e.g., 2D keypoints), directly generating speakers' gestures in the image domain remains unsolved. In this work, we formally define and study this challenging problem of audio-driven co-speech gesture video generation, i.e., using a unified framework to generate speaker image sequence driven by speech audio. Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics. To this end, we propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns as well as fine-grained rhythmic movements. To achieve high-fidelity image sequence generation, we leverage an unsupervised motion representation instead of a structural human body prior (e.g., 2D skeletons). Specifically, 1) we propose a vector quantized motion extractor (VQ-Motion Extractor) to summarize common co-speech gesture patterns from implicit motion representation to codebooks. 2) Moreover, a co-speech gesture GPT with motion refinement (Co-Speech GPT) is devised to complement the subtle prosodic motion details. Extensive experiments demonstrate that our framework renders realistic and vivid co-speech gesture video. Demo video and more resources can be found in: https://alvinliu0.github.io/projects/ANGIE
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation
    Zhu, Lingting
    Liu, Xian
    Liu, Xuanyu
    Qian, Rui
    Liu, Ziwei
    Yu, Lequan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10544 - 10553
  • [2] DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models
    Yang, Sicheng
    Wu, Zhiyong
    Li, Minglei
    Zhang, Zhensong
    Hao, Lei
    Bao, Weihong
    Cheng, Ming
    Xiao, Long
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 5860 - 5868
  • [3] EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation
    Qi, Xingqun
    Liu, Chen
    Li, Lincheng
    Hou, Jie
    Xin, Haoran
    Yu, Xin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 10420 - 10430
  • [4] DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures
    Hogue, Steven
    Zhang, Chenxu
    Daruger, Hamza
    Tian, Yapeng
    Guo, Xiaohu
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW, 2024, : 1922 - 1931
  • [5] A Comprehensive Review of Data-Driven Co-Speech Gesture Generation
    Nyatsanga, S.
    Kucherenko, T.
    Ahuja, C.
    Henter, G. E.
    Neff, M.
    COMPUTER GRAPHICS FORUM, 2023, 42 (02) : 569 - 596
  • [6] Audio-driven Neural Gesture Reenactment with Video Motion Graphs
    Zhou, Yang
    Yang, Jimei
    Li, Dingzeyu
    Saito, Jun
    Aneja, Deepali
    Kalogerakis, Evangelos
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 3408 - 3418
  • [7] Co-speech Gesture Video Generation with 3D Human Meshes
    Mahapatra, Aniruddha
    Mishra, Richa
    Li, Renda
    Chen, Ziyi
    Ding, Boyang
    Wang, Shoulei
    Zhu, Jun-Yan
    Chang, Peng
    Han, Mei
    Xiao, Jing
    COMPUTER VISION - ECCV 2024, PT LXXXIX, 2025, 15147 : 172 - 189
  • [8] Audio-driven Talking Face Video Generation with Emotion
    Liang, Jiadong
    Lu, Feng
    2024 IEEE CONFERENCE ON VIRTUAL REALITY AND 3D USER INTERFACES ABSTRACTS AND WORKSHOPS, VRW 2024, 2024, : 863 - 864
  • [9] Audio-Driven Stylized Gesture Generation with Flow-Based Model
    Ye, Sheng
    Wen, Yu-Hui
    Sun, Yanan
    He, Ying
    Zhang, Ziyang
    Wang, Yaoyuan
    He, Weihua
    Liu, Yong-Jin
    COMPUTER VISION - ECCV 2022, PT V, 2022, 13665 : 712 - 728
  • [10] EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling
    Liu, Haiyang
    Zhu, Zihao
    Becherini, Giorgio
    Peng, Yichen
    Su, Mingyang
    Zhou, You
    Zhe, Xuefei
    Iwamoto, Naoya
    Zheng, Bo
    Black, Michael J.
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 1144 - 1154