EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling

被引:1
|
作者
Liu, Haiyang [1 ]
Zhu, Zihao [2 ]
Becherini, Giorgio [3 ]
Peng, Yichen [4 ]
Su, Mingyang [5 ]
Zhou, You
Zhe, Xuefei
Iwamoto, Naoya
Zheng, Bo
Black, Michael J.
机构
[1] Univ Tokyo, Tokyo, Japan
[2] Keio Univ, Tokyo, Japan
[3] Max Planck Inst Intelligent Syst, Stuttgart, Germany
[4] JAIST, Nomi, Japan
[5] Tsinghua Univ, Beijing, Peoples R China
关键词
D O I
10.1109/CVPR52733.2024.00115
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose EMAGE, a framework to generate fullbody human gestures from audio and masked gestures, encompassing facial, local body, hands, and global movements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. BEAT2 combines a MoShed SMPL-X body with FLAME head parameters and further refines the modeling of head, neck, and finger movements, offering a community-standardized, high-quality 3D motion captured dataset. EMAGE leverages masked body gesture priors during training to boost inference performance. It involves a Masked Audio Gesture Transformer, facilitating joint training on audio-to-gesture generation and masked gesture reconstruction to effectively encode audio and body gesture hints. Encoded body hints from masked gestures are then separately employed to generate facial and body movements. Moreover, EMAGE adaptively merges speech features from the audio's rhythm and content and utilizes four compositional VQ-VAEs to enhance the results' fidelity and diversity. Experiments demonstrate that EMAGE generates holistic gestures with state-of-the-art performance and is flexible in accepting predefined spatial-temporal gesture inputs, generating complete, audio-synchronized results. Our code and dataset are available.(1)
引用
收藏
页码:1144 / 1154
页数:11
相关论文
共 29 条
  • [1] Audio-Driven Co-Speech Gesture Video Generation
    Liu, Xian
    Wu, Qianyi
    Zhou, Hang
    Du, Yuanqi
    Wu, Wayne
    Lin, Dahua
    Liu, Ziwei
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [2] Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation
    Zhu, Lingting
    Liu, Xian
    Liu, Xuanyu
    Qian, Rui
    Liu, Ziwei
    Yu, Lequan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10544 - 10553
  • [3] LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation
    Zhi, Yihao
    Cun, Xiaodong
    Chen, Xuelin
    Shen, Xi
    Guo, Wen
    Huang, Shaoli
    Gao, Shenghua
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 20750 - 20760
  • [4] DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models
    Yang, Sicheng
    Wu, Zhiyong
    Li, Minglei
    Zhang, Zhensong
    Hao, Lei
    Bao, Weihong
    Cheng, Ming
    Xiao, Long
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 5860 - 5868
  • [5] Continual Learning for Personalized Co-Speech Gesture Generation
    Ahuja, Chaitanya
    Joshi, Pratik
    Ishii, Ryo
    Morency, Louis-Philippe
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 20836 - 20846
  • [6] SEEG: Semantic Energized Co-speech Gesture Generation
    Liang, Yuanzhi
    Feng, Qianyu
    Zhu, Linchao
    Hu, Li
    Pan, Pan
    Yang, Yi
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10463 - 10472
  • [7] Towards a Framework for Social Robot Co-speech Gesture Generation with Semantic Expression
    Zhang, Heng
    Yu, Chuang
    Tapus, Adriana
    SOCIAL ROBOTICS, ICSR 2022, PT I, 2022, 13817 : 110 - 119
  • [8] Difusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation
    Deichler, Anna
    Mehta, Shivam
    Alexanderson, Simon
    Beskow, Jonas
    PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, 2023, : 755 - 762
  • [9] Cross-Modal Quantization for Co-Speech Gesture Generation
    Wang, Zheng
    Zhang, Wei
    Ye, Long
    Zeng, Dan
    Mei, Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 10251 - 10263
  • [10] Learning hierarchical discrete prior for co-speech gesture generation
    Zhang, Jian
    Yoshie, Osamu
    NEUROCOMPUTING, 2024, 595