EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling

被引:1
|
作者
Liu, Haiyang [1 ]
Zhu, Zihao [2 ]
Becherini, Giorgio [3 ]
Peng, Yichen [4 ]
Su, Mingyang [5 ]
Zhou, You
Zhe, Xuefei
Iwamoto, Naoya
Zheng, Bo
Black, Michael J.
机构
[1] Univ Tokyo, Tokyo, Japan
[2] Keio Univ, Tokyo, Japan
[3] Max Planck Inst Intelligent Syst, Stuttgart, Germany
[4] JAIST, Nomi, Japan
[5] Tsinghua Univ, Beijing, Peoples R China
关键词
D O I
10.1109/CVPR52733.2024.00115
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose EMAGE, a framework to generate fullbody human gestures from audio and masked gestures, encompassing facial, local body, hands, and global movements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. BEAT2 combines a MoShed SMPL-X body with FLAME head parameters and further refines the modeling of head, neck, and finger movements, offering a community-standardized, high-quality 3D motion captured dataset. EMAGE leverages masked body gesture priors during training to boost inference performance. It involves a Masked Audio Gesture Transformer, facilitating joint training on audio-to-gesture generation and masked gesture reconstruction to effectively encode audio and body gesture hints. Encoded body hints from masked gestures are then separately employed to generate facial and body movements. Moreover, EMAGE adaptively merges speech features from the audio's rhythm and content and utilizes four compositional VQ-VAEs to enhance the results' fidelity and diversity. Experiments demonstrate that EMAGE generates holistic gestures with state-of-the-art performance and is flexible in accepting predefined spatial-temporal gesture inputs, generating complete, audio-synchronized results. Our code and dataset are available.(1)
引用
收藏
页码:1144 / 1154
页数:11
相关论文
共 29 条
  • [21] Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation
    Liu, Xian
    Wu, Qianyi
    Zhou, Hang
    Xu, Yinghao
    Qian, Rui
    Lin, Xinyi
    Zhou, Xiaowei
    Wu, Wayne
    Dai, Bo
    Zhou, Bolei
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10452 - 10462
  • [22] Audio2DiffuGesture: Generating a diverse co-speech gesture based on a diffusion model
    Yao, Hongze
    Xu, Yingting
    Wu, Weitao
    He, Huabin
    Ren, Wen
    Cai, Zhiming
    ELECTRONIC RESEARCH ARCHIVE, 2024, 32 (09): : 5392 - 5408
  • [23] Augmented Co-Speech Gesture Generation: Including Form and Meaning Features to Guide Learning-Based Gesture Synthesis
    Voss, Hendric
    Kopp, Stefan
    PROCEEDINGS OF THE 23RD ACM INTERNATIONAL CONFERENCE ON INTELLIGENT VIRTUAL AGENTS, IVA 2023, 2023,
  • [24] The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation
    Yoon, Youngwoo
    Wolfert, Pieter
    Kucherenko, Taras
    Viegas, Carla
    Nikolov, Teodor
    Tsakov, Mihail
    Henter, Gustav Eje
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 736 - 747
  • [25] Hybrid Seq2Seq Architecture for 3D Co-Speech Gesture Generation
    Saleh, Khaled
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 748 - 752
  • [26] Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots
    Yoon, Youngwoo
    Ko, Woo-Ri
    Jang, Minsu
    Lee, Jaeyeon
    Kim, Jaehong
    Lee, Geehyuk
    2019 INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2019, : 4303 - 4309
  • [27] TAG2G: A Diffusion-Based Approach to Interlocutor-Aware Co-Speech Gesture Generation
    Favali, Filippo
    Schmuck, Viktor
    Villani, Valeria
    Celiktutan, Oya
    ELECTRONICS, 2024, 13 (17)
  • [28] Modeling the Conditional Distribution of Co-Speech Upper Body Gesture Jointly Using Conditional-GAN and Unrolled-GAN
    Wu, Bowen
    Liu, Chaoran
    Ishi, Carlos Toshinori
    Ishiguro, Hiroshi
    ELECTRONICS, 2021, 10 (03) : 1 - 15
  • [29] The IVI Lab entry to the GENEA Challenge 2022-A Tacotron2 Based Method for Co-Speech Gesture Generation With Locality-Constraint Atention Mechanism
    Chang, Che-Jui
    Zhang, Sen
    Kapadia, Mubbasir
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 784 - 789