Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

被引:0
|
作者
He, Xu [1 ]
Huang, Qiaochu [1 ]
Zhang, Zhensong [2 ]
Lin, Zhiwei [1 ]
Wu, Zhiyong [1 ,4 ]
Yang, Sicheng [1 ]
Li, Minglei [3 ]
Chen, Zhiyi [3 ]
Xu, Songcen [2 ]
Wu, Xiaofei [2 ]
机构
[1] Tsinghua Univ, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
[2] Huawei Noahs Ark Lab, Hong Kong, Peoples R China
[3] Huawei Cloud Comp Technol Co Ltd, Hong Kong, Peoples R China
[4] Chinese Univ Hong Kong, Hong Kong, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52733.2024.00220
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Co-speech gestures, if presented in the lively form of videos, can achieve superior visual effects in human-machine interaction. While previous works mostly generate structural human skeletons, resulting in the omission of appearance information, we focus on the direct generation of audio-driven co-speech gesture videos in this work. There are two main challenges: 1) A suitable motion feature is needed to describe complex human movements with crucial appearance information. 2) Gestures and speech exhibit inherent dependencies and should be temporally aligned even of arbitrary length. To solve these problems, we present a novel motion-decoupled framework to generate co-speech gesture videos. Specifically, we first introduce a well-designed nonlinear TPS transformation to obtain latent motion features preserving essential appearance information. Then a transformer-based diffusion model is proposed to learn the temporal correlation between gestures and speech, and performs generation in the latent motion space, followed by an optimal motion selection module to produce long-term coherent and consistent gesture videos. For better visual perception, we further design a refinement network focusing on missing details of certain areas. Extensive experimental results show that our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations. Our code, demos, and more resources are available at https://github.com/thuhcsi/S2G-MDDiffusion.
引用
收藏
页码:2263 / 2273
页数:11
相关论文
共 50 条
  • [1] Audio-Driven Co-Speech Gesture Video Generation
    Liu, Xian
    Wu, Qianyi
    Zhou, Hang
    Du, Yuanqi
    Wu, Wayne
    Lin, Dahua
    Liu, Ziwei
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [2] Co-speech Gesture Video Generation with 3D Human Meshes
    Mahapatra, Aniruddha
    Mishra, Richa
    Li, Renda
    Chen, Ziyi
    Ding, Boyang
    Wang, Shoulei
    Zhu, Jun-Yan
    Chang, Peng
    Han, Mei
    Xiao, Jing
    COMPUTER VISION - ECCV 2024, PT LXXXIX, 2025, 15147 : 172 - 189
  • [3] Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation
    Zhu, Lingting
    Liu, Xian
    Liu, Xuanyu
    Qian, Rui
    Liu, Ziwei
    Yu, Lequan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10544 - 10553
  • [4] Continual Learning for Personalized Co-Speech Gesture Generation
    Ahuja, Chaitanya
    Joshi, Pratik
    Ishii, Ryo
    Morency, Louis-Philippe
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 20836 - 20846
  • [5] SEEG: Semantic Energized Co-speech Gesture Generation
    Liang, Yuanzhi
    Feng, Qianyu
    Zhu, Linchao
    Hu, Li
    Pan, Pan
    Yang, Yi
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10463 - 10472
  • [6] DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models
    Yang, Sicheng
    Wu, Zhiyong
    Li, Minglei
    Zhang, Zhensong
    Hao, Lei
    Bao, Weihong
    Cheng, Ming
    Xiao, Long
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 5860 - 5868
  • [7] Cross-Modal Quantization for Co-Speech Gesture Generation
    Wang, Zheng
    Zhang, Wei
    Ye, Long
    Zeng, Dan
    Mei, Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 10251 - 10263
  • [8] Learning hierarchical discrete prior for co-speech gesture generation
    Zhang, Jian
    Yoshie, Osamu
    NEUROCOMPUTING, 2024, 595
  • [9] EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling
    Liu, Haiyang
    Zhu, Zihao
    Becherini, Giorgio
    Peng, Yichen
    Su, Mingyang
    Zhou, You
    Zhe, Xuefei
    Iwamoto, Naoya
    Zheng, Bo
    Black, Michael J.
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 1144 - 1154
  • [10] Co-speech gesture in bimodal bilinguals
    Casey, Shannon
    Emmorey, Karen
    LANGUAGE AND COGNITIVE PROCESSES, 2009, 24 (02): : 290 - 312