Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

被引:0
|
作者
He, Xu [1 ]
Huang, Qiaochu [1 ]
Zhang, Zhensong [2 ]
Lin, Zhiwei [1 ]
Wu, Zhiyong [1 ,4 ]
Yang, Sicheng [1 ]
Li, Minglei [3 ]
Chen, Zhiyi [3 ]
Xu, Songcen [2 ]
Wu, Xiaofei [2 ]
机构
[1] Tsinghua Univ, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
[2] Huawei Noahs Ark Lab, Hong Kong, Peoples R China
[3] Huawei Cloud Comp Technol Co Ltd, Hong Kong, Peoples R China
[4] Chinese Univ Hong Kong, Hong Kong, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52733.2024.00220
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Co-speech gestures, if presented in the lively form of videos, can achieve superior visual effects in human-machine interaction. While previous works mostly generate structural human skeletons, resulting in the omission of appearance information, we focus on the direct generation of audio-driven co-speech gesture videos in this work. There are two main challenges: 1) A suitable motion feature is needed to describe complex human movements with crucial appearance information. 2) Gestures and speech exhibit inherent dependencies and should be temporally aligned even of arbitrary length. To solve these problems, we present a novel motion-decoupled framework to generate co-speech gesture videos. Specifically, we first introduce a well-designed nonlinear TPS transformation to obtain latent motion features preserving essential appearance information. Then a transformer-based diffusion model is proposed to learn the temporal correlation between gestures and speech, and performs generation in the latent motion space, followed by an optimal motion selection module to produce long-term coherent and consistent gesture videos. For better visual perception, we further design a refinement network focusing on missing details of certain areas. Extensive experimental results show that our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations. Our code, demos, and more resources are available at https://github.com/thuhcsi/S2G-MDDiffusion.
引用
收藏
页码:2263 / 2273
页数:11
相关论文
共 50 条
  • [31] Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates
    Qian, Shenhan
    Tu, Zhi
    Zhi, Yihao
    Liu, Wen
    Gao, Shenghua
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 11057 - 11066
  • [32] Relations between syntactic encoding and co-speech gestures:: Implications for a model of speech and gesture production
    Kita, Sotaro
    Oezyiirek, Asli
    Allen, Shanley
    Brown, Amanda
    Furman, Reyhan
    Ishizuka, Tomoko
    LANGUAGE AND COGNITIVE PROCESSES, 2007, 22 (08): : 1212 - 1236
  • [33] Verbal working memory and co-speech gesture processing
    Momsen, Jacob
    Gordon, Jared
    Wu, Ying Choon
    Coulson, Seana
    BRAIN AND COGNITION, 2020, 146
  • [34] Gesture2Vec: Clustering Gestures using Representation Learning Methods for Co-speech Gesture Generation
    Yazdian, Payam Jome
    Chen, Mo
    Lim, Angelica
    2022 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2022, : 3100 - 3107
  • [35] "The More You Gesture, the Less I Gesture": Co-Speech Gestures as a Measure of Mental Model Quality
    Cutica, Ilaria
    Bucciarelli, Monica
    JOURNAL OF NONVERBAL BEHAVIOR, 2011, 35 (03) : 173 - 187
  • [36] Social eye gaze modulates processing of speech and co-speech gesture
    Holler, Judith
    Schubotz, Louise
    Kelly, Spencer
    Hagoort, Peter
    Schuetze, Manuela
    Ozyurek, Asli
    COGNITION, 2014, 133 (03) : 692 - 697
  • [37] Towards Real-time Co-speech Gesture Generation in Online Interaction in Social XR
    Krome, Niklas
    Kopp, Stefan
    PROCEEDINGS OF THE 23RD ACM INTERNATIONAL CONFERENCE ON INTELLIGENT VIRTUAL AGENTS, IVA 2023, 2023,
  • [38] Difusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation
    Deichler, Anna
    Mehta, Shivam
    Alexanderson, Simon
    Beskow, Jonas
    PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, 2023, : 755 - 762
  • [39] Augmented Co-Speech Gesture Generation: Including Form and Meaning Features to Guide Learning-Based Gesture Synthesis
    Voss, Hendric
    Kopp, Stefan
    PROCEEDINGS OF THE 23RD ACM INTERNATIONAL CONFERENCE ON INTELLIGENT VIRTUAL AGENTS, IVA 2023, 2023,
  • [40] Learning Co-Speech Gesture for Multimodal Aphasia Type Detection
    Lee, Daeun
    Son, Sejung
    Jeon, Hyolim
    Kim, Seungbae
    Han, Jinyoung
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 9287 - 9303