Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

被引：0

作者：

He, Xu ^{[1
]}

Huang, Qiaochu ^{[1
]}

Zhang, Zhensong ^{[2
]}

Lin, Zhiwei ^{[1
]}

Wu, Zhiyong ^{[1
,4
]}

Yang, Sicheng ^{[1
]}

Li, Minglei ^{[3
]}

Chen, Zhiyi ^{[3
]}

Xu, Songcen ^{[2
]}

Wu, Xiaofei ^{[2
]}

机构：

[1] Tsinghua Univ, Shenzhen Int Grad Sch, Shenzhen, Peoples R China

[2] Huawei Noahs Ark Lab, Hong Kong, Peoples R China

[3] Huawei Cloud Comp Technol Co Ltd, Hong Kong, Peoples R China

[4] Chinese Univ Hong Kong, Hong Kong, Peoples R China

来源：

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024 | 2024年

基金：

中国国家自然科学基金;

关键词：

D O I：

10.1109/CVPR52733.2024.00220

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Co-speech gestures, if presented in the lively form of videos, can achieve superior visual effects in human-machine interaction. While previous works mostly generate structural human skeletons, resulting in the omission of appearance information, we focus on the direct generation of audio-driven co-speech gesture videos in this work. There are two main challenges: 1) A suitable motion feature is needed to describe complex human movements with crucial appearance information. 2) Gestures and speech exhibit inherent dependencies and should be temporally aligned even of arbitrary length. To solve these problems, we present a novel motion-decoupled framework to generate co-speech gesture videos. Specifically, we first introduce a well-designed nonlinear TPS transformation to obtain latent motion features preserving essential appearance information. Then a transformer-based diffusion model is proposed to learn the temporal correlation between gestures and speech, and performs generation in the latent motion space, followed by an optimal motion selection module to produce long-term coherent and consistent gesture videos. For better visual perception, we further design a refinement network focusing on missing details of certain areas. Extensive experimental results show that our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations. Our code, demos, and more resources are available at https://github.com/thuhcsi/S2G-MDDiffusion.

引用

页码：2263 / 2273

页数：11

共 50 条

[21] Tracking Discourse Topics in Co-speech Gesture
Laparle, Schuyler
DIGITAL HUMAN MODELING AND APPLICATIONS IN HEALTH, SAFETY, ERGONOMICS AND RISK MANAGEMENT. HUMAN BODY, MOTION AND BEHAVIOR, DHM 2021, PT I, 2021, 12777 : 233 - 249
[22] Co-speech gesture as input in verb learning
Goodrich, Whitney
Kam, Carla L. Hudson
DEVELOPMENTAL SCIENCE, 2009, 12 (01) : 81 - 87
[23] LEARNING TORSO PRIOR FOR CO-SPEECH GESTURE GENERATION WITH BETTER HAND SHAPE
Wang, Hexiang
Liu, Fengqi
Yi, Ran
Ma, Lizhuang
2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1 - 5
[24] Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation
Liu, Xian
Wu, Qianyi
Zhou, Hang
Xu, Yinghao
Qian, Rui
Lin, Xinyi
Zhou, Xiaowei
Wu, Wayne
Dai, Bo
Zhou, Bolei
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10452 - 10462
[25] Towards a Framework for Social Robot Co-speech Gesture Generation with Semantic Expression
Zhang, Heng
Yu, Chuang
Tapus, Adriana
SOCIAL ROBOTICS, ICSR 2022, PT I, 2022, 13817 : 110 - 119
[26] Co-Speech Gesture Synthesis using Discrete Gesture Token Learning
Lu, Shuhong
Yoon, Youngwoo
Feng, Andrew
2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2023, : 9808 - 9815
[27] Using and Seeing Co-speech Gesture in a Spatial Task
Suppes, Alexandra
Tzeng, Christina Y.
Galguera, Laura
JOURNAL OF NONVERBAL BEHAVIOR, 2015, 39 (03) : 241 - 257
[28] TAG2G: A Diffusion-Based Approach to Interlocutor-Aware Co-Speech Gesture Generation
Favali, Filippo
Schmuck, Viktor
Villani, Valeria
Celiktutan, Oya
ELECTRONICS, 2024, 13 (17)
[29] “The More You Gesture, the Less I Gesture”: Co-Speech Gestures as a Measure of Mental Model Quality
Ilaria Cutica
Monica Bucciarelli
Journal of Nonverbal Behavior, 2011, 35 : 173 - 187
[30] Using and Seeing Co-speech Gesture in a Spatial Task
Alexandra Suppes
Christina Y. Tzeng
Laura Galguera
Journal of Nonverbal Behavior, 2015, 39 : 241 - 257

← 1 2 3 4 5 →