Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

被引：0

作者：

He, Xu ^{[1
]}

Huang, Qiaochu ^{[1
]}

Zhang, Zhensong ^{[2
]}

Lin, Zhiwei ^{[1
]}

Wu, Zhiyong ^{[1
,4
]}

Yang, Sicheng ^{[1
]}

Li, Minglei ^{[3
]}

Chen, Zhiyi ^{[3
]}

Xu, Songcen ^{[2
]}

Wu, Xiaofei ^{[2
]}

机构：

[1] Tsinghua Univ, Shenzhen Int Grad Sch, Shenzhen, Peoples R China

[2] Huawei Noahs Ark Lab, Hong Kong, Peoples R China

[3] Huawei Cloud Comp Technol Co Ltd, Hong Kong, Peoples R China

[4] Chinese Univ Hong Kong, Hong Kong, Peoples R China

来源：

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024 | 2024年

基金：

中国国家自然科学基金;

关键词：

D O I：

10.1109/CVPR52733.2024.00220

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Co-speech gestures, if presented in the lively form of videos, can achieve superior visual effects in human-machine interaction. While previous works mostly generate structural human skeletons, resulting in the omission of appearance information, we focus on the direct generation of audio-driven co-speech gesture videos in this work. There are two main challenges: 1) A suitable motion feature is needed to describe complex human movements with crucial appearance information. 2) Gestures and speech exhibit inherent dependencies and should be temporally aligned even of arbitrary length. To solve these problems, we present a novel motion-decoupled framework to generate co-speech gesture videos. Specifically, we first introduce a well-designed nonlinear TPS transformation to obtain latent motion features preserving essential appearance information. Then a transformer-based diffusion model is proposed to learn the temporal correlation between gestures and speech, and performs generation in the latent motion space, followed by an optimal motion selection module to produce long-term coherent and consistent gesture videos. For better visual perception, we further design a refinement network focusing on missing details of certain areas. Extensive experimental results show that our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations. Our code, demos, and more resources are available at https://github.com/thuhcsi/S2G-MDDiffusion.

引用

页码：2263 / 2273

页数：11

共 50 条

[41] The visuo-sensorimotor substrate of co-speech gesture processing
Chui, Kawai
Ng, Chan-Tat
Chang, Ting-Ting
NEUROPSYCHOLOGIA, 2023, 190
[42] The effects of learning American Sign Language on co-speech gesture
Casey, Shannon
Emmorey, Karen
Larrabee, Heather
BILINGUALISM-LANGUAGE AND COGNITION, 2012, 15 (04) : 677 - 686
[43] The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation
Yoon, Youngwoo
Wolfert, Pieter
Kucherenko, Taras
Viegas, Carla
Nikolov, Teodor
Tsakov, Mihail
Henter, Gustav Eje
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 736 - 747
[44] Visible Cohesion: A Comparison of Reference Tracking in Sign, Speech, and Co-Speech Gesture
Perniss, Pamela
Ozyurek, Asli
TOPICS IN COGNITIVE SCIENCE, 2015, 7 (01) : 36 - 60
[45] Co-speech gesture projection: Evidence from inferential judgments
Tieu, Lyn
Pasternak, Robert
Schlenker, Philippe
Chemla, Emmanuel
GLOSSA-A JOURNAL OF GENERAL LINGUISTICS, 2018, 3 (01):
[46] Grammatical aspect, gesture, and conceptualization: Using co-speech gesture to reveal event representations
Parrill, Fey
Bergen, Benjamin K.
Lichtenstein, Patricia V.
COGNITIVE LINGUISTICS, 2013, 24 (01) : 135 - 158
[47] Hybrid Seq2Seq Architecture for 3D Co-Speech Gesture Generation
Saleh, Khaled
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 748 - 752
[48] Exploring the Emotional Functions of Co-Speech Hand Gesture in Language and Communication
Kelly, Spencer D.
Tran, Quang-Anh Ngo
TOPICS IN COGNITIVE SCIENCE, 2023,
[49] EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation
Qi, Xingqun
Liu, Chen
Li, Lincheng
Hou, Jie
Xin, Haoran
Yu, Xin
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 10420 - 10430
[50] The development of co-speech gesture in the communication of children with autism spectrum disorders
Sowden, Hannah
Clegg, Judy
Perkins, Michael
CLINICAL LINGUISTICS & PHONETICS, 2013, 27 (12) : 922 - 939

← 1 2 3 4 5 →