Audio-Driven Co-Speech Gesture Video Generation

被引：0

作者：

Liu, Xian ^{[1
]}

Wu, Qianyi ^{[2
]}

Zhou, Hang ^{[1
]}

Du, Yuanqi ^{[3
]}

Wu, Wayne ^{[4
]}

Lin, Dahua ^{[1
,4
]}

Liu, Ziwei ^{[5
]}

机构：

[1] Chinese Univ Hong Kong, Multimedia Lab, Hong Kong, Peoples R China

[2] Monash Univ, Clayton, Vic, Australia

[3] Cornell Univ, Ithaca, NY USA

[4] Shanghai AI Lab, Shanghai, Peoples R China

[5] Nanyang Technol Univ, S Lab, Singapore, Singapore

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022) | 2022年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Co-speech gesture is crucial for human-machine interaction and digital entertainment. While previous works mostly map speech audio to human skeletons (e.g., 2D keypoints), directly generating speakers' gestures in the image domain remains unsolved. In this work, we formally define and study this challenging problem of audio-driven co-speech gesture video generation, i.e., using a unified framework to generate speaker image sequence driven by speech audio. Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics. To this end, we propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns as well as fine-grained rhythmic movements. To achieve high-fidelity image sequence generation, we leverage an unsupervised motion representation instead of a structural human body prior (e.g., 2D skeletons). Specifically, 1) we propose a vector quantized motion extractor (VQ-Motion Extractor) to summarize common co-speech gesture patterns from implicit motion representation to codebooks. 2) Moreover, a co-speech gesture GPT with motion refinement (Co-Speech GPT) is devised to complement the subtle prosodic motion details. Extensive experiments demonstrate that our framework renders realistic and vivid co-speech gesture video. Demo video and more resources can be found in: https://alvinliu0.github.io/projects/ANGIE

引用

页数：14

共 50 条

[21] The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation
Yoon, Youngwoo
Wolfert, Pieter
Kucherenko, Taras
Viegas, Carla
Nikolov, Teodor
Tsakov, Mihail
Henter, Gustav Eje
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 736 - 747
[22] Audio-Driven Talking Face Video Generation With Dynamic Convolution Kernels
Ye, Zipeng
Xia, Mengfei
Yi, Ran
Zhang, Juyong
Lai, Yu-Kun
Huang, Xuwei
Zhang, Guoxin
Liu, Yong-Jin
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2033 - 2046
[23] Co-speech gesture in bimodal bilinguals
Casey, Shannon
Emmorey, Karen
LANGUAGE AND COGNITIVE PROCESSES, 2009, 24 (02): : 290 - 312
[24] Let's Play Music: Audio-driven Performance Video Generation
Zhu, Hao
Li, Yi
Zhu, Feixia
Zheng, Aihua
He, Ran
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 3574 - 3581
[25] Low-Resource Adaptation for Personalized Co-Speech Gesture Generation
Ahuja, Chaitanya
Lee, Dong Won
Morency, Louis-Philippe
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 20534 - 20544
[26] LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation
Zhi, Yihao
Cun, Xiaodong
Chen, Xuelin
Shen, Xi
Guo, Wen
Huang, Shaoli
Gao, Shenghua
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 20750 - 20760
[27] Audio-Driven Talking Video Frame Restoration
Cheng, Harry
Guo, Yangyang
Yin, Jianhua
Chen, Haonan
Wang, Jiafang
Nie, Liqiang
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 4110 - 4122
[28] Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation
Liu, Xian
Xu, Yinghao
Wu, Qianyi
Zhou, Hang
Wu, Wayne
Zhou, Bolei
COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 : 106 - 125
[29] The UAN Colombian co-speech gesture corpus
David A. Herrera
Sonia Rodríguez
Douglas Niño
Mercedes Pardo-Martínez
Verónica Giraldo
Language Resources and Evaluation, 2021, 55 : 833 - 854
[30] The UAN Colombian co-speech gesture corpus
Herrera, David A.
Rodriguez, Sonia
Nino, Douglas
Pardo-Martinez, Mercedes
Giraldo, Veronica
LANGUAGE RESOURCES AND EVALUATION, 2021, 55 (03) : 833 - 854

← 1 2 3 4 5 →