Audio-Driven Co-Speech Gesture Video Generation

被引:0
|
作者
Liu, Xian [1 ]
Wu, Qianyi [2 ]
Zhou, Hang [1 ]
Du, Yuanqi [3 ]
Wu, Wayne [4 ]
Lin, Dahua [1 ,4 ]
Liu, Ziwei [5 ]
机构
[1] Chinese Univ Hong Kong, Multimedia Lab, Hong Kong, Peoples R China
[2] Monash Univ, Clayton, Vic, Australia
[3] Cornell Univ, Ithaca, NY USA
[4] Shanghai AI Lab, Shanghai, Peoples R China
[5] Nanyang Technol Univ, S Lab, Singapore, Singapore
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Co-speech gesture is crucial for human-machine interaction and digital entertainment. While previous works mostly map speech audio to human skeletons (e.g., 2D keypoints), directly generating speakers' gestures in the image domain remains unsolved. In this work, we formally define and study this challenging problem of audio-driven co-speech gesture video generation, i.e., using a unified framework to generate speaker image sequence driven by speech audio. Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics. To this end, we propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns as well as fine-grained rhythmic movements. To achieve high-fidelity image sequence generation, we leverage an unsupervised motion representation instead of a structural human body prior (e.g., 2D skeletons). Specifically, 1) we propose a vector quantized motion extractor (VQ-Motion Extractor) to summarize common co-speech gesture patterns from implicit motion representation to codebooks. 2) Moreover, a co-speech gesture GPT with motion refinement (Co-Speech GPT) is devised to complement the subtle prosodic motion details. Extensive experiments demonstrate that our framework renders realistic and vivid co-speech gesture video. Demo video and more resources can be found in: https://alvinliu0.github.io/projects/ANGIE
引用
收藏
页数:14
相关论文
共 50 条
  • [21] The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation
    Yoon, Youngwoo
    Wolfert, Pieter
    Kucherenko, Taras
    Viegas, Carla
    Nikolov, Teodor
    Tsakov, Mihail
    Henter, Gustav Eje
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 736 - 747
  • [22] Audio-Driven Talking Face Video Generation With Dynamic Convolution Kernels
    Ye, Zipeng
    Xia, Mengfei
    Yi, Ran
    Zhang, Juyong
    Lai, Yu-Kun
    Huang, Xuwei
    Zhang, Guoxin
    Liu, Yong-Jin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2033 - 2046
  • [23] Co-speech gesture in bimodal bilinguals
    Casey, Shannon
    Emmorey, Karen
    LANGUAGE AND COGNITIVE PROCESSES, 2009, 24 (02): : 290 - 312
  • [24] Let's Play Music: Audio-driven Performance Video Generation
    Zhu, Hao
    Li, Yi
    Zhu, Feixia
    Zheng, Aihua
    He, Ran
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 3574 - 3581
  • [25] Low-Resource Adaptation for Personalized Co-Speech Gesture Generation
    Ahuja, Chaitanya
    Lee, Dong Won
    Morency, Louis-Philippe
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 20534 - 20544
  • [26] LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation
    Zhi, Yihao
    Cun, Xiaodong
    Chen, Xuelin
    Shen, Xi
    Guo, Wen
    Huang, Shaoli
    Gao, Shenghua
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 20750 - 20760
  • [27] Audio-Driven Talking Video Frame Restoration
    Cheng, Harry
    Guo, Yangyang
    Yin, Jianhua
    Chen, Haonan
    Wang, Jiafang
    Nie, Liqiang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 4110 - 4122
  • [28] Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation
    Liu, Xian
    Xu, Yinghao
    Wu, Qianyi
    Zhou, Hang
    Wu, Wayne
    Zhou, Bolei
    COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 : 106 - 125
  • [29] The UAN Colombian co-speech gesture corpus
    David A. Herrera
    Sonia Rodríguez
    Douglas Niño
    Mercedes Pardo-Martínez
    Verónica Giraldo
    Language Resources and Evaluation, 2021, 55 : 833 - 854
  • [30] The UAN Colombian co-speech gesture corpus
    Herrera, David A.
    Rodriguez, Sonia
    Nino, Douglas
    Pardo-Martinez, Mercedes
    Giraldo, Veronica
    LANGUAGE RESOURCES AND EVALUATION, 2021, 55 (03) : 833 - 854