Learning Speech-driven 3D Conversational Gestures from Video

被引:32
|
作者
Habibie, Ikhsanul [1 ]
Xu, Weipeng [2 ]
Mehta, Dushyant [1 ]
Liu, Lingjie [1 ]
Seidel, Hans-Peter [1 ]
Pons-Moll, Gerard [3 ]
Elgharib, Mohamed [1 ]
Theobalt, Christian [1 ]
机构
[1] Max Planck Inst Informat, Saarbrucken, Germany
[2] Facebook Real Labs, Redmond, WA USA
[3] Univ Tubingen, Tubingen, Germany
关键词
gesture synthesis; character control; audio-driven pose estimation;
D O I
10.1145/3472306.3478335
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose the first approach to synthesize the synchronous 3D conversational body and hand gestures, as well as 3D face and head animations, of a virtual character from speech input. Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures. Synthesis of conversational body gestures is a multi-modal problem since many similar gestures can plausibly accompany the same input speech. To synthesize plausible body gestures in this setting, we train a Generative Adversarial Network (GAN) based model that measures the plausibility of the generated sequences of 3D body motion when paired with the input audio features. We also contribute a new corpus that contains more than 33 hours of annotated data from in-the-wild videos of talking people. To this end, we apply state-of-the-art monocular approaches for 3D body and hand pose estimation as well as 3D face performance capture to the video corpus. In this way, we can train on orders of magnitude more data than previous algorithms that resort to complex in-studio motion capture solutions, and thereby train more expressive synthesis algorithms. Our experiments and user study show the state-of-the-art quality of our speech-synthesized full 3D character animations.
引用
收藏
页码:101 / 108
页数:8
相关论文
共 50 条
  • [41] Semiautomatic Learning of 3D Objects from Video Streams
    Carrara, Fabio
    Falchi, Fabrizio
    Gennaro, Claudio
    [J]. SIMILARITY SEARCH AND APPLICATIONS, SISAP 2015, 2015, 9371 : 217 - 228
  • [42] Video Content Production Support System with Speech-Driven Embodied Entrainment Character by Speech and Hand Motion Inputs
    Yamamoto, Michiya
    Osaki, Kouzi
    Watanabe, Tomio
    [J]. HUMAN-COMPUTER INTERACTION, PT III: AMBIENT, UBIQUITOUS AND INTELLIGENT INTERACTION, 2009, 5612 : 358 - +
  • [43] VISUAL SPEECH SYNTHESIS FROM 3D MESH SEQUENCES DRIVEN BY COMBINED SPEECH FEATURES
    Kuhnke, Felix
    Ostermann, Joern
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 1075 - 1080
  • [44] SRG3: Speech-driven Robot Gesture Generation with GAN
    Yu, Chuang
    Tapus, Adriana
    [J]. 16TH IEEE INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION, ROBOTICS AND VISION (ICARCV 2020), 2020, : 759 - 766
  • [45] Speech driven 3D head gesture synthesis
    Sargin, M. E.
    Erzin, E.
    Yemez, Y.
    Tekalp, A. M.
    Erdem, A. Tanju
    [J]. 2006 IEEE 14TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS, VOLS 1 AND 2, 2006, : 237 - +
  • [46] The Power is in Your Hands: 3D Analysis of Hand Gestures in Naturalistic Video
    Ohn-Bar, Eshed
    Trivedi, Mohan M.
    [J]. 2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), 2013, : 912 - 917
  • [47] 3D Perception Algorithms: Towards Perceptually Driven Compression of 3D Video
    Ruimin Hu
    Rui Zhong
    Zhongyuan Wang
    Zhen Han
    [J]. ZTE Communications, 2013, 11 (01) : 11 - 16
  • [48] Video Communication System with Speech-Driven Embodied Entrainment Audience Characters with Partner's Face
    Nakayama, Shiho
    Watanabe, Tomio
    Ishii, Yutaka
    [J]. 2013 IEEE/SICE INTERNATIONAL SYMPOSIUM ON SYSTEM INTEGRATION (SII), 2013, : 873 - 878
  • [49] An Augmented Reality Application with Hand Gestures for Learning 3D Geometry
    Le, Hong-Quan
    Kim, Jee-In
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2017, : 34 - 41
  • [50] Model-Based Synthesis of Visual Speech Movements from 3D Video
    Edge, James D.
    Hilton, Adrian
    Jackson, Philip
    [J]. EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2009,