Learning Speech-driven 3D Conversational Gestures from Video

被引：32

作者：

Habibie, Ikhsanul ^{[1
]}

Xu, Weipeng ^{[2
]}

Mehta, Dushyant ^{[1
]}

Liu, Lingjie ^{[1
]}

Seidel, Hans-Peter ^{[1
]}

Pons-Moll, Gerard ^{[3
]}

Elgharib, Mohamed ^{[1
]}

Theobalt, Christian ^{[1
]}

机构：

[1] Max Planck Inst Informat, Saarbrucken, Germany

[2] Facebook Real Labs, Redmond, WA USA

[3] Univ Tubingen, Tubingen, Germany

来源：

PROCEEDINGS OF THE 21ST ACM INTERNATIONAL CONFERENCE ON INTELLIGENT VIRTUAL AGENTS (IVA) | 2021年

关键词：

gesture synthesis; character control; audio-driven pose estimation;

D O I：

10.1145/3472306.3478335

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We propose the first approach to synthesize the synchronous 3D conversational body and hand gestures, as well as 3D face and head animations, of a virtual character from speech input. Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures. Synthesis of conversational body gestures is a multi-modal problem since many similar gestures can plausibly accompany the same input speech. To synthesize plausible body gestures in this setting, we train a Generative Adversarial Network (GAN) based model that measures the plausibility of the generated sequences of 3D body motion when paired with the input audio features. We also contribute a new corpus that contains more than 33 hours of annotated data from in-the-wild videos of talking people. To this end, we apply state-of-the-art monocular approaches for 3D body and hand pose estimation as well as 3D face performance capture to the video corpus. In this way, we can train on orders of magnitude more data than previous algorithms that resort to complex in-studio motion capture solutions, and thereby train more expressive synthesis algorithms. Our experiments and user study show the state-of-the-art quality of our speech-synthesized full 3D character animations.

引用

页码：101 / 108

页数：8

共 50 条

[41] Semiautomatic Learning of 3D Objects from Video Streams
Carrara, Fabio
Falchi, Fabrizio
Gennaro, Claudio
[J]. SIMILARITY SEARCH AND APPLICATIONS, SISAP 2015, 2015, 9371 : 217 - 228
[42] Video Content Production Support System with Speech-Driven Embodied Entrainment Character by Speech and Hand Motion Inputs
Yamamoto, Michiya
Osaki, Kouzi
Watanabe, Tomio
[J]. HUMAN-COMPUTER INTERACTION, PT III: AMBIENT, UBIQUITOUS AND INTELLIGENT INTERACTION, 2009, 5612 : 358 - +
[43] VISUAL SPEECH SYNTHESIS FROM 3D MESH SEQUENCES DRIVEN BY COMBINED SPEECH FEATURES
Kuhnke, Felix
Ostermann, Joern
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 1075 - 1080
[44] SRG3: Speech-driven Robot Gesture Generation with GAN
Yu, Chuang
Tapus, Adriana
[J]. 16TH IEEE INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION, ROBOTICS AND VISION (ICARCV 2020), 2020, : 759 - 766
[45] Speech driven 3D head gesture synthesis
Sargin, M. E.
Erzin, E.
Yemez, Y.
Tekalp, A. M.
Erdem, A. Tanju
[J]. 2006 IEEE 14TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS, VOLS 1 AND 2, 2006, : 237 - +
[46] The Power is in Your Hands: 3D Analysis of Hand Gestures in Naturalistic Video
Ohn-Bar, Eshed
Trivedi, Mohan M.
[J]. 2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), 2013, : 912 - 917
[47] 3D Perception Algorithms: Towards Perceptually Driven Compression of 3D Video
Ruimin Hu
Rui Zhong
Zhongyuan Wang
Zhen Han
[J]. ZTE Communications, 2013, 11 (01) : 11 - 16
[48] Video Communication System with Speech-Driven Embodied Entrainment Audience Characters with Partner's Face
Nakayama, Shiho
Watanabe, Tomio
Ishii, Yutaka
[J]. 2013 IEEE/SICE INTERNATIONAL SYMPOSIUM ON SYSTEM INTEGRATION (SII), 2013, : 873 - 878
[49] An Augmented Reality Application with Hand Gestures for Learning 3D Geometry
Le, Hong-Quan
Kim, Jee-In
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2017, : 34 - 41
[50] Model-Based Synthesis of Visual Speech Movements from 3D Video
Edge, James D.
Hilton, Adrian
Jackson, Philip
[J]. EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2009,

← 1 2 3 4 5 →