FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning

被引：3

作者：

Haque, Kazi Injamamul ^{[1
]}

Yumak, Zerrin ^{[1
]}

机构：

[1] Univ Utrecht, Utrecht, Netherlands

来源：

PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023 | 2023年

关键词：

facial animation synthesis; deep learning; digital humans;

D O I：

10.1145/3577190.3614157

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper presents FaceXHuBERT, a text-less speech-driven 3D facial animation generation method that generates facial cues driven by an emotional expressiveness condition. In addition, it can handle audio recorded in a variety of situations (e.g. background noise, multiple people speaking). Recent approaches employ end-to-end deep learning taking into account both audio and text as input to generate 3D facial animation. However, scarcity of publicly available expressive audio-3D facial animation datasets poses a major bottleneck. The resulting animations still have issues regarding accurate lip-syncing, emotional expressivity, person-specific facial cues and generalizability. In this work, we first achieve better results than state-of-the-art on the speech-driven 3D facial animation generation task by effectively employing the self-supervised pretrained HuBERT speech model that allows to incorporate both lexical and non-lexical information in the audio without using a large lexicon. Second, we incorporate emotional expressiveness modality by guiding the network with a binary emotion condition. We carried out extensive objective and subjective evaluations in comparison to ground-truth and state-of-the-art. A perceptual user study demonstrates that expressively generated facial animations using our approach are indeed perceived more realistic and are preferred over the non-expressive ones. In addition, we show that having a strong audio encoder alone eliminates the need of a complex decoder for the network architecture, reducing the network complexity and training time significantly. We provide the code(1) publicly and recommend watching the video.

引用

页码：282 / 291

页数：10

共 19 条

[1] FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using Diffusion
Stan, Stefan
Haque, Kazi Injamamul
Yumak, Zerrin
[J]. 15TH ANNUAL ACM SIGGRAPH CONFERENCE ON MOTION, INTERACTION AND GAMES, MIG 2023, 2023,
[2] CLTalk: Speech-Driven 3D Facial Animation with Contrastive Learning
Zhang, Xitie
Wu, Suping
[J]. PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 1175 - 1179
[3] CSTalk: Correlation Supervised Speech-driven 3D Emotional Facial Animation Generation
Liang, Xiangyu
Zhuang, Wenlin
Wang, Tianyong
Geng, Guangxing
Geng, Guangyue
Xia, Haifeng
Xia, Siyu
[J]. 2024 IEEE 18TH INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, FG 2024, 2024,
[4] Speech-Driven 3D Facial Animation with Mesh Convolution
Ji, Xuejie
Su, Zewei
Dong, Lanfang
Li, Guoming
[J]. 2022 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, COMPUTER VISION AND MACHINE LEARNING (ICICML), 2022, : 14 - 18
[5] Speech-driven 3D Facial Animation for Mobile Entertainment
Yan, Juan
Xie, Xiang
Hu, Hao
[J]. INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 2334 - 2337
[6] Imitator: Personalized Speech-driven 3D Facial Animation
Thambiraja, Balamurugan
Habibie, Ikhsanul
Aliakbarian, Sadegh
Cosker, Darren
Theobalt, Christian
Thies, Justus
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 20564 - 20574
[7] FaceFormer: Speech-Driven 3D Facial Animation with Transformers
Fan, Yingruo
Lin, Zhaojiang
Saito, Jun
Wang, Wenping
Komura, Taku
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18749 - 18758
[8] Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation
Fan, Yingruo
Lin, Zhaojiang
Saito, Jun
Wang, Wenping
Komura, Taku
[J]. PROCEEDINGS OF THE ACM ON COMPUTER GRAPHICS AND INTERACTIVE TECHNIQUES, 2022, 5 (01)
[9] Speech4Mesh: Speech-Assisted Monocular 3D Facial Reconstruction for Speech-Driven 3D Facial Animation
He, Shan
He, Haonan
Yang, Shuo
Wu, Xiaoyan
Xia, Pengcheng
Yin, Bing
Liu, Cong
Dai, Lirong
Xu, Chang
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 14146 - 14156
[10] Speech-driven 3D Facial Animation with Implicit Emotional Awareness: A Deep Learning Approach
Pham, Hai X.
Cheung, Samuel
Pavlovic, Vladimir
[J]. 2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), 2017, : 2328 - 2336

← 1 2 →