Wav2NeRF: Audio-driven realistic talking head generation via wavelet-based NeRF

被引：2

作者：

Shin, Ah-Hyung ^{[1
]}

Lee, Jae-Ho ^{[1
]}

Hwang, Jiwon ^{[1
]}

Kim, Yoonhyung ^{[2
]}

Park, Gyeong-Moon ^{[1
]}

机构：

[1] Kyung Hee Univ, Yongin, South Korea

[2] Elect & Telecommun Res Inst ETRI, Daejeon, South Korea

来源：

IMAGE AND VISION COMPUTING | 2024年 / 148卷

关键词：

Talking head generation; Neural radiance fields; Cross -modal generation; Audio-visual; Wavelet transform;

D O I：

10.1016/j.imavis.2024.105104

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Talking head generation is an essential task in various real -world applications such as film making and virtual reality. To this end, recent works focus on the NeRF-based methods that can capture the 3D structural information of faces and generate more natural and vivid talking videos. However, the existing NeRF-based methods fail to accurately generate the audio -synced videos. In this paper, we point out that the previous methods do not consider the audio-visual representations explicitly, which is crucial for precise lip synchronization. Moreover, the existing methods struggle to generate high -frequency details, making the generation results unnatural. To overcome these problems, we propose a novel audio -synced and high-fidelity NeRF-based talking head generation framework, named Wav2NeRF, which learns audio-visual cross -modality representations and employs the wavelet transform for better visual quality. In precise, we adopt a 2D CNN -based neural rendering decoder to a NeRF-based encoder for fast generation of the whole image to employ a new multi -level SyncNet loss for accurate lip synchronization. We also propose a novel cross -attention module to effectively fuse the image and the audio representation. In addition, we integrate the wavelet transform into our framework by proposing the wavelet loss function to enhance high -frequency details. We demonstrate that the proposed method renders realistic and audio -synced talking head videos and shows outstanding performances on average in 4 representative metrics, including PSNR (+4.7%), SSIM (+ 2.2%), LMD (+51.3%), and SyncNet Confidence (+ 154.7%) compared to the NeRF-based current state-of-the-art methods.

引用

页数：14

共 12 条

[1] AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis
Guo, Yudong
Chen, Keyu
Liang, Sen
Liu, Yong-Jin
Bao, Hujun
Zhang, Juyong
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 5764 - 5774
[2] Audio-driven talking face generation with diverse yet realistic facial animations
Wu, Rongliang
Yu, Yingchen
Zhan, Fangneng
Zhang, Jiahui
Zhang, Xiaoqin
Lu, Shijian
PATTERN RECOGNITION, 2023, 144
[3] Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation
Gan, Yuan
Yang, Zongxin
Yue, Xihang
Sun, Lingyun
Yang, Yi
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22577 - 22588
[4] Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion
Wang, Suzhen
Li, Lincheng
Ding, Yu
Fan, Changjie
Yu, Xin
PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 1098 - 1105
[5] Audio-driven Talking Head Generation with Transformer and 3D Morphable Model
Huang, Ricong
Zhong, Weizhi
Li, Guanbin
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 7035 - 7039
[6] MergeTalk: Audio-Driven Talking Head Generation From Single Image With Feature Merge
Gao, Jian
Shu, Chang
Zheng, Ximin
Lu, Zheng
Bao, Nengsheng
IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 1850 - 1854
[7] LPIPS-AttnWav2Lip: Generic audio-driven lip synchronization for talking head generation in the wild
Chen, Zhipeng
Wang, Xinheng
Xie, Lun
Yuan, Haijie
Pan, Hang
SPEECH COMMUNICATION, 2024, 157
[8] Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head Generation
Song, Wenchao
Liu, Qiong
Liu, Yanchao
Zhang, Pengzhou
Cao, Juan
APPLIED SCIENCES-BASEL, 2025, 15 (01):
[9] SD-NeRF: Towards Lifelike Talking Head Animation via Spatially-Adaptive Dual-Driven NeRFs
Shen, Shuai
Li, Wanhua
Huang, Xiaoke
Zhu, Zheng
Zhou, Jie
Lu, Jiwen
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3221 - 3234
[10] EAT-Face: Emotion-Controllable Audio-Driven Talking Face Generation via Diffusion Model
Wang, Haodi
Jia, Xiaojun
Cao, Xiaochun
2024 IEEE 18TH INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, FG 2024, 2024,

← 1 2 →