Wav2NeRF: Audio-driven realistic talking head generation via wavelet-based NeRF

被引:2
|
作者
Shin, Ah-Hyung [1 ]
Lee, Jae-Ho [1 ]
Hwang, Jiwon [1 ]
Kim, Yoonhyung [2 ]
Park, Gyeong-Moon [1 ]
机构
[1] Kyung Hee Univ, Yongin, South Korea
[2] Elect & Telecommun Res Inst ETRI, Daejeon, South Korea
关键词
Talking head generation; Neural radiance fields; Cross -modal generation; Audio-visual; Wavelet transform;
D O I
10.1016/j.imavis.2024.105104
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Talking head generation is an essential task in various real -world applications such as film making and virtual reality. To this end, recent works focus on the NeRF-based methods that can capture the 3D structural information of faces and generate more natural and vivid talking videos. However, the existing NeRF-based methods fail to accurately generate the audio -synced videos. In this paper, we point out that the previous methods do not consider the audio-visual representations explicitly, which is crucial for precise lip synchronization. Moreover, the existing methods struggle to generate high -frequency details, making the generation results unnatural. To overcome these problems, we propose a novel audio -synced and high-fidelity NeRF-based talking head generation framework, named Wav2NeRF, which learns audio-visual cross -modality representations and employs the wavelet transform for better visual quality. In precise, we adopt a 2D CNN -based neural rendering decoder to a NeRF-based encoder for fast generation of the whole image to employ a new multi -level SyncNet loss for accurate lip synchronization. We also propose a novel cross -attention module to effectively fuse the image and the audio representation. In addition, we integrate the wavelet transform into our framework by proposing the wavelet loss function to enhance high -frequency details. We demonstrate that the proposed method renders realistic and audio -synced talking head videos and shows outstanding performances on average in 4 representative metrics, including PSNR (+4.7%), SSIM (+ 2.2%), LMD (+51.3%), and SyncNet Confidence (+ 154.7%) compared to the NeRF-based current state-of-the-art methods.
引用
收藏
页数:14
相关论文
共 12 条
  • [1] AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis
    Guo, Yudong
    Chen, Keyu
    Liang, Sen
    Liu, Yong-Jin
    Bao, Hujun
    Zhang, Juyong
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 5764 - 5774
  • [2] Audio-driven talking face generation with diverse yet realistic facial animations
    Wu, Rongliang
    Yu, Yingchen
    Zhan, Fangneng
    Zhang, Jiahui
    Zhang, Xiaoqin
    Lu, Shijian
    PATTERN RECOGNITION, 2023, 144
  • [3] Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation
    Gan, Yuan
    Yang, Zongxin
    Yue, Xihang
    Sun, Lingyun
    Yang, Yi
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22577 - 22588
  • [4] Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion
    Wang, Suzhen
    Li, Lincheng
    Ding, Yu
    Fan, Changjie
    Yu, Xin
    PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 1098 - 1105
  • [5] Audio-driven Talking Head Generation with Transformer and 3D Morphable Model
    Huang, Ricong
    Zhong, Weizhi
    Li, Guanbin
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 7035 - 7039
  • [6] MergeTalk: Audio-Driven Talking Head Generation From Single Image With Feature Merge
    Gao, Jian
    Shu, Chang
    Zheng, Ximin
    Lu, Zheng
    Bao, Nengsheng
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 1850 - 1854
  • [7] LPIPS-AttnWav2Lip: Generic audio-driven lip synchronization for talking head generation in the wild
    Chen, Zhipeng
    Wang, Xinheng
    Xie, Lun
    Yuan, Haijie
    Pan, Hang
    SPEECH COMMUNICATION, 2024, 157
  • [8] Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head Generation
    Song, Wenchao
    Liu, Qiong
    Liu, Yanchao
    Zhang, Pengzhou
    Cao, Juan
    APPLIED SCIENCES-BASEL, 2025, 15 (01):
  • [9] SD-NeRF: Towards Lifelike Talking Head Animation via Spatially-Adaptive Dual-Driven NeRFs
    Shen, Shuai
    Li, Wanhua
    Huang, Xiaoke
    Zhu, Zheng
    Zhou, Jie
    Lu, Jiwen
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3221 - 3234
  • [10] EAT-Face: Emotion-Controllable Audio-Driven Talking Face Generation via Diffusion Model
    Wang, Haodi
    Jia, Xiaojun
    Cao, Xiaochun
    2024 IEEE 18TH INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, FG 2024, 2024,