CMFF-Face: Attention-Based Cross-Modal Feature Fusion for High-Quality Audio-Driven Talking Face Generation

被引:0
|
作者
Zhao, Guangzhe [1 ]
Liu, Yanan [1 ]
Wang, Xueping [1 ]
Yan, Feihu [1 ]
机构
[1] Beijing Univ Civil Engn & Architecture, Sch Elect & Informat Engn, Beijing, Peoples R China
基金
美国国家科学基金会;
关键词
Talking Face Generation; Cross-Modal Feature Fusion; Attention Mechanism; Lip Synchronization; High-Quality Face;
D O I
10.1145/3652583.3658055
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audio-driven talking face generation creates lip-synchronized and high-quality face videos from given audio and target face images, which is a challenging task due to the inherent modal gap between audio and face images. To address this issue, we propose an attention-based Cross-Modal Feature Fusion network for talking Face generation, called CMFF-Face. Specifically, we introduce a cross-modal feature fusion generator, which incorporates a fusion process in each convolutional encoder layer, allowing for layer-wise fusing of interactive audio and face features to generate high-quality talking faces. Additionally, a lip synchronization discriminator is designed to improve audio-lip synchronization, which uses a two-branch cross-attention mechanism to capture the associations between synchronized audio and face more effectively. Finally, we employ a CLIP-based audio-lip synchronization loss that helps distinguish between positive and negative sample pairs to enhance the lip synchronization. Comprehensive experiments on the LRS2 and LRW datasets demonstrate that our method outperforms the state-of-the-arts in terms of lip synchronization and visual quality.
引用
收藏
页码:101 / 110
页数:10
相关论文
共 3 条
  • [1] Multihead Attention-based Audio Image Generation with Cross-Modal Shared Weight Classifier
    Xu, Yiming
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [2] Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams
    Hou, Yuanbo
    Yu, Zhesong
    Liang, Xia
    Du, Xingjian
    Zhu, Bilei
    Ma, Zejun
    Botteldooren, Dick
    INTERSPEECH 2021, 2021, : 321 - 325
  • [3] CMAFGAN: A Cross-Modal Attention Fusion based Generative Adversarial Network for attribute word-to-face synthesis
    Luo, Xiaodong
    Chen, Xiang
    He, Xiaohai
    Qing, Linbo
    Tan, Xinyue
    KNOWLEDGE-BASED SYSTEMS, 2022, 255