CMFF-Face: Attention-Based Cross-Modal Feature Fusion for High-Quality Audio-Driven Talking Face Generation

被引：0

作者：

Zhao, Guangzhe ^{[1
]}

Liu, Yanan ^{[1
]}

Wang, Xueping ^{[1
]}

Yan, Feihu ^{[1
]}

机构：

[1] Beijing Univ Civil Engn & Architecture, Sch Elect & Informat Engn, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024 | 2024年

基金：

美国国家科学基金会;

关键词：

Talking Face Generation; Cross-Modal Feature Fusion; Attention Mechanism; Lip Synchronization; High-Quality Face;

D O I：

10.1145/3652583.3658055

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Audio-driven talking face generation creates lip-synchronized and high-quality face videos from given audio and target face images, which is a challenging task due to the inherent modal gap between audio and face images. To address this issue, we propose an attention-based Cross-Modal Feature Fusion network for talking Face generation, called CMFF-Face. Specifically, we introduce a cross-modal feature fusion generator, which incorporates a fusion process in each convolutional encoder layer, allowing for layer-wise fusing of interactive audio and face features to generate high-quality talking faces. Additionally, a lip synchronization discriminator is designed to improve audio-lip synchronization, which uses a two-branch cross-attention mechanism to capture the associations between synchronized audio and face more effectively. Finally, we employ a CLIP-based audio-lip synchronization loss that helps distinguish between positive and negative sample pairs to enhance the lip synchronization. Comprehensive experiments on the LRS2 and LRW datasets demonstrate that our method outperforms the state-of-the-arts in terms of lip synchronization and visual quality.

引用

页码：101 / 110

页数：10

共 3 条

[1] Multihead Attention-based Audio Image Generation with Cross-Modal Shared Weight Classifier
Xu, Yiming
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[2] Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams
Hou, Yuanbo
Yu, Zhesong
Liang, Xia
Du, Xingjian
Zhu, Bilei
Ma, Zejun
Botteldooren, Dick
INTERSPEECH 2021, 2021, : 321 - 325
[3] CMAFGAN: A Cross-Modal Attention Fusion based Generative Adversarial Network for attribute word-to-face synthesis
Luo, Xiaodong
Chen, Xiang
He, Xiaohai
Qing, Linbo
Tan, Xinyue
KNOWLEDGE-BASED SYSTEMS, 2022, 255

← 1 →