Audio-Visual Fusion With Temporal Convolutional Attention Network for Speech Separation

被引：0

作者：

Liu, Debang ^{[1
]}

Zhang, Tianqi ^{[1
]}

Christensen, Mads Graesboll ^{[2
]}

Yi, Chen ^{[1
]}

An, Zeliang ^{[1
]}

机构：

[1] Chongqing Univ Posts & Telecommun, Sch Commun & Informat Engn, Chongqing 400065, Peoples R China

[2] Aalborg Univ, Audio Anal Lab, CREATE, DK-9000 Aalborg, Denmark

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

基金：

中国国家自然科学基金;

关键词：

Visualization; Feature extraction; Computational modeling; Time-domain analysis; Convolution; Context modeling; Speech enhancement; Audio-visual multimodal fusion; speech separation; attention mechanism; time-domain; NEURAL-NETWORKS; ENHANCEMENT; INFORMATION;

D O I：

10.1109/TASLP.2024.3463411

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Currently, audio-visual speech separation methods utilize the speaker's audio and visual correlation information to help separate the speech of the target speaker. However, these methods commonly use the approach of feature concatenation with linear mapping to obtain the fused audio-visual features, which prompts us to conduct a deeper exploration for audio-visual fusion. Therefore, in this paper, according to the speaker's mouth landmark movements during speech, we propose a novel time-domain single-channel audio-visual speech separation method: audio-visual fusion with temporal convolution attention network for speech separation model (AVTCA). In this method, we design temporal convolution attention network (TCANet) based on the attention mechanism to model the contextual relationships between audio and visual sequences, and use TCANet as the basic unit to construct sequence learning and fusion network. In the whole deep separation framework, we first use cross attention to focus on the cross-correlation information of the audio and visual sequences, and then we use the TCANet to fuse the audio-visual feature sequences with temporal dependencies and cross-correlations. Afterwards, the fused audio-visual features sequences will be used as input to the separation network to predict mask and separate the source of each speaker. Finally, this paper conducts comparative experiments on Vox2, GRID, LRS2 and TCD-TIMIT datasets, indicating that AVTCA outperforms other state-of-the-art (SOTA) separation methods. Furthermore, it exhibits greater efficiency in computational performance and model size.

引用

页码：4647 / 4660

页数：14

共 50 条

[31] Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization
Xuan, Hanyu
Zhang, Zhenyu
Chen, Shuo
Yang, Jian
Yan, Yan
THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 279 - 286
[32] Audio-visual speech experience with age influences perceived audio-visual asynchrony in speech
Alm, M. (magnus.alm@svt.ntnu.no), 1600, Acoustical Society of America (134):
[33] Audio-visual speech experience with age influences perceived audio-visual asynchrony in speech
Alm, Magnus
Behne, Dawn
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2013, 134 (04): : 3001 - 3010
[34] Audio-Visual Fusion Based on Interactive Attention for Person Verification
Jing, Xuebin
He, Liang
Song, Zhida
Wang, Shaolei
SENSORS, 2023, 23 (24)
[35] Robust Audio-Visual Speech Recognition Based on Hybrid Fusion
Liu, Hong
Li, Wenhao
Yang, Bing
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 7580 - 7586
[36] Audio-Visual Fusion for Sound Source Localization and Improved Attention
Lee, Byoung-gi
Choi, JongSuk
Yoon, SangSuk
Choi, Mun-Taek
Kim, Munsang
Kim, Daijin
TRANSACTIONS OF THE KOREAN SOCIETY OF MECHANICAL ENGINEERS A, 2011, 35 (07) : 737 - 743
[37] FaceFilter: Audio-visual speech separation using still images
Chung, Soo-Whan
Choe, Soyeon
Chung, Joon Son
Kang, Hong-Goo
INTERSPEECH 2020, 2020, : 3481 - 3485
[38] Attention-Based Audio-Visual Fusion for Video Summarization
Fang, Yinghong
Zhang, Junpeng
Lu, Cewu
NEURAL INFORMATION PROCESSING (ICONIP 2019), PT II, 2019, 11954 : 328 - 340
[39] Multi-Stream Gated and Pyramidal Temporal Convolutional Neural Networks for Audio-Visual Speech Separation in Multi-Talker Environments
Luo, Yiyu
Wang, Jing
Xu, Liang
Yang, Lidong
INTERSPEECH 2021, 2021, : 1104 - 1108
[40] Deep audio-visual speech separation based on facial motion
Rigal, Remi
Chodorowski, Jacques
Zerr, Benoit
INTERSPEECH 2021, 2021, : 3540 - 3544

← 1 2 3 4 5 →