Audio-Visual Fusion With Temporal Convolutional Attention Network for Speech Separation

被引:0
|
作者
Liu, Debang [1 ]
Zhang, Tianqi [1 ]
Christensen, Mads Graesboll [2 ]
Yi, Chen [1 ]
An, Zeliang [1 ]
机构
[1] Chongqing Univ Posts & Telecommun, Sch Commun & Informat Engn, Chongqing 400065, Peoples R China
[2] Aalborg Univ, Audio Anal Lab, CREATE, DK-9000 Aalborg, Denmark
基金
中国国家自然科学基金;
关键词
Visualization; Feature extraction; Computational modeling; Time-domain analysis; Convolution; Context modeling; Speech enhancement; Audio-visual multimodal fusion; speech separation; attention mechanism; time-domain; NEURAL-NETWORKS; ENHANCEMENT; INFORMATION;
D O I
10.1109/TASLP.2024.3463411
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Currently, audio-visual speech separation methods utilize the speaker's audio and visual correlation information to help separate the speech of the target speaker. However, these methods commonly use the approach of feature concatenation with linear mapping to obtain the fused audio-visual features, which prompts us to conduct a deeper exploration for audio-visual fusion. Therefore, in this paper, according to the speaker's mouth landmark movements during speech, we propose a novel time-domain single-channel audio-visual speech separation method: audio-visual fusion with temporal convolution attention network for speech separation model (AVTCA). In this method, we design temporal convolution attention network (TCANet) based on the attention mechanism to model the contextual relationships between audio and visual sequences, and use TCANet as the basic unit to construct sequence learning and fusion network. In the whole deep separation framework, we first use cross attention to focus on the cross-correlation information of the audio and visual sequences, and then we use the TCANet to fuse the audio-visual feature sequences with temporal dependencies and cross-correlations. Afterwards, the fused audio-visual features sequences will be used as input to the separation network to predict mask and separate the source of each speaker. Finally, this paper conducts comparative experiments on Vox2, GRID, LRS2 and TCD-TIMIT datasets, indicating that AVTCA outperforms other state-of-the-art (SOTA) separation methods. Furthermore, it exhibits greater efficiency in computational performance and model size.
引用
收藏
页码:4647 / 4660
页数:14
相关论文
共 50 条
  • [1] Audio-Visual Fusion using Multiscale Temporal Convolutional Attention for Time-Domain Speech Separation
    Liu, Debang
    Zhang, Tianqi
    Christensen, Mads Graesboll
    Wei, Ying
    An, Zeliang
    INTERSPEECH 2023, 2023, : 3694 - 3698
  • [2] DEEP AUDIO-VISUAL SPEECH SEPARATION WITH ATTENTION MECHANISM
    Li, Chenda
    Qian, Yanmin
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7314 - 7318
  • [3] Multi-Attention Audio-Visual Fusion Network for Audio Spatialization
    Zhang, Wen
    Shao, Jie
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 394 - 401
  • [4] Efficient audio-visual information fusion using encoding pace synchronization for Audio-Visual Speech Separation
    Xu, Xinmeng
    Tu, Weiping
    Yang, Yuhong
    INFORMATION FUSION, 2025, 115
  • [5] Audio-visual speech processing and attention
    Sams, M
    PSYCHOPHYSIOLOGY, 2003, 40 : S5 - S6
  • [6] Separation of audio-visual speech sources: A new approach exploiting the audio-visual coherence of speech stimuli
    Sodoyer, D
    Schwartz, JL
    Girin, L
    Klinkisch, J
    Jutten, C
    EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2002, 2002 (11) : 1165 - 1173
  • [7] Separation of audio-visual speech sources: A new approach exploiting the audio-visual coherence of speech stimuli
    Sodoyer, D. (sodoyer@icp.inpg.fr), 1600, Hindawi Publishing Corporation (2002):
  • [8] Separation of Audio-Visual Speech Sources: A New Approach Exploiting the Audio-Visual Coherence of Speech Stimuli
    David Sodoyer
    Jean-Luc Schwartz
    Laurent Girin
    Jacob Klinkisch
    Christian Jutten
    EURASIP Journal on Advances in Signal Processing, 2002
  • [9] Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition
    Sterpu, George
    Saam, Christian
    Harte, Naomi
    ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 111 - 115
  • [10] Audio-Visual Deep Clustering for Speech Separation
    Lu, Rui
    Duan, Zhiyao
    Zhang, Changshui
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (11) : 1697 - 1712