Audio-Visual Fusion With Temporal Convolutional Attention Network for Speech Separation

被引:0
|
作者
Liu, Debang [1 ]
Zhang, Tianqi [1 ]
Christensen, Mads Graesboll [2 ]
Yi, Chen [1 ]
An, Zeliang [1 ]
机构
[1] Chongqing Univ Posts & Telecommun, Sch Commun & Informat Engn, Chongqing 400065, Peoples R China
[2] Aalborg Univ, Audio Anal Lab, CREATE, DK-9000 Aalborg, Denmark
基金
中国国家自然科学基金;
关键词
Visualization; Feature extraction; Computational modeling; Time-domain analysis; Convolution; Context modeling; Speech enhancement; Audio-visual multimodal fusion; speech separation; attention mechanism; time-domain; NEURAL-NETWORKS; ENHANCEMENT; INFORMATION;
D O I
10.1109/TASLP.2024.3463411
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Currently, audio-visual speech separation methods utilize the speaker's audio and visual correlation information to help separate the speech of the target speaker. However, these methods commonly use the approach of feature concatenation with linear mapping to obtain the fused audio-visual features, which prompts us to conduct a deeper exploration for audio-visual fusion. Therefore, in this paper, according to the speaker's mouth landmark movements during speech, we propose a novel time-domain single-channel audio-visual speech separation method: audio-visual fusion with temporal convolution attention network for speech separation model (AVTCA). In this method, we design temporal convolution attention network (TCANet) based on the attention mechanism to model the contextual relationships between audio and visual sequences, and use TCANet as the basic unit to construct sequence learning and fusion network. In the whole deep separation framework, we first use cross attention to focus on the cross-correlation information of the audio and visual sequences, and then we use the TCANet to fuse the audio-visual feature sequences with temporal dependencies and cross-correlations. Afterwards, the fused audio-visual features sequences will be used as input to the separation network to predict mask and separate the source of each speaker. Finally, this paper conducts comparative experiments on Vox2, GRID, LRS2 and TCD-TIMIT datasets, indicating that AVTCA outperforms other state-of-the-art (SOTA) separation methods. Furthermore, it exhibits greater efficiency in computational performance and model size.
引用
收藏
页码:4647 / 4660
页数:14
相关论文
共 50 条
  • [21] Information Fusion Techniques in Audio-Visual Speech Recognition
    Karabalkan, H.
    Erdogan, H.
    2009 IEEE 17TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, VOLS 1 AND 2, 2009, : 734 - 737
  • [22] A Single Channel Audio-Visual Fusion Speech Separation Method Based on DCNN and BiLSTM
    Lan C.-F.
    Wang S.-B.
    Guo X.-X.
    Han Y.-L.
    Kang S.-Q.
    Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2023, 51 (04): : 914 - 921
  • [23] Developing an audio-visual speech source separation algorithm
    Sodoyer, D
    Girin, L
    Jutten, C
    Schwartz, JL
    SPEECH COMMUNICATION, 2004, 44 (1-4) : 113 - 125
  • [24] AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
    Choi, Jeongsoo
    Park, Se Jin
    Kim, Minsu
    Ro, Yong Man
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 27315 - 27327
  • [25] An audio-visual distance for audio-visual speech vector quantization
    Girin, L
    Foucher, E
    Feng, G
    1998 IEEE SECOND WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 1998, : 523 - 528
  • [26] Audio-Visual Speech Separation with Visual Features Enhanced by Adversarial Training
    Zhang, Peng
    Xu, Jiaming
    Shi, Jing
    Hao, Yunzhe
    Qin, Lei
    Xu, Bo
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [27] MULTI-SCALE HYBRID FUSION NETWORK FOR MANDARIN AUDIO-VISUAL SPEECH RECOGNITION
    Wang, Jinxin
    Guo, Zhongwen
    Yang, Chao
    Li, Xiaomei
    Cui, Ziyuan
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 642 - 647
  • [28] Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention
    Xu, Xinmeng
    Wang, Yang
    Jia, Jie
    Chen, Binbin
    Li, Dejun
    INTERSPEECH 2022, 2022, : 971 - 975
  • [29] AUDIO-VISUAL SPEECH RECOGNITION WITH A HYBRID CTC/ATTENTION ARCHITECTURE
    Petridis, Stavros
    Stafylakis, Themos
    Ma, Pingchuan
    Tzimiropoulos, Georgios
    Pantic, Maja
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 513 - 520
  • [30] Transfer of Audio-Visual Temporal Training to Temporal and Spatial Audio-Visual Tasks
    Suerig, Ralf
    Bottari, Davide
    Roeder, Brigitte
    MULTISENSORY RESEARCH, 2018, 31 (06) : 556 - 578