Audio-Visual Fusion With Temporal Convolutional Attention Network for Speech Separation

被引:0
|
作者
Liu, Debang [1 ]
Zhang, Tianqi [1 ]
Christensen, Mads Graesboll [2 ]
Yi, Chen [1 ]
An, Zeliang [1 ]
机构
[1] Chongqing Univ Posts & Telecommun, Sch Commun & Informat Engn, Chongqing 400065, Peoples R China
[2] Aalborg Univ, Audio Anal Lab, CREATE, DK-9000 Aalborg, Denmark
基金
中国国家自然科学基金;
关键词
Visualization; Feature extraction; Computational modeling; Time-domain analysis; Convolution; Context modeling; Speech enhancement; Audio-visual multimodal fusion; speech separation; attention mechanism; time-domain; NEURAL-NETWORKS; ENHANCEMENT; INFORMATION;
D O I
10.1109/TASLP.2024.3463411
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Currently, audio-visual speech separation methods utilize the speaker's audio and visual correlation information to help separate the speech of the target speaker. However, these methods commonly use the approach of feature concatenation with linear mapping to obtain the fused audio-visual features, which prompts us to conduct a deeper exploration for audio-visual fusion. Therefore, in this paper, according to the speaker's mouth landmark movements during speech, we propose a novel time-domain single-channel audio-visual speech separation method: audio-visual fusion with temporal convolution attention network for speech separation model (AVTCA). In this method, we design temporal convolution attention network (TCANet) based on the attention mechanism to model the contextual relationships between audio and visual sequences, and use TCANet as the basic unit to construct sequence learning and fusion network. In the whole deep separation framework, we first use cross attention to focus on the cross-correlation information of the audio and visual sequences, and then we use the TCANet to fuse the audio-visual feature sequences with temporal dependencies and cross-correlations. Afterwards, the fused audio-visual features sequences will be used as input to the separation network to predict mask and separate the source of each speaker. Finally, this paper conducts comparative experiments on Vox2, GRID, LRS2 and TCD-TIMIT datasets, indicating that AVTCA outperforms other state-of-the-art (SOTA) separation methods. Furthermore, it exhibits greater efficiency in computational performance and model size.
引用
收藏
页码:4647 / 4660
页数:14
相关论文
共 50 条
  • [41] DEEP VARIATIONAL GENERATIVE MODELS FOR AUDIO-VISUAL SPEECH SEPARATION
    Viet-Nhat Nguyen
    Sadeghi, Mostafa
    Ricci, Elisa
    Alameda-Pineda, Xavier
    2021 IEEE 31ST INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2021,
  • [42] Audio-Visual Speech Separation Using I-Vectors
    Luo, Yiyu
    Wang, Jing
    Wang, Xinyao
    Wen, Liang
    Wang, Lizhong
    2019 2ND IEEE INTERNATIONAL CONFERENCE ON INFORMATION COMMUNICATION AND SIGNAL PROCESSING (ICICSP), 2019, : 276 - 280
  • [43] An audio-visual speech recognition with a new mandarin audio-visual database
    Liao, Wen-Yuan
    Pao, Tsang-Long
    Chen, Yu-Te
    Chang, Tsun-Wei
    INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
  • [44] Selective Attention Modulates the Direction of Audio-Visual Temporal Recalibration
    Ikumi, Nara
    Soto-Faraco, Salvador
    PLOS ONE, 2014, 9 (07):
  • [45] Expressive audio-visual speech
    Bevacqua, E
    Pelachaud, C
    COMPUTER ANIMATION AND VIRTUAL WORLDS, 2004, 15 (3-4) : 297 - 304
  • [46] Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis
    Yang, Karren
    Markovic, Dejan
    Krenn, Steven
    Agrawal, Vasu
    Richard, Alexander
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 8217 - 8227
  • [47] Effects of aging on audio-visual speech integration Effects of aging on audio-visual speech integration
    Huyse, Aurelie
    Leybaert, Jacqueline
    Berthommier, Frederic
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2014, 136 (04): : 1918 - 1931
  • [48] AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio-Visual Speech Recognition
    Che, Na
    Zhu, Yiming
    Wang, Haiyan
    Zeng, Xianwei
    Du, Qinsheng
    APPLIED SCIENCES-BASEL, 2025, 15 (01):
  • [49] Lip landmark-based audio-visual speech enhancement with multimodal feature fusion network
    Li, Yangke
    Zhang, Xinman
    NEUROCOMPUTING, 2023, 549
  • [50] Temporal Filtering of Visual Speech for Audio-Visual Speech Recognition in Acoustically and Visually Challenging Environments
    Lee, Jong-Seok
    Park, Cheol Hoon
    ICMI'07: PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERFACES, 2007, : 220 - 227