Audio-Visual Fusion using Multiscale Temporal Convolutional Attention for Time-Domain Speech Separation

被引:0
|
作者
Liu, Debang [1 ]
Zhang, Tianqi [1 ]
Christensen, Mads Graesboll [2 ]
Wei, Ying [1 ]
An, Zeliang [1 ]
机构
[1] Chongqing Univ Posts & Telecommun, Sch Commun & Informat Engn, Chongqing 400065, Peoples R China
[2] Aalborg Univ, Audio Anal Lab, CREATE, DK-9000 Aalborg, Denmark
来源
基金
中国国家自然科学基金;
关键词
audio-visual fusion; time-domain; speech separation; temporal convolutional attention; training cost;
D O I
10.21437/Interspeech.2023-801
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Audio-only speech separation methods cannot fully exploit audio-visual correlation information of speaker, which limits separation performance. Additionally, audio-visual separation methods usually adopt traditional idea of feature splicing and linear mapping to fuse audio-visual features, this approach requires us to think more about fusion process. Therefore, in this paper, combining with the changes of speaker mouth landmarks, we propose a time-domain audio-visual temporal convolution attention speech separation method (AVTA). In AVTA, we design a multiscale temporal convolutional attention (MTCA) to better focus on contextual dependencies of time sequences. We then use sequence learning and fusion network composed of MTCA to build a separation model for speech separation task. On different datasets, AVTA achieves competitive performance, and compared to baseline methods, AVTA is better balanced in training cost, computational complexity and separation performance.
引用
收藏
页码:3694 / 3698
页数:5
相关论文
共 50 条
  • [1] Audio-Visual Fusion With Temporal Convolutional Attention Network for Speech Separation
    Liu, Debang
    Zhang, Tianqi
    Christensen, Mads Graesboll
    Yi, Chen
    An, Zeliang
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 4647 - 4660
  • [2] TIME-DOMAIN AUDIO-VISUAL SPEECH SEPARATION ON LOW QUALITY VIDEOS
    Wu, Yifei
    Li, Chenda
    Bai, Jinfeng
    Wu, Zhongqin
    Qian, Yanmin
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 256 - 260
  • [3] Efficient audio-visual information fusion using encoding pace synchronization for Audio-Visual Speech Separation
    Xu, Xinmeng
    Tu, Weiping
    Yang, Yuhong
    INFORMATION FUSION, 2025, 115
  • [4] DEEP AUDIO-VISUAL SPEECH SEPARATION WITH ATTENTION MECHANISM
    Li, Chenda
    Qian, Yanmin
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7314 - 7318
  • [5] Audio-Visual Speech Enhancement Based on Multiscale Features and Parallel Attention
    Jia, Shifan
    Zhang, Xinman
    Han, Weiqi
    2024 23RD INTERNATIONAL SYMPOSIUM INFOTEH-JAHORINA, INFOTEH, 2024,
  • [6] TIME DOMAIN AUDIO VISUAL SPEECH SEPARATION
    Wu, Jian
    Xu, Yong
    Zhang, Shi-Xiong
    Chen, Lian-Wu
    Yu, Meng
    Xie, Lei
    Yu, Dong
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 667 - 673
  • [7] Audio-visual speech processing and attention
    Sams, M
    PSYCHOPHYSIOLOGY, 2003, 40 : S5 - S6
  • [8] Audio-Visual Domain Adaptation Feature Fusion for Speech Emotion Recognition
    Wei, Jie
    Hu, Guanyu
    Yang, Xinyu
    Luu, Anh Tuan
    Dong, Yizhuo
    INTERSPEECH 2022, 2022, : 1988 - 1992
  • [9] PERFORMANCE STUDY OF A CONVOLUTIONAL TIME-DOMAIN AUDIO SEPARATION NETWORK FOR REAL-TIME SPEECH DENOISING
    Sonning, Samuel
    Scheldt, Christian
    Erdogan, Hakan
    Wisdom, Scott
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 831 - 835
  • [10] FaceFilter: Audio-visual speech separation using still images
    Chung, Soo-Whan
    Choe, Soyeon
    Chung, Joon Son
    Kang, Hong-Goo
    INTERSPEECH 2020, 2020, : 3481 - 3485