Audio-Visual Fusion using Multiscale Temporal Convolutional Attention for Time-Domain Speech Separation

被引:0
|
作者
Liu, Debang [1 ]
Zhang, Tianqi [1 ]
Christensen, Mads Graesboll [2 ]
Wei, Ying [1 ]
An, Zeliang [1 ]
机构
[1] Chongqing Univ Posts & Telecommun, Sch Commun & Informat Engn, Chongqing 400065, Peoples R China
[2] Aalborg Univ, Audio Anal Lab, CREATE, DK-9000 Aalborg, Denmark
来源
基金
中国国家自然科学基金;
关键词
audio-visual fusion; time-domain; speech separation; temporal convolutional attention; training cost;
D O I
10.21437/Interspeech.2023-801
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Audio-only speech separation methods cannot fully exploit audio-visual correlation information of speaker, which limits separation performance. Additionally, audio-visual separation methods usually adopt traditional idea of feature splicing and linear mapping to fuse audio-visual features, this approach requires us to think more about fusion process. Therefore, in this paper, combining with the changes of speaker mouth landmarks, we propose a time-domain audio-visual temporal convolution attention speech separation method (AVTA). In AVTA, we design a multiscale temporal convolutional attention (MTCA) to better focus on contextual dependencies of time sequences. We then use sequence learning and fusion network composed of MTCA to build a separation model for speech separation task. On different datasets, AVTA achieves competitive performance, and compared to baseline methods, AVTA is better balanced in training cost, computational complexity and separation performance.
引用
收藏
页码:3694 / 3698
页数:5
相关论文
共 50 条
  • [31] A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection
    Tamura, Satoshi
    Ishikawa, Masato
    Hashiba, Takashi
    Takeuchi, Shin'ichi
    Hayamizu, Satoru
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2702 - +
  • [32] Speech Pattern Discovery using Audio-Visual Fusion and Canonical Correlation Analysis
    Xie, Lei
    Xu, Yinqing
    Zheng, Lilei
    Huang, Qiang
    Li, Bingfeng
    13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 2371 - 2374
  • [33] Audio-Visual Speech Separation with Visual Features Enhanced by Adversarial Training
    Zhang, Peng
    Xu, Jiaming
    Shi, Jing
    Hao, Yunzhe
    Qin, Lei
    Xu, Bo
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [34] AUDIO-VISUAL SPEECH SEPARATION USING CROSS-MODAL CORRESPONDENCE LOSS
    Makishima, Naoki
    Ihori, Mana
    Takashima, Akihiko
    Tanaka, Tomohiro
    Orihashi, Shota
    Masumura, Ryo
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6673 - 6677
  • [35] AUDIO-VISUAL SPEECH RECOGNITION WITH A HYBRID CTC/ATTENTION ARCHITECTURE
    Petridis, Stavros
    Stafylakis, Themos
    Ma, Pingchuan
    Tzimiropoulos, Georgios
    Pantic, Maja
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 513 - 520
  • [36] Multi-Kernel Attention Encoder For Time-Domain Speech Separation
    Liu, Zengrun
    Shi, Diya
    Wei, Ying
    2024 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS 2024, 2024,
  • [37] Audio-Visual Fusion Based on Interactive Attention for Person Verification
    Jing, Xuebin
    He, Liang
    Song, Zhida
    Wang, Shaolei
    SENSORS, 2023, 23 (24)
  • [38] Robust Audio-Visual Speech Recognition Based on Hybrid Fusion
    Liu, Hong
    Li, Wenhao
    Yang, Bing
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 7580 - 7586
  • [39] Audio-Visual Fusion for Sound Source Localization and Improved Attention
    Lee, Byoung-gi
    Choi, JongSuk
    Yoon, SangSuk
    Choi, Mun-Taek
    Kim, Munsang
    Kim, Daijin
    TRANSACTIONS OF THE KOREAN SOCIETY OF MECHANICAL ENGINEERS A, 2011, 35 (07) : 737 - 743
  • [40] Attention-Based Audio-Visual Fusion for Video Summarization
    Fang, Yinghong
    Zhang, Junpeng
    Lu, Cewu
    NEURAL INFORMATION PROCESSING (ICONIP 2019), PT II, 2019, 11954 : 328 - 340