An Attention Based Speaker-Independent Audio-Visual Deep Learning Model for Speech Enhancement

被引:1
|
作者
Sun, Zhongbo [1 ]
Wang, Yannan [2 ]
Cao, Li [1 ]
机构
[1] Tsinghua Univ, Dept Automat, Beijing, Peoples R China
[2] Tencent, Media Lab, Shenzhen, Peoples R China
来源
关键词
Speech enhancement; Audio-visual; Attention mechanism; Deep learning; NOISE;
D O I
10.1007/978-3-030-37734-2_60
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech enhancement aims to improve speech quality in noisy environments. While most speech enhancement methods use only audio data as input, joining video information can achieve better results. In this paper, we present an attention based speaker-independent audio-visual deep learning model for single channel speech enhancement. We apply both the time-wise attention and spatial attention in the video feature extraction module to focus on more important features. Audio features and video features are then concatenated along the time dimension as the audio-visual features. The proposed video feature extraction module can be spliced to the audio-only model without extensive modifications. The results show that the proposed method can achieve better results than recent audio-visual speech enhancement methods.
引用
收藏
页码:722 / 728
页数:7
相关论文
共 50 条
  • [1] Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
    Ephrat, Ariel
    Mosseri, Inbar
    Lang, Oran
    Dekel, Tali
    Wilson, Kevin
    Hassidim, Avinatan
    Freeman, William T.
    Rubinstein, Michael
    [J]. ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04):
  • [2] FACE LANDMARK-BASED SPEAKER-INDEPENDENT AUDIO-VISUAL SPEECH ENHANCEMENT IN MULTI-TALKER ENVIRONMENTS
    Morrone, Giovanni
    Pasa, Luca
    Tikhanoff, Vadim
    Bergamaschi, Sonia
    Fadiga, Luciano
    Badino, Leonardo
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6900 - 6904
  • [3] Speaker-Independent Audio-Visual Speech Separation Based on Transformer in Multi-Talker Environments
    Wang, Jing
    Luo, Yiyu
    Yi, Weiming
    Xie, Xiang
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2022, E105D (04) : 766 - 777
  • [4] Speaker independent audio-visual speech recognition
    Zhang, Y
    Levinson, S
    Huang, T
    [J]. 2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 1073 - 1076
  • [5] Speaker independent audio-visual continuous speech recognition
    Liang, LH
    Liu, XX
    Zhao, YB
    Pi, XB
    Nefian, AV
    [J]. IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL I AND II, PROCEEDINGS, 2002, : A25 - A28
  • [6] An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
    Michelsanti, Daniel
    Tan, Zheng-Hua
    Zhang, Shi-Xiong
    Xu, Yong
    Yu, Meng
    Yu, Dong
    Jensen, Jesper
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 1368 - 1396
  • [7] Deep-learning-based audio-visual speech enhancement in presence of Lombard effect
    Michelsanti, Daniel
    Tan, Zheng-Hua
    Sigurdsson, Sigurdur
    Jensen, Jesper
    [J]. SPEECH COMMUNICATION, 2019, 115 : 38 - 50
  • [8] Audio-Visual Speech Enhancement Based on Multiscale Features and Parallel Attention
    Jia, Shifan
    Zhang, Xinman
    Han, Weiqi
    [J]. 2024 23RD INTERNATIONAL SYMPOSIUM INFOTEH-JAHORINA, INFOTEH, 2024,
  • [9] DEEP AUDIO-VISUAL SPEECH SEPARATION WITH ATTENTION MECHANISM
    Li, Chenda
    Qian, Yanmin
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7314 - 7318
  • [10] AUDIO-VISUAL SPEECH INPAINTING WITH DEEP LEARNING
    Morrone, Giovanni
    Michelsanti, Daniel
    Tan, Zheng-Hua
    Jensen, Jesper
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6653 - 6657