Audio-Visual Speech Enhancement Based on Multiscale Features and Parallel Attention

被引:0
|
作者
Jia, Shifan [1 ]
Zhang, Xinman [1 ]
Han, Weiqi [2 ]
机构
[1] Xi An Jiao Tong Univ, Sch Automat Sci & Engn, Xian, Shaanxi, Peoples R China
[2] Zhengzhou Univ, Sch Management, Zhengzhou, Peoples R China
关键词
speech enhancement; time-frequency domain; audio-visual; multiscale features; attention;
D O I
10.1109/INFOTEH60418.2024.10495981
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Audio-visual speech enhancement (AVSE) refers to the use of visual information to assist noise reduction when performing speech enhancement tasks in multimodal scenes. For the AVSE task, especially in low signal-to-noise ratio scenarios, lip movements play an important role in hearing, based on which we design more effective models to improve the performance of audio-visual speech enhancement. In this paper, we propose an innovative AVSE model which assists speech enhancement by extracting visual features. Specifically, the network consists of 3 main parts. Firstly, Resnet18, feature pyramid network (FPN) and coordinate attention (CA) modules are combined to extract multi-scale visual features. Secondly, multi-scale speech features are better extracted by double-branching combined with cavity convolution and cascade convolution, and the temporal data is modeled using a temporal convolutional network (TCN) module. Finally, for the fused audio-visual features, the time and frequency domain features are extracted using the parallel conformer module to better aggregate the global and local information of the sequence parts. Experiments on the GRID audio-visual dataset show that the model outperforms common single-channel speech enhancement models, and the effectiveness of the modules is demonstrated by ablation tests.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
    Hwang, Jung-Wook
    Park, Jeongkyun
    Park, Rae-Hong
    Park, Hyung-Min
    [J]. APPLIED ACOUSTICS, 2023, 211
  • [2] Audio-visual speech processing and attention
    Sams, M
    [J]. PSYCHOPHYSIOLOGY, 2003, 40 : S5 - S6
  • [3] Inventory-Based Audio-Visual Speech Enhancement
    Kolossa, Dorothea
    Nickel, Robert
    Zeiler, Steffen
    Martin, Rainer
    [J]. 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 586 - 589
  • [4] Lite Audio-Visual Speech Enhancement
    Chuang, Shang-Yi
    Tsao, Yu
    Lo, Chen-Chou
    Wang, Hsin-Min
    [J]. INTERSPEECH 2020, 2020, : 1131 - 1135
  • [5] Audio-visual enhancement of speech in noise
    Girin, L
    Schwartz, JL
    Feng, G
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2001, 109 (06): : 3007 - 3020
  • [6] Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis
    Yang, Karren
    Markovic, Dejan
    Krenn, Steven
    Agrawal, Vasu
    Richard, Alexander
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 8217 - 8227
  • [7] An Attention Based Speaker-Independent Audio-Visual Deep Learning Model for Speech Enhancement
    Sun, Zhongbo
    Wang, Yannan
    Cao, Li
    [J]. MULTIMEDIA MODELING (MMM 2020), PT II, 2020, 11962 : 722 - 728
  • [8] TWIN-HMM-BASED AUDIO-VISUAL SPEECH ENHANCEMENT
    Abdelaziz, Ahmed Hussen
    Zeiler, Steffen
    Kolossa, Dorothea
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 3726 - 3730
  • [9] Audio-visual speech enhancement with AVCDCN (audio-visual codebook dependent cepstral normalization)
    Deligne, S
    Potamianos, G
    Neti, C
    [J]. SAM2002: IEEE SENSOR ARRAY AND MULTICHANNEL SIGNAL PROCESSING WORKSHOP PROCEEDINGS, 2002, : 68 - 71
  • [10] Improved Lite Audio-Visual Speech Enhancement
    Chuang, Shang-Yi
    Wang, Hsin-Min
    Tsao, Yu
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1345 - 1359