A Time-Frequency Attention Module for Neural Speech Enhancement

被引:13
|
作者
Zhang, Qiquan [1 ,2 ,3 ]
Qian, Xinyuan [1 ,4 ]
Ni, Zhaoheng [5 ]
Nicolson, Aaron [6 ]
Ambikairajah, Eliathamby [2 ]
Li, Haizhou [3 ,7 ,8 ,9 ,10 ]
机构
[1] Univ Sci & Technol Beijing, Sch Comp & Commun Engn, Beijing 100083, Peoples R China
[2] Univ New South Wales, Sch Elect Engn & Telecommun, Sydney, NSW 2052, Australia
[3] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore 119077, Singapore
[4] Chinese Univ Hong Kong, Shenzhen 518172, Peoples R China
[5] Meta AI, New York, NY 10003 USA
[6] CSIRO, Australian Hlth Res Ctr, Black Mt, ACT 2601, Australia
[7] Chinese Univ Hong Kong, Guangdong Prov Key Lab Big Data Comp, Shenzhen 518172, Peoples R China
[8] Shenzhen Res Inst Big Data, Shenzhen 51872, Peoples R China
[9] Univ Bremen, D-28359 Bremen, Germany
[10] Kriston AI Lab, Xiamen, Peoples R China
关键词
Speech enhancement; time-frequency attention; ResTCN; transformer; TRAINING TARGETS; SELF-ATTENTION; NOISE; NETWORK; AMPLITUDE; SUPPRESSION; ALGORITHM; MASKING;
D O I
10.1109/TASLP.2022.3225649
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech enhancement plays an essential role in a wide range of speech processing applications. Recent studies on speech enhancement tend to investigate how to effectively capture the long-term contextual dependencies of speech signals to boost performance. However, these studies generally neglect the time-frequency (T-F) distribution information of speech spectral components, which is equally important for speech enhancement. In this paper, we propose a simple yet very effective network module, which we term the T-F attention (TFA) module, that uses two parallel attention branches, i.e., time-frame attention and frequency-channel attention, to explicitly exploit position information to generate a 2-D attention map to characterise the salient T-F speech distribution. We validate our TFA module as part of two widely used backbone networks (residual temporal convolution network and Transformer) and conduct speech enhancement with four most popular training objectives. Our extensive experiments demonstrate that our proposed TFA module consistently leads to substantial enhancement performance improvements in terms of the five most widely used objective metrics, with negligible parameter overheads. In addition, we further evaluate the efficacy of speech enhancement as a front-end for a downstream speech recognition task. Our evaluation results show that the TFA module significantly improves the robustness of the system to noisy conditions.
引用
收藏
页码:462 / 475
页数:14
相关论文
共 50 条
  • [1] TIME-FREQUENCY ATTENTION FOR MONAURAL SPEECH ENHANCEMENT
    Zhang, Qiquan
    Song, Qi
    Ni, Zhaoheng
    Nicolson, Aaron
    Li, Haizhou
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7852 - 7856
  • [2] Neural speech enhancement in the time-frequency domain
    Volkmer, M
    [J]. 2003 IEEE XIII WORKSHOP ON NEURAL NETWORKS FOR SIGNAL PROCESSING - NNSP'03, 2003, : 617 - 626
  • [3] A time-frequency smoothing neural network for speech enhancement
    Yuan, Wenhao
    [J]. SPEECH COMMUNICATION, 2020, 124 : 75 - 84
  • [4] Residual Unet with Attention Mechanism for Time-Frequency Domain Speech Enhancement
    Chen, Hanyu
    Peng, Xiwei
    Jiang, Qiqi
    Guo, Yujie
    [J]. 2022 41ST CHINESE CONTROL CONFERENCE (CCC), 2022, : 7007 - 7011
  • [5] Noisy-reverberant Speech Enhancement Using DenseUNet with Time-frequency Attention
    Zhao, Yan
    Wang, DeLiang
    [J]. INTERSPEECH 2020, 2020, : 3261 - 3265
  • [6] Frequency Gating: Improved Convolutional Neural Networks for Speech Enhancement in the Time-Frequency Domain
    Oostermeijer, Koen
    Wang, Qing
    Du, Jun
    [J]. 2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 465 - 470
  • [7] Joint Time-Frequency and Time Domain Learning for Speech Enhancement
    Tang, Chuanxin
    Luo, Chong
    Zhao, Zhiyuan
    Xie, Wenxuan
    Zeng, Wenjun
    [J]. PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 3816 - 3822
  • [8] Adaptive time-frequency data fusion for speech enhancement
    Shi, G
    Aarabi, P
    Lazic, N
    [J]. FUSION 2003: PROCEEDINGS OF THE SIXTH INTERNATIONAL CONFERENCE OF INFORMATION FUSION, VOLS 1 AND 2, 2003, : 394 - 399
  • [9] Integrated speech enhancement and coding in the time-frequency domain
    Drygajlo, A
    Carnero, B
    [J]. 1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 1183 - 1186
  • [10] SPEECH ENHANCEMENT BASED ON JOINT TIME-FREQUENCY SEGMENTATION
    Tantibundhit, C.
    Pernkopf, F.
    Kubin, G.
    [J]. 2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 4673 - +