Multi-stage Progressive Learning-Based Speech Enhancement Using Time-Frequency Attentive Squeezed Temporal Convolutional Networks

被引:2
|
作者
Jannu, Chaitanya [1 ]
Vanambathina, Sunny Dayal [1 ]
机构
[1] VIT AP Univ, Sch Elect Engn, Beside AP Secretariat, Amaravati 522237, Andhra Pradesh, India
关键词
Speech enhancement (SE); Squeezed temporal convolutional networks (S-TCN); Time-frequency attention (TFA); Deep neural network (DNN); Multi-stage learning; NEURAL-NETWORK; SELF-ATTENTION; NOISE; CNN;
D O I
10.1007/s00034-023-02455-7
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Speech enhancement is an important method for improving speech quality and intelligibility in noisy environments. An effective speech enhancement model depends on precise modelling of the long-range dependencies of noisy speech. Several recent studies have examined ways to enhance speech by capturing the long-term contextual information. For speech enhancement, the time-frequency (T-F) distribution of speech spectral components is also important, but is usually ignored in these studies. The multi-stage learning method is an effective way to integrate various deep learning modules at the same time. The benefit of multi-stage training is that the optimization target can be iteratively updated stage by stage. In this paper, speech enhancement is investigated by multi-stage learning using a multi-stage structure in which time-frequency attention (TFA) blocks are followed by stacks of squeezed temporal convolutional networks (S-TCN) with exponentially increasing dilation rates. To reinject original information into later stages, a feature fusion block (FB) is inserted at the input of later stages to reduce the possibility of speech information being lost in the early stages. The S-TCN blocks are responsible for temporal sequence modelling tasks. The time-frequency attention (TFA) is a simple but effective network module that explicitly exploits position information to generate a 2D attention map to characterize the salient T-F distribution of speech by using two branches, time-frame attention and frequency attention in parallel. Extensive experiments have demonstrated that the proposed model consistently improves the performance over existing baselines across two widely used objective metrics such as PESQ and STOI. A significant improvement in system robustness to noise is also shown by our evaluation results using the TFA module.
引用
收藏
页码:7467 / 7493
页数:27
相关论文
共 50 条
  • [1] Multi-stage Progressive Learning-Based Speech Enhancement Using Time–Frequency Attentive Squeezed Temporal Convolutional Networks
    Chaitanya Jannu
    Sunny Dayal Vanambathina
    [J]. Circuits, Systems, and Signal Processing, 2023, 42 : 7467 - 7493
  • [2] Speech Enhancement Using Multi-Stage Self-Attentive Temporal Convolutional Networks
    Lin, Ju
    van Wijngaarden, Adriaan J. de Lind
    Wang, Kuang-Ching
    Smith, Melissa C.
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 3440 - 3450
  • [3] TIME-FREQUENCY MASKING BASED ONLINE SPEECH ENHANCEMENT WITH MULTI-CHANNEL DATA USING CONVOLUTIONAL NEURAL NETWORKS
    Chakrabarty, Soumitro
    Wang, DeLiang
    Habets, Emanuel A. P.
    [J]. 2018 16TH INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC), 2018, : 476 - 480
  • [4] Speech enhancement using progressive learning-based convolutional recurrent neural network
    Li, Andong
    Yuan, Minmin
    Zheng, Chengshi
    Li, Xiaodong
    [J]. APPLIED ACOUSTICS, 2020, 166
  • [5] Time-Frequency Masking Based Online Multi-Channel Speech Enhancement With Convolutional Recurrent Neural Networks
    Chakrabarty, Soumitro
    Habets, Emanuel A. P.
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2019, 13 (04) : 787 - 799
  • [6] Real time speech enhancement using densely connected neural networks and Squeezed temporal convolutional modules
    Sunny Dayal Vanambathina
    Manaswini Burra
    Bhumika Edupalli
    Eswar Reddy Vallem
    Venkata Sravani Nellore
    [J]. Multimedia Tools and Applications, 2024, 83 : 50289 - 50305
  • [7] Real time speech enhancement using densely connected neural networks and Squeezed temporal convolutional modules
    Vanambathina, Sunny Dayal
    Burra, Manaswini
    Edupalli, Bhumika
    Vallem, Eswar Reddy
    Nellore, Venkata Sravani
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (17) : 50289 - 50305
  • [8] Frequency Gating: Improved Convolutional Neural Networks for Speech Enhancement in the Time-Frequency Domain
    Oostermeijer, Koen
    Wang, Qing
    Du, Jun
    [J]. 2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 465 - 470
  • [9] Time-Frequency Mask-based Speech Enhancement using Convolutional Generative Adversarial Network
    Shah, Neil
    Patil, Hemant A.
    Soni, Meet H.
    [J]. 2018 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2018, : 1246 - 1251
  • [10] Efficient Wavelet Boost Learning-Based Multi-stage Progressive Refinement Network for Underwater Image Enhancement
    Huo, Fushuo
    Li, Bingheng
    Zhu, Xuegui
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 1944 - 1952