Multi-stage Progressive Learning-Based Speech Enhancement Using Time-Frequency Attentive Squeezed Temporal Convolutional Networks

被引:2
|
作者
Jannu, Chaitanya [1 ]
Vanambathina, Sunny Dayal [1 ]
机构
[1] VIT AP Univ, Sch Elect Engn, Beside AP Secretariat, Amaravati 522237, Andhra Pradesh, India
关键词
Speech enhancement (SE); Squeezed temporal convolutional networks (S-TCN); Time-frequency attention (TFA); Deep neural network (DNN); Multi-stage learning; NEURAL-NETWORK; SELF-ATTENTION; NOISE; CNN;
D O I
10.1007/s00034-023-02455-7
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Speech enhancement is an important method for improving speech quality and intelligibility in noisy environments. An effective speech enhancement model depends on precise modelling of the long-range dependencies of noisy speech. Several recent studies have examined ways to enhance speech by capturing the long-term contextual information. For speech enhancement, the time-frequency (T-F) distribution of speech spectral components is also important, but is usually ignored in these studies. The multi-stage learning method is an effective way to integrate various deep learning modules at the same time. The benefit of multi-stage training is that the optimization target can be iteratively updated stage by stage. In this paper, speech enhancement is investigated by multi-stage learning using a multi-stage structure in which time-frequency attention (TFA) blocks are followed by stacks of squeezed temporal convolutional networks (S-TCN) with exponentially increasing dilation rates. To reinject original information into later stages, a feature fusion block (FB) is inserted at the input of later stages to reduce the possibility of speech information being lost in the early stages. The S-TCN blocks are responsible for temporal sequence modelling tasks. The time-frequency attention (TFA) is a simple but effective network module that explicitly exploits position information to generate a 2D attention map to characterize the salient T-F distribution of speech by using two branches, time-frame attention and frequency attention in parallel. Extensive experiments have demonstrated that the proposed model consistently improves the performance over existing baselines across two widely used objective metrics such as PESQ and STOI. A significant improvement in system robustness to noise is also shown by our evaluation results using the TFA module.
引用
收藏
页码:7467 / 7493
页数:27
相关论文
共 50 条
  • [21] TIME-FREQUENCY MASKING-BASED SPEECH ENHANCEMENT USING GENERATIVE ADVERSARIAL NETWORK
    Soni, Meet H.
    Shah, Neil
    Patil, Hemant A.
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5039 - 5043
  • [22] A Wavelet-Based Denoising System Using Time-Frequency Adaptation for Speech Enhancement
    Wang, Kun-Ching
    Chin, Chuin-Li
    Tsai, Yi-Hsing
    [J]. 2009 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, 2009, : 114 - 117
  • [23] Robust deep learning-based seismic inversion workflow using temporal convolutional networks
    Smith, Robert
    Nivlet, Philippe
    Alfayez, Hussain
    AlBinHassan, Nasher
    [J]. INTERPRETATION-A JOURNAL OF SUBSURFACE CHARACTERIZATION, 2022, 10 (02): : SC41 - SC55
  • [24] Adaptive Cyber Defense Against Multi-Stage Attacks Using Learning-Based POMDP
    Hu, Zhisheng
    Zhu, Minghui
    Liu, Peng
    [J]. ACM TRANSACTIONS ON PRIVACY AND SECURITY, 2021, 24 (01)
  • [25] Skeleton-Based Action Segmentation With Multi-Stage Spatial-Temporal Graph Convolutional Neural Networks
    Filtjens, Benjamin
    Vanrumste, Bart
    Slaets, Peter
    [J]. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2024, 12 (01) : 202 - 212
  • [26] A Phase-Based Time-Frequency masking for multi-channel speech enhancement in domestic environments
    Brutti, Alessio
    Tsiami, Antigoni
    Katsamanis, Athanasios
    Maragos, Petros
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2875 - 2879
  • [27] PHASE TIME-FREQUENCY MASKING BASED SPEECH ENHANCEMENT ALGORITHM USING CIRCULAR MICROPHONE ARRAY
    He, Li
    Zhou, Yi
    Liu, Hongqing
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 808 - 813
  • [28] Learning time-frequency mask for noisy speech enhancement using gaussian-bernoulli pre-trained deep neural networks
    Saleem, Nasir
    Khattak, Muhammad Irfan
    Al-Hasan, Mu'ath
    Jan, Atif
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2021, 40 (01) : 849 - 864
  • [29] Hybrid feedback and reinforcement learning-based control of machine cycle time for a multi-stage production system
    Li, Chen
    Chang, Qing
    [J]. JOURNAL OF MANUFACTURING SYSTEMS, 2022, 65 : 351 - 361
  • [30] A machine learning-based approach for wait-time estimation in healthcare facilities with multi-stage queues
    Al-Mousa, Amjed
    Al-Zubaidi, Hamza
    Al-Dweik, Mohammad
    [J]. IET SMART CITIES, 2024, 6 (04) : 333 - 350