WaveCRN: An Efficient Convolutional Recurrent Neural Network for End-to-End Speech Enhancement

被引:49
|
作者
Hsieh, Tsun-An [1 ]
Wang, Hsin-Min [2 ]
Lu, Xugang [3 ]
Tsao, Yu [1 ]
机构
[1] Acad Sinica, Res Ctr Informat Technol Innovat, Taipei 11529, Taiwan
[2] Acad Sinica, Inst Informat Sci, Taipei 11529, Taiwan
[3] NICT, Koganei, Tokyo 1848795, Japan
关键词
Speech enhancement; Feature extraction; Task analysis; Noise reduction; Convolution; Noise measurement; Training; Compressed speech restoration; convolutional recurrent neural networks; raw waveform speech enhancement; simple recurrent unit; DEEP; DOMAIN; SEPARATION;
D O I
10.1109/LSP.2020.3040693
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Due to the simple design pipeline, end-to-end (E2E) neural models for speech enhancement (SE) have attracted great interest. In order to improve the performance of the E2E model, the local and sequential properties of speech should be efficiently taken into account when modelling. However, in most current E2E models for SE, these properties are either not fully considered or are too complex to be realized. In this letter, we propose an efficient E2E SE model, termed WaveCRN. Compared with models based on convolutional neural networks (CNN) or long short-term memory (LSTM), WaveCRN uses a CNN module to capture the speech locality features and a stacked simple recurrent units (SRU) module to model the sequential property of the locality features. Different from conventional recurrent neural networks and LSTM, SRU can be efficiently parallelized in calculation, with even fewer model parameters. In order to more effectively suppress noise components in the noisy speech, we derive a novel restricted feature masking approach, which performs enhancement on the feature maps in the hidden layers; this is different from the approaches that apply the estimated ratio mask to the noisy spectral features, which is commonly used in speech separation methods. Experimental results on speech denoising and compressed speech restoration tasks confirm that with the SRU and the restricted feature map, WaveCRN performs comparably to other state-of-the-art approaches with notably reduced model complexity and inference time.
引用
收藏
页码:2149 / 2153
页数:5
相关论文
共 50 条
  • [1] End-to-End Deep Convolutional Recurrent Models for Noise Robust Waveform Speech Enhancement
    Ullah, Rizwan
    Wuttisittikulkij, Lunchakorn
    Chaudhary, Sushank
    Parnianifard, Amir
    Shah, Shashi
    Ibrar, Muhammad
    Wahab, Fazal-E
    [J]. SENSORS, 2022, 22 (20)
  • [2] A Multiscale Autoencoder (MSAE) Framework for End-to-End Neural Network Speech Enhancement
    Borgstrom, Bengt J.
    Brandstein, Michael S.
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2418 - 2431
  • [3] ADIEU FEATURES? END-TO-END SPEECH EMOTION RECOGNITION USING A DEEP CONVOLUTIONAL RECURRENT NETWORK
    Trigeorgis, George
    Ringeval, Fabien
    Brueckner, Raymond
    Marchi, Erik
    Nicolaou, Mihalis A.
    Shuller, Bjoern
    Zafeiriou, Stefanos
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5200 - 5204
  • [4] IMPROVING END-TO-END SPEECH SYNTHESIS WITH LOCAL RECURRENT NEURAL NETWORK ENHANCED TRANSFORMER
    Zheng, Yibin
    Li, Xinhui
    Xie, Fenglong
    Lu, Li
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6734 - 6738
  • [5] Segmental Recurrent Neural Networks for End-to-end Speech Recognition
    Lu, Liang
    Kong, Lingpeng
    Dyer, Chris
    Smith, Noah A.
    Renals, Steve
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 385 - 389
  • [6] Towards End-to-End Speech Recognition with Recurrent Neural Networks
    Graves, Alex
    Jaitly, Navdeep
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 32 (CYCLE 2), 2014, 32 : 1764 - 1772
  • [7] End-to-End Speech Emotion Recognition Based on One-Dimensional Convolutional Neural Network
    Gao, Mengna
    Dong, Jing
    Zhou, Dongsheng
    Zhang, Qiang
    Yang, Deyun
    [J]. 3RD INTERNATIONAL CONFERENCE ON INNOVATION IN ARTIFICIAL INTELLIGENCE (ICIAI 2019), 2019, : 78 - 82
  • [8] FLGCNN: A novel fully convolutional neural network for end-to-end monaural speech enhancement with utterance-based objective functions
    Zhu, Yuanyuan
    Xu, Xu
    Ye, Zhongfu
    [J]. APPLIED ACOUSTICS, 2020, 170
  • [9] Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks
    Zhang, Ying
    Pezeshki, Mohammad
    Brakel, Philemon
    Zhang, Saizheng
    Laurent, Cesar
    Bengio, Yoshua
    Courville, Aaron
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 410 - 414
  • [10] Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition
    Parcollet, Titouan
    Zhang, Ying
    Morchid, Mohamed
    Trabelsi, Chiheb
    Linares, Georges
    De Mori, Renato
    Bengio, Yoshua
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 22 - 26