WaveCRN: An Efficient Convolutional Recurrent Neural Network for End-to-End Speech Enhancement

被引：53

作者：

Hsieh, Tsun-An ^{[1
]}

Wang, Hsin-Min ^{[2
]}

Lu, Xugang ^{[3
]}

Tsao, Yu ^{[1
]}

机构：

[1] Acad Sinica, Res Ctr Informat Technol Innovat, Taipei 11529, Taiwan

[2] Acad Sinica, Inst Informat Sci, Taipei 11529, Taiwan

[3] NICT, Koganei, Tokyo 1848795, Japan

来源：

IEEE SIGNAL PROCESSING LETTERS | 2020年 / 27卷 / 27期

关键词：

Speech enhancement; Feature extraction; Task analysis; Noise reduction; Convolution; Noise measurement; Training; Compressed speech restoration; convolutional recurrent neural networks; raw waveform speech enhancement; simple recurrent unit; DEEP; DOMAIN; SEPARATION;

D O I：

10.1109/LSP.2020.3040693

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Due to the simple design pipeline, end-to-end (E2E) neural models for speech enhancement (SE) have attracted great interest. In order to improve the performance of the E2E model, the local and sequential properties of speech should be efficiently taken into account when modelling. However, in most current E2E models for SE, these properties are either not fully considered or are too complex to be realized. In this letter, we propose an efficient E2E SE model, termed WaveCRN. Compared with models based on convolutional neural networks (CNN) or long short-term memory (LSTM), WaveCRN uses a CNN module to capture the speech locality features and a stacked simple recurrent units (SRU) module to model the sequential property of the locality features. Different from conventional recurrent neural networks and LSTM, SRU can be efficiently parallelized in calculation, with even fewer model parameters. In order to more effectively suppress noise components in the noisy speech, we derive a novel restricted feature masking approach, which performs enhancement on the feature maps in the hidden layers; this is different from the approaches that apply the estimated ratio mask to the noisy spectral features, which is commonly used in speech separation methods. Experimental results on speech denoising and compressed speech restoration tasks confirm that with the SRU and the restricted feature map, WaveCRN performs comparably to other state-of-the-art approaches with notably reduced model complexity and inference time.

引用

页码：2149 / 2153

页数：5

共 50 条

[1] End-to-End Deep Convolutional Recurrent Models for Noise Robust Waveform Speech Enhancement
Ullah, Rizwan
Wuttisittikulkij, Lunchakorn
Chaudhary, Sushank
Parnianifard, Amir
Shah, Shashi
Ibrar, Muhammad
Wahab, Fazal-E
SENSORS, 2022, 22 (20)
[2] A Multiscale Autoencoder (MSAE) Framework for End-to-End Neural Network Speech Enhancement
Borgstrom, Bengt J.
Brandstein, Michael S.
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2418 - 2431
[3] ADIEU FEATURES? END-TO-END SPEECH EMOTION RECOGNITION USING A DEEP CONVOLUTIONAL RECURRENT NETWORK
Trigeorgis, George
Ringeval, Fabien
Brueckner, Raymond
Marchi, Erik
Nicolaou, Mihalis A.
Shuller, Bjoern
Zafeiriou, Stefanos
2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5200 - 5204
[4] IMPROVING END-TO-END SPEECH SYNTHESIS WITH LOCAL RECURRENT NEURAL NETWORK ENHANCED TRANSFORMER
Zheng, Yibin
Li, Xinhui
Xie, Fenglong
Lu, Li
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6734 - 6738
[5] Segmental Recurrent Neural Networks for End-to-end Speech Recognition
Lu, Liang
Kong, Lingpeng
Dyer, Chris
Smith, Noah A.
Renals, Steve
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 385 - 389
[6] Towards End-to-End Speech Recognition with Recurrent Neural Networks
Graves, Alex
Jaitly, Navdeep
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 32 (CYCLE 2), 2014, 32 : 1764 - 1772
[7] End-to-End Speech Emotion Recognition Based on One-Dimensional Convolutional Neural Network
Gao, Mengna
Dong, Jing
Zhou, Dongsheng
Zhang, Qiang
Yang, Deyun
3RD INTERNATIONAL CONFERENCE ON INNOVATION IN ARTIFICIAL INTELLIGENCE (ICIAI 2019), 2019, : 78 - 82
[8] FLGCNN: A novel fully convolutional neural network for end-to-end monaural speech enhancement with utterance-based objective functions
Zhu, Yuanyuan
Xu, Xu
Ye, Zhongfu
APPLIED ACOUSTICS, 2020, 170
[9] Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks
Zhang, Ying
Pezeshki, Mohammad
Brakel, Philemon
Zhang, Saizheng
Laurent, Cesar
Bengio, Yoshua
Courville, Aaron
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 410 - 414
[10] Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition
Parcollet, Titouan
Zhang, Ying
Morchid, Mohamed
Trabelsi, Chiheb
Linares, Georges
De Mori, Renato
Bengio, Yoshua
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 22 - 26

← 1 2 3 4 5 →