Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

被引:0
|
作者
Wang, Jing [1 ]
Saleem, Nasir [2 ,3 ]
Gunawan, Teddy Surya [3 ]
机构
[1] Yunnan Univ, Sch Mat Sci & Engn, Kunming City, Yunnan Province, Peoples R China
[2] Gomal Univ, Fac Engn & Technol, Dept Elect Engn, Dera Ismail Khan 29050, Pakistan
[3] Int Islamic Univ Malaysia IIUM, Dept Elect & Comp Engn, Kuala Lumpur, Malaysia
关键词
Deep learning; Speech enhancement; Speech recognition; Skip connections; LSTM; Acoustic features; Attention process; NOISE;
D O I
10.1007/s12559-024-10288-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Long short-term memory (LSTM) has proven effective in modeling sequential data. However, it may encounter challenges in accurately capturing long-term temporal dependencies. LSTM plays a central role in speech enhancement by effectively modeling and capturing temporal dependencies in speech signals. This paper introduces a variable-neurons-based LSTM designed for capturing long-term temporal dependencies by reducing neuron representation in layers with no loss of data. A skip connection between nonadjacent layers is added to prevent gradient vanishing. An attention mechanism in these connections highlights important features and spectral components. Our LSTM is inherently causal, making it well-suited for real-time processing without relying on future information. Training involves utilizing combined acoustic feature sets for improved performance, and the models estimate two time-frequency masks-the ideal ratio mask (IRM) and the ideal binary mask (IBM). Comprehensive evaluation using perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) showed that the proposed LSTM architecture demonstrates enhanced speech intelligibility and perceptual quality. Composite measures further substantiated performance, considering residual noise distortion (Cbak) and speech distortion (Csig). The proposed model showed a 16.21% improvement in STOI and a 0.69 improvement in PESQ on the TIMIT database. Similarly, with the LibriSpeech database, the STOI and PESQ showed improvements of 16.41% and 0.71 over noisy mixtures. The proposed LSTM architecture outperforms deep neural networks (DNNs) in different stationary and nonstationary background noisy conditions. To train an automatic speech recognition (ASR) system on enhanced speech, the Kaldi toolkit is used for evaluating word error rate (WER). The proposed LSTM at the front-end notably reduced WERs, achieving a notable 15.13% WER across different noisy backgrounds.
引用
收藏
页码:1221 / 1236
页数:16
相关论文
共 50 条
  • [1] Speech Enhancement Method Based On LSTM Neural Network for Speech Recognition
    Liu, Ming
    Wang, Yujun
    Wang, Jin
    Wang, Jing
    Xie, Xiang
    [J]. PROCEEDINGS OF 2018 14TH IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP), 2018, : 245 - 249
  • [2] DEEP RECURRENT REGULARIZATION NEURAL NETWORK FOR SPEECH RECOGNITION
    Chien, Jen-Tzung
    Lu, Tsai-Wei
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4560 - 4564
  • [3] Deep neural network architectures for dysarthric speech analysis and recognition
    Brahim Fares Zaidi
    Sid Ahmed Selouani
    Malika Boudraa
    Mohammed Sidi Yakoub
    [J]. Neural Computing and Applications, 2021, 33 : 9089 - 9108
  • [4] Deep neural network architectures for dysarthric speech analysis and recognition
    Zaidi, Brahim Fares
    Selouani, Sid Ahmed
    Boudraa, Malika
    Sidi Yakoub, Mohammed
    [J]. NEURAL COMPUTING & APPLICATIONS, 2021, 33 (15): : 9089 - 9108
  • [5] Deep causal speech enhancement and recognition using efficient long-short term memory Recurrent Neural Network
    Li, Zhenqing
    Basit, Abdul
    Daraz, Amil
    Jan, Atif
    [J]. PLOS ONE, 2024, 19 (01):
  • [6] Towards an efficient backbone for preserving features in speech emotion recognition: deep-shallow convolution with recurrent neural network
    Goel, Dev Priya
    Mahajan, Kushagra
    Ngoc Duy Nguyen
    Srinivasan, Natesan
    Lim, Chee Peng
    [J]. NEURAL COMPUTING & APPLICATIONS, 2023, 35 (03): : 2457 - 2469
  • [7] Towards an efficient backbone for preserving features in speech emotion recognition: deep-shallow convolution with recurrent neural network
    Dev Priya Goel
    Kushagra Mahajan
    Ngoc Duy Nguyen
    Natesan Srinivasan
    Chee Peng Lim
    [J]. Neural Computing and Applications, 2023, 35 : 2457 - 2469
  • [8] GAUSSIAN PROCESS LSTM RECURRENT NEURAL NETWORK LANGUAGE MODELS FOR SPEECH RECOGNITION
    Lam, Max W. Y.
    Chen, Xie
    Hu, Shoukang
    Yu, Jianwei
    Liu, Xunying
    Meng, Helen
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7235 - 7239
  • [9] Speech Enhancement for Speaker Recognition Using Deep Recurrent Neural Networks
    Tkachenko, Maxim
    Yamshinin, Alexander
    Lyubimov, Nikolay
    Kotov, Mikhail
    Nastasenko, Marina
    [J]. SPEECH AND COMPUTER, SPECOM 2017, 2017, 10458 : 690 - 699
  • [10] TOWARDS STRUCTURED DEEP NEURAL NETWORK FOR AUTOMATIC SPEECH RECOGNITION
    Liao, Yi-Hsiu
    Lee, Hung-yi
    Lee, Lin-shan
    [J]. 2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2015, : 137 - 144