Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

被引:0
|
作者
Wang, Jing [1 ]
Saleem, Nasir [2 ,3 ]
Gunawan, Teddy Surya [3 ]
机构
[1] Yunnan Univ, Sch Mat Sci & Engn, Kunming City, Yunnan Province, Peoples R China
[2] Gomal Univ, Fac Engn & Technol, Dept Elect Engn, Dera Ismail Khan 29050, Pakistan
[3] Int Islamic Univ Malaysia IIUM, Dept Elect & Comp Engn, Kuala Lumpur, Malaysia
关键词
Deep learning; Speech enhancement; Speech recognition; Skip connections; LSTM; Acoustic features; Attention process; NOISE;
D O I
10.1007/s12559-024-10288-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Long short-term memory (LSTM) has proven effective in modeling sequential data. However, it may encounter challenges in accurately capturing long-term temporal dependencies. LSTM plays a central role in speech enhancement by effectively modeling and capturing temporal dependencies in speech signals. This paper introduces a variable-neurons-based LSTM designed for capturing long-term temporal dependencies by reducing neuron representation in layers with no loss of data. A skip connection between nonadjacent layers is added to prevent gradient vanishing. An attention mechanism in these connections highlights important features and spectral components. Our LSTM is inherently causal, making it well-suited for real-time processing without relying on future information. Training involves utilizing combined acoustic feature sets for improved performance, and the models estimate two time-frequency masks-the ideal ratio mask (IRM) and the ideal binary mask (IBM). Comprehensive evaluation using perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) showed that the proposed LSTM architecture demonstrates enhanced speech intelligibility and perceptual quality. Composite measures further substantiated performance, considering residual noise distortion (Cbak) and speech distortion (Csig). The proposed model showed a 16.21% improvement in STOI and a 0.69 improvement in PESQ on the TIMIT database. Similarly, with the LibriSpeech database, the STOI and PESQ showed improvements of 16.41% and 0.71 over noisy mixtures. The proposed LSTM architecture outperforms deep neural networks (DNNs) in different stationary and nonstationary background noisy conditions. To train an automatic speech recognition (ASR) system on enhanced speech, the Kaldi toolkit is used for evaluating word error rate (WER). The proposed LSTM at the front-end notably reduced WERs, achieving a notable 15.13% WER across different noisy backgrounds.
引用
收藏
页码:1221 / 1236
页数:16
相关论文
共 50 条
  • [31] DEEP CONVOLUTIONAL RECURRENT NEURAL NETWORK WITH ATTENTION MECHANISM FOR ROBUST SPEECH EMOTION RECOGNITION
    Huang, Che-Wei
    Narayanan, Shrikanth
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 583 - 588
  • [32] Speech Emotion Recognition Using Deep Convolutional Neural Network and Simple Recurrent Unit
    Jiang, Pengxu
    Fu, Hongliang
    Tao, Huawei
    [J]. ENGINEERING LETTERS, 2019, 27 (04) : 901 - 906
  • [33] Efficient Training and Evaluation of Recurrent Neural Network Language Models for Automatic Speech Recognition
    Chen, Xie
    Liu, Xunying
    Wang, Yongqiang
    Gales, Mark J. F.
    Woodland, Philip C.
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (11) : 2146 - 2157
  • [34] Speech enhancement from fused features based on deep neural network and gated recurrent unit network
    Wang, Youming
    Han, Jiali
    Zhang, Tianqi
    Qing, Didi
    [J]. EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2021, 2021 (01)
  • [35] Speech enhancement from fused features based on deep neural network and gated recurrent unit network
    Youming Wang
    Jiali Han
    Tianqi Zhang
    Didi Qing
    [J]. EURASIP Journal on Advances in Signal Processing, 2021
  • [36] Implementation of an autoassociative Recurrent Neural Network for speech recognition
    Cocchiglia, A
    Paplinski, A
    [J]. IEEE TENCON'97 - IEEE REGIONAL 10 ANNUAL CONFERENCE, PROCEEDINGS, VOLS 1 AND 2: SPEECH AND IMAGE TECHNOLOGIES FOR COMPUTING AND TELECOMMUNICATIONS, 1997, : 245 - 248
  • [37] A Fuzzy Neural Network Applied in the Speech Recognition System
    Zhang, Xueying
    Wang, Peng
    Li, Gaoyun
    Hou, Wenjun
    [J]. ICNC 2008: FOURTH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, VOL 3, PROCEEDINGS, 2008, : 14 - +
  • [38] Time Delay Recurrent Neural Network for Speech Recognition
    Liu, Boji
    Zhang, Weibin
    Xu, Xiangming
    Chen, Dongpeng
    [J]. 2019 3RD INTERNATIONAL CONFERENCE ON MACHINE VISION AND INFORMATION TECHNOLOGY (CMVIT 2019), 2019, 1229
  • [39] Maxout neurons for deep convolutional and LSTM neural networks in speech recognition
    Cai, Meng
    Liu, Jia
    [J]. SPEECH COMMUNICATION, 2016, 77 : 53 - 64
  • [40] Efficient Hardware Architectures for Deep Convolutional Neural Network
    Wang, Jichen
    Lin, Jun
    Wang, Zhongfeng
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS, 2018, 65 (06) : 1941 - 1953