An efficient joint training model for monaural noisy-reverberant speech recognition

被引:0
|
作者
Lian, Xiaoyu [1 ]
Xia, Nan [1 ]
Dai, Gaole [1 ]
Yang, Hongqin [1 ]
机构
[1] School of Information Science and Engineering, Dalian Polytechnic University, Liaoning, Dalian,116034, China
关键词
Background noise;
D O I
10.1016/j.apacoust.2024.110322
中图分类号
学科分类号
摘要
Noise and reverberation can seriously reduce speech quality and intelligibility, affecting the performance of downstream speech recognition tasks. This paper constructs a joint training speech recognition network for speech recognition in monaural noisy-reverberant environments. In the speech enhancement model, a complex-valued channel and temporal-frequency attention (CCTFA) are integrated to focus on the key features of the complex spectrum. Then the CCTFA network (CCTFANet) is constructed to reduce the influence of noise and reverberation. In the speech recognition model, an element-wise linear attention (EWLA) module is proposed to linearize the attention complexity and reduce the number of parameters and computations required for the attention module. Then the EWLA Conformer (EWLAC) is constructed as an efficient end-to-end speech recognition model. On the open source dataset, joint training of CCTFANet with EWLAC reduces the CER by 3.27%. Compared to other speech recognition models, EWLAC maintains CER while achieving much lower parameter count, computational overhead, and higher inference speed. © 2024 Elsevier Ltd
引用
收藏
相关论文
共 50 条
  • [31] Joint Training of Speech Separation, Filterbank and Acoustic Model for Robust Automatic Speech Recognition
    Wang, Zhong-Qiu
    Wang, DeLiang
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 2839 - 2843
  • [32] Model-Based Feature Enhancement for Reverberant Speech Recognition
    Krueger, Alexander
    Haeb-Umbach, Reinhold
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (07): : 1692 - 1707
  • [33] Blind Model Selection for Automatic Speech Recognition in Reverberant Environments
    Laurent Couvreur
    Christophe Couvreur
    Journal of VLSI signal processing systems for signal, image and video technology, 2004, 36 : 189 - 203
  • [34] Blind model selection for automatic speech recognition in reverberant environments
    Couvreur, L
    Couvreur, C
    JOURNAL OF VLSI SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2004, 36 (2-3): : 189 - 203
  • [35] Model adaptation based on HMM decomposition for reverberant speech recognition
    Takiguchi, T
    Nakamura, S
    Huo, Q
    Shikano, K
    1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 827 - 830
  • [36] Collaborative Joint Training With Multitask Recurrent Model for Speech and Speaker Recognition
    Tang, Zhiyuan
    Li, Lantian
    Wang, Dong
    Vipperla, Ravichander
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (03) : 493 - 504
  • [37] JOINT MAXIMUM LIKELIHOOD ESTIMATION OF LATE REVERBERANT AND SPEECH POWER SPECTRAL DENSITY IN NOISY ENVIRONMENTS
    Schwartz, Ofer
    Gannot, Sharon
    Habets, Emanueel A. P.
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 151 - 155
  • [38] Speech recognition based on HMM decomposition and composition method with a microphone array in noisy reverberant environments
    Miki, K
    Nishiura, T
    Nakamura, S
    Shikano, K
    ELECTRONICS AND COMMUNICATIONS IN JAPAN PART II-ELECTRONICS, 2002, 85 (09): : 13 - 22
  • [39] A STUDY ON JOINT BEAMFORMING AND SPECTRAL ENHANCEMENT FOR ROBUST SPEECH RECOGNITION IN REVERBERANT ENVIRONMENTS
    Xiong, Feifei
    Meyer, Bernd T.
    Goetze, Stefan
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 5043 - 5047
  • [40] Joint enhancement and classification constraints for noisy speech emotion recognition
    Sun, Linhui
    Lei, Yunlong
    Wang, Shun
    Chen, Shuaitong
    Zhao, Min
    Li, Pingan
    DIGITAL SIGNAL PROCESSING, 2024, 151