An Overview of Noise-Robust Automatic Speech Recognition

被引:364
|
作者
Li, Jinyu [1 ]
Deng, Li [1 ]
Gong, Yifan [1 ]
Haeb-Umbach, Reinhold [2 ]
机构
[1] Microsoft Corp, Redmond, WA 98052 USA
[2] Univ Paderborn, Dept Commun Engn, D-33098 Paderborn, Germany
关键词
Speech recognition; noise; robustness; distortion modeling; compensation; uncertainty processing; joint model training; NONNEGATIVE MATRIX FACTORIZATION; PREDICTIVE CLASSIFICATION APPROACH; MAXIMUM-LIKELIHOOD-ESTIMATION; RAPID SPEAKER ADAPTATION; HISTOGRAM EQUALIZATION; LINEAR-REGRESSION; MASK ESTIMATION; FEATURE ENHANCEMENT; JOINT COMPENSATION; ENVIRONMENT MODEL;
D O I
10.1109/TASLP.2014.2304637
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
New waves of consumer-centric applications, such as voice search and voice interaction with mobile devices and home entertainment systems, increasingly require automatic speech recognition (ASR) to be robust to the full range of real-world noise and other acoustic distorting conditions. Despite its practical importance, however, the inherent links between and distinctions among the myriad of methods for noise-robust ASR have yet to be carefully studied in order to advance the field further. To this end, it is critical to establish a solid, consistent, and common mathematical foundation for noise-robust ASR, which is lacking at present. This article is intended to fill this gap and to provide a thorough overview of modern noise-robust techniques for ASR developed over the past 30 years. We emphasize methods that are proven to be successful and that are likely to sustain or expand their future applicability. We distill key insights from our comprehensive overview in this field and take a fresh look at a few old problems, which nevertheless are still highly relevant today. Specifically, we have analyzed and categorized a wide range of noise-robust techniques using five different criteria: 1) feature-domain vs. model-domain processing, 2) the use of prior knowledge about the acoustic environment distortion, 3) the use of explicit environment-distortion models, 4) deterministic vs. uncertainty processing, and 5) the use of acoustic models trained jointly with the same feature enhancement or model adaptation process used in the testing stage. With this taxonomy-oriented review, we equip the reader with the insight to choose among techniques and with the awareness of the performance-complexity tradeoffs. The pros and cons of using different noise-robust ASR techniques in practical application scenarios are provided as a guide to interested practitioners. The current challenges and future research directions in this field is also carefully analyzed.
引用
收藏
页码:745 / 777
页数:33
相关论文
共 50 条
  • [21] Orthogonalized distinctive phonetic feature extraction for noise-robust automatic speech recognition
    Fukuda, T
    Nitta, T
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2004, E87D (05): : 1110 - 1118
  • [22] Knowledge Distillation-Based Training of Speech Enhancement for Noise-Robust Automatic Speech Recognition
    Woo Lee, Geon
    Kook Kim, Hong
    Kong, Duk-Jo
    [J]. IEEE ACCESS, 2024, 12 : 72707 - 72720
  • [23] EXPLOITING SYNCHRONY SPECTRA AND DEEP NEURAL NETWORKS FOR NOISE-ROBUST AUTOMATIC SPEECH RECOGNITION
    Ma, Ning
    Marxer, Ricard
    Barker, Jon
    Brown, Guy J.
    [J]. 2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2015, : 490 - 495
  • [24] An Efficient Noise-Robust Automatic Speech Recognition System using Artificial Neural Networks
    Gupta, Santosh
    Bhurchandi, Kishor M.
    Keskar, Avinash G.
    [J]. 2016 INTERNATIONAL CONFERENCE ON COMMUNICATION AND SIGNAL PROCESSING (ICCSP), VOL. 1, 2016, : 1873 - 1877
  • [25] An FFT-Based Companding Front End for Noise-Robust Automatic Speech Recognition
    Bhiksha Raj
    Lorenzo Turicchia
    Bent Schmidt-Nielsen
    Rahul Sarpeshkar
    [J]. EURASIP Journal on Audio, Speech, and Music Processing, 2007
  • [26] An FFT-Based Companding Front End for Noise-Robust Automatic Speech Recognition
    Raj, Bhiksha
    Turicchia, Lorenzo
    Schmidt-Nielsen, Bent
    Sarpeshkar, Rahul
    [J]. EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2007, 2007 (1)
  • [27] An engineering model of the masking for the noise-robust speech recognition
    Park, KY
    Lee, SY
    [J]. NEUROCOMPUTING, 2003, 52-4 : 615 - 620
  • [28] Unsupervised Speech Enhancement Based on Multichannel NMF-Informed Beamforming for Noise-Robust Automatic Speech Recognition
    Shimada, Kazuki
    Bando, Yoshiaki
    Mimura, Masato
    Itoyama, Katsutoshi
    Yoshii, Kazuyoshi
    Kawahara, Tatsuya
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (05) : 960 - 971
  • [29] Noise-robust speech recognition based on difference of power spectrum
    Xu, JF
    Wei, G
    [J]. ELECTRONICS LETTERS, 2000, 36 (14) : 1247 - 1248
  • [30] Deep Maxout Networks Applied to Noise-Robust Speech Recognition
    de-la-Calle-Silos, F.
    Gallardo-Antolin, A.
    Pelaez-Moreno, C.
    [J]. ADVANCES IN SPEECH AND LANGUAGE TECHNOLOGIES FOR IBERIAN LANGUAGES, IBERSPEECH 2014, 2014, 8854 : 109 - 118