Speech recognition by humans and machines under conditions with severe channel variability and noise

被引:1
|
作者
Lippmann, RP
Carlson, BA
机构
关键词
speech recognition; speech perception; missing features; filtering; noise; robust; neural network;
D O I
10.1117/12.271525
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite dramatic recent advances in speech recognition technology, speech recognizers still perform much worse than humans. The difference in performance between humans and machines is most dramatic when variable amounts and types of filtering and noise are present during testing. For example, humans readily understand speech that is low-pass filtered below 3 kHz or high-pass filtered above 1 kHz. Machines trained with wide-band speech, however, degrade dramatically under these conditions. An approach to compensate for variable unknown sharp filtering and noise is presented which uses mel-filter-bank magnitudes as input features, estimates the signal-to-noise ratio (SNR) for each filter, and uses missing feature theory to dynamically modify the probability computations performed using Gaussian Mixture or Radial Basis Function neural network classifiers embedded within Hidden Markov Model (HMM) recognizers. The approach was successfully demonstrated using a talker-independent digit recognition task. It was found that recognition accuracy across many conditions rises from below 50% to above 95% with this approach. These promising results suggest future work to dynamically estimate SNR's and to explore the dynamics of human adaptation to channel and noise variability.
引用
收藏
页码:46 / 57
页数:12
相关论文
共 50 条
  • [1] Speech recognition by machines and humans
    Lippmann, RP
    [J]. SPEECH COMMUNICATION, 1997, 22 (01) : 1 - 15
  • [2] NOISE ROBUST SPEECH RECOGNITION ON AURORA4 BY HUMANS AND MACHINES
    Qian, Yanmin
    Tan, Tian
    Hu, Hu
    Liu, Qi
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5604 - 5608
  • [3] English Conversational Telephone Speech Recognition by Humans and Machines
    Saon, George
    Kurata, Gakuto
    Sercu, Tom
    Audhkhasi, Kartik
    Thomas, Samuel
    Dimitriadis, Dimitrios
    Cui, Xiaodong
    Ramabhadran, Bhuvana
    Picheny, Michael
    Lim, Lynn-Li
    Roomi, Bergul
    Hall, Phil
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 132 - 136
  • [4] ENGLISH BROADCAST NEWS SPEECH RECOGNITION BY HUMANS AND MACHINES
    Thomas, Samuel
    Suzuki, Masayuki
    Huang, Yinghui
    Kurata, Gakuto
    Tuske, Zoltan
    Saon, George
    Kingsbury, Brian
    Picheny, Michael
    Dibert, Tom
    Kaiser-Schatzlein, Alice
    Samko, Bern
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6455 - 6459
  • [5] Generative Noise Modeling and Channel Simulation for Robust Speech Recognition in Unseen Conditions
    Soni, Meet
    Joshi, Sonal
    Panda, Ashish
    [J]. INTERSPEECH 2019, 2019, : 441 - 445
  • [6] Assessing costa rican children speech recognition by humans and machines
    Morales-Rodriguez, Maribel
    Coto-Jimenez, Marvin
    [J]. TECNOLOGIA EN MARCHA, 2022, 35
  • [7] SYNTHESIS AND RECOGNITION OF SPEECH - VOICE COMMUNICATION BETWEEN HUMANS AND MACHINES
    FLANAGAN, JL
    [J]. IEEE TRANSACTIONS ON SONICS AND ULTRASONICS, 1982, 29 (03): : 158 - 158
  • [8] SPEECH RECOGNITION IN UNSEEN AND NOISY CHANNEL CONDITIONS
    Mitra, Vikramjit
    Franco, Horacio
    Bartels, Chris
    van Hout, Julien
    Graciarena, Martin
    Vergyri, Dimitra
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5215 - 5219
  • [9] A novel channel estimate for noise robust speech recognition
    Vanderreydt, Geoffroy
    Demuynck, Kris
    [J]. COMPUTER SPEECH AND LANGUAGE, 2024, 86
  • [10] Speech Emotion Recognition under White Noise
    Huang, Chengwei
    Chen, Guoming
    Yu, Hua
    Bao, Yongqiang
    Zhao, Li
    [J]. ARCHIVES OF ACOUSTICS, 2013, 38 (04) : 457 - 463