Direct Acoustics-to-Word Models for English Conversational Speech Recognition

被引:74
|
作者
Audhkhasi, Kartik [1 ]
Ramabhadran, Bhuvana [1 ]
Saon, George [1 ]
Picheny, Michael [1 ]
Nahamoo, David [1 ]
机构
[1] IBM TJ Watson Res Ctr, Yorktown Hts, NY 10598 USA
关键词
automatic speech recognition; neural networks; end-to-end;
D O I
10.21437/Interspeech.2017-546
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent work on end-to-end automatic speech recognition (ASR) has shown that the connectionist temporal classification (CTC) loss can be used to convert acoustics to phone or character sequences. Such systems are used with a dictionary and separately-trained Language Model (LM) to produce word sequences. However. they are not truly end-to-end in the sense of mapping acoustics directly to words without an intermediate phone representation. In this paper, we present the first results employing direct acoustics-to-word CTC models on two well-known public benchmark tasks: Switchboard and Call-Home. These models do not require an LM or even a decoder at run-time and hence recognize speech with minimal complexity. However, due to the large number of word output units. CTC word models require orders of magnitude more data to train reliably compared to traditional systems. We present some techniques to mitigate this issue. Our CTC word model achieves a word error rate of 13.0%/18.8% on the Hub5-2000 Switch-board/CallHome test sets without any LM or decoder compared with 9.6%/16.0% for phone-based CTC with a 4-gram LM. We also present rescoring results on CTC word model lattices to quantify the performance benefits of a LM, and contrast the performance of word and phone CTC models.
引用
收藏
页码:959 / 963
页数:5
相关论文
共 50 条
  • [1] BUILDING COMPETITIVE DIRECT ACOUSTICS-TO-WORD MODELS FOR ENGLISH CONVERSATIONAL SPEECH RECOGNITION
    Audhkhasi, Kartik
    Kingsbury, Brian
    Ramabhadran, Bhuvana
    Saon, George
    Picheny, Michael
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4759 - 4763
  • [2] ACOUSTICALLY GROUNDED WORD EMBEDDINGS FOR IMPROVED ACOUSTICS-TO-WORD SPEECH RECOGNITION
    Settle, Shane
    Audhkhasi, Kartik
    Livescu, Karen
    Picheny, Michael
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5641 - 5645
  • [3] On the limit of English conversational speech recognition
    Tuske, Zoltan
    Saon, George
    Kingsbury, Brian
    [J]. INTERSPEECH 2021, 2021, : 2062 - 2066
  • [4] Improving English Conversational Telephone Speech Recognition
    Medennikov, Ivan
    Prudnikov, Alexey
    Zatvornitskiy, Alexander
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2 - 6
  • [5] ON MODULAR TRAINING OF NEURAL ACOUSTICS-TO-WORD MODEL FOR LVCSR
    Chen, Zhehuai
    Liu, Qi
    Li, Hao
    Yu, Kai
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4754 - 4758
  • [6] English Conversational Telephone Speech Recognition by Humans and Machines
    Saon, George
    Kurata, Gakuto
    Sercu, Tom
    Audhkhasi, Kartik
    Thomas, Samuel
    Dimitriadis, Dimitrios
    Cui, Xiaodong
    Ramabhadran, Bhuvana
    Picheny, Michael
    Lim, Lynn-Li
    Roomi, Bergul
    Hall, Phil
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 132 - 136
  • [7] Hybrid language models for out of vocabulary word detection in large vocabulary conversational speech recognition
    Yazgan, A
    Saraclar, M
    [J]. 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 745 - 748
  • [8] On the acoustics of overlapping laughter in conversational speech
    Truong, Khiet P.
    Trouvain, Juergen
    [J]. 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 850 - 853
  • [9] The IBM 2016 English Conversational Telephone Speech Recognition System
    Saon, George
    Sercu, Tom
    Rennie, Steven
    Kuo, Hong-Kwang J.
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 7 - 11
  • [10] The IBM 2015 English Conversational Telephone Speech Recognition System
    Saon, George
    Kuo, Hong-Kwang J.
    Rennie, Steven
    Picheny, Michael
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 3140 - 3144