On the limit of English conversational speech recognition

被引:10
|
作者
Tuske, Zoltan [1 ]
Saon, George [1 ]
Kingsbury, Brian [1 ]
机构
[1] IBM Res AI, Yorktown Hts, NY 10598 USA
来源
关键词
encoder-decoder; attention; speech recognition; AdamW; Switchboard; i-vector; FEATURES; NOISE;
D O I
10.21437/Interspeech.2021-211
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
In our previous work we demonstrated that a single headed attention encoder-decoder model is able to reach state-of-the-art results in conversational speech recognition. In this paper, we further improve the results for both Switchboard 300 and 2000. Through use of an improved optimizer, speaker vector embeddings, and alternative speech representations we reduce the recognition errors of our LSTM system on Switchboard-300 by 4% relative. Compensation of the decoder model with the probability ratio approach allows more efficient integration of an external language model, and we report 5.9% and 11.5% WER on the SWB and CHM parts of Hub5'00 with very simple LSTM models. Our study also considers the recently proposed conformer, and more advanced self-attention based language models. Overall, the conformer shows similar performance to the LSTM; nevertheless, their combination and decoding with an improved LM reaches a new record on Switchboard-300, 5.0% and 10.0% WER on SWB and CHM. Our findings are also confirmed on Switchboard-2000, and a new state of the art is reported, practically reaching the limit of the benchmark.
引用
收藏
页码:2062 / 2066
页数:5
相关论文
共 50 条
  • [1] Improving English Conversational Telephone Speech Recognition
    Medennikov, Ivan
    Prudnikov, Alexey
    Zatvornitskiy, Alexander
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2 - 6
  • [2] English Conversational Telephone Speech Recognition by Humans and Machines
    Saon, George
    Kurata, Gakuto
    Sercu, Tom
    Audhkhasi, Kartik
    Thomas, Samuel
    Dimitriadis, Dimitrios
    Cui, Xiaodong
    Ramabhadran, Bhuvana
    Picheny, Michael
    Lim, Lynn-Li
    Roomi, Bergul
    Hall, Phil
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 132 - 136
  • [3] The IBM 2016 English Conversational Telephone Speech Recognition System
    Saon, George
    Sercu, Tom
    Rennie, Steven
    Kuo, Hong-Kwang J.
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 7 - 11
  • [4] The IBM 2015 English Conversational Telephone Speech Recognition System
    Saon, George
    Kuo, Hong-Kwang J.
    Rennie, Steven
    Picheny, Michael
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 3140 - 3144
  • [5] The 2001 BYBLOS english large vocabulary conversational speech recognition system
    Matsoukas, S
    Colthurst, T
    Kimball, O
    Solomonoff, A
    Richardson, F
    Quillen, C
    Gish, H
    Dognin, P
    [J]. 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 721 - 724
  • [6] Direct Acoustics-to-Word Models for English Conversational Speech Recognition
    Audhkhasi, Kartik
    Ramabhadran, Bhuvana
    Saon, George
    Picheny, Michael
    Nahamoo, David
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 959 - 963
  • [7] A FIRST SPEECH RECOGNITION SYSTEM FOR MANDARIN-ENGLISH CODE-SWITCH CONVERSATIONAL SPEECH
    Ngoc Thang Vu
    Lyu, Dau-Cheng
    Weiner, Jochen
    Telaar, Dominic
    Schlippe, Tim
    Blaicher, Fabian
    Chng, Eng-Siong
    Schultz, Tanja
    Li, Haizhou
    [J]. 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4889 - 4892
  • [8] A FIRST SPEECH RECOGNITION SYSTEM FOR MANDARIN-ENGLISH CODE-SWITCH CONVERSATIONAL SPEECH
    Ngoc Thang Vu
    Lyu, Dau-Cheng
    Weiner, Jochen
    Telaar, Dominic
    Schlippe, Tim
    Blaicher, Fabian
    Chng, Eng-Siong
    Schultz, Tanja
    Li, Haizhou
    [J]. 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4889 - 4892
  • [9] Conversational telephone speech recognition
    Gauvain, JL
    Lamel, L
    Schwenk, H
    Adda, G
    Chen, L
    Lefèvre, F
    [J]. 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING I, 2003, : 212 - 215
  • [10] BUILDING COMPETITIVE DIRECT ACOUSTICS-TO-WORD MODELS FOR ENGLISH CONVERSATIONAL SPEECH RECOGNITION
    Audhkhasi, Kartik
    Kingsbury, Brian
    Ramabhadran, Bhuvana
    Saon, George
    Picheny, Michael
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4759 - 4763