ACOUSTIC-TO-WORD RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS

被引:0
|
作者
Palaskar, Shruti [1 ]
Metze, Florian [1 ]
机构
[1] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
关键词
end-to-end speech recognition; encoder-decoder; acoustic-to-word; speech embeddings;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Acoustic-to-Word recognition provides a straightforward solution to end-to-end speech recognition without needing external decoding, language model re-scoring or lexicon. While character-based models offer a natural solution to the out-of-vocabulary problem, word models can be simpler to decode and may also be able to directly recognize semantically meaningful units. We present effective methods to train Sequence-to-Sequence models for direct word-level recognition (and character-level recognition) and show an absolute improvement of 4.4-5.0% in Word Error Rate on the Switchboard corpus compared to prior work. In addition to these promising results, word-based models are more interpretable than character models, which have to be composed into words using a separate decoding step. We analyze the encoder hidden states and the attention behavior, and show that location-aware attention naturally represents words as a single speech-word-vector, despite spanning multiple frames in the input. We finally show that the Acoustic-to-Word model also learns to segment speech into words with a mean standard deviation of 3 frames as compared with human annotated forced-alignments for the Switchboard corpus.
引用
收藏
页码:397 / 404
页数:8
相关论文
共 50 条
  • [1] LEVERAGING SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS FOR ENHANCING ACOUSTIC-TO-WORD SPEECH RECOGNITION
    Mimura, Masato
    Ueno, Sei
    Inaguma, Hirofumi
    Sakai, Shinsuke
    Kawahara, Tatsuya
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 477 - 484
  • [2] MULTI-SPEAKER SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS FOR DATA AUGMENTATION IN ACOUSTIC-TO-WORD SPEECH RECOGNITION
    Ueno, Sei
    Mimura, Masato
    Sakai, Shinsuke
    Kawahara, Tatsuya
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6161 - 6165
  • [3] A Comparison of Sequence-to-Sequence Models for Speech Recognition
    Prabhavalkar, Rohit
    Rao, Kanishka
    Sainath, Tara N.
    Li, Bo
    Johnson, Leif
    Jaitly, Navdeep
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 939 - 943
  • [4] SUPERVISED ATTENTION IN SEQUENCE-TO-SEQUENCE MODELS FOR SPEECH RECOGNITION
    Yang, Gene-Ping
    Tang, Hao
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7222 - 7226
  • [5] STATE-OF-THE-ART SPEECH RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS
    Chiu, Chung-Cheng
    Sainath, Tara N.
    Wu, Yonghui
    Prabhavalkar, Rohit
    Nguyen, Patrick
    Chen, Zhifeng
    Kannan, Anjuli
    Weiss, Ron J.
    Rao, Kanishka
    Gonina, Ekaterina
    Jaitly, Navdeep
    Li, Bo
    Chorowski, Jan
    Bacchiani, Michiel
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4774 - 4778
  • [6] COUPLED TRAINING OF SEQUENCE-TO-SEQUENCE MODELS FOR ACCENTED SPEECH RECOGNITION
    Unni, Vinit
    Joshi, Nitish
    Jyothi, Preethi
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8254 - 8258
  • [7] Utilizing Character and Word Embeddings for Text Normalization with Sequence-to-Sequence Models
    Watson, Daniel
    Zalmout, Nasser
    Habash, Nizar
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 837 - 843
  • [8] Sparse Sequence-to-Sequence Models
    Peters, Ben
    Niculae, Vlad
    Martins, Andre F. T.
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1504 - 1519
  • [9] SEQUENCE-TO-SEQUENCE AUTOMATIC SPEECH RECOGNITION WITH WORD EMBEDDING REGULARIZATION AND FUSED DECODING
    Liu, Alexander H.
    Sung, Tzu-Wei
    Chuang, Shun-Po
    Lee, Hung-yi
    Lee, Lin-shah
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7879 - 7883
  • [10] SEQUENCE-TO-SEQUENCE MODELS FOR PUNCTUATED TRANSCRIPTION COMBINING LEXICAL AND ACOUSTIC FEATURES
    Klejch, Ondrej
    Bell, Peter
    Renals, Steve
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5700 - 5704