ACOUSTIC-TO-WORD RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS

被引:0
|
作者
Palaskar, Shruti [1 ]
Metze, Florian [1 ]
机构
[1] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
关键词
end-to-end speech recognition; encoder-decoder; acoustic-to-word; speech embeddings;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Acoustic-to-Word recognition provides a straightforward solution to end-to-end speech recognition without needing external decoding, language model re-scoring or lexicon. While character-based models offer a natural solution to the out-of-vocabulary problem, word models can be simpler to decode and may also be able to directly recognize semantically meaningful units. We present effective methods to train Sequence-to-Sequence models for direct word-level recognition (and character-level recognition) and show an absolute improvement of 4.4-5.0% in Word Error Rate on the Switchboard corpus compared to prior work. In addition to these promising results, word-based models are more interpretable than character models, which have to be composed into words using a separate decoding step. We analyze the encoder hidden states and the attention behavior, and show that location-aware attention naturally represents words as a single speech-word-vector, despite spanning multiple frames in the input. We finally show that the Acoustic-to-Word model also learns to segment speech into words with a mean standard deviation of 3 frames as compared with human annotated forced-alignments for the Switchboard corpus.
引用
收藏
页码:397 / 404
页数:8
相关论文
共 50 条
  • [31] On Sparsifying Encoder Outputs in Sequence-to-Sequence Models
    Zhang, Biao
    Titov, Ivan
    Sennrich, Rico
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2888 - 2900
  • [32] Learning Damage Representations with Sequence-to-Sequence Models
    Yang, Qun
    Shen, Dejian
    SENSORS, 2022, 22 (02)
  • [33] On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition
    Irie, Kazuki
    Prabhavalkar, Rohit
    Kannan, Anjuli
    Bruguier, Antoine
    Rybach, David
    Nguyen, Patrick
    INTERSPEECH 2019, 2019, : 3800 - 3804
  • [34] Analysis of Multilingual Sequence-to-Sequence Speech Recognition Systems
    Karafiat, Martin
    Baskar, Murali Karthick
    Watanabe, Shinji
    Hori, Takaaki
    Wiesner, Matthew
    Cernocky, Jan Honza
    INTERSPEECH 2019, 2019, : 2220 - 2224
  • [35] EXTRACTING UNIT EMBEDDINGS USING SEQUENCE-TO-SEQUENCE ACOUSTIC MODELS FOR UNIT SELECTION SPEECH SYNTHESIS
    Zhou, Xiao
    Ling, Zhen-Hua
    Dai, Li-Rong
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7659 - 7663
  • [36] Automated Integration of Genomic Metadata with Sequence-to-Sequence Models
    Cannizzaro, Giuseppe
    Leone, Michele
    Bernasconi, Anna
    Canakoglu, Arif
    Carman, Mark J.
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: APPLIED DATA SCIENCE AND DEMO TRACK, ECML PKDD 2020, PT V, 2021, 12461 : 187 - 203
  • [37] Neural AMR: Sequence-to-Sequence Models for Parsing and Generation
    Konstas, Ioannis
    Iyer, Srinivasan
    Yatskar, Mark
    Choi, Yejin
    Zettlemoyer, Luke
    PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 146 - 157
  • [38] UnitNet: A Sequence-to-Sequence Acoustic Model for Concatenative Speech Synthesis
    Zhou, Xiao
    Ling, Zhen-Hua
    Dai, Li-Rong
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2643 - 2655
  • [39] Plan, Attend, Generate: Planning for Sequence-to-Sequence Models
    Dutil, Francis
    Gulcehre, Caglar
    Trischler, Adam
    Bengio, Yoshua
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [40] Sequence-to-Sequence Models for Trajectory Deformation of Dynamic Manipulation
    Kutsuzawa, Kyo
    Sakaino, Sho
    Tsuji, Toshiaki
    IECON 2017 - 43RD ANNUAL CONFERENCE OF THE IEEE INDUSTRIAL ELECTRONICS SOCIETY, 2017, : 5227 - 5232