ACOUSTIC-TO-WORD RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS

被引:0
|
作者
Palaskar, Shruti [1 ]
Metze, Florian [1 ]
机构
[1] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
关键词
end-to-end speech recognition; encoder-decoder; acoustic-to-word; speech embeddings;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Acoustic-to-Word recognition provides a straightforward solution to end-to-end speech recognition without needing external decoding, language model re-scoring or lexicon. While character-based models offer a natural solution to the out-of-vocabulary problem, word models can be simpler to decode and may also be able to directly recognize semantically meaningful units. We present effective methods to train Sequence-to-Sequence models for direct word-level recognition (and character-level recognition) and show an absolute improvement of 4.4-5.0% in Word Error Rate on the Switchboard corpus compared to prior work. In addition to these promising results, word-based models are more interpretable than character models, which have to be composed into words using a separate decoding step. We analyze the encoder hidden states and the attention behavior, and show that location-aware attention naturally represents words as a single speech-word-vector, despite spanning multiple frames in the input. We finally show that the Acoustic-to-Word model also learns to segment speech into words with a mean standard deviation of 3 frames as compared with human annotated forced-alignments for the Switchboard corpus.
引用
收藏
页码:397 / 404
页数:8
相关论文
共 50 条
  • [21] Empirical Evaluation of Sequence-to-Sequence Models for Word Discovery in Low-resource Settings
    Boito, Marcely Zanon
    Villavicencio, Aline
    Besacier, Laurent
    INTERSPEECH 2019, 2019, : 2688 - 2692
  • [22] CONFIDENCE ESTIMATION FOR ATTENTION-BASED SEQUENCE-TO-SEQUENCE MODELS FOR SPEECH RECOGNITION
    Li, Qiujia
    Qiu, David
    Zhang, Yu
    Li, Bo
    He, Yanzhang
    Woodland, Philip C.
    Cao, Liangliang
    Strohman, Trevor
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6388 - 6392
  • [23] MINIMUM WORD ERROR RATE TRAINING FOR ATTENTION-BASED SEQUENCE-TO-SEQUENCE MODELS
    Prabhavalkar, Rohit
    Sainath, Tara N.
    Wu, Yonghui
    Nguyen, Patrick
    Chen, Zhifeng
    Chiu, Chung-Cheng
    Kannan, Anjuli
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4839 - 4843
  • [24] Advancing sequence-to-sequence based speech recognition
    Tuske, Zoltan
    Audhkhasi, Kartik
    Saon, George
    INTERSPEECH 2019, 2019, : 3780 - 3784
  • [25] Sequence-to-Sequence Contrastive Learning for Text Recognition
    Aberdam, Aviad
    Litman, Ron
    Tsiper, Shahar
    Anschel, Oron
    Slossberg, Ron
    Mazor, Shai
    Manmatha, R.
    Perona, Pietro
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15297 - 15307
  • [26] Deep Reinforcement Learning for Sequence-to-Sequence Models
    Keneshloo, Yaser
    Shi, Tian
    Ramakrishnan, Naren
    Reddy, Chandan K.
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2020, 31 (07) : 2469 - 2489
  • [27] Sequence-to-Sequence Models for Automated Text Simplification
    Botarleanu, Robert-Mihai
    Dascalu, Mihai
    Crossley, Scott Andrew
    McNamara, Danielle S.
    ARTIFICIAL INTELLIGENCE IN EDUCATION (AIED 2020), PT II, 2020, 12164 : 31 - 36
  • [28] Sequence-to-Sequence Models for Emphasis Speech Translation
    Quoc Truong Do
    Sakti, Sakriani
    Nakamura, Satoshi
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (10) : 1873 - 1883
  • [29] Sequence-to-sequence Models for Cache Transition Systems
    Peng, Xiaochang
    Song, Linfeng
    Gildea, Daniel
    Satta, Giorgio
    PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 1842 - 1852
  • [30] On Evaluation of Adversarial Perturbations for Sequence-to-Sequence Models
    Michel, Paul
    Li, Xian
    Neubig, Graham
    Pino, Juan Miguel
    2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 3103 - 3114