Joint Grapheme and Phoneme Embeddings for Contextual End-to-End ASR

被引:13
|
作者
Chen, Zhehuai [1 ,2 ]
Jain, Mahaveer [2 ]
Wang, Yongqiang [2 ]
Seltzer, Michael L. [2 ]
Fuegen, Christian [2 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, SpeechLab, Shanghai, Peoples R China
[2] Facebook AI, One Hacker Way, Menlo Pk, CA 94025 USA
来源
关键词
End-to-end Speech Recognition; Deep Context; CLAS; Sequence Pooling; Grapheme-to-Phoneme (G2P);
D O I
10.21437/Interspeech.2019-1434
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
End-to-end approaches to automatic speech recognition, such as Listen-Attend-Spell (LAS), blend all components of a traditional speech recognizer into a unified model. Although this simplifies training and decoding pipelines, a unified model is hard to adapt when mismatch exists between training and test data, especially if this information is dynamically changing. The Contextual LAS (CLAS) framework tries to solve this problem by encoding contextual entities into fixed-dimensional embeddings and utilizing an attention mechanism to model the probabilities of seeing these entities. In this work, we improve the CLAS approach by proposing several new strategies to extract embeddings for the contextual entities. We compare these embedding extractors based on graphemic and phonetic input and/or output sequences and show that an encoder-decoder model trained jointly towards graphemes and phonemes outperforms other approaches. Leveraging phonetic information obtains better discrimination for similarly written graphemic sequences and also helps the model generalize better to graphemic sequences unseen in training. We show significant improvements over the original CLAS approach and also demonstrate that the proposed method scales much better to a large number of contextual entities across multiple domains.
引用
收藏
页码:3490 / 3494
页数:5
相关论文
共 50 条
  • [31] Shallow-Fusion End-to-End Contextual Biasing
    Zhao, Ding
    Sainath, Tara N.
    Rybach, David
    Rondon, Pat
    Bhatia, Deepti
    Li, Bo
    Pang, Ruoming
    [J]. INTERSPEECH 2019, 2019, : 1418 - 1422
  • [32] DEEP CONTEXT: END-TO-END CONTEXTUAL SPEECH RECOGNITION
    Pundak, Golan
    Sainath, Tara N.
    Prabhavalkar, Rohit
    Kannan, Anjuli
    Zhao, Ding
    [J]. 2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 418 - 425
  • [33] End-to-end Contextual Perception and Prediction with Interaction Transformer
    Li, Lingyun Luke
    Yang, Bin
    Liang, Ming
    Zeng, Wenyuan
    Ren, Mengye
    Segal, Sean
    Urtasun, Raquel
    [J]. 2020 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2020, : 5784 - 5791
  • [34] A Phoneme Sequence Driven Lightweight End-To-End Speech Synthesis Approach
    Jiang, Zite
    Qin, Feiwei
    Zhao, Liaoying
    [J]. 2019 3RD INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, AUTOMATION AND CONTROL TECHNOLOGIES (AIACT 2019), 2019, 1267
  • [35] Towards End-to-End Spoken Dialogue Systems with Turn Embeddings
    Bayer, Ali Orkan
    Stepanov, Evgeny A.
    Riccardi, Giuseppe
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2516 - 2520
  • [36] End-to-end recurrent denoising autoencoder embeddings for speaker identification
    Esther Rituerto-González
    Carmen Peláez-Moreno
    [J]. Neural Computing and Applications, 2021, 33 : 14429 - 14439
  • [37] End-to-end recurrent denoising autoencoder embeddings for speaker identification
    Rituerto-Gonzalez, Esther
    Pelaez-Moreno, Carmen
    [J]. NEURAL COMPUTING & APPLICATIONS, 2021, 33 (21): : 14429 - 14439
  • [38] A phoneme-similarity based ASR front-end
    Applebaum, TH
    Morin, P
    Hanson, BA
    [J]. 1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 33 - 36
  • [39] END-TO-END ASR-FREE KEYWORD SEARCH FROM SPEECH
    Audhkhasi, Kartik
    Rosenberg, Andrew
    Sethy, Abhinav
    Ramabhadran, Bhuvana
    Kingsbury, Brian
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 4840 - 4844
  • [40] END-TO-END MULTI-SPEAKER ASR WITH INDEPENDENT VECTOR ANALYSIS
    Scheibler, Robin
    Zhang, Wangyou
    Chang, Xuankai
    Watanabe, Shinji
    Qian, Yanmin
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 496 - 501