INDEPENDENT LANGUAGE MODELING ARCHITECTURE FOR END-TO-END ASR

被引:0
|
作者
Van Tung Pham [1 ]
Xu, Haihua [1 ]
Khassanov, Yerbolat [1 ,2 ]
Zeng, Zhiping [1 ]
Chng, Eng Siong [1 ]
Ni, Chongjia [3 ]
Ma, Bin [3 ]
Li, Haizhou [4 ]
机构
[1] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore
[2] Nazarbayev Univ, ISSAI, Astana, Kazakhstan
[3] Alibaba Grp, Machine Intelligence Technol, Hangzhou, Peoples R China
[4] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore, Singapore
关键词
Independent language model; low-resource ASR; pre-training; fine-tuning; catastrophic forgetting;
D O I
10.1109/icassp40776.2020.9054116
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The attention-based end-to-end (E2E) automatic speech recognition (ASR) architecture allows for joint optimization of acoustic and language models within a single network. However, in a vanilla E2E ASR architecture, the decoder sub-network (subnet), which incorporates the role of the language model (LM), is conditioned on the encoder output. This means that the acoustic encoder and the language model are entangled that doesn't allow language model to be trained separately from external text data. To address this problem, in this work, we propose a new architecture that separates the decoder subnet from the encoder output. In this way, the decoupled subnet becomes an independently trainable LM subnet, which can easily be updated using the external text data. We study two strategies for updating the new architecture. Experimental results show that, 1) the independent LM architecture benefits from external text data, achieving 9.3% and 22.8% relative character and word error rate reduction on Mandarin HKUST and English NSC datasets respectively; 2) the proposed architecture works well with external LM and can be generalized to different amount of labelled data.
引用
收藏
页码:7059 / 7063
页数:5
相关论文
共 50 条
  • [21] ASR-AWARE END-TO-END NEURAL DIARIZATION
    Khare, Aparna
    Han, Eunjung
    Yang, Yuguang
    Stolcke, Andreas
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8092 - 8096
  • [22] A STUDY OF TRANSDUCER BASED END-TO-END ASR WITH ESPNET: ARCHITECTURE, AUXILIARY LOSS AND DECODING STRATEGIES
    Boyer, Florian
    Shinohara, Yusuke
    Ishii, Takaaki
    Inaguma, Hirofumi
    Watanabe, Shinji
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 16 - 23
  • [23] Improving Performance of End-to-End ASR on Numeric Sequences
    Peyser, Cal
    Zhang, Hao
    Sainath, Tara N.
    Wu, Zelin
    [J]. INTERSPEECH 2019, 2019, : 2185 - 2189
  • [24] TOWARDS FAST AND ACCURATE STREAMING END-TO-END ASR
    Li, Bo
    Chang, Shuo-yiin
    Sainath, Tara N.
    Pang, Ruoming
    He, Yanzhang
    Strohman, Trevor
    Wu, Yonghui
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6069 - 6073
  • [25] Transfer Learning for End-to-End ASR to Deal with Low-Resource Problem in Persian Language
    Kermanshahi, Maryam Asadolahzade
    Akbari, Ahmad
    Nasersharif, Babak
    [J]. 2021 26TH INTERNATIONAL COMPUTER CONFERENCE, COMPUTER SOCIETY OF IRAN (CSICC), 2021,
  • [26] EXPLORING ASR-FREE END-TO-END MODELING TO IMPROVE SPOKEN LANGUAGE UNDERSTANDING IN A CLOUD-BASED DIALOG SYSTEM
    Qian, Yao
    Ubale, Rutuja
    Ramanaryanan, Vikram
    Lange, Patrick
    Suendermann-Oeft, David
    Evanini, Keelan
    Tsuprun, Eugene
    [J]. 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 569 - 576
  • [27] CTC-Based End-To-End ASR for the Low Resource Sanskrit Language with Spectrogram Augmentation
    Anoop, C. S.
    Ramakrishnan, A. G.
    [J]. 2021 NATIONAL CONFERENCE ON COMMUNICATIONS (NCC), 2021, : 111 - 116
  • [28] Improving low-resource Tibetan end-to-end ASR by multilingual and multilevel unit modeling
    Siqing Qin
    Longbiao Wang
    Sheng Li
    Jianwu Dang
    Lixin Pan
    [J]. EURASIP Journal on Audio, Speech, and Music Processing, 2022
  • [29] Improving low-resource Tibetan end-to-end ASR by multilingual and multilevel unit modeling
    Qin, Siqing
    Wang, Longbiao
    Li, Sheng
    Dang, Jianwu
    Pan, Lixin
    [J]. EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2022, 2022 (01)
  • [30] Multi-Modal Data Augmentation for End-to-End ASR
    Renduchintala, Adithya
    Ding, Shuoyang
    Wiesner, Matthew
    Watanabe, Shinji
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2394 - 2398