Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-Resource Speech Recognition

被引:28
|
作者
Yi, Cheng [1 ,2 ]
Zhou, Shiyu [1 ]
Xu, Bo [1 ]
机构
[1] Chinese Acad Sci, Inst Automat, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100190, Peoples R China
关键词
Acoustics; Bit error rate; Linguistics; Task analysis; Training; Decoding; Data models; BERT; end-to-end modeling; low-resource ASR; pre-training; wav2vec; CTC; ASR;
D O I
10.1109/LSP.2021.3071668
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
End-to-end models have achieved impressive results on the task of automatic speech recognition (ASR). For low-resource ASR tasks, however, labeled data can hardly satisfy the demand of end-to-end models. Self-supervised acoustic pre-training has already shown its impressive ASR performance, while the transcription is still inadequate for language modeling in end-to-end models. In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model. The fused model only needs to learn the transfer from speech to language during fine-tuning on limited labeled data. The length of the two modalities is matched by a monotonic attention mechanism without additional parameters. Besides, a fully connected layer is introduced for the hidden mapping between modalities. We further propose a scheduled fine-tuning strategy to preserve and utilize the text context modeling ability of the pre-trained linguistic encoder. Experiments show our effective utilizing of pre-trained modules. Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
引用
收藏
页码:788 / 792
页数:5
相关论文
共 50 条
  • [31] ANALYSIS OF X-VECTORS FOR LOW-RESOURCE SPEECH RECOGNITION
    Karafiat, Martin
    Vesely, Karel
    Cernocky, Jan Honza
    Profant, Jan
    Nytra, Jiri
    Hlavacek, Miroslav
    Pavlicek, Tomas
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6998 - 7002
  • [32] Low-resource automatic speech recognition and error analyses of oral cancer speech
    Halpern, Bence Mark
    Feng, Siyuan
    van Son, Rob
    van den Brekel, Michiel
    Scharenborg, Odette
    SPEECH COMMUNICATION, 2022, 141 : 14 - 27
  • [33] A hybrid acoustic model based on PDP coding for resolving articulation differences in low-resource speech recognition
    Zhu, Wenbo
    Jin, Hao
    Chen, Jianwen
    Luo, Lufeng
    Wang, Jinhai
    Lu, Qinghua
    Li, Aiyuan
    APPLIED ACOUSTICS, 2022, 192
  • [34] Improved low-resource Somali speech recognition by semi-supervised acoustic and language model training
    Biswas, Astik
    Menon, Raghav
    van der Westhuizen, Ewald
    Niesler, Thomas
    INTERSPEECH 2019, 2019, : 3008 - 3012
  • [35] Advanced Convolutional Neural Network-Based Hybrid Acoustic Models for Low-Resource Speech Recognition
    Fantaye, Tessfu Geteye
    Yu, Junqing
    Hailu, Tulu Tilahun
    COMPUTERS, 2020, 9 (02)
  • [36] Predicting positive transfer for improved low-resource speech recognition using acoustic pseudo-tokens
    San, Nay
    Paraskevopoulos, Georgios
    Arora, Aryaman
    He, Xiluo
    Kaur, Prabhjot
    Adams, Oliver
    Jurafsky, Dan
    arXiv,
  • [37] Low-resource Taxonomy Enrichment with Pretrained Language Models
    Takeoka, Kunihiro
    Akimoto, Kosuke
    Oyamada, Masafumi
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 2747 - 2758
  • [38] A Comparative Study of Pre-trained Encoders for Low-Resource Named Entity Recognition
    Chen, Yuxuan
    Mikkelsen, Jonas
    Binder, Arne
    Alt, Christoph
    Hennig, Leonhard
    PROCEEDINGS OF THE 7TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP, 2022, : 46 - 59
  • [39] Linguistic Foundations of Low-Resource Languages for Speech Synthesis on the Example of the Kazakh Language
    Bekmanova, Gulmira
    Yergesh, Banu
    Sharipbay, Altynbek
    Omarbekova, Assel
    Zakirova, Alma
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2022 WORKSHOPS, PART III, 2022, 13379 : 3 - 14
  • [40] Cross-Lingual Language Modeling for Low-Resource Speech Recognition
    Xu, Ping
    Fung, Pascale
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2013, 21 (06): : 1134 - 1144