Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-Resource Speech Recognition

被引：28

作者：

Yi, Cheng ^{[1
,2
]}

Zhou, Shiyu ^{[1
]}

Xu, Bo ^{[1
]}

机构：

[1] Chinese Acad Sci, Inst Automat, Beijing 100190, Peoples R China

[2] Univ Chinese Acad Sci, Beijing 100190, Peoples R China

来源：

IEEE SIGNAL PROCESSING LETTERS | 2021年 / 28卷 / 28期

关键词：

Acoustics; Bit error rate; Linguistics; Task analysis; Training; Decoding; Data models; BERT; end-to-end modeling; low-resource ASR; pre-training; wav2vec; CTC; ASR;

D O I：

10.1109/LSP.2021.3071668

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

End-to-end models have achieved impressive results on the task of automatic speech recognition (ASR). For low-resource ASR tasks, however, labeled data can hardly satisfy the demand of end-to-end models. Self-supervised acoustic pre-training has already shown its impressive ASR performance, while the transcription is still inadequate for language modeling in end-to-end models. In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model. The fused model only needs to learn the transfer from speech to language during fine-tuning on limited labeled data. The length of the two modalities is matched by a monotonic attention mechanism without additional parameters. Besides, a fully connected layer is introduced for the hidden mapping between modalities. We further propose a scheduled fine-tuning strategy to preserve and utilize the text context modeling ability of the pre-trained linguistic encoder. Experiments show our effective utilizing of pre-trained modules. Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.

引用

页码：788 / 792

页数：5

共 50 条

[41] Speech-to-speech Low-resource Translation
Liu, Hsiao-Chuan
Day, Min-Yuh
Wang, Chih-Chien
2023 IEEE 24TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION FOR DATA SCIENCE, IRI, 2023, : 91 - 95
[42] Exploiting Adapters for Cross-Lingual Low-Resource Speech Recognition
Hou, Wenxin
Zhu, Han
Wang, Yidong
Wang, Jindong
Qin, Tao
Xu, Renju
Shinozaki, Takahiro
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 317 - 329
[43] Multilingual Meta-Transfer Learning for Low-Resource Speech Recognition
Zhou, Rui
Koshikawa, Takaki
Ito, Akinori
Nose, Takashi
Chen, Chia-Ping
IEEE ACCESS, 2024, 12 : 158493 - 158504
[44] A Method Improves Speech Recognition with Contrastive Learning in Low-Resource Languages
Sun, Lixu
Yolwas, Nurmemet
Jiang, Lina
APPLIED SCIENCES-BASEL, 2023, 13 (08):
[45] Multitask Learning of Deep Neural Networks for Low-Resource Speech Recognition
Chen, Dongpeng
Mak, Brian Kan-Wing
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (07) : 1172 - 1183
[46] Exploring End-to-End Techniques for Low-Resource Speech Recognition
Bataev, Vladimir
Korenevsky, Maxim
Medennikov, Ivan
Zatvornitskiy, Alexander
SPEECH AND COMPUTER (SPECOM 2018), 2018, 11096 : 32 - 41
[47] A General Procedure for Improving Language Models in Low-Resource Speech Recognition
Liu, Qian
Zhang, Wei-Qiang
Liu, Jia
Liu, Yao
PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 428 - 433
[48] Language-universal phonetic encoder for low-resource speech recognition
Feng, Siyuan
Tu, Ming
Xia, Rui
Huang, Chuanzeng
Wang, Yuxuan
INTERSPEECH 2023, 2023, : 1429 - 1433
[49] A Novel Self-training Approach for Low-resource Speech Recognition
Singh, Satwinder
Hou, Feng
Wang, Ruili
INTERSPEECH 2023, 2023, : 1588 - 1592
[50] Exploring the Potential of Prompting Methods in Low-Resource Speech Recognition with Whisper
Chen, Yaqi
Zhang, Wenlin
Zhang, Hao
Yang, Xukui
Qu, Dan
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024, 2025, 15361 : 382 - 393

← 1 2 3 4 5 →