ANALYZING ASR PRETRAINING FOR LOW-RESOURCE SPEECH-TO-TEXT TRANSLATION

被引:0
|
作者
Stoian, Mihaela C. [1 ]
Bansal, Sameer [1 ]
Goldwater, Sharon [1 ]
机构
[1] Univ Edinburgh, Sch Informat, Edinburgh, Midlothian, Scotland
关键词
speech-to-text translation; transfer learning; pretraining; speech recognition; data augmentation; NEURAL-NETWORKS; RECURRENT;
D O I
10.1109/icassp40776.2020.9053847
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Previous work has shown that for low-resource source languages, automatic speech-to-text translation (AST) can be improved by pretraining an end-to-end model on automatic speech recognition (ASR) data from a high-resource language. However, it is not clear what factors-e.g., language relatedness or size of the pretraining data-yield the biggest improvements, or whether pretraining can be effectively combined with other methods such as data augmentation. Here, we experiment with pretraining on datasets of varying sizes, including languages related and unrelated to the AST source language. We find that the best predictor of final AST performance is the word error rate of the pretrained ASR model, and that differences in ASR/AST performance correlate with how phonetic information is encoded in the later RNN layers of our model. We also show that pretraining and data augmentation yield complementary benefits for AST.
引用
收藏
页码:7909 / 7913
页数:5
相关论文
共 50 条
  • [1] Low-Resource Speech-to-Text Translation
    Bansal, Sameer
    Kamper, Herman
    Livescu, Karen
    Lopez, Adam
    Goldwater, Sharon
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1298 - 1302
  • [2] Pre-training on High-Resource Speech Recognition Improves Low-Resource Speech-to-Text Translation
    Bansal, Sameer
    Kamper, Herman
    Livescu, Karen
    Lopez, Adam
    Goldwater, Sharon
    [J]. 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 58 - 68
  • [3] Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized Streaming ASR
    Chen, Junkun
    Ma, Mingbo
    Zheng, Renjie
    Huang, Liang
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 4618 - 4624
  • [4] Improving data augmentation for low resource speech-to-text translation with diverse paraphrasing
    Mi, Chenggang
    Xie, Lei
    Zhang, Yanning
    [J]. NEURAL NETWORKS, 2022, 148 : 194 - 205
  • [5] Domain Adaptation Speech-to-Text for Low-Resource European Portuguese Using Deep Learning
    Medeiros, Eduardo
    Corado, Leonel
    Rato, Luis
    Quaresma, Paulo
    Salgueiro, Pedro
    [J]. FUTURE INTERNET, 2023, 15 (05)
  • [6] Pretraining by Backtranslation for End-to-end ASR in Low-Resource Settings
    Wiesner, Matthew
    Renduchintala, Adithya
    Watanabe, Shinji
    Liu, Chunxi
    Dehak, Najim
    Khudanpur, Sanjeev
    [J]. INTERSPEECH 2019, 2019, : 4375 - 4379
  • [7] Speech-to-speech Low-resource Translation
    Liu, Hsiao-Chuan
    Day, Min-Yuh
    Wang, Chih-Chien
    [J]. 2023 IEEE 24TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION FOR DATA SCIENCE, IRI, 2023, : 91 - 95
  • [8] Consecutive Decoding for Speech-to-text Translation
    Dong, Qianqian
    Wang, Mingxuan
    Zhou, Hao
    Xu, Shuang
    Xu, Bo
    Li, Lei
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 12738 - 12748
  • [9] TOWARDS UNSUPERVISED SPEECH-TO-TEXT TRANSLATION
    Chung, Yu-An
    Weng, Wei-Hung
    Tong, Schrasing
    Glass, James
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7170 - 7174
  • [10] Text-to-speech for low-resource systems
    Schnell, M
    Küstner, M
    Jokisch, O
    Hoffmann, R
    [J]. PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2002, : 259 - 262