Unified Speech-Text Pre-training for Speech Translation and Recognition

被引:0
|
作者
Tang, Yun [1 ]
Gong, Hongyu [1 ]
Dong, Ning [1 ]
Wang, Changhan [1 ]
Hsu, Wei-Ning [1 ]
Gu, Jiatao [1 ]
Baevski, Alexei [1 ]
Li, Xian [1 ]
Mohamed, Abdelrahman [1 ]
Auli, Michael [1 ]
Pino, Juan [1 ]
机构
[1] Meta AI, Menlo Pk, CA 94025 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We describe a method to jointly pre-train speech and text in an encoder-decoder modeling framework for speech translation and recognition. The proposed method incorporates four self-supervised and supervised subtasks for cross modality learning. A self-supervised speech subtask leverages un-labelled speech data, and a (self-)supervised text to text subtask makes use of abundant text training data. Two auxiliary supervised speech tasks are included to unify speech and text modeling space. Our contribution lies in integrating linguistic information from the text corpus into the speech pre-training. Detailed analysis reveals learning interference among subtasks. Two pre-training configurations for speech translation and recognition, respectively, are presented to alleviate subtask interference. Our experiments show the proposed method can effectively fuse speech and text information into one model. It achieves between 1.7 and 2.3 BLEU improvement above the state of the art on the MUST-C speech translation dataset and comparable WERs to wav2vec 2.0 on the LIBRISPEECH speech recognition task. (1)
引用
收藏
页码:1488 / 1499
页数:12
相关论文
共 50 条
  • [21] GUIDED CONTRASTIVE SELF-SUPERVISED PRE-TRAINING FOR AUTOMATIC SPEECH RECOGNITION
    Khare, Aparna
    Wu, Minhua
    Bhati, Saurabhchand
    Droppo, Jasha
    Maas, Roland
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 174 - 181
  • [22] Speech Recognition, Machine Translation, and Speech Translation-A Unified Discriminative Learning Paradigm
    He, Xiaodong
    Deng, Li
    [J]. IEEE SIGNAL PROCESSING MAGAZINE, 2011, 28 (05) : 126 - 133
  • [23] A STUDY ON THE EFFICACY OF MODEL PRE-TRAINING IN DEVELOPING NEURAL TEXT-TO-SPEECH SYSTEM
    Zhang, Guangyan
    Leng, Yichong
    Tan, Daxin
    Qin, Ying
    Song, Kaitao
    Tan, Xu
    Zhao, Sheng
    Lee, Tan
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6087 - 6091
  • [24] ASR-Generated Text for Language Model Pre-training Applied to Speech Tasks
    Pelloin, Valentin
    Dary, Franck
    Herve, Nicolas
    Favre, Benoit
    Camelin, Nathalie
    Laurent, Antoine
    Besacier, Laurent
    [J]. INTERSPEECH 2022, 2022, : 3453 - 3457
  • [25] Training Speech Recognition Model with Speech Synthesis and Text Discriminator
    Lin, Hou-an
    Chen, Chia-ping
    [J]. JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2024, 40 (02) : 359 - 373
  • [26] Investigating Self-supervised Pre-training for End-to-end Speech Translation
    Ha Nguyen
    Bougares, Fethi
    Tomashenko, Natalia
    Esteve, Yannick
    Besacier, Laurent
    [J]. INTERSPEECH 2020, 2020, : 1466 - 1470
  • [27] Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning
    Shang, Yanan
    Fu, Tianqi
    [J]. INTELLIGENT SYSTEMS WITH APPLICATIONS, 2024, 24
  • [28] Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding
    Liu, Yuchen
    Zhang, Jiajun
    Xiong, Hao
    Zhou, Long
    He, Zhongjun
    Wu, Hua
    Wang, Haifeng
    Zong, Chengqing
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 8417 - 8424
  • [29] Neural speech enhancement with unsupervised pre-training and mixture training
    Hao, Xiang
    Xu, Chenglin
    Xie, Lei
    [J]. NEURAL NETWORKS, 2023, 158 : 216 - 227
  • [30] GENERATIVE PRE-TRAINING FOR SPEECH WITH AUTOREGRESSIVE PREDICTIVE CODING
    Chung, Yu-An
    Glass, James
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 3497 - 3501