Unified Speech-Text Pre-training for Speech Translation and Recognition

被引:0
|
作者
Tang, Yun [1 ]
Gong, Hongyu [1 ]
Dong, Ning [1 ]
Wang, Changhan [1 ]
Hsu, Wei-Ning [1 ]
Gu, Jiatao [1 ]
Baevski, Alexei [1 ]
Li, Xian [1 ]
Mohamed, Abdelrahman [1 ]
Auli, Michael [1 ]
Pino, Juan [1 ]
机构
[1] Meta AI, Menlo Pk, CA 94025 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We describe a method to jointly pre-train speech and text in an encoder-decoder modeling framework for speech translation and recognition. The proposed method incorporates four self-supervised and supervised subtasks for cross modality learning. A self-supervised speech subtask leverages un-labelled speech data, and a (self-)supervised text to text subtask makes use of abundant text training data. Two auxiliary supervised speech tasks are included to unify speech and text modeling space. Our contribution lies in integrating linguistic information from the text corpus into the speech pre-training. Detailed analysis reveals learning interference among subtasks. Two pre-training configurations for speech translation and recognition, respectively, are presented to alleviate subtask interference. Our experiments show the proposed method can effectively fuse speech and text information into one model. It achieves between 1.7 and 2.3 BLEU improvement above the state of the art on the MUST-C speech translation dataset and comparable WERs to wav2vec 2.0 on the LIBRISPEECH speech recognition task. (1)
引用
收藏
页码:1488 / 1499
页数:12
相关论文
共 50 条
  • [1] Pre-training on High-Resource Speech Recognition Improves Low-Resource Speech-to-Text Translation
    Bansal, Sameer
    Kamper, Herman
    Livescu, Karen
    Lopez, Adam
    Goldwater, Sharon
    [J]. 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 58 - 68
  • [2] Speech-Text Dialog Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment
    Yu, Tianshu
    Gao, Haoyu
    Lin, Ting-En
    Yang, Min
    Wu, Yuchuan
    Ma, Wentao
    Wang, Chao
    Huang, Fei
    Li, Yongbin
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 7900 - 7913
  • [3] SELF-TRAINING AND PRE-TRAINING ARE COMPLEMENTARY FOR SPEECH RECOGNITION
    Xu, Qiantong
    Baevski, Alexei
    Likhomanenko, Tatiana
    Tomasello, Paden
    Conneau, Alexis
    Collobert, Ronan
    Synnaeve, Gabriel
    Auli, Michael
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3030 - 3034
  • [4] STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation
    Fang, Qingkai
    Ye, Rong
    Li, Lei
    Feng, Yang
    Wang, Mingxuan
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 7050 - 7062
  • [5] SENTIMENT-AWARE AUTOMATIC SPEECH RECOGNITION PRE-TRAINING FOR ENHANCED SPEECH EMOTION RECOGNITION
    Ghriss, Ayoub
    Yang, Bo
    Rozgic, Viktor
    Shriberg, Elizabeth
    Wang, Chao
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7347 - 7351
  • [6] Curriculum Pre-training for End-to-End Speech Translation
    Wang, Chengyi
    Wu, Yu
    Liu, Shujie
    Zhou, Ming
    Yang, Zhenglu
    [J]. 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 3728 - 3738
  • [7] Improved Deliberation Network with Text Pre-training for Code-Switching Automatic Speech Recognition
    Shen, Zhijie
    Guo, Wu
    [J]. INTERSPEECH 2022, 2022, : 3854 - 3858
  • [8] VatLM: Visual-Audio-Text Pre-Training With Unified Masked Prediction for Speech Representation Learning
    Zhu, Qiushi
    Zhou, Long
    Zhang, Ziqiang
    Liu, Shujie
    Jiao, Binxing
    Zhang, Jie
    Dai, Lirong
    Jiang, Daxin
    Li, Jinyu
    Wei, Furu
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1055 - 1064
  • [9] A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training
    Meng, Weijing
    Yolwas, Nurmemet
    [J]. SENSORS, 2023, 23 (02)
  • [10] LATTICEBART: LATTICE-TO-LATTICE PRE-TRAINING FOR SPEECH RECOGNITION
    Dai, Lingfeng
    Chen, Lu
    Zhou, Zhikai
    Yu, Kai
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6112 - 6116