Unified Speech-Text Pre-training for Speech Translation and Recognition

被引：0

作者：

Tang, Yun ^{[1
]}

Gong, Hongyu ^{[1
]}

Dong, Ning ^{[1
]}

Wang, Changhan ^{[1
]}

Hsu, Wei-Ning ^{[1
]}

Gu, Jiatao ^{[1
]}

Baevski, Alexei ^{[1
]}

Li, Xian ^{[1
]}

Mohamed, Abdelrahman ^{[1
]}

Auli, Michael ^{[1
]}

Pino, Juan ^{[1
]}

机构：

[1] Meta AI, Menlo Pk, CA 94025 USA

来源：

PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS) | 2022年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We describe a method to jointly pre-train speech and text in an encoder-decoder modeling framework for speech translation and recognition. The proposed method incorporates four self-supervised and supervised subtasks for cross modality learning. A self-supervised speech subtask leverages un-labelled speech data, and a (self-)supervised text to text subtask makes use of abundant text training data. Two auxiliary supervised speech tasks are included to unify speech and text modeling space. Our contribution lies in integrating linguistic information from the text corpus into the speech pre-training. Detailed analysis reveals learning interference among subtasks. Two pre-training configurations for speech translation and recognition, respectively, are presented to alleviate subtask interference. Our experiments show the proposed method can effectively fuse speech and text information into one model. It achieves between 1.7 and 2.3 BLEU improvement above the state of the art on the MUST-C speech translation dataset and comparable WERs to wav2vec 2.0 on the LIBRISPEECH speech recognition task. (1)

引用

页码：1488 / 1499

页数：12

共 50 条

[1] Pre-training on High-Resource Speech Recognition Improves Low-Resource Speech-to-Text Translation
Bansal, Sameer
Kamper, Herman
Livescu, Karen
Lopez, Adam
Goldwater, Sharon
[J]. 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 58 - 68
[2] Speech-Text Dialog Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment
Yu, Tianshu
Gao, Haoyu
Lin, Ting-En
Yang, Min
Wu, Yuchuan
Ma, Wentao
Wang, Chao
Huang, Fei
Li, Yongbin
[J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 7900 - 7913
[3] SELF-TRAINING AND PRE-TRAINING ARE COMPLEMENTARY FOR SPEECH RECOGNITION
Xu, Qiantong
Baevski, Alexei
Likhomanenko, Tatiana
Tomasello, Paden
Conneau, Alexis
Collobert, Ronan
Synnaeve, Gabriel
Auli, Michael
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3030 - 3034
[4] STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation
Fang, Qingkai
Ye, Rong
Li, Lei
Feng, Yang
Wang, Mingxuan
[J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 7050 - 7062
[5] SENTIMENT-AWARE AUTOMATIC SPEECH RECOGNITION PRE-TRAINING FOR ENHANCED SPEECH EMOTION RECOGNITION
Ghriss, Ayoub
Yang, Bo
Rozgic, Viktor
Shriberg, Elizabeth
Wang, Chao
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7347 - 7351
[6] Curriculum Pre-training for End-to-End Speech Translation
Wang, Chengyi
Wu, Yu
Liu, Shujie
Zhou, Ming
Yang, Zhenglu
[J]. 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 3728 - 3738
[7] Improved Deliberation Network with Text Pre-training for Code-Switching Automatic Speech Recognition
Shen, Zhijie
Guo, Wu
[J]. INTERSPEECH 2022, 2022, : 3854 - 3858
[8] VatLM: Visual-Audio-Text Pre-Training With Unified Masked Prediction for Speech Representation Learning
Zhu, Qiushi
Zhou, Long
Zhang, Ziqiang
Liu, Shujie
Jiao, Binxing
Zhang, Jie
Dai, Lirong
Jiang, Daxin
Li, Jinyu
Wei, Furu
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1055 - 1064
[9] A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training
Meng, Weijing
Yolwas, Nurmemet
[J]. SENSORS, 2023, 23 (02)
[10] LATTICEBART: LATTICE-TO-LATTICE PRE-TRAINING FOR SPEECH RECOGNITION
Dai, Lingfeng
Chen, Lu
Zhou, Zhikai
Yu, Kai
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6112 - 6116

← 1 2 3 4 5 →