STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation

被引：0

作者：

Fang, Qingkai ^{[1
,2
,3
]}

Ye, Rong ^{[3
]}

Li, Lei ^{[3
,4
]}

Feng, Yang ^{[1
,2
]}

Wang, Mingxuan ^{[3
]}

机构：

[1] Chinese Acad Sci ICT CAS, Key Lab Intelligent Informat Proc, Inst Comp Thchnol, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Beijing, Peoples R China

[3] ByteDance AI Lab, Beijing, Peoples R China

[4] Univ Calif Santa Barbara, Santa Barbara, CA 93106 USA

来源：

PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS) | 2022年

基金：

国家重点研发计划;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

How to learn a better speech representation for end-to-end speech-to-text translation (ST) with limited labeled data? Existing techniques often attempt to transfer powerful machine translation (MT) capabilities to ST, but neglect the representation discrepancy across modalities. In this paper, we propose the Speech-TExt Manifold Mixup (STEMM) method to calibrate such discrepancy. Specifically, we mix up the representation sequences of different modalities, and take both unimodal speech sequences and multimodal mixed sequences as input to the translation model in parallel, and regularize their output predictions with a self-learning framework. Experiments on MuST-C speech translation benchmark and further analysis show that our method effectively alleviates the cross-modal representation discrepancy, and achieves significant improvements over a strong baseline on eight translation directions.

引用

页码：7050 / 7062

页数：13

共 50 条

[1] Unified Speech-Text Pre-training for Speech Translation and Recognition
Tang, Yun
Gong, Hongyu
Dong, Ning
Wang, Changhan
Hsu, Wei-Ning
Gu, Jiatao
Baevski, Alexei
Li, Xian
Mohamed, Abdelrahman
Auli, Michael
Pino, Juan
[J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 1488 - 1499
[2] MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition
Cheng, Xize
Jin, Tao
Huang, Rongjie
Li, Linjun
Lin, Wang
Wang, Zehan
Wang, Ye
Liu, Huadai
Yin, Aoxiong
Zhao, Zhou
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15689 - 15699
[3] Joint Speech-Text Embeddings for Multitask Speech Processing
Gonzales, Michael Gian
Corcoran, Peter
Harte, Naomi
Schukat, Michael
[J]. IEEE Access, 2024, 12 : 145955 - 145967
[4] STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation Learning
Mishra, Prakamya
[J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2021, PT III, 2021, 12714 : 55 - 66
[5] MAESTRO-U: LEVERAGING JOINT SPEECH-TEXT REPRESENTATION LEARNING FOR ZERO SUPERVISED SPEECH ASR
Chen, Zhehuai
Bapna, Ankur
Rosenberg, Andrew
Zhang, Yu
Ramabhadran, Bhuvana
Moreno, Pedro
Chen, Nanxin
[J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 68 - 75
[6] AN ANALYSIS OF SEMANTICALLY-ALIGNED SPEECH-TEXT EMBEDDINGS
Huzaifah, Muhammad
Kukanov, Ivan
[J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 747 - 754
[7] Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning
Shang, Yanan
Fu, Tianqi
[J]. INTELLIGENT SYSTEMS WITH APPLICATIONS, 2024, 24
[8] Joint Speech-Text Embeddings with Disentangled Speaker Features
Gonzales, Michael Gian
Corcoran, Peter
Harte, Naomi
Schukat, Michael
[J]. 2023 34TH IRISH SIGNALS AND SYSTEMS CONFERENCE, ISSC, 2023,
[9] Self-learning speaker identification for enhanced speech recognition
Herbig, Tobias
Gerl, Franz
Minker, Wolfgang
[J]. COMPUTER SPEECH AND LANGUAGE, 2012, 26 (03): : 210 - 227
[10] Self-learning Vector Quantization for Pattern Discovery from Speech
Rasanen, Okko Johannes
Laine, Unto Kalervo
Altosaar, Toomas
[J]. INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 848 - 851

← 1 2 3 4 5 →