STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation

被引:0
|
作者
Fang, Qingkai [1 ,2 ,3 ]
Ye, Rong [3 ]
Li, Lei [3 ,4 ]
Feng, Yang [1 ,2 ]
Wang, Mingxuan [3 ]
机构
[1] Chinese Acad Sci ICT CAS, Key Lab Intelligent Informat Proc, Inst Comp Thchnol, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] ByteDance AI Lab, Beijing, Peoples R China
[4] Univ Calif Santa Barbara, Santa Barbara, CA 93106 USA
基金
国家重点研发计划;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
How to learn a better speech representation for end-to-end speech-to-text translation (ST) with limited labeled data? Existing techniques often attempt to transfer powerful machine translation (MT) capabilities to ST, but neglect the representation discrepancy across modalities. In this paper, we propose the Speech-TExt Manifold Mixup (STEMM) method to calibrate such discrepancy. Specifically, we mix up the representation sequences of different modalities, and take both unimodal speech sequences and multimodal mixed sequences as input to the translation model in parallel, and regularize their output predictions with a self-learning framework. Experiments on MuST-C speech translation benchmark and further analysis show that our method effectively alleviates the cross-modal representation discrepancy, and achieves significant improvements over a strong baseline on eight translation directions.
引用
收藏
页码:7050 / 7062
页数:13
相关论文
共 50 条
  • [1] Unified Speech-Text Pre-training for Speech Translation and Recognition
    Tang, Yun
    Gong, Hongyu
    Dong, Ning
    Wang, Changhan
    Hsu, Wei-Ning
    Gu, Jiatao
    Baevski, Alexei
    Li, Xian
    Mohamed, Abdelrahman
    Auli, Michael
    Pino, Juan
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 1488 - 1499
  • [2] MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition
    Cheng, Xize
    Jin, Tao
    Huang, Rongjie
    Li, Linjun
    Lin, Wang
    Wang, Zehan
    Wang, Ye
    Liu, Huadai
    Yin, Aoxiong
    Zhao, Zhou
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15689 - 15699
  • [3] Joint Speech-Text Embeddings for Multitask Speech Processing
    Gonzales, Michael Gian
    Corcoran, Peter
    Harte, Naomi
    Schukat, Michael
    [J]. IEEE Access, 2024, 12 : 145955 - 145967
  • [4] STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation Learning
    Mishra, Prakamya
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2021, PT III, 2021, 12714 : 55 - 66
  • [5] MAESTRO-U: LEVERAGING JOINT SPEECH-TEXT REPRESENTATION LEARNING FOR ZERO SUPERVISED SPEECH ASR
    Chen, Zhehuai
    Bapna, Ankur
    Rosenberg, Andrew
    Zhang, Yu
    Ramabhadran, Bhuvana
    Moreno, Pedro
    Chen, Nanxin
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 68 - 75
  • [6] AN ANALYSIS OF SEMANTICALLY-ALIGNED SPEECH-TEXT EMBEDDINGS
    Huzaifah, Muhammad
    Kukanov, Ivan
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 747 - 754
  • [7] Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning
    Shang, Yanan
    Fu, Tianqi
    [J]. INTELLIGENT SYSTEMS WITH APPLICATIONS, 2024, 24
  • [8] Joint Speech-Text Embeddings with Disentangled Speaker Features
    Gonzales, Michael Gian
    Corcoran, Peter
    Harte, Naomi
    Schukat, Michael
    [J]. 2023 34TH IRISH SIGNALS AND SYSTEMS CONFERENCE, ISSC, 2023,
  • [9] Self-learning speaker identification for enhanced speech recognition
    Herbig, Tobias
    Gerl, Franz
    Minker, Wolfgang
    [J]. COMPUTER SPEECH AND LANGUAGE, 2012, 26 (03): : 210 - 227
  • [10] Self-learning Vector Quantization for Pattern Discovery from Speech
    Rasanen, Okko Johannes
    Laine, Unto Kalervo
    Altosaar, Toomas
    [J]. INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 848 - 851