STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation

被引:0
|
作者
Fang, Qingkai [1 ,2 ,3 ]
Ye, Rong [3 ]
Li, Lei [3 ,4 ]
Feng, Yang [1 ,2 ]
Wang, Mingxuan [3 ]
机构
[1] Chinese Acad Sci ICT CAS, Key Lab Intelligent Informat Proc, Inst Comp Thchnol, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] ByteDance AI Lab, Beijing, Peoples R China
[4] Univ Calif Santa Barbara, Santa Barbara, CA 93106 USA
基金
国家重点研发计划;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
How to learn a better speech representation for end-to-end speech-to-text translation (ST) with limited labeled data? Existing techniques often attempt to transfer powerful machine translation (MT) capabilities to ST, but neglect the representation discrepancy across modalities. In this paper, we propose the Speech-TExt Manifold Mixup (STEMM) method to calibrate such discrepancy. Specifically, we mix up the representation sequences of different modalities, and take both unimodal speech sequences and multimodal mixed sequences as input to the translation model in parallel, and regularize their output predictions with a self-learning framework. Experiments on MuST-C speech translation benchmark and further analysis show that our method effectively alleviates the cross-modal representation discrepancy, and achieves significant improvements over a strong baseline on eight translation directions.
引用
收藏
页码:7050 / 7062
页数:13
相关论文
共 50 条
  • [41] Weakly-supervised Speech-to-text Mapping with Visually Connected Non-parallel Speech-text Data using Cyclic Partially-aligned Transformer
    Effendi, Johanes
    Sakti, Sakriani
    Nakamura, Satoshi
    INTERSPEECH 2021, 2021, : 2257 - 2261
  • [42] AUTOMATIC PRONUNCIATION PREDICTION FOR TEXT-TO-SPEECH SYNTHESIS OF DIALECTAL ARABIC IN A SPEECH-TO-SPEECH TRANSLATION SYSTEM
    Ananthakrishnan, Sankaranarayanan
    Tsakalidis, Stavros
    Prasad, Rohit
    Natarajan, Prem
    Vembu, Aravind Namandi
    2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4957 - 4960
  • [43] Dynamic MRI using deep manifold self-learning
    Ahmed, Abdul Haseeb
    Aggarwal, Hemant
    Nagpal, Prashant
    Jacob, Mathews
    2020 IEEE 17TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (ISBI 2020), 2020, : 1061 - 1064
  • [44] Improved Machine Translation of Speech-to-Text outputs
    Dechelotte, Daniel
    Schwenk, Holger
    Adda, Gilles
    Gauvain, Jean-Luc
    INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2632 - 2635
  • [45] Text to Speech Synthesis System for English to Malayalam Translation
    Anto, Ancy
    Nisha, K. K.
    IEEE INTERNATIONAL CONFERENCE ON EMERGING TECHNOLOGICAL TRENDS IN COMPUTING, COMMUNICATIONS AND ELECTRICAL ENGINEERING (ICETT), 2016,
  • [46] Low-Resource Speech-to-Text Translation
    Bansal, Sameer
    Kamper, Herman
    Livescu, Karen
    Lopez, Adam
    Goldwater, Sharon
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1298 - 1302
  • [47] The impact of a particle. The parts of speech in the text and in translation
    Jansen, H
    REVUE ROMANE, 2003, 38 (02) : 272 - 302
  • [48] Text reconstruction practices during self-learning
    Hartwell, Laura
    RECHERCHE ET PRATIQUES PEDAGOGIQUES EN LANGUES DE SPECIALITE-CAHIERS DE L APLIUT, 2010, 29 (03): : 81 - 95
  • [49] Back Translation for Speech-to-text TranslationWithout Transcripts
    Fang, Qingkai
    Feng, Yang
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 4567 - 4587
  • [50] Recent Advances in Direct Speech-to-text Translation
    Xu, Chen
    Ye, Rong
    Dong, Qianqian
    Zhao, Chengqi
    Ko, Tom
    Wang, Mingxuan
    Xiao, Tong
    Zhu, Jingbo
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 6796 - 6804