Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data

被引:0
|
作者
Kang, Yu [1 ]
Liu, Tianqiao [1 ]
Li, Hang [1 ]
Hao, Yang [1 ]
Ding, Wenbiao [1 ,2 ]
机构
[1] TAL Educ Grp, Beijing, Peoples R China
[2] Tencent, Beijing, Peoples R China
基金
国家重点研发计划;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multimodal pre-training for audio-and-text has recently been proved to be effective and has significantly improved the performance of many downstream speech understanding tasks. However, these state-of-the-art pre-training audio-text models work well only when provided with large amount of parallel audio-and-text data, which brings challenges on many languages that are rich in unimodal corpora but scarce of parallel cross-modal corpus. In this paper, we investigate whether it is possible to pre-train an audio-text multimodal model with extremely low-resource parallel data and extra non-parallel unimodal data. Our pre-training framework consists of the following components: (1) Intra-modal Denoising Auto-Encoding (IDAE), which is able to reconstruct input text (audio) representations from a noisy version of itself. (2) Cross-modal Denoising Auto-Encoding (CDAE), which is pre-trained to reconstruct the input text (audio), given both a noisy version of the input text (audio) and the corresponding translated noisy audio features (text embeddings). (3) Iterative Denoising Process (IDP), which iteratively translates raw audio (text) and the corresponding text embeddings (audio features) translated from previous iteration into the new less-noisy text embeddings (audio features). We adapt a dual cross-modal Transformer as our backbone model which consists of two unimodal encoders for IDAE and two cross-modal encoders for CDAE and IDP. Our method achieves comparable performance on multiple downstream speech understanding tasks compared with the model pre-trained on fully parallel data, demonstrating the great potential of the proposed method.
引用
收藏
页码:10875 / 10883
页数:9
相关论文
共 50 条
  • [1] Self-supervised ECG pre-training
    Liu, Han
    Zhao, Zhenbo
    She, Qiang
    [J]. BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2021, 70
  • [2] Masked Text Modeling: A Self-Supervised Pre-training Method for Scene Text Detection
    Wang, Keran
    Xie, Hongtao
    Wang, Yuxin
    Zhang, Dongming
    Qu, Yadong
    Gao, Zuan
    Zhang, Yongdong
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 2006 - 2015
  • [3] Self-supervised Pre-training for Mirror Detection
    Lin, Jiaying
    Lau, Rynson W. H.
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 12193 - 12202
  • [4] Self-supervised Pre-training for Nuclei Segmentation
    Haq, Mohammad Minhazul
    Huang, Junzhou
    [J]. MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT II, 2022, 13432 : 303 - 313
  • [5] EFFECTIVENESS OF SELF-SUPERVISED PRE-TRAINING FOR ASR
    Baevski, Alexei
    Mohamed, Abdelrahman
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7694 - 7698
  • [6] Open-Domain Response Generation in Low-Resource Settings using Self-Supervised Pre-Training of Warm-Started Transformers
    Naous, Tarek
    Bassyouni, Zahraa
    Mousi, Bassel
    Hajj, Hazem
    El Hajj, Wassim
    Shaban, Khaled
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (04)
  • [7] Self-Supervised Pre-training for Time Series Classification
    Shi, Pengxiang
    Ye, Wenwen
    Qin, Zheng
    [J]. 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [8] Pre-Training on Mixed Data for Low-Resource Neural Machine Translation
    Zhang, Wenbo
    Li, Xiao
    Yang, Yating
    Dong, Rui
    [J]. INFORMATION, 2021, 12 (03)
  • [9] Self-Supervised Pre-Training Boosts Semantic Scene Segmentation on LiDAR data
    Caros, Mariona
    Just, Ariadna
    Segui, Santi
    Vitria, Jordi
    [J]. 2023 18TH INTERNATIONAL CONFERENCE ON MACHINE VISION AND APPLICATIONS, MVA, 2023,
  • [10] Augmenting Low-Resource Text Classification with Graph-Grounded Pre-training and Prompting
    Wen, Zhihao
    Fang, Yuan
    [J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 506 - 516