CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations

被引:0
|
作者
Li, Hang [1 ]
Ding, Wenbiao [1 ]
Kang, Yu [1 ]
Liu, Tianqiao [1 ]
Wu, Zhongqin [1 ]
Liu, Zitao [1 ]
机构
[1] TAL Educ Grp, Beijing, Peoples R China
基金
国家重点研发计划;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing approaches for audio-language task-specific prediction focus on building complicated late-fusion mechanisms. However, these models face challenges of overfitting with limited labels and poor generalization. In this paper, we present a Cross-modal Transformer for Audio-and-Language, i.e., CTAL, which aims to learn the intra- and inter-modalities connections between audio and language through two proxy tasks from a large number of audio-and-language pairs: masked language modeling and masked cross-modal acoustic modeling. After fine-tuning our CTAL model on multiple downstream audio and-language tasks, we observe significant improvements on different tasks, including emotion classification, sentiment analysis, and speaker verification. Furthermore, we design a fusion mechanism in the fine-tuning phase, which allows CTAL to achieve better performance. Lastly, we conduct detailed ablation studies to demonstrate that both our novel cross-modality fusion component and audio language pre-training methods contribute to the promising results. The code and pretrained models are available at https://github.com/tal-al/CTAL_EMNLP2021.
引用
收藏
页码:3966 / 3977
页数:12
相关论文
共 50 条
  • [1] Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation
    Wu, Siying
    Fu, Xueyang
    Wu, Feng
    Zha, Zheng-Jun
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4233 - 4241
  • [2] Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation
    Jiang, Chaoya
    Ye, Wei
    Xu, Haiyang
    Huang, Songfang
    Huang, Fei
    Zhang, Shikun
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 14660 - 14679
  • [3] VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
    Wang, Teng
    Jiang, Wenhao
    Lu, Zhichao
    Zheng, Feng
    Cheng, Ran
    Yin, Chengguo
    Luo, Ping
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [4] Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training
    Li, Gen
    Duan, Nan
    Fang, Yuejian
    Gong, Ming
    Jiang, Daxin
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11336 - 11344
  • [5] UniXcoder: Unified Cross-Modal Pre-training for Code Representation
    Guo, Daya
    Lu, Shuai
    Duan, Nan
    Wang, Yanlin
    Zhou, Ming
    Yin, Jian
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 7212 - 7225
  • [6] Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training
    Zeng, Yan
    Zhou, Wangchunshu
    Luo, Ao
    Cheng, Ziming
    Zhang, Xinsong
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 5731 - 5746
  • [7] VLCDoC: Vision-Language contrastive pre-training model for cross-Modal document classification
    Bakkali, Souhail
    Ming, Zuheng
    Coustaty, Mickael
    Rusinol, Marcal
    Ramos Terrades, Oriol
    [J]. PATTERN RECOGNITION, 2023, 139
  • [8] CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training
    Ma, Zhiyuan
    Li, Jianjun
    Li, Guohui
    Huang, Kaiyan
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4515 - 4524
  • [9] COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation
    Wen, Keyu
    Xia, Jin
    Huang, Yuanyuan
    Li, Linyang
    Xu, Jiayan
    Shao, Jie
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2188 - 2197
  • [10] UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training
    Zhou, Mingyang
    Zhou, Luowei
    Wang, Shuohang
    Cheng, Yu
    Li, Linjie
    Yu, Zhou
    Liu, Jingjing
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 4153 - 4163