Self-supervised Contrastive Cross-Modality Representation Learning for Spoken Question Answering

被引:0
|
作者
You, Chenyu [1 ]
Chen, Nuo [2 ]
Zou, Yuexian [2 ,3 ]
机构
[1] Yale Univ, Dept Elect Engn, New Haven, CT 06520 USA
[2] Peking Univ, Sch ECE, ADSPLAB, Shenzhen, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
关键词
NETWORKS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Spoken question answering (SQA) requires fine-grained understanding of both spoken documents and questions for the optimal answer prediction. In this paper, we propose novel training schemes for spoken question answering with a self-supervised training stage and a contrastive representation learning stage. In the self-supervised stage, we propose three auxiliary self-supervised tasks, including utterance restoration, utterance insertion, and question discrimination, and jointly train the model to capture consistency and coherence among speech documents without any additional data or annotations. We then propose to learn noise-invariant utterance representations in a contrastive objective by adopting multiple augmentation strategies, including span deletion and span substitution. Besides, we design a Temporal-Alignment attention to semantically align the speech-text clues in the learned common space and benefit the SQA tasks. By this means, the training schemes can more effectively guide the generation model to predict more proper answers. Experimental results show that our model achieves state-ofthe-art results on three SQA benchmarks.
引用
收藏
页码:28 / 39
页数:12
相关论文
共 50 条
  • [1] Robust video question answering via contrastive cross-modality representation learning
    Xun YANG
    Jianming ZENG
    Dan GUO
    Shanshan WANG
    Jianfeng DONG
    Meng WANG
    [J]. Science China(Information Sciences)., 2024, 67 (10) - 226
  • [2] Robust video question answering via contrastive cross-modality representation learning
    Yang, Xun
    Zeng, Jianming
    Guo, Dan
    Wang, Shanshan
    Dong, Jianfeng
    Wang, Meng
    [J]. SCIENCE CHINA-INFORMATION SCIENCES, 2024, 67 (10)
  • [3] Inter-Intra Cross-Modality Self-Supervised Video Representation Learning by Contrastive Clustering
    Wei, Jiutong
    Luo, Guan
    Li, Bing
    Hu, Weiming
    [J]. 2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 4815 - 4821
  • [4] Self-supervised Graph Contrastive Learning for Video Question Answering
    Yao, Xuan
    Gao, Jun-Yu
    Xu, Chang-Sheng
    [J]. Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2083 - 2100
  • [5] Self-supervised Dialogue Learning for Spoken Conversational Question Answering
    Chen, Nuo
    You, Chenyu
    Zou, Yuexian
    [J]. INTERSPEECH 2021, 2021, : 231 - 235
  • [6] Simple contrastive learning in a self-supervised manner for robust visual question answering
    Yang, Shuwen
    Xiao, Luwei
    Wu, Xingjiao
    Xu, Junjie
    Wang, Linlin
    He, Liang
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 241
  • [7] Self-supervised Feature Learning by Cross-modality and Cross-view Correspondences
    Jing, Longlong
    Zhang, Ling
    Tian, Yingli
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 1581 - 1591
  • [8] Contrastive Image Synthesis and Self-supervised Feature Adaptation for Cross-Modality Biomedical Image Segmentation
    Hu, Xinrong
    Wang, Corey
    Shi, Yiyu
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2329 - 2338
  • [9] A multi-scale self-supervised hypergraph contrastive learning framework for video question answering
    Wang, Zheng
    Wu, Bin
    Ota, Kaoru
    Dong, Mianxiong
    Li, He
    [J]. NEURAL NETWORKS, 2023, 168 : 272 - 286
  • [10] elBERto: Self-supervised commonsense learning for question answering
    Zhan, Xunlin
    Li, Yuan
    Dong, Xiao
    Liang, Xiaodan
    Hu, Zhiting
    Carin, Lawrence
    [J]. KNOWLEDGE-BASED SYSTEMS, 2022, 258