Robust video question answering via contrastive cross-modality representation learning

被引:0
|
作者
Xun YANG [1 ]
Jianming ZENG [1 ,2 ]
Dan GUO [3 ]
Shanshan WANG [4 ]
Jianfeng DONG [5 ]
Meng WANG [3 ,2 ]
机构
[1] School of Information Science and Technology, University of Science and Technology of China
[2] Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
[3] School of Computer Science and Information Engineering, Hefei University of Technology
[4] Institutes of Physical Science and Information Technology, Anhui University
[5] School of Computer Science and Technology, Zhejiang Gongshang
关键词
D O I
暂无
中图分类号
TP391.41 []; TP391.1 [文字信息处理];
学科分类号
080203 ;
摘要
Video question answering(VideoQA) is a challenging yet important task that requires a joint understanding of low-level video content and high-level textual semantics. Despite the promising progress of existing efforts, recent studies revealed that current VideoQA models mostly tend to over-rely on the superficial correlations rooted in the dataset bias while overlooking the key video content, thus leading to unreliable results. Effectively understanding and modeling the temporal and semantic characteristics of a given video for robust VideoQA is crucial but, to our knowledge, has not been well investigated. To fill the research gap, we propose a robust VideoQA framework that can effectively model the cross-modality fusion and enforce the model to focus on the temporal and global content of videos when making a QA decision instead of exploiting the shortcuts in datasets. Specifically, we design a self-supervised contrastive learning objective to contrast the positive and negative pairs of multimodal input, where the fused representation of the original multimodal input is enforced to be closer to that of the intervened input based on video perturbation. We expect the fused representation to focus more on the global context of videos rather than some static keyframes. Moreover, we introduce an effective temporal order regularization to enforce the inherent sequential structure of videos for video representation. We also design a Kullback-Leibler divergence-based perturbation invariance regularization of the predicted answer distribution to improve the robustness of the model against temporal content perturbation of videos. Our method is model-agnostic and can be easily compatible with various VideoQA backbones. Extensive experimental results and analyses on several public datasets show the advantage of our method over the state-of-the-art methods in terms of both accuracy and robustness.
引用
收藏
页码:211 / 226
页数:16
相关论文
共 50 条
  • [31] Unsupervised Cross-modality Cardiac Image Segmentation via Disentangled Representation Learning and Consistency Regularization
    Wang, Runze
    Zheng, Guoyan
    MACHINE LEARNING IN MEDICAL IMAGING, MLMI 2021, 2021, 12966 : 517 - 526
  • [32] Learning Question-Guided Video Representation for Multi-Turn Video Question Answering
    Chao, Guan-Lin
    Rastogi, Abhinav
    Yavuz, Semih
    Hakkani-Tur, Dilek
    Chen, Jindong
    Lane, Ian
    20TH ANNUAL MEETING OF THE SPECIAL INTEREST GROUP ON DISCOURSE AND DIALOGUE (SIGDIAL 2019), 2019, : 215 - 225
  • [33] Medical Visual Question Answering via Conditional Reasoning and Contrastive Learning
    Liu, Bo
    Zhan, Li-Ming
    Xu, Li
    Wu, Xiao-Ming
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2023, 42 (05) : 1532 - 1545
  • [34] Efficient Web Video Classification via Cross-modality Knowledge Transferring
    Xia, Shijun
    Li, Tianyu
    Ge, Shengbin
    Dong, Zhengya
    8TH INTERNATIONAL CONFERENCE ON INTERNET MULTIMEDIA COMPUTING AND SERVICE (ICIMCS2016), 2016, : 211 - 216
  • [35] Cross-Modality Data Augmentation for Aerial Object Detection with Representation Learning
    Wei, Chiheng
    Bai, Lianfa
    Chen, Xiaoyu
    Han, Jing
    REMOTE SENSING, 2024, 16 (24)
  • [36] Attend to the Difference: Cross-Modality Person Re-Identification via Contrastive Correlation
    Zhang, Shizhou
    Yang, Yifei
    Wang, Peng
    Liang, Guoqiang
    Zhang, Xiuwei
    Zhang, Yanning
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 8861 - 8872
  • [37] Learning Cross-modality Interaction for Robust Depth Perception of Autonomous Driving
    Liang, Yunji
    Chen, Nengzhen
    Yu, Zhiwen
    Tang, Lei
    Yu, Hongkai
    Guo, Bin
    Zeng, Daniel Dajun
    ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2024, 15 (03)
  • [38] Enhanced Simple Question Answering with Contrastive Learning
    Wang, Xin
    Yang, Lan
    He, Honglian
    Fang, Yu
    Zhan, Huayi
    Zhang, Ji
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT I, 2022, 13368 : 502 - 515
  • [39] CMMS-GCL: cross-modality metabolic stability prediction with graph contrastive learning
    Du, Bing-Xue
    Long, Yahui
    Li, Xiaoli
    Wu, Min
    Shi, Jian-Yu
    BIOINFORMATICS, 2023, 39 (08)
  • [40] Simple contrastive learning in a self-supervised manner for robust visual question answering
    Yang, Shuwen
    Xiao, Luwei
    Wu, Xingjiao
    Xu, Junjie
    Wang, Linlin
    He, Liang
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 241