Robust video question answering via contrastive cross-modality representation learning

被引:0
|
作者
Xun YANG [1 ]
Jianming ZENG [1 ,2 ]
Dan GUO [3 ]
Shanshan WANG [4 ]
Jianfeng DONG [5 ]
Meng WANG [3 ,2 ]
机构
[1] School of Information Science and Technology, University of Science and Technology of China
[2] Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
[3] School of Computer Science and Information Engineering, Hefei University of Technology
[4] Institutes of Physical Science and Information Technology, Anhui University
[5] School of Computer Science and Technology, Zhejiang Gongshang
关键词
D O I
暂无
中图分类号
TP391.41 []; TP391.1 [文字信息处理];
学科分类号
080203 ;
摘要
Video question answering(VideoQA) is a challenging yet important task that requires a joint understanding of low-level video content and high-level textual semantics. Despite the promising progress of existing efforts, recent studies revealed that current VideoQA models mostly tend to over-rely on the superficial correlations rooted in the dataset bias while overlooking the key video content, thus leading to unreliable results. Effectively understanding and modeling the temporal and semantic characteristics of a given video for robust VideoQA is crucial but, to our knowledge, has not been well investigated. To fill the research gap, we propose a robust VideoQA framework that can effectively model the cross-modality fusion and enforce the model to focus on the temporal and global content of videos when making a QA decision instead of exploiting the shortcuts in datasets. Specifically, we design a self-supervised contrastive learning objective to contrast the positive and negative pairs of multimodal input, where the fused representation of the original multimodal input is enforced to be closer to that of the intervened input based on video perturbation. We expect the fused representation to focus more on the global context of videos rather than some static keyframes. Moreover, we introduce an effective temporal order regularization to enforce the inherent sequential structure of videos for video representation. We also design a Kullback-Leibler divergence-based perturbation invariance regularization of the predicted answer distribution to improve the robustness of the model against temporal content perturbation of videos. Our method is model-agnostic and can be easily compatible with various VideoQA backbones. Extensive experimental results and analyses on several public datasets show the advantage of our method over the state-of-the-art methods in terms of both accuracy and robustness.
引用
收藏
页码:211 / 226
页数:16
相关论文
共 50 条
  • [1] Robust video question answering via contrastive cross-modality representation learning
    Yang, Xun
    Zeng, Jianming
    Guo, Dan
    Wang, Shanshan
    Dong, Jianfeng
    Wang, Meng
    SCIENCE CHINA-INFORMATION SCIENCES, 2024, 67 (10)
  • [2] Self-supervised Contrastive Cross-Modality Representation Learning for Spoken Question Answering
    You, Chenyu
    Chen, Nuo
    Zou, Yuexian
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 28 - 39
  • [3] Inter-Intra Cross-Modality Self-Supervised Video Representation Learning by Contrastive Clustering
    Wei, Jiutong
    Luo, Guan
    Li, Bing
    Hu, Weiming
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 4815 - 4821
  • [4] Learning Modality-Invariant Features by Cross-Modality Adversarial Network for Visual Question Answering
    Fu, Ze
    Zheng, Changmeng
    Cai, Yi
    Li, Qing
    Wang, Tao
    WEB AND BIG DATA, APWEB-WAIM 2021, PT I, 2021, 12858 : 316 - 331
  • [5] Bridging the Cross-Modality Semantic Gap in Visual Question Answering
    Wang, Boyue
    Ma, Yujian
    Li, Xiaoyan
    Gao, Junbin
    Hu, Yongli
    Yin, Baocai
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2025, 36 (03) : 4519 - 4531
  • [6] Bridging the Cross-Modality Semantic Gap in Visual Question Answering
    Wang, Boyue
    Ma, Yujian
    Li, Xiaoyan
    Gao, Junbin
    Hu, Yongli
    Yin, Baocai
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 13
  • [7] Representation Learning for Cross-Modality Classification
    van Tulder, Gijs
    de Bruijne, Marleen
    MEDICAL COMPUTER VISION AND BAYESIAN AND GRAPHICAL MODELS FOR BIOMEDICAL IMAGING, 2017, 10081 : 126 - 136
  • [8] Cross-modality Multiple Relations Learning for Knowledge-based Visual Question Answering
    Wang, Yan
    Li, Peize
    Si, Qingyi
    Zhang, Hanwen
    Zang, Wenyu
    Lin, Zheng
    Fu, Peng
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (03)
  • [9] Contrastive Video Question Answering via Video Graph Transformer
    Xiao, Junbin
    Zhou, Pan
    Yao, Angela
    Li, Yicong
    Hong, Richang
    Yan, Shuicheng
    Chua, Tat-Seng
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 13265 - 13280
  • [10] Cross-Modality Contrastive Learning for Hyperspectral Image Classification
    Hang, Renlong
    Qian, Xuwei
    Liu, Qingshan
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60