Robust video question answering via contrastive cross-modality representation learning

被引:0
|
作者
Xun YANG [1 ]
Jianming ZENG [1 ,2 ]
Dan GUO [3 ]
Shanshan WANG [4 ]
Jianfeng DONG [5 ]
Meng WANG [3 ,2 ]
机构
[1] School of Information Science and Technology, University of Science and Technology of China
[2] Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
[3] School of Computer Science and Information Engineering, Hefei University of Technology
[4] Institutes of Physical Science and Information Technology, Anhui University
[5] School of Computer Science and Technology, Zhejiang Gongshang
关键词
D O I
暂无
中图分类号
TP391.41 []; TP391.1 [文字信息处理];
学科分类号
080203 ;
摘要
Video question answering(VideoQA) is a challenging yet important task that requires a joint understanding of low-level video content and high-level textual semantics. Despite the promising progress of existing efforts, recent studies revealed that current VideoQA models mostly tend to over-rely on the superficial correlations rooted in the dataset bias while overlooking the key video content, thus leading to unreliable results. Effectively understanding and modeling the temporal and semantic characteristics of a given video for robust VideoQA is crucial but, to our knowledge, has not been well investigated. To fill the research gap, we propose a robust VideoQA framework that can effectively model the cross-modality fusion and enforce the model to focus on the temporal and global content of videos when making a QA decision instead of exploiting the shortcuts in datasets. Specifically, we design a self-supervised contrastive learning objective to contrast the positive and negative pairs of multimodal input, where the fused representation of the original multimodal input is enforced to be closer to that of the intervened input based on video perturbation. We expect the fused representation to focus more on the global context of videos rather than some static keyframes. Moreover, we introduce an effective temporal order regularization to enforce the inherent sequential structure of videos for video representation. We also design a Kullback-Leibler divergence-based perturbation invariance regularization of the predicted answer distribution to improve the robustness of the model against temporal content perturbation of videos. Our method is model-agnostic and can be easily compatible with various VideoQA backbones. Extensive experimental results and analyses on several public datasets show the advantage of our method over the state-of-the-art methods in terms of both accuracy and robustness.
引用
收藏
页码:211 / 226
页数:16
相关论文
共 50 条
  • [41] Question Difficulty Estimation with Directional Modality Association in Video Question Answering
    Kim, Bong-Min
    Park, Seong-Bae
    ADVANCES AND TRENDS IN ARTIFICIAL INTELLIGENCE: THEORY AND PRACTICES IN ARTIFICIAL INTELLIGENCE, 2022, 13343 : 287 - 299
  • [42] Anatomy-Regularized Representation Learning for Cross-Modality Medical Image Segmentation
    Chen, Xu
    Lian, Chunfeng
    Wang, Li
    Deng, Hannah
    Kuang, Tianshu
    Fung, Steve
    Gateno, Jaime
    Yap, Pew-Thian
    Xia, James J.
    Shen, Dinggang
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2021, 40 (01) : 274 - 285
  • [43] Cross-Modality Cardiac Insight Transfer: A Contrastive Learning Approach to Enrich ECG with CMR Features
    Ding, Zhengyao
    Hu, Yujian
    Li, Ziyu
    Zhang, Hongkun
    Wu, Fei
    Xiang, Yilang
    Li, Tian
    Liu, Ziyi
    Chu, Xuesen
    Huang, Zhengxing
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT III, 2024, 15003 : 109 - 119
  • [44] TIR/VIS cross-modality modelling via correlative subspace learning
    Sun, L.
    Wu, M. H.
    Dai, X. X.
    ELECTRONICS LETTERS, 2011, 47 (16) : 915 - +
  • [45] Cross-Modality Binary Code Learning via Fusion Similarity Hashing
    Liu, Hong
    Ji, Rongrong
    Wu, Yongjian
    Huang, Feiyue
    Zhang, Baochang
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6345 - 6353
  • [46] Not All Pixels Are Matched: Dense Contrastive Learning for Cross-Modality Person Re-Identification
    Sun, Hanzhe
    Liu, Jun
    Zhang, Zhizhong
    Wang, Chengjie
    Qu, Yanyun
    Xie, Yuan
    Ma, Lizhuang
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5333 - 5341
  • [47] C3CMR: Cross-Modality Cross-Instance Contrastive Learning for Cross-Media Retrieval
    Wang, Junsheng
    Gong, Tiantian
    Zeng, Zhixiong
    Sun, Changchang
    Yan, Yan
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4300 - 4308
  • [48] Robust visual question answering via semantic cross modal augmentation
    Mashrur, Akib
    Luo, Wei
    Zaidi, Nayyar A.
    Robles-Kelly, Antonio
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 238
  • [49] A Cross-Modality Person Re-Identification Method Based on Joint Middle Modality and Representation Learning
    Ma, Li
    Guan, Zhibin
    Dai, Xinguan
    Gao, Hangbiao
    Lu, Yuanmeng
    ELECTRONICS, 2023, 12 (12)
  • [50] Unsupervised cross-modality domain adaptation via source-domain labels guided contrastive learning for medical image segmentation
    Chen, Wenshuang
    Ye, Qi
    Guo, Lihua
    Wu, Qi
    MEDICAL & BIOLOGICAL ENGINEERING & COMPUTING, 2025,