Robust video question answering via contrastive cross-modality representation learning

被引:0
|
作者
Xun YANG [1 ]
Jianming ZENG [1 ,2 ]
Dan GUO [3 ]
Shanshan WANG [4 ]
Jianfeng DONG [5 ]
Meng WANG [3 ,2 ]
机构
[1] School of Information Science and Technology, University of Science and Technology of China
[2] Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
[3] School of Computer Science and Information Engineering, Hefei University of Technology
[4] Institutes of Physical Science and Information Technology, Anhui University
[5] School of Computer Science and Technology, Zhejiang Gongshang
关键词
D O I
暂无
中图分类号
TP391.41 []; TP391.1 [文字信息处理];
学科分类号
080203 ;
摘要
Video question answering(VideoQA) is a challenging yet important task that requires a joint understanding of low-level video content and high-level textual semantics. Despite the promising progress of existing efforts, recent studies revealed that current VideoQA models mostly tend to over-rely on the superficial correlations rooted in the dataset bias while overlooking the key video content, thus leading to unreliable results. Effectively understanding and modeling the temporal and semantic characteristics of a given video for robust VideoQA is crucial but, to our knowledge, has not been well investigated. To fill the research gap, we propose a robust VideoQA framework that can effectively model the cross-modality fusion and enforce the model to focus on the temporal and global content of videos when making a QA decision instead of exploiting the shortcuts in datasets. Specifically, we design a self-supervised contrastive learning objective to contrast the positive and negative pairs of multimodal input, where the fused representation of the original multimodal input is enforced to be closer to that of the intervened input based on video perturbation. We expect the fused representation to focus more on the global context of videos rather than some static keyframes. Moreover, we introduce an effective temporal order regularization to enforce the inherent sequential structure of videos for video representation. We also design a Kullback-Leibler divergence-based perturbation invariance regularization of the predicted answer distribution to improve the robustness of the model against temporal content perturbation of videos. Our method is model-agnostic and can be easily compatible with various VideoQA backbones. Extensive experimental results and analyses on several public datasets show the advantage of our method over the state-of-the-art methods in terms of both accuracy and robustness.
引用
收藏
页码:211 / 226
页数:16
相关论文
共 50 条
  • [21] Video question answering via grounded cross-attention network learning
    Ye, Yunan
    Zhang, Shifeng
    Li, Yimeng
    Qian, Xufeng
    Tang, Siliang
    Pu, Shiliang
    Xiao, Jun
    INFORMATION PROCESSING & MANAGEMENT, 2020, 57 (04)
  • [22] HCCL: Hierarchical Counterfactual Contrastive Learning for Robust Visual Question Answering
    Hao, Dongze
    Wang, Qunbo
    Zhu, Xinxin
    Liu, Jing
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (10)
  • [23] Cross-modality representation learning from transformer for hashtag prediction
    Mian Muhammad Yasir Khalil
    Qingxian Wang
    Bo Chen
    Weidong Wang
    Journal of Big Data, 10
  • [24] Cross-modality Representation Interactive Learning For Multimodal Sentiment Analysis
    Huang, Jian
    Ji, Yanli
    Yang, Yang
    Shen, Heng Tao
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 426 - 434
  • [25] Cross-modality representation learning from transformer for hashtag prediction
    Khalil, Mian Muhammad Yasir
    Wang, Qingxian
    Chen, Bo
    Wang, Weidong
    JOURNAL OF BIG DATA, 2023, 10 (01)
  • [26] Gated Multi-modal Fusion with Cross-modal Contrastive Learning for Video Question Answering
    Lyu, Chenyang
    Li, Wenxi
    Ji, Tianbo
    Zhou, Liting
    Gurrin, Cathal
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII, 2023, 14260 : 427 - 438
  • [27] Cross-Modality Feature Learning via Convolutional Autoencoder
    Liu, Xueliang
    Wang, Meng
    Zha, Zheng-Jun
    Hong, Richang
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (01)
  • [28] Object-Centric Representation Learning for Video Question Answering
    Long Hoang Dang
    Thao Minh Le
    Vuong Le
    Truyen Tran
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [29] Robust RGB-T Tracking via Adaptive Modality Weight Correlation Filters and Cross-modality Learning
    Zhou, Mingliang
    Zhao, Xinwen
    Luo, Futing
    Luo, Jun
    Pu, Huayan
    Xiang, Tao
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (04)
  • [30] Hybrid cross-modality fusion network for medical image segmentation with contrastive learning
    Zhou, Xichuan
    Song, Qianqian
    Nie, Jing
    Feng, Yujie
    Liu, Haijun
    Liang, Fu
    Chen, Lihui
    Xie, Jin
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2025, 144