Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning

被引:13
|
作者
Zhang, Xi [1 ,2 ]
Zhang, Feifei [1 ]
Xu, Changsheng [1 ,2 ,3 ]
机构
[1] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
基金
中国国家自然科学基金; 北京市自然科学基金;
关键词
VCR; contrastive learning; counterfactual thinking;
D O I
10.1145/3474085.3475328
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given a question about an image, a Visual Commonsense Reasoning (VCR) model needs to provide not only a correct answer, but also a rationale to justify the answer. It is a challenging task due to the requirements of diverse visual content understanding, abstract language comprehending, and complicated inter-modality relationship reasoning. To solve above challenges, previous methods either resort to holistic attention mechanism or explore transformer-based model with pre-training, which, however, cannot perform comprehensive understanding and usually suffer from heavy computing burden. In this paper, we propose a novel multi-level counterfactual contrastive learning network for VCR by jointly modeling the hierarchical visual contents and the inter-modality relationships between the visual and linguistic domains. The proposed method enjoys several merits. First, with sufficient instance-level, imagelevel, and semantic-level contrastive learning, our model can extract discriminative features and perform comprehensive understanding for the image and linguistic expressions. Second, taking advantage of counterfactual thinking, we can generate informative factual and counterfactual samples for contrastive learning, resulting in stronger perception ability of our model. Third, an auxiliary contrast module is incorporated into our method to directly optimize the answer prediction in VCR, which further facilitates the representation learning. Extensive experiments on the VCR dataset demonstrate that our approach performs favorably against the state-of-the-arts.
引用
收藏
页码:1793 / 1802
页数:10
相关论文
共 50 条
  • [21] Visual Tracking with Multi-level Dictionary Learning
    Liu, Yufeng
    Zhang, Huifang
    Su, Zhuo
    Luo, Xiaonan
    2014 5TH INTERNATIONAL CONFERENCE ON DIGITAL HOME (ICDH), 2014, : 8 - 13
  • [22] Infrastructure of social control: A multi-level counterfactual analysis of surveillance and Black education
    Johnson Jr, Odis
    Jabbari, Jason
    JOURNAL OF CRIMINAL JUSTICE, 2022, 83
  • [23] Multi-level contrast filtering in image difference metrics
    Simone, Gabriele
    Pedersen, Marius
    Farup, Ivar
    Oleari, Claudio
    EURASIP JOURNAL ON IMAGE AND VIDEO PROCESSING, 2013,
  • [24] Multi-level contrast filtering in image difference metrics
    Gabriele Simone
    Marius Pedersen
    Ivar Farup
    Claudio Oleari
    EURASIP Journal on Image and Video Processing, 2013
  • [25] Multimodal Emotion Classification With Multi-Level Semantic Reasoning Network
    Zhu, Tong
    Li, Leida
    Yang, Jufeng
    Zhao, Sicheng
    Xiao, Xiao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 6868 - 6880
  • [26] Multi-level Networked Knowledge Base: DDL-Reasoning
    Klai, Sihem
    Zimmermann, Antoine
    Khadir, Med Tarek
    MODEL AND DATA ENGINEERING, 2016, 9893 : 118 - 131
  • [27] Multi-level discrimination and prediction by applying simple fuzzy reasoning
    Chen, Zhidian
    Huaqiao Daxue Xuebao/Journal of Huaqiao University, 2000, 21 (03): : 228 - 233
  • [28] Connective Cognition Network for Directional Visual Commonsense Reasoning
    Wu, Aming
    Zhu, Linchao
    Han, Yahong
    Yang, Yi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [29] A Case Study of the Shortcut Effects in Visual Commonsense Reasoning
    Ye, Keren
    Kovashka, Adriana
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 3181 - 3189
  • [30] Learning to Agree on Vision Attention for Visual Commonsense Reasoning
    Li, Zhenyang
    Guo, Yangyang
    Wang, Kejie
    Liu, Fan
    Nie, Liqiang
    Kankanhalli, Mohan
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1065 - 1075