Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning

被引:10
|
作者
Zhang, Xi [1 ,2 ]
Zhang, Feifei [1 ]
Xu, Changsheng [1 ,2 ,3 ]
机构
[1] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
基金
中国国家自然科学基金; 北京市自然科学基金;
关键词
VCR; contrastive learning; counterfactual thinking;
D O I
10.1145/3474085.3475328
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given a question about an image, a Visual Commonsense Reasoning (VCR) model needs to provide not only a correct answer, but also a rationale to justify the answer. It is a challenging task due to the requirements of diverse visual content understanding, abstract language comprehending, and complicated inter-modality relationship reasoning. To solve above challenges, previous methods either resort to holistic attention mechanism or explore transformer-based model with pre-training, which, however, cannot perform comprehensive understanding and usually suffer from heavy computing burden. In this paper, we propose a novel multi-level counterfactual contrastive learning network for VCR by jointly modeling the hierarchical visual contents and the inter-modality relationships between the visual and linguistic domains. The proposed method enjoys several merits. First, with sufficient instance-level, imagelevel, and semantic-level contrastive learning, our model can extract discriminative features and perform comprehensive understanding for the image and linguistic expressions. Second, taking advantage of counterfactual thinking, we can generate informative factual and counterfactual samples for contrastive learning, resulting in stronger perception ability of our model. Third, an auxiliary contrast module is incorporated into our method to directly optimize the answer prediction in VCR, which further facilitates the representation learning. Extensive experiments on the VCR dataset demonstrate that our approach performs favorably against the state-of-the-arts.
引用
收藏
页码:1793 / 1802
页数:10
相关论文
共 50 条
  • [1] Multi-Level Knowledge Injecting for Visual Commonsense Reasoning
    Wen, Zhang
    Peng, Yuxin
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (03) : 1042 - 1054
  • [2] COSIM: Commonsense Reasoning for Counterfactual Scene Imagination
    Kim, Hyounghun
    Zala, Abhay
    Bansal, Mohit
    [J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 911 - 923
  • [3] Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning
    Lv, Changsheng
    Zhang, Shuai
    Tian, Yapeng
    Qi, Mengshi
    Ma, Huadong
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [4] Multi-level Contrastive Learning for Commonsense Question Answering
    Fang, Quntian
    Huang, Zhen
    Zhang, Ziwen
    Hu, Minghao
    Hu, Biao
    Wang, Ankun
    Li, Dongsheng
    [J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT IV, KSEM 2023, 2023, 14120 : 318 - 331
  • [5] Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog
    Zhang, Shunyu
    Jiang, Xiaoze
    Yang, Zequn
    Wan, Tao
    Qin, Zengchang
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4599 - 4608
  • [6] Visual commonsense reasoning with directional visual connections
    Han, Yahong
    Wu, Aming
    Zhu, Linchao
    Yang, Yi
    [J]. FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2021, 22 (05) : 625 - 637
  • [7] TOWARDS A THEORY OF COMMONSENSE VISUAL REASONING
    CHANDRASEKARAN, B
    NARAYANAN, NH
    [J]. LECTURE NOTES IN COMPUTER SCIENCE, 1990, 472 : 387 - 409
  • [8] A MULTI-LEVEL GEOMETRIC REASONING SYSTEM FOR VISION
    BARRY, M
    CYRLUK, D
    KAPUR, D
    MUNDY, J
    NGUYEN, VD
    [J]. ARTIFICIAL INTELLIGENCE, 1988, 37 (1-3) : 291 - 332
  • [9] Heterogeneous Graph Learning for Visual Commonsense Reasoning
    Yu, Weijiang
    Zhou, Jingwen
    Yu, Weihao
    Liang, Xiaodan
    Xiao, Nong
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [10] From Recognition to Cognition: Visual Commonsense Reasoning
    Zellers, Rowan
    Bisk, Yonatan
    Farhadi, Ali
    Choi, Yejin
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6713 - 6724