Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning

被引:13
|
作者
Zhang, Xi [1 ,2 ]
Zhang, Feifei [1 ]
Xu, Changsheng [1 ,2 ,3 ]
机构
[1] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
基金
中国国家自然科学基金; 北京市自然科学基金;
关键词
VCR; contrastive learning; counterfactual thinking;
D O I
10.1145/3474085.3475328
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given a question about an image, a Visual Commonsense Reasoning (VCR) model needs to provide not only a correct answer, but also a rationale to justify the answer. It is a challenging task due to the requirements of diverse visual content understanding, abstract language comprehending, and complicated inter-modality relationship reasoning. To solve above challenges, previous methods either resort to holistic attention mechanism or explore transformer-based model with pre-training, which, however, cannot perform comprehensive understanding and usually suffer from heavy computing burden. In this paper, we propose a novel multi-level counterfactual contrastive learning network for VCR by jointly modeling the hierarchical visual contents and the inter-modality relationships between the visual and linguistic domains. The proposed method enjoys several merits. First, with sufficient instance-level, imagelevel, and semantic-level contrastive learning, our model can extract discriminative features and perform comprehensive understanding for the image and linguistic expressions. Second, taking advantage of counterfactual thinking, we can generate informative factual and counterfactual samples for contrastive learning, resulting in stronger perception ability of our model. Third, an auxiliary contrast module is incorporated into our method to directly optimize the answer prediction in VCR, which further facilitates the representation learning. Extensive experiments on the VCR dataset demonstrate that our approach performs favorably against the state-of-the-arts.
引用
收藏
页码:1793 / 1802
页数:10
相关论文
共 50 条
  • [1] Multi-Level Knowledge Injecting for Visual Commonsense Reasoning
    Wen, Zhang
    Peng, Yuxin
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (03) : 1042 - 1054
  • [2] COSIM: Commonsense Reasoning for Counterfactual Scene Imagination
    Kim, Hyounghun
    Zala, Abhay
    Bansal, Mohit
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 911 - 923
  • [3] Synergy in Multi-Level Reasoning
    Steinberg, Alan. N.
    Bowman, Christopher L.
    2022 IEEE CONFERENCE ON COGNITIVE AND COMPUTATIONAL ASPECTS OF SITUATION MANAGEMENT, COGSIMA, 2022, : 23 - 30
  • [4] Multi-level Contrastive Learning for Commonsense Question Answering
    Fang, Quntian
    Huang, Zhen
    Zhang, Ziwen
    Hu, Minghao
    Hu, Biao
    Wang, Ankun
    Li, Dongsheng
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT IV, KSEM 2023, 2023, 14120 : 318 - 331
  • [5] Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning
    Lv, Changsheng
    Zhang, Shuai
    Tian, Yapeng
    Qi, Mengshi
    Ma, Huadong
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [6] Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog
    Zhang, Shunyu
    Jiang, Xiaoze
    Yang, Zequn
    Wan, Tao
    Qin, Zengchang
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4599 - 4608
  • [7] Visual commonsense reasoning with directional visual connections
    Han, Yahong
    Wu, Aming
    Zhu, Linchao
    Yang, Yi
    FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2021, 22 (05) : 625 - 637
  • [8] TOWARDS A THEORY OF COMMONSENSE VISUAL REASONING
    CHANDRASEKARAN, B
    NARAYANAN, NH
    LECTURE NOTES IN COMPUTER SCIENCE, 1990, 472 : 387 - 409
  • [9] A MULTI-LEVEL GEOMETRIC REASONING SYSTEM FOR VISION
    BARRY, M
    CYRLUK, D
    KAPUR, D
    MUNDY, J
    NGUYEN, VD
    ARTIFICIAL INTELLIGENCE, 1988, 37 (1-3) : 291 - 332
  • [10] Joint Answering and Explanation for Visual Commonsense Reasoning
    Li, Zhenyang
    Guo, Yangyang
    Wang, Kejie
    Wei, Yinwei
    Nie, Liqiang
    Kankanhalli, Mohan
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 3836 - 3846