Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning

被引:13
|
作者
Zhang, Xi [1 ,2 ]
Zhang, Feifei [1 ]
Xu, Changsheng [1 ,2 ,3 ]
机构
[1] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
基金
中国国家自然科学基金; 北京市自然科学基金;
关键词
VCR; contrastive learning; counterfactual thinking;
D O I
10.1145/3474085.3475328
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given a question about an image, a Visual Commonsense Reasoning (VCR) model needs to provide not only a correct answer, but also a rationale to justify the answer. It is a challenging task due to the requirements of diverse visual content understanding, abstract language comprehending, and complicated inter-modality relationship reasoning. To solve above challenges, previous methods either resort to holistic attention mechanism or explore transformer-based model with pre-training, which, however, cannot perform comprehensive understanding and usually suffer from heavy computing burden. In this paper, we propose a novel multi-level counterfactual contrastive learning network for VCR by jointly modeling the hierarchical visual contents and the inter-modality relationships between the visual and linguistic domains. The proposed method enjoys several merits. First, with sufficient instance-level, imagelevel, and semantic-level contrastive learning, our model can extract discriminative features and perform comprehensive understanding for the image and linguistic expressions. Second, taking advantage of counterfactual thinking, we can generate informative factual and counterfactual samples for contrastive learning, resulting in stronger perception ability of our model. Third, an auxiliary contrast module is incorporated into our method to directly optimize the answer prediction in VCR, which further facilitates the representation learning. Extensive experiments on the VCR dataset demonstrate that our approach performs favorably against the state-of-the-arts.
引用
收藏
页码:1793 / 1802
页数:10
相关论文
共 50 条
  • [31] Case Level Counterfactual Reasoning in Process Mining
    Qafari, Mahnaz Sadat
    van der Aalst, Wil M. P.
    INTELLIGENT INFORMATION SYSTEMS, CAISE FORUM 2021, 2021, 424 : 55 - 63
  • [32] Back to the Future: Unsupervised Backprop-based Decoding for Counterfactual and Abductive Commonsense Reasoning
    Lianhui, Qin
    Vered, Shwartz
    Peter, West
    Chandra, Bhagavatula
    Jena, D. Hwang
    Ronan, Le Bras
    Antoine, Bosselut
    Choi, Yejin
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 794 - 805
  • [33] Multi-level Visual Fusion Networks for Image Captioning
    Zhou, Dongming
    Zhang, Canlong
    Li, Zhixin
    Wang, Zhiwen
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [34] Multi-level Net: A Visual Saliency Prediction Model
    Cornia, Marcella
    Baraldi, Lorenzo
    Serra, Giuseppe
    Cucchiara, Rita
    COMPUTER VISION - ECCV 2016 WORKSHOPS, PT II, 2016, 9914 : 302 - 315
  • [35] Multi-level Attention Networks for Visual Question Answering
    Yu, Dongfei
    Fu, Jianlong
    Mei, Tao
    Rui, Yong
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4187 - 4195
  • [36] Counterfactual Visual Dialog: Robust Commonsense Knowledge Learning From Unbiased Training
    Liu, An-An
    Huang, Chenxi
    Xu, Ning
    Tian, Hongshuo
    Liu, Jing
    Zhang, Yongdong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1639 - 1651
  • [37] Multi-Modal fusion with multi-level attention for Visual Dialog
    Zhang, Jingping
    Wang, Qiang
    Han, Yahong
    INFORMATION PROCESSING & MANAGEMENT, 2020, 57 (04)
  • [38] An Effective Multi-Level Multi-Shares Visual Cryptography Technique
    Dalvi, Gopal D.
    Wakde, S. D.
    Kale, P. V.
    2018 FOURTH INTERNATIONAL CONFERENCE ON COMPUTING COMMUNICATION CONTROL AND AUTOMATION (ICCUBEA), 2018,
  • [39] AND/OR reasoning graphs for determining prime implicants in multi-level combinational networks
    Stoffel, D
    Kunz, W
    Gerber, S
    PROCEEDINGS OF THE ASP-DAC '97 - ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE 1997, 1996, : 529 - 538
  • [40] A knowledge graph embedding model based on multi-level analogical reasoning
    Zhao, Xiaofei
    Yang, Mengqian
    Yang, Hongji
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2024, 27 (08): : 10553 - 10567