Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning

被引:10
|
作者
Zhang, Xi [1 ,2 ]
Zhang, Feifei [1 ]
Xu, Changsheng [1 ,2 ,3 ]
机构
[1] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
基金
中国国家自然科学基金; 北京市自然科学基金;
关键词
VCR; contrastive learning; counterfactual thinking;
D O I
10.1145/3474085.3475328
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given a question about an image, a Visual Commonsense Reasoning (VCR) model needs to provide not only a correct answer, but also a rationale to justify the answer. It is a challenging task due to the requirements of diverse visual content understanding, abstract language comprehending, and complicated inter-modality relationship reasoning. To solve above challenges, previous methods either resort to holistic attention mechanism or explore transformer-based model with pre-training, which, however, cannot perform comprehensive understanding and usually suffer from heavy computing burden. In this paper, we propose a novel multi-level counterfactual contrastive learning network for VCR by jointly modeling the hierarchical visual contents and the inter-modality relationships between the visual and linguistic domains. The proposed method enjoys several merits. First, with sufficient instance-level, imagelevel, and semantic-level contrastive learning, our model can extract discriminative features and perform comprehensive understanding for the image and linguistic expressions. Second, taking advantage of counterfactual thinking, we can generate informative factual and counterfactual samples for contrastive learning, resulting in stronger perception ability of our model. Third, an auxiliary contrast module is incorporated into our method to directly optimize the answer prediction in VCR, which further facilitates the representation learning. Extensive experiments on the VCR dataset demonstrate that our approach performs favorably against the state-of-the-arts.
引用
收藏
页码:1793 / 1802
页数:10
相关论文
共 50 条
  • [31] Back to the Future: Unsupervised Backprop-based Decoding for Counterfactual and Abductive Commonsense Reasoning
    Lianhui, Qin
    Vered, Shwartz
    Peter, West
    Chandra, Bhagavatula
    Jena, D. Hwang
    Ronan, Le Bras
    Antoine, Bosselut
    Choi, Yejin
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 794 - 805
  • [32] Multi-level Net: A Visual Saliency Prediction Model
    Cornia, Marcella
    Baraldi, Lorenzo
    Serra, Giuseppe
    Cucchiara, Rita
    [J]. COMPUTER VISION - ECCV 2016 WORKSHOPS, PT II, 2016, 9914 : 302 - 315
  • [33] Multi-level Visual Fusion Networks for Image Captioning
    Zhou, Dongming
    Zhang, Canlong
    Li, Zhixin
    Wang, Zhiwen
    [J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [34] Counterfactual Visual Dialog: Robust Commonsense Knowledge Learning From Unbiased Training
    Liu, An-An
    Huang, Chenxi
    Xu, Ning
    Tian, Hongshuo
    Liu, Jing
    Zhang, Yongdong
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1639 - 1651
  • [35] Multi-level Attention Networks for Visual Question Answering
    Yu, Dongfei
    Fu, Jianlong
    Mei, Tao
    Rui, Yong
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4187 - 4195
  • [36] Multi-Modal fusion with multi-level attention for Visual Dialog
    Zhang, Jingping
    Wang, Qiang
    Han, Yahong
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2020, 57 (04)
  • [37] A knowledge graph embedding model based on multi-level analogical reasoning
    Zhao, Xiaofei
    Yang, Mengqian
    Yang, Hongji
    [J]. CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2024, 27 (08): : 10553 - 10567
  • [38] AND/OR reasoning graphs for determining prime implicants in multi-level combinational networks
    Stoffel, D
    Kunz, W
    Gerber, S
    [J]. PROCEEDINGS OF THE ASP-DAC '97 - ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE 1997, 1996, : 529 - 538
  • [39] An Effective Multi-Level Multi-Shares Visual Cryptography Technique
    Dalvi, Gopal D.
    Wakde, S. D.
    Kale, P. V.
    [J]. 2018 FOURTH INTERNATIONAL CONFERENCE ON COMPUTING COMMUNICATION CONTROL AND AUTOMATION (ICCUBEA), 2018,
  • [40] A Multi-Level Study of Undergraduate Computer Science Reasoning about Concurrency
    Lawson, Aubrey
    Kraemer, Eileen T.
    Che, S. Megan
    Kennedy, Cazembe
    [J]. PROCEEDINGS OF THE 2019 ACM CONFERENCE ON INNOVATION AND TECHNOLOGY IN COMPUTER SCIENCE EDUCATION (ITICSE '19), 2019, : 210 - 216