Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning

被引：13

作者：

Zhang, Xi ^{[1
,2
]}

Zhang, Feifei ^{[1
]}

Xu, Changsheng ^{[1
,2
,3
]}

机构：

[1] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China

[3] Peng Cheng Lab, Shenzhen, Peoples R China

来源：

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021 | 2021年

基金：

中国国家自然科学基金; 北京市自然科学基金;

关键词：

VCR; contrastive learning; counterfactual thinking;

D O I：

10.1145/3474085.3475328

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Given a question about an image, a Visual Commonsense Reasoning (VCR) model needs to provide not only a correct answer, but also a rationale to justify the answer. It is a challenging task due to the requirements of diverse visual content understanding, abstract language comprehending, and complicated inter-modality relationship reasoning. To solve above challenges, previous methods either resort to holistic attention mechanism or explore transformer-based model with pre-training, which, however, cannot perform comprehensive understanding and usually suffer from heavy computing burden. In this paper, we propose a novel multi-level counterfactual contrastive learning network for VCR by jointly modeling the hierarchical visual contents and the inter-modality relationships between the visual and linguistic domains. The proposed method enjoys several merits. First, with sufficient instance-level, imagelevel, and semantic-level contrastive learning, our model can extract discriminative features and perform comprehensive understanding for the image and linguistic expressions. Second, taking advantage of counterfactual thinking, we can generate informative factual and counterfactual samples for contrastive learning, resulting in stronger perception ability of our model. Third, an auxiliary contrast module is incorporated into our method to directly optimize the answer prediction in VCR, which further facilitates the representation learning. Extensive experiments on the VCR dataset demonstrate that our approach performs favorably against the state-of-the-arts.

引用

页码：1793 / 1802

页数：10

共 50 条

[1] Multi-Level Knowledge Injecting for Visual Commonsense Reasoning
Wen, Zhang
Peng, Yuxin
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (03) : 1042 - 1054
[2] COSIM: Commonsense Reasoning for Counterfactual Scene Imagination
Kim, Hyounghun
Zala, Abhay
Bansal, Mohit
NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 911 - 923
[3] Synergy in Multi-Level Reasoning
Steinberg, Alan. N.
Bowman, Christopher L.
2022 IEEE CONFERENCE ON COGNITIVE AND COMPUTATIONAL ASPECTS OF SITUATION MANAGEMENT, COGSIMA, 2022, : 23 - 30
[4] Multi-level Contrastive Learning for Commonsense Question Answering
Fang, Quntian
Huang, Zhen
Zhang, Ziwen
Hu, Minghao
Hu, Biao
Wang, Ankun
Li, Dongsheng
KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT IV, KSEM 2023, 2023, 14120 : 318 - 331
[5] Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning
Lv, Changsheng
Zhang, Shuai
Tian, Yapeng
Qi, Mengshi
Ma, Huadong
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[6] Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog
Zhang, Shunyu
Jiang, Xiaoze
Yang, Zequn
Wan, Tao
Qin, Zengchang
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4599 - 4608
[7] Visual commonsense reasoning with directional visual connections
Han, Yahong
Wu, Aming
Zhu, Linchao
Yang, Yi
FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2021, 22 (05) : 625 - 637
[8] TOWARDS A THEORY OF COMMONSENSE VISUAL REASONING
CHANDRASEKARAN, B
NARAYANAN, NH
LECTURE NOTES IN COMPUTER SCIENCE, 1990, 472 : 387 - 409
[9] A MULTI-LEVEL GEOMETRIC REASONING SYSTEM FOR VISION
BARRY, M
CYRLUK, D
KAPUR, D
MUNDY, J
NGUYEN, VD
ARTIFICIAL INTELLIGENCE, 1988, 37 (1-3) : 291 - 332
[10] Joint Answering and Explanation for Visual Commonsense Reasoning
Li, Zhenyang
Guo, Yangyang
Wang, Kejie
Wei, Yinwei
Nie, Liqiang
Kankanhalli, Mohan
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 3836 - 3846

← 1 2 3 4 5 →