Counterfactual Visual Dialog: Robust Commonsense Knowledge Learning From Unbiased Training

被引：2

作者：

Liu, An-An ^{[1
,2
]}

Huang, Chenxi ^{[1
]}

Xu, Ning ^{[1
]}

Tian, Hongshuo ^{[1
]}

Liu, Jing ^{[1
]}

Zhang, Yongdong ^{[3
]}

机构：

[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China

[2] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230088, Peoples R China

[3] Univ Sci & Technol China, Hefei 230026, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

基金：

中国国家自然科学基金;

关键词：

Visualization; Commonsense reasoning; History; Task analysis; Correlation; Knowledge based systems; Computational modeling; Visual dialog; commonsense; multi-modal; counterfactual;

D O I：

10.1109/TMM.2023.3284594

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Visual Dialog (VD) requires an agent to answer the current question by engaging in a conversation with humans referring to an image. Despite the recent progress, it is beneficial to introduce external commonsense knowledge to fully understand the given image and dialog history. However, the existing knowledge-based VD models are inclined to rely on severe learning bias brought by commonsense, e.g., the retrieved < bus, capable of, transport people > , < bus ,is a ,public transport > , and < bus ,is a, car > can induce a spurious correlation between the question "What is the bus used for?" and the false answer "City bus". There are two challenges to make commonsense learning more robust against spurious correlations: 1) how to disentangle the true effect of "good" commonsense knowledge from the whole, and 2) how to estimate and remove the effect of "bad" commonsense bias on answers. In this article, we propose a novel CounterFactual Commonsense learning scheme for the Visual Dialog task (CFC-VD). First, comparing with the causal graph of existing VD models, we add one new commonsense node and one new link to multi-modal information from history, question, and image. Since the retrieved knowledge prior is subtle and uncontrollable, we consider it as an unobserved confounder in the commonsense node, which leads to spurious correlations for the answer inference. Then, to remove the effect of the confounder, we formulate it as the direct causal effect of commonsense on answers and remove the direct language effect by subtracting it from the total causal effect via counterfactual reasoning. Experimental results certify the effectiveness of our method on the prevailing Visdial v0.9 and Visdial v1.0 datasets.

引用

页码：1639 / 1651

页数：13

共 50 条

[31] LEARNING BY WATCHING - EXTRACTING REUSABLE TASK KNOWLEDGE FROM VISUAL OBSERVATION OF HUMAN-PERFORMANCE
KUNIYOSHI, Y
INABA, M
INOUE, H
[J]. IEEE TRANSACTIONS ON ROBOTICS AND AUTOMATION, 1994, 10 (06): : 799 - 822
[32] Training for Coherence Formation When Learning From Text and Picture and the Interplay With Learners' Prior Knowledge
Seufert, Tina
[J]. FRONTIERS IN PSYCHOLOGY, 2019, 10
[33] A Study on Teacher Training Mechanism Supported by Blended learning from the Perspectives of Communication and Knowledge Management
Liu, Zhiming
Jia, Shan
[J]. 2017 INTERNATIONAL SYMPOSIUM ON EDUCATIONAL TECHNOLOGY (ISET 2017), 2017, : 62 - 66
[34] Internal wikis for procedures and training- from tacit knowledge to self-guided learning
Welsh, Anne
[J]. ONLINE, 2007, 31 (06): : 26 - 29
[35] Leveraging Machine Learning to Automatically Derive Robust Decision Strategies from Imperfect Knowledge of the Real World
Mehta A.
Jain Y.R.
Kemtur A.
Stojcheski J.
Consul S.
Tošić M.
Lieder F.
[J]. Computational Brain & Behavior, 2022, 5 (3) : 343 - 377
[36] Adaptive Fusion of Deep Learning With Statistical Anatomical Knowledge for Robust Patella Segmentation From CT Images
Zhao, Jiachen
Jiang, Tianshu
Lin, Yi
Chan, Lok-Chun
Chan, Ping-Keung
Wen, Chunyi
Chen, Hao
[J]. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2024, 28 (05) : 2842 - 2853
[37] Robust Visual Recognition in Poor Visibility Conditions: A Prior Knowledge-Guided Adversarial Learning Approach (Vol 12, 3711, 2023)
Yang, Jiangang
Yang, Jianfei
Luo, Luqing
Wang, Yun
Wang, Shizheng
Liu, Jian
[J]. ELECTRONICS, 2024, 13 (03)
[38] Broad-based visual benefits from training with an integrated perceptual-learning video game
Deveau, Jenni
Lovcik, Gary
Seitz, Aaron R.
[J]. VISION RESEARCH, 2014, 99 : 134 - 140
[39] Visualizing Instructions for Physical Training: Exploring Visual Cues to Support Movement Learning from Instructional Videos
Semeraro, Alessandra
Vidal, Laia Turmo
[J]. PROCEEDINGS OF THE 2022 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS (CHI' 22), 2022,
[40] Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
You, Haoxuan
Zhou, Luowei
Xiao, Bin
Codella, Noel
Cheng, Yu
Xu, Ruochen
Chang, Shih-Fu
Yuan, Lu
[J]. COMPUTER VISION - ECCV 2022, PT XXVII, 2022, 13687 : 69 - 87

← 1 2 3 4 5 →