Counterfactual Visual Dialog: Robust Commonsense Knowledge Learning From Unbiased Training

被引:2
|
作者
Liu, An-An [1 ,2 ]
Huang, Chenxi [1 ]
Xu, Ning [1 ]
Tian, Hongshuo [1 ]
Liu, Jing [1 ]
Zhang, Yongdong [3 ]
机构
[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China
[2] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230088, Peoples R China
[3] Univ Sci & Technol China, Hefei 230026, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Commonsense reasoning; History; Task analysis; Correlation; Knowledge based systems; Computational modeling; Visual dialog; commonsense; multi-modal; counterfactual;
D O I
10.1109/TMM.2023.3284594
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Visual Dialog (VD) requires an agent to answer the current question by engaging in a conversation with humans referring to an image. Despite the recent progress, it is beneficial to introduce external commonsense knowledge to fully understand the given image and dialog history. However, the existing knowledge-based VD models are inclined to rely on severe learning bias brought by commonsense, e.g., the retrieved < bus, capable of, transport people > , < bus ,is a ,public transport > , and < bus ,is a, car > can induce a spurious correlation between the question "What is the bus used for?" and the false answer "City bus". There are two challenges to make commonsense learning more robust against spurious correlations: 1) how to disentangle the true effect of "good" commonsense knowledge from the whole, and 2) how to estimate and remove the effect of "bad" commonsense bias on answers. In this article, we propose a novel CounterFactual Commonsense learning scheme for the Visual Dialog task (CFC-VD). First, comparing with the causal graph of existing VD models, we add one new commonsense node and one new link to multi-modal information from history, question, and image. Since the retrieved knowledge prior is subtle and uncontrollable, we consider it as an unobserved confounder in the commonsense node, which leads to spurious correlations for the answer inference. Then, to remove the effect of the confounder, we formulate it as the direct causal effect of commonsense on answers and remove the direct language effect by subtracting it from the total causal effect via counterfactual reasoning. Experimental results certify the effectiveness of our method on the prevailing Visdial v0.9 and Visdial v1.0 datasets.
引用
收藏
页码:1639 / 1651
页数:13
相关论文
共 50 条
  • [31] LEARNING BY WATCHING - EXTRACTING REUSABLE TASK KNOWLEDGE FROM VISUAL OBSERVATION OF HUMAN-PERFORMANCE
    KUNIYOSHI, Y
    INABA, M
    INOUE, H
    [J]. IEEE TRANSACTIONS ON ROBOTICS AND AUTOMATION, 1994, 10 (06): : 799 - 822
  • [32] Training for Coherence Formation When Learning From Text and Picture and the Interplay With Learners' Prior Knowledge
    Seufert, Tina
    [J]. FRONTIERS IN PSYCHOLOGY, 2019, 10
  • [33] A Study on Teacher Training Mechanism Supported by Blended learning from the Perspectives of Communication and Knowledge Management
    Liu, Zhiming
    Jia, Shan
    [J]. 2017 INTERNATIONAL SYMPOSIUM ON EDUCATIONAL TECHNOLOGY (ISET 2017), 2017, : 62 - 66
  • [34] Internal wikis for procedures and training- from tacit knowledge to self-guided learning
    Welsh, Anne
    [J]. ONLINE, 2007, 31 (06): : 26 - 29
  • [35] Leveraging Machine Learning to Automatically Derive Robust Decision Strategies from Imperfect Knowledge of the Real World
    Mehta A.
    Jain Y.R.
    Kemtur A.
    Stojcheski J.
    Consul S.
    Tošić M.
    Lieder F.
    [J]. Computational Brain & Behavior, 2022, 5 (3) : 343 - 377
  • [36] Adaptive Fusion of Deep Learning With Statistical Anatomical Knowledge for Robust Patella Segmentation From CT Images
    Zhao, Jiachen
    Jiang, Tianshu
    Lin, Yi
    Chan, Lok-Chun
    Chan, Ping-Keung
    Wen, Chunyi
    Chen, Hao
    [J]. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2024, 28 (05) : 2842 - 2853
  • [37] Robust Visual Recognition in Poor Visibility Conditions: A Prior Knowledge-Guided Adversarial Learning Approach (Vol 12, 3711, 2023)
    Yang, Jiangang
    Yang, Jianfei
    Luo, Luqing
    Wang, Yun
    Wang, Shizheng
    Liu, Jian
    [J]. ELECTRONICS, 2024, 13 (03)
  • [38] Broad-based visual benefits from training with an integrated perceptual-learning video game
    Deveau, Jenni
    Lovcik, Gary
    Seitz, Aaron R.
    [J]. VISION RESEARCH, 2014, 99 : 134 - 140
  • [39] Visualizing Instructions for Physical Training: Exploring Visual Cues to Support Movement Learning from Instructional Videos
    Semeraro, Alessandra
    Vidal, Laia Turmo
    [J]. PROCEEDINGS OF THE 2022 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS (CHI' 22), 2022,
  • [40] Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
    You, Haoxuan
    Zhou, Luowei
    Xiao, Bin
    Codella, Noel
    Cheng, Yu
    Xu, Ruochen
    Chang, Shih-Fu
    Yuan, Lu
    [J]. COMPUTER VISION - ECCV 2022, PT XXVII, 2022, 13687 : 69 - 87