Counterfactual Visual Dialog: Robust Commonsense Knowledge Learning From Unbiased Training

被引:2
|
作者
Liu, An-An [1 ,2 ]
Huang, Chenxi [1 ]
Xu, Ning [1 ]
Tian, Hongshuo [1 ]
Liu, Jing [1 ]
Zhang, Yongdong [3 ]
机构
[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China
[2] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230088, Peoples R China
[3] Univ Sci & Technol China, Hefei 230026, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Commonsense reasoning; History; Task analysis; Correlation; Knowledge based systems; Computational modeling; Visual dialog; commonsense; multi-modal; counterfactual;
D O I
10.1109/TMM.2023.3284594
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Visual Dialog (VD) requires an agent to answer the current question by engaging in a conversation with humans referring to an image. Despite the recent progress, it is beneficial to introduce external commonsense knowledge to fully understand the given image and dialog history. However, the existing knowledge-based VD models are inclined to rely on severe learning bias brought by commonsense, e.g., the retrieved < bus, capable of, transport people > , < bus ,is a ,public transport > , and < bus ,is a, car > can induce a spurious correlation between the question "What is the bus used for?" and the false answer "City bus". There are two challenges to make commonsense learning more robust against spurious correlations: 1) how to disentangle the true effect of "good" commonsense knowledge from the whole, and 2) how to estimate and remove the effect of "bad" commonsense bias on answers. In this article, we propose a novel CounterFactual Commonsense learning scheme for the Visual Dialog task (CFC-VD). First, comparing with the causal graph of existing VD models, we add one new commonsense node and one new link to multi-modal information from history, question, and image. Since the retrieved knowledge prior is subtle and uncontrollable, we consider it as an unobserved confounder in the commonsense node, which leads to spurious correlations for the answer inference. Then, to remove the effect of the confounder, we formulate it as the direct causal effect of commonsense on answers and remove the direct language effect by subtracting it from the total causal effect via counterfactual reasoning. Experimental results certify the effectiveness of our method on the prevailing Visdial v0.9 and Visdial v1.0 datasets.
引用
收藏
页码:1639 / 1651
页数:13
相关论文
共 50 条
  • [21] Training Haptic Stiffness Discrimination: Time Course of Learning With or Without Visual Information and Knowledge of Results
    Teodorescu, Kinneret
    Bouchigny, Sylvain
    Korman, Maria
    [J]. HUMAN FACTORS, 2013, 55 (04) : 830 - 840
  • [22] Robust Visual Recognition in Poor Visibility Conditions: A Prior Knowledge-Guided Adversarial Learning Approach
    Yang, Jiangang
    Yang, Jianfei
    Luo, Luqing
    Wang, Yun
    Wang, Shizheng
    Liu, Jian
    [J]. ELECTRONICS, 2023, 12 (17)
  • [23] Visual models and transfer of scientific knowledge: Learning from TV advertisements.
    Venugopalan, G
    [J]. INTERNATIONAL JOURNAL OF PSYCHOLOGY, 2000, 35 (3-4) : 133 - 133
  • [24] Facial Chirality: From Visual Self-Reflection to Robust Facial Feature Learning
    Lo, Ling
    Xie, Hongxia
    Shuai, Hong-Han
    Cheng, Wen-Huang
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 4275 - 4284
  • [25] Toward Learning Robust Detectors from Imbalanced Datasets Leveraging Weighted Adversarial Training
    Hasegawa, Kento
    Hidano, Seira
    Kiyomoto, Shinsaku
    Togawa, Nozomu
    [J]. CRYPTOLOGY AND NETWORK SECURITY, CANS 2021, 2021, 13099 : 392 - 411
  • [26] Learning a Gaussian Mixture Model From Imperfect Training Data for Robust Channel Estimation
    Fesl, Benedikt
    Turan, Nurettin
    Joham, Michael
    Utschick, Wolfgang
    [J]. IEEE WIRELESS COMMUNICATIONS LETTERS, 2023, 12 (06) : 1066 - 1070
  • [27] Knowledge Retention and Learning Benefits from Utilization of a Web-Based Learning Module Methodology for Pathology Training
    Chung, B. M.
    James, J. H.
    Groth, J. V.
    Lee, J.
    Sontag, S.
    Chejfec, G.
    Wiley, E. L.
    [J]. MODERN PATHOLOGY, 2013, 26 : 125A - 126A
  • [28] Knowledge Retention and Learning Benefits from Utilization of a Web-Based Learning Module Methodology for Pathology Training
    Chung, B. M.
    James, J. H.
    Groth, J. V.
    Lee, J.
    Sontag, S.
    Chejfec, G.
    Wiley, E. L.
    [J]. LABORATORY INVESTIGATION, 2013, 93 : 125A - 126A
  • [29] Automated Construction of Visual-Linguistic Knowledge via Concept Learning from Cartoon Videos
    Ha, Jung-Woo
    Kim, Kyung-Min
    Zhang, Byoung-Tak
    [J]. PROCEEDINGS OF THE TWENTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2015, : 522 - 528