Counterfactual Visual Dialog: Robust Commonsense Knowledge Learning From Unbiased Training

被引:2
|
作者
Liu, An-An [1 ,2 ]
Huang, Chenxi [1 ]
Xu, Ning [1 ]
Tian, Hongshuo [1 ]
Liu, Jing [1 ]
Zhang, Yongdong [3 ]
机构
[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China
[2] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230088, Peoples R China
[3] Univ Sci & Technol China, Hefei 230026, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Commonsense reasoning; History; Task analysis; Correlation; Knowledge based systems; Computational modeling; Visual dialog; commonsense; multi-modal; counterfactual;
D O I
10.1109/TMM.2023.3284594
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Visual Dialog (VD) requires an agent to answer the current question by engaging in a conversation with humans referring to an image. Despite the recent progress, it is beneficial to introduce external commonsense knowledge to fully understand the given image and dialog history. However, the existing knowledge-based VD models are inclined to rely on severe learning bias brought by commonsense, e.g., the retrieved < bus, capable of, transport people > , < bus ,is a ,public transport > , and < bus ,is a, car > can induce a spurious correlation between the question "What is the bus used for?" and the false answer "City bus". There are two challenges to make commonsense learning more robust against spurious correlations: 1) how to disentangle the true effect of "good" commonsense knowledge from the whole, and 2) how to estimate and remove the effect of "bad" commonsense bias on answers. In this article, we propose a novel CounterFactual Commonsense learning scheme for the Visual Dialog task (CFC-VD). First, comparing with the causal graph of existing VD models, we add one new commonsense node and one new link to multi-modal information from history, question, and image. Since the retrieved knowledge prior is subtle and uncontrollable, we consider it as an unobserved confounder in the commonsense node, which leads to spurious correlations for the answer inference. Then, to remove the effect of the confounder, we formulate it as the direct causal effect of commonsense on answers and remove the direct language effect by subtracting it from the total causal effect via counterfactual reasoning. Experimental results certify the effectiveness of our method on the prevailing Visdial v0.9 and Visdial v1.0 datasets.
引用
收藏
页码:1639 / 1651
页数:13
相关论文
共 50 条
  • [1] Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog
    Zhang, Shunyu
    Jiang, Xiaoze
    Yang, Zequn
    Wan, Tao
    Qin, Zengchang
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4599 - 4608
  • [2] Counterfactual Samples Synthesizing and Training for Robust Visual Question Answering
    Chen, Long
    Zheng, Yuhang
    Niu, Yulei
    Zhang, Hanwang
    Xiao, Jun
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 13218 - 13234
  • [3] Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering
    Liang, Zujie
    Jiang, Weitao
    Hu, Haifeng
    Zhu, Jiaying
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 3285 - 3292
  • [4] Learning from Missing Relations: Contrastive Learning with Commonsense Knowledge Graphs for Commonsense Inference
    Jung, Yong-Ho
    Park, Jun-Hyung
    Choi, Joon-Young
    Lee, Mingyu
    Kim, Junho
    Kim, Kang-Min
    Lee, SangKeun
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 1514 - 1523
  • [5] HCCL: Hierarchical Counterfactual Contrastive Learning for Robust Visual Question Answering
    Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, China
    不详
    [J]. ACM Trans. Multimedia Comput. Commun. Appl., 2024, 10
  • [6] What do Models Learn From Training on More Than Text? Measuring Visual Commonsense Knowledge
    Hagstrom, Lovisa
    Johansson, Richard
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): STUDENT RESEARCH WORKSHOP, 2022, : 252 - 261
  • [7] Reasoning Visual Dialog with Sparse Graph Learning and Knowledge Transfer
    Kang, Gi-Cheon
    Park, Junseok
    Lee, Hwaran
    Zhang, Byoung-Tak
    Kim, Jin-Hwa
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 327 - 339
  • [8] You should know more: Learning external knowledge for visual dialog
    Zhao, Lei
    Zhang, Haonan
    Li, Xiangpeng
    Yang, Sen
    Song, Yuanfeng
    [J]. NEUROCOMPUTING, 2022, 488 : 54 - 65
  • [9] Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model
    Lu, Jiasen
    Kannan, Anitha
    Yang, Jianwei
    Parikh, Devi
    Batra, Dhruv
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [10] SALKG: Learning From Knowledge Graph Explanations for Commonsense Reasoning
    Chan, Aaron
    Xu, Jiashu
    Long, Boyuan
    Sanyal, Soumya
    Gupta, Tanishq
    Ren, Xiang
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34