ASCL: Adaptive self-supervised counterfactual learning for robust visual question answering

被引:0
|
作者
Shu, Xinyao [1 ]
Yan, Shiyang [2 ]
Yang, Xu [3 ]
Wu, Ziheng [1 ]
Chen, Zhongfeng [1 ]
Lu, Zhenyu [1 ]
机构
[1] Nanjing Univ Informat Sci & Technol, Sch Artificial Intelligence, Nanjing, Peoples R China
[2] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei, Peoples R China
[3] Southeast Univ, Sch Comp Sci & Engn, Nanjing, Peoples R China
关键词
Visual question answering; Language bias; Distance metric learning; Self-supervised learning; Counterfactual learning; RELEVANCE FEEDBACK;
D O I
10.1016/j.eswa.2023.123125
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual question answering (VQA) is a critical multimodal task in which an agent must answer questions according to the visual cue. Unfortunately, language bias is a common problem in VQA, which refers to the situation where the model generates answers solely based on the surface -level correlations between the question -answer pairs in the training set, without fully understanding the visual content. To reduce the language bias, Several recent approaches increase the image -dependency by introducing auxiliary tasks. However, these auxiliary tasks balance the data by adding extra manual image annotations or simply constructing counterfactual samples, without fully exploring the intrinsic information of the samples themselves. In this paper, we tackle the language bias problem by proposing an adaptive self -supervised counterfactual learning (ASCL) method to enhance the model's understanding of images. We propose a new adaptive feature selection module to mine the intrinsic information of the samples. This module can adaptively divides the image into question -relevant visual positive objects and question -irrelevant visual negative objects based on the given question. The question -relevant visual positive objects are used directly to generate the predicted answer, in order to reduce the influence of visual distracting information on the model's understanding of the image and ensure the actual cause of the answer. The question -irrelevant visual negative objects are treated as counterfactual samples to guide model training and prevent the model from being driven by language bias. To avoid incorrect classification of images on the classification edge during training, we propose an adaptive contrastive loss learning method that automatically adjusts the measurement distance to increase the distance between images on the classification edge. Our method has been extensively evaluated on the VQA-CP dataset, demonstrating its effectiveness and yielding improved results. Specifically, by leveraging the LMH model as a foundation, we achieve state-of-the-art performance on both the VQA CPv1 and VQA CPv2 datasets. Notably, our method significantly enhances the accuracy of the baseline, with improvements of 10.36% on the VQA CPv2 dataset and 9.38% on the VQA CPv1 dataset. The source code is publicly available at: https://github.com/shuxy0120/ASCL.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Simple contrastive learning in a self-supervised manner for robust visual question answering
    Yang, Shuwen
    Xiao, Luwei
    Wu, Xingjiao
    Xu, Junjie
    Wang, Linlin
    He, Liang
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 241
  • [2] Overcoming Language Priors with Self-supervised Learning for Visual Question Answering
    Zhi, Xi
    Mao, Zhendong
    Liu, Chunxiao
    Zhang, Peng
    Wang, Bin
    Zhang, Yongdong
    [J]. PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 1083 - 1089
  • [3] Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering
    Liang, Zujie
    Jiang, Weitao
    Hu, Haifeng
    Zhu, Jiaying
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 3285 - 3292
  • [4] elBERto: Self-supervised commonsense learning for question answering
    Zhan, Xunlin
    Li, Yuan
    Dong, Xiao
    Liang, Xiaodan
    Hu, Zhiting
    Carin, Lawrence
    [J]. KNOWLEDGE-BASED SYSTEMS, 2022, 258
  • [5] Self-supervised Dialogue Learning for Spoken Conversational Question Answering
    Chen, Nuo
    You, Chenyu
    Zou, Yuexian
    [J]. INTERSPEECH 2021, 2021, : 231 - 235
  • [6] Self-supervised Graph Contrastive Learning for Video Question Answering
    Yao, Xuan
    Gao, Jun-Yu
    Xu, Chang-Sheng
    [J]. Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2083 - 2100
  • [7] QASAR: Self-Supervised Learning Framework for Extractive Question Answering
    Assem, Haytham
    Sarkar, Iajdeep
    Dutta, Sourav
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 1797 - 1808
  • [8] SELF-SUPERVISED VISION-LANGUAGE PRETRAINING FOR MEDIAL VISUAL QUESTION ANSWERING
    Li, Pengfei
    Liu, Gang
    Tan, Lin
    Liao, Jinying
    Zhong, Shenjun
    [J]. 2023 IEEE 20TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING, ISBI, 2023,
  • [9] PGCL: Prompt guidance and self-supervised contrastive learning-based method for Visual Question Answering
    Gao, Ling
    Zhang, Hongda
    Liu, Yiming
    Sheng, Nan
    Feng, Haotian
    Xu, Hao
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 251
  • [10] Counterfactual Samples Synthesizing and Training for Robust Visual Question Answering
    Chen, Long
    Zheng, Yuhang
    Niu, Yulei
    Zhang, Hanwang
    Xiao, Jun
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 13218 - 13234