ASCL: Adaptive self-supervised counterfactual learning for robust visual question answering

被引:0
|
作者
Shu, Xinyao [1 ]
Yan, Shiyang [2 ]
Yang, Xu [3 ]
Wu, Ziheng [1 ]
Chen, Zhongfeng [1 ]
Lu, Zhenyu [1 ]
机构
[1] Nanjing Univ Informat Sci & Technol, Sch Artificial Intelligence, Nanjing, Peoples R China
[2] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei, Peoples R China
[3] Southeast Univ, Sch Comp Sci & Engn, Nanjing, Peoples R China
关键词
Visual question answering; Language bias; Distance metric learning; Self-supervised learning; Counterfactual learning; RELEVANCE FEEDBACK;
D O I
10.1016/j.eswa.2023.123125
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual question answering (VQA) is a critical multimodal task in which an agent must answer questions according to the visual cue. Unfortunately, language bias is a common problem in VQA, which refers to the situation where the model generates answers solely based on the surface -level correlations between the question -answer pairs in the training set, without fully understanding the visual content. To reduce the language bias, Several recent approaches increase the image -dependency by introducing auxiliary tasks. However, these auxiliary tasks balance the data by adding extra manual image annotations or simply constructing counterfactual samples, without fully exploring the intrinsic information of the samples themselves. In this paper, we tackle the language bias problem by proposing an adaptive self -supervised counterfactual learning (ASCL) method to enhance the model's understanding of images. We propose a new adaptive feature selection module to mine the intrinsic information of the samples. This module can adaptively divides the image into question -relevant visual positive objects and question -irrelevant visual negative objects based on the given question. The question -relevant visual positive objects are used directly to generate the predicted answer, in order to reduce the influence of visual distracting information on the model's understanding of the image and ensure the actual cause of the answer. The question -irrelevant visual negative objects are treated as counterfactual samples to guide model training and prevent the model from being driven by language bias. To avoid incorrect classification of images on the classification edge during training, we propose an adaptive contrastive loss learning method that automatically adjusts the measurement distance to increase the distance between images on the classification edge. Our method has been extensively evaluated on the VQA-CP dataset, demonstrating its effectiveness and yielding improved results. Specifically, by leveraging the LMH model as a foundation, we achieve state-of-the-art performance on both the VQA CPv1 and VQA CPv2 datasets. Notably, our method significantly enhances the accuracy of the baseline, with improvements of 10.36% on the VQA CPv2 dataset and 9.38% on the VQA CPv1 dataset. The source code is publicly available at: https://github.com/shuxy0120/ASCL.
引用
收藏
页数:16
相关论文
共 50 条
  • [41] Self-supervised Visual Representation Learning for Histopathological Images
    Yang, Pengshuai
    Hong, Zhiwei
    Yin, Xiaoxu
    Zhu, Chengzhan
    Jiang, Rui
    [J]. MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2021, PT II, 2021, 12902 : 47 - 57
  • [42] Transitive Invariance for Self-supervised Visual Representation Learning
    Wang, Xiaolong
    He, Kaiming
    Gupta, Abhinav
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1338 - 1347
  • [43] Scaling and Benchmarking Self-Supervised Visual Representation Learning
    Goyal, Priya
    Mahajan, Dhruv
    Gupta, Abhinav
    Misra, Ishan
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6400 - 6409
  • [44] Self-Supervised Visual Descriptor Learning for Dense Correspondence
    Schmidt, Tanner
    Newcombe, Richard
    Fox, Dieter
    [J]. IEEE ROBOTICS AND AUTOMATION LETTERS, 2017, 2 (02): : 420 - 427
  • [45] Self-Supervised Visual Representation Learning with Semantic Grouping
    Wen, Xin
    Zhao, Bingchen
    Zheng, Anlin
    Zhang, Xiangyu
    Qi, Xiaojuan
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [46] Self-supervised representation learning by predicting visual permutations
    Zhao, Qilu
    Dong, Junyu
    [J]. KNOWLEDGE-BASED SYSTEMS, 2020, 210
  • [47] Self-supervised Visual Attribute Learning for Fashion Compatibility
    Kim, Donghyun
    Saito, Kuniaki
    Mishra, Samarth
    Sclaroff, Stan
    Saenko, Kate
    Plummer, Bryan A.
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 1057 - 1066
  • [48] A novel self-supervised graph model based on counterfactual learning for diversified recommendation
    Ji, Pu
    Yang, Minghui
    Sun, Rui
    [J]. INFORMATION SYSTEMS, 2024, 121
  • [49] Robust Explanations for Visual Question Answering
    Patro, Badri N.
    Patel, Shivansh
    Namboodiri, Vinay P.
    [J]. 2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 1566 - 1575
  • [50] Debiasing Medical Visual Question Answering via Counterfactual Training
    Zhan, Chenlu
    Peng, Peng
    Zhang, Hanrong
    Sun, Haiyue
    Shang, Chunnan
    Chen, Tao
    Wang, Hongsen
    Wang, Gaoang
    Wang, Hongwei
    [J]. MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT II, 2023, 14221 : 382 - 393