Collaborative Modality Fusion for Mitigating Language Bias in Visual Question Answering

被引:2
|
作者
Lu, Qiwen [1 ]
Chen, Shengbo [1 ]
Zhu, Xiaoke [1 ]
机构
[1] Henan Univ, Sch Comp & Informat Engn, Kaifeng 475001, Peoples R China
关键词
visual question answering; collaborative learning; language bias;
D O I
10.3390/jimaging10030056
中图分类号
TB8 [摄影技术];
学科分类号
0804 ;
摘要
Language bias stands as a noteworthy concern in visual question answering (VQA), wherein models tend to rely on spurious correlations between questions and answers for prediction. This prevents the models from effectively generalizing, leading to a decrease in performance. In order to address this bias, we propose a novel modality fusion collaborative de-biasing algorithm (CoD). In our approach, bias is considered as the model's neglect of information from a particular modality during prediction. We employ a collaborative training approach to facilitate mutual modeling between different modalities, achieving efficient feature fusion and enabling the model to fully leverage multimodal knowledge for prediction. Our experiments on various datasets, including VQA-CP v2, VQA v2, and VQA-VS, using different validation strategies, demonstrate the effectiveness of our approach. Notably, employing a basic baseline model resulted in an accuracy of 60.14% on VQA-CP v2.
引用
收藏
页数:15
相关论文
共 50 条
  • [21] ADAPTIVE ATTENTION FUSION NETWORK FOR VISUAL QUESTION ANSWERING
    Gu, Geonmo
    Kim, Seong Tae
    Ro, Yong Man
    2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 997 - 1002
  • [22] Relational reasoning and adaptive fusion for visual question answering
    Shen, Xiang
    Han, Dezhi
    Zong, Liang
    Guo, Zihan
    Hua, Jie
    APPLIED INTELLIGENCE, 2024, 54 (06) : 5062 - 5080
  • [23] MUTAN: Multimodal Tucker Fusion for Visual Question Answering
    Ben-younes, Hedi
    Cadene, Remi
    Cord, Matthieu
    Thome, Nicolas
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2631 - 2639
  • [24] Fusion of Detected Objects in Text for Visual Question Answering
    Alberti, Chris
    Ling, Jeffrey
    Collins, Michael
    Reitter, David
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 2131 - 2140
  • [25] CONTEXT RELATION FUSION MODEL FOR VISUAL QUESTION ANSWERING
    Zhang, Haotian
    Wu, Wei
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 2112 - 2116
  • [26] An Empirical Study on the Language Modal in Visual Question Answering
    Peng, Daowan
    Wei, Wei
    Mao, Xian-Ling
    Fu, Yuanyuan
    Chen, Dangyang
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 4109 - 4117
  • [27] LANGUAGE TRANSFORMERS FOR REMOTE SENSING VISUAL QUESTION ANSWERING
    Chappuis, Christel
    Mendez, Vincent
    Walt, Eliot
    Lobry, Sylvain
    Le Saux, Bertrand
    Tuia, Devis
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 4855 - 4858
  • [28] Bridging the Cross-Modality Semantic Gap in Visual Question Answering
    Wang, Boyue
    Ma, Yujian
    Li, Xiaoyan
    Gao, Junbin
    Hu, Yongli
    Yin, Baocai
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2025, 36 (03) : 4519 - 4531
  • [29] Bridging the Cross-Modality Semantic Gap in Visual Question Answering
    Wang, Boyue
    Ma, Yujian
    Li, Xiaoyan
    Gao, Junbin
    Hu, Yongli
    Yin, Baocai
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 13
  • [30] Multi-modality Latent Interaction Network for Visual Question Answering
    Gao, Peng
    You, Haoxuan
    Zhang, Zhanpeng
    Wang, Xiaogang
    Li, Hongsheng
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5824 - 5834