Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering

被引:2
|
作者
Jiang, Jingjing [1 ]
Liu, Ziyi [1 ]
Zheng, Nanning [1 ]
机构
[1] Xi An Jiao Tong Univ, Inst Artificial Intelligence & Robot, Xian 710049, Shaanxi, Peoples R China
基金
美国国家科学基金会;
关键词
Information bottleneck; Robustness; Visual question answering; Vision-language model; LANGUAGE;
D O I
10.1007/s11263-023-01858-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Benefiting from large-scale pretrained vision language models (VLMs), the performance of visual question answering (VQA) has approached human oracles. However, finetuning such models on limited data often suffers from overfitting and poor generalization issues, leading to a lack of model robustness. In this paper, we aim to improve input robustness from an information bottleneck perspective when adapting pretrained VLMs to the downstream VQA task. Input robustness refers to the ability of models to defend against visual and linguistic input variations, as well as shortcut learning involved in inputs. Generally, the representations obtained by pretrained VLMs inevitably contain irrelevant and redundant information for a specific downstream task, resulting in statistically spurious correlations and insensitivity to input variations. To encourage representations to converge to a minimal sufficient statistic in multimodal learning, we propose Correlation Information Bottleneck (CIB), which seeks a tradeoff between compression and redundancy in representations by minimizing the mutual information (MI) between inputs and representations while maximizing the MI between outputs and representations. Moreover, we derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations, incorporating different internal correlations that guide models to learn more robust representations and facilitate modality alignment. Extensive experiments consistently demonstrate the effectiveness and superiority of the proposed CIB in terms of input robustness and accuracy.
引用
收藏
页码:185 / 207
页数:23
相关论文
共 50 条
  • [21] MaXM: Towards Multilingual Visual Question Answering
    Changpinyo, Soravit
    Xue, Linting
    Yarom, Michal
    Thapliyal, Ashish V.
    Szpektor, Idan
    Amelot, Julien
    Chen, Xi
    Soricut, Radu
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 2667 - 2682
  • [22] Information fusion in visual question answering: A Survey
    Zhang, Dongxiang
    Cao, Rui
    Wu, Sai
    INFORMATION FUSION, 2019, 52 : 268 - 280
  • [23] Digging out Discrimination Information from Generated Samples for Robust Visual Question Answering
    Wen, Zhiquan
    Wang, Yaowei
    Tan, Mingkui
    Wu, Qingyao
    Wu, Qi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 6910 - 6928
  • [24] Leveraging Text-to-Text Pretrained Language Models for Question Answering in Chemistry
    Tran, Dan
    Pascazio, Laura
    Akroyd, Jethro
    Mosbach, Sebastian
    Kraft, Markus
    ACS OMEGA, 2024, 9 (12): : 13883 - 13896
  • [25] The meaning of "most" for visual question answering models
    Kuhnle, Alexander
    Copestake, Ann
    BLACKBOXNLP WORKSHOP ON ANALYZING AND INTERPRETING NEURAL NETWORKS FOR NLP AT ACL 2019, 2019, : 46 - 55
  • [26] Latent Variable Models for Visual Question Answering
    Wang, Zixu
    Miao, Yishu
    Specia, Lucia
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3137 - 3141
  • [27] Cycle-Consistency for Robust Visual Question Answering
    Shah, Meet
    Chen, Xinlei
    Rohrbach, Marcus
    Parikh, Devi
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6642 - 6651
  • [28] Multimodal Encoders and Decoders with Gate Attention for Visual Question Answering
    Li, Haiyan
    Han, Dezhi
    COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2021, 18 (03) : 1023 - 1040
  • [29] Multimodal fusion: advancing medical visual question-answering
    Mudgal, Anjali
    Kush, Udbhav
    Kumar, Aditya
    Jafari, Amir
    Neural Computing and Applications, 2024, 36 (33) : 20949 - 20962
  • [30] On the role of question encoder sequence model in robust visual question answering
    Kv, Gouthaman
    Mittal, Anurag
    PATTERN RECOGNITION, 2022, 131