Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering

被引：2

作者：

Jiang, Jingjing ^{[1
]}

Liu, Ziyi ^{[1
]}

Zheng, Nanning ^{[1
]}

机构：

[1] Xi An Jiao Tong Univ, Inst Artificial Intelligence & Robot, Xian 710049, Shaanxi, Peoples R China

来源：

INTERNATIONAL JOURNAL OF COMPUTER VISION | 2023年 / 132卷 / 1期

基金：

美国国家科学基金会;

关键词：

Information bottleneck; Robustness; Visual question answering; Vision-language model; LANGUAGE;

D O I：

10.1007/s11263-023-01858-y

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Benefiting from large-scale pretrained vision language models (VLMs), the performance of visual question answering (VQA) has approached human oracles. However, finetuning such models on limited data often suffers from overfitting and poor generalization issues, leading to a lack of model robustness. In this paper, we aim to improve input robustness from an information bottleneck perspective when adapting pretrained VLMs to the downstream VQA task. Input robustness refers to the ability of models to defend against visual and linguistic input variations, as well as shortcut learning involved in inputs. Generally, the representations obtained by pretrained VLMs inevitably contain irrelevant and redundant information for a specific downstream task, resulting in statistically spurious correlations and insensitivity to input variations. To encourage representations to converge to a minimal sufficient statistic in multimodal learning, we propose Correlation Information Bottleneck (CIB), which seeks a tradeoff between compression and redundancy in representations by minimizing the mutual information (MI) between inputs and representations while maximizing the MI between outputs and representations. Moreover, we derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations, incorporating different internal correlations that guide models to learn more robust representations and facilitate modality alignment. Extensive experiments consistently demonstrate the effectiveness and superiority of the proposed CIB in terms of input robustness and accuracy.

引用

页码：185 / 207

页数：23

共 50 条

[1] Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering
Jingjing Jiang
Ziyi Liu
Nanning Zheng
International Journal of Computer Vision, 2024, 132 : 185 - 207
[2] RSAdapter: Adapting Multimodal Models for Remote Sensing Visual Question Answering
Wang, Yuduo
Ghamisi, Pedram
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
[3] VISUAL QUESTION ANSWERING IN REMOTE SENSING WITH CROSS-ATTENTION AND MULTIMODAL INFORMATION BOTTLENECK
Songara, Jayesh
Pande, Shivam
Choudhury, Shabnam
Banerjee, Biplab
Velmurugan, Rajbabu
IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 6278 - 6281
[4] Adapting Visual Question Answering Models for Enhancing Multimodal Community Q&A Platforms
Srivastava, Avikalp
Liu, Hsin-Wen
Fujita, Sumio
PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, : 1421 - 1430
[5] Multimodal Attention for Visual Question Answering
Kodra, Lorena
Mece, Elinda Kajo
INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 783 - 792
[6] Robust Explanations for Visual Question Answering
Patro, Badri N.
Patel, Shivansh
Namboodiri, Vinay P.
2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 1566 - 1575
[7] Multimodal Learning and Reasoning for Visual Question Answering
Ilievski, Ilija
Feng, Jiashi
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
[8] Faithful Multimodal Explanation for Visual Question Answering
Wu, Jialin
Mooney, Raymond J.
BLACKBOXNLP WORKSHOP ON ANALYZING AND INTERPRETING NEURAL NETWORKS FOR NLP AT ACL 2019, 2019, : 103 - 112
[9] Visual Commonsense in Pretrained Unimodal and Multimodal Models
Zhang, Chenyu
Van Durme, Benjamin
Li, Zhuowan
Stengel-Eskin, Elias
NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5321 - 5335
[10] OpenViVQA: Task, dataset, and multimodal fusion models for visual question answering in Vietnamese
Nguyen, Nghia Hieu
Vo, Duong T. D.
Nguyen, Kiet Van
Nguyen, Ngan Luu-Thuy
INFORMATION FUSION, 2023, 100

← 1 2 3 4 5 →