GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph

被引：0

作者：

Li, Xin ^{[1
,2
]}

Lian, Dongze ^{[2
]}

Lu, Zhihe ^{[2
]}

Bai, Jiawang ^{[2
,3
]}

Chen, Zhibo ^{[1
]}

Wang, Xinchao ^{[2
]}

机构：

[1] Univ Sci & Technol China, Hefei, Anhui, Peoples R China

[2] Natl Univ Singapore, Singapore, Singapore

[3] Tsinghua Univ, Beijing, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

基金：

新加坡国家研究基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Adapter-style efficient transfer learning (ETL) has shown excellent performance in the tuning of vision-language models (VLMs) under the low-data regime, where only a few additional parameters are introduced to excavate the task-specific knowledge based on the general and powerful representation of VLMs. However, most adapter-style works face two limitations: (i) modeling task-specific knowledge with a single modality only; and (ii) overlooking the exploitation of the interclass relationships in downstream tasks, thereby leading to sub-optimal solutions. To mitigate that, we propose an effective adapter-style tuning strategy, dubbed GraphAdapter, which performs the textual adapter by explicitly modeling the dual-modality structure knowledge (i.e., the correlation of different semantics/classes in textual and visual modalities) with a dual knowledge graph. In particular, the dual knowledge graph is established with two sub-graphs, i.e., a textual knowledge sub-graph, and a visual knowledge sub-graph, where the nodes and edges represent the semantics/classes and their correlations in two modalities, respectively. This enables the textual feature of each prompt to leverage the task-specific structure knowledge from both textual and visual modalities, yielding a more effective classifier for downstream tasks. Extensive experimental results on 11 benchmark datasets reveal that our GraphAdapter significantly outperforms previous adapter-based methods. The code will be released at https://github.com/lixinustc/GraphAdapter

引用

页数：19

共 50 条

[31] Task Bias in Contrastive Vision-Language Models
Menon, Sachit
Chandratreya, Ishaan Preetam
Vondrick, Carl
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (06) : 2026 - 2040
[32] Perceptual Grouping in Contrastive Vision-Language Models
Ranasinghe, Kanchana
McKinzie, Brandon
Ravi, Sachin
Yang, Yinfei
Toshev, Alexander
Shlens, Jonathon
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5548 - 5561
[33] Equivariant Similarity for Vision-Language Foundation Models
Wang, Tan
Lin, Kevin
Li, Linjie
Lin, Chung-Ching
Yang, Zhengyuan
Zhang, Hanwang
Liu, Zicheng
Wang, Lijuan
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11964 - 11974
[34] Adventures of Trustworthy Vision-Language Models: A Survey
Vatsa, Mayank
Jain, Anubhooti
Singh, Richa
[J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 20, 2024, : 22650 - 22658
[35] Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models
Shu, Manli
Nie, Weili
Huang, De-An
Yu, Zhiding
Goldstein, Tom
Anandkumar, Anima
Xiao, Chaowei
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[36] cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation
Gupta, Kshitij
Gautam, Devansh
Mamidi, Radhika
[J]. Proceedings - International Conference on Pattern Recognition, 2022, 2022-August : 1734 - 1741
[37] cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation
Gupta, Kshitij
Gautam, Devansh
Mamidi, Radhika
[J]. 2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1734 - 1741
[38] Graph neural networks in vision-language image understanding: a survey
Senior, Henry
Slabaugh, Gregory
Yuan, Shanxin
Rossi, Luca
[J]. VISUAL COMPUTER, 2024,
[39] DeAR: Debiasing Vision-Language Models with Additive Residuals
Seth, Ashish
Hemani, Mayur
Agarwal, Chirag
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6820 - 6829
[40] Learning Domain Invariant Prompt for Vision-Language Models
Zhao, Cairong
Wang, Yubin
Jiang, Xinyang
Shen, Yifei
Song, Kaitao
Li, Dongsheng
Miao, Duoqian
[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1348 - 1360

← 1 2 3 4 5 →