GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph

被引：0

作者：

Li, Xin ^{[1
,2
]}

Lian, Dongze ^{[2
]}

Lu, Zhihe ^{[2
]}

Bai, Jiawang ^{[2
,3
]}

Chen, Zhibo ^{[1
]}

Wang, Xinchao ^{[2
]}

机构：

[1] Univ Sci & Technol China, Hefei, Anhui, Peoples R China

[2] Natl Univ Singapore, Singapore, Singapore

[3] Tsinghua Univ, Beijing, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

基金：

新加坡国家研究基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Adapter-style efficient transfer learning (ETL) has shown excellent performance in the tuning of vision-language models (VLMs) under the low-data regime, where only a few additional parameters are introduced to excavate the task-specific knowledge based on the general and powerful representation of VLMs. However, most adapter-style works face two limitations: (i) modeling task-specific knowledge with a single modality only; and (ii) overlooking the exploitation of the interclass relationships in downstream tasks, thereby leading to sub-optimal solutions. To mitigate that, we propose an effective adapter-style tuning strategy, dubbed GraphAdapter, which performs the textual adapter by explicitly modeling the dual-modality structure knowledge (i.e., the correlation of different semantics/classes in textual and visual modalities) with a dual knowledge graph. In particular, the dual knowledge graph is established with two sub-graphs, i.e., a textual knowledge sub-graph, and a visual knowledge sub-graph, where the nodes and edges represent the semantics/classes and their correlations in two modalities, respectively. This enables the textual feature of each prompt to leverage the task-specific structure knowledge from both textual and visual modalities, yielding a more effective classifier for downstream tasks. Extensive experimental results on 11 benchmark datasets reveal that our GraphAdapter significantly outperforms previous adapter-based methods. The code will be released at https://github.com/lixinustc/GraphAdapter

引用

页数：19

共 50 条

[1] Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models
Kan, Baoshuo
Wang, Teng
Lu, Wenpeng
Zhen, Xiantong
Guan, Weili
Zheng, Feng
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15624 - 15634
[2] Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles
Ye, Shuquan
Xie, Yujia
Chen, Dongdong
Xu, Yichong
Yuan, Lu
Zhu, Chenguang
Liao, Jing
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2634 - 2645
[3] Task Residual for Tuning Vision-Language Models
Yu, Tao
Lu, Zhihe
Jin, Xin
Chen, Zhibo
Wang, Xinchao
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10899 - 10909
[4] Tuning Vision-Language Models With Multiple Prototypes Clustering
Guo, Meng-Hao
Zhang, Yi
Mu, Tai-Jiang
Huang, Sharon X.
Hu, Shi-Min
[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46 (12) : 11186 - 11199
[5] Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models
Ma, Chengcheng
Liu, Yang
Deng, Jiankang
Xie, Lingxi
Dong, Weiming
Xu, Changsheng
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 4616 - 4629
[6] Distribution-Aware Prompt Tuning for Vision-Language Models
Cho, Eulrang
Kim, Jooyeon
Kim, Hyunwoo J.
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 21947 - 21956
[7] Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models
Zheng, Kecheng
Wu, Wei
Feng, Ruili
Zhu, Kai
Liu, Jiawei
Zhao, Deli
Zha, Zheng-Jun
Chen, Wei
Shen, Yujun
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11629 - 11639
[8] Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
Luo, Gen
Zhou, Yiyi
Ren, Tianhe
Chen, Shengxin
Sun, Xiaoshuai
Ji, Rongrong
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[9] Adapting Vision-Language Models via Learning to Inject Knowledge
Xuan, Shiyu
Yang, Ming
Zhang, Shiliang
[J]. IEEE Transactions on Image Processing, 2024, 33 : 5798 - 5809
[10] Robust Fine-Tuning of Vision-Language Models for Domain Generalization
Vogt-Lowell, Kevin
Lee, Noah
Tsiligkaridis, Theodoros
Vaillant, Marc
[J]. 2023 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE, HPEC, 2023,

← 1 2 3 4 5 →