Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models

被引:0
|
作者
Wang, Yubin [1 ]
Jiang, Xinyang [2 ]
Cheng, De [3 ]
Li, Dongsheng [2 ]
Zhao, Cairong [1 ]
机构
[1] Tongji Univ, Shanghai, Peoples R China
[2] Microsoft Res Asia, Beijing, Peoples R China
[3] Xidian Univ, Xian, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Prompt learning has become a prevalent strategy for adapting vision-language foundation models to downstream tasks. As large language models (LLMs) have emerged, recent studies have explored the use of category-related descriptions as input to enhance prompt effectiveness. Nevertheless, conventional descriptions fall short of structured information that effectively represents the interconnections among entities or attributes linked to a particular category. To address this limitation and prioritize harnessing structured knowledge, this paper advocates for leveraging LLMs to build a graph for each description to model the entities and attributes describing the category, as well as their correlations. Preexisting prompt tuning methods exhibit inadequacies in managing this structured knowledge. Consequently, we propose a novel approach called Hierarchical Prompt Tuning (HPT), which enables simultaneous modeling of both structured and conventional linguistic knowledge. Specifically, we introduce a relationship-guided attention module to capture pair-wise associations among entities and attributes for low-level prompt learning. In addition, by incorporating high-level and global-level prompts modeling overall semantics, the proposed hierarchical structure forges cross-level interlinks and empowers the model to handle more complex and long-term relationships. Extensive experiments demonstrate that our HPT shows strong effectiveness and generalizes much better than existing SOTA methods. Our code is available at https://github.com/Vill-Lab/2024-AAAI-HPT.
引用
收藏
页码:5749 / 5757
页数:9
相关论文
共 50 条
  • [41] VISION-LANGUAGE MODELS AS SUCCESS DETECTORS
    Du, Yuqing
    Konyushkova, Ksenia
    Denil, Misha
    Raju, Akhil
    Landon, Jessica
    Hill, Felix
    de Freitas, Nando
    Cabi, Serkan
    CONFERENCE ON LIFELONG LEARNING AGENTS, VOL 232, 2023, 232 : 120 - 136
  • [42] MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models
    Monajatipoor, Masoud
    Li, Liunian Harold
    Rouhsedaghat, Mozhdeh
    Yang, Lin F.
    Chang, Kai-Wei
    61ST CONFERENCE OF THE THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 2, 2023, : 495 - 508
  • [43] Pre-training A Prompt Pool for Vision-Language Model
    Liu, Jun
    Gu, Yang
    Yang, Zhaohua
    Guo, Shuai
    Liu, Huaqiu
    Chen, Yiqiang
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [44] Distilling Vision-Language Foundation Models: A Data-Free Approach via Prompt Diversification
    Xuan, Yunyi
    Chen, Weijie
    Yang, Shicai
    Xie, Di
    Lin, Luojun
    Zhuang, Yueting
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4928 - 4938
  • [45] Debiasing vision-language models for vision tasks: a survey
    Zhu, Beier
    Zhang, Hanwang
    Frontiers of Computer Science, 2025, 19 (01)
  • [46] Multiple Prompt Fusion for Zero-Shot Lesion Detection Using Vision-Language Models
    Guo, Miaotian
    Yi, Huahui
    Qin, Ziyuan
    Wang, Haiying
    Men, Aidong
    Lao, Qicheng
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT V, 2023, 14224 : 283 - 292
  • [47] Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models
    Shu, Manli
    Nie, Weili
    Huang, De-An
    Yu, Zhiding
    Goldstein, Tom
    Anandkumar, Anima
    Xiao, Chaowei
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [48] LiFT: Transfer Learning in Vision-Language Models for Downstream Adaptation and Generalization
    Li, Jingzheng
    Sun, Hailong
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4678 - 4687
  • [49] Cross-Modal Concept Learning and Inference for Vision-Language Models
    Zhang, Yi
    Zhang, Ce
    Tang, Yushun
    He, Zhihai
    NEUROCOMPUTING, 2024, 583
  • [50] UDKAG: Augmenting Large Vision-Language Models with Up-to-Date Knowledge
    Li, Chuanhao
    Li, Zhen
    Jing, Chenchen
    Liu, Shuo
    Shao, Wenqi
    Wu, Yuwei
    Luo, Ping
    Qiao, Yu
    Zhang, Kaipeng
    arXiv,