VL-Meta: Vision-Language Models for Multimodal Meta-Learning

被引:3
|
作者
Ma, Han [1 ]
Fan, Baoyu [1 ]
Ng, Benjamin K. [1 ]
Lam, Chan-Tong [1 ]
机构
[1] Macao Polytech Univ, Fac Appl Sci, Taipa 999078, Macao, Peoples R China
关键词
vision-language models; multimodal learning; meta-learning; token-level training; visual question answering;
D O I
10.3390/math12020286
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Multimodal learning is a promising area in artificial intelligence (AI) that can make the model understand different kinds of data. Existing works are trying to re-train a new model based on pre-trained models that requires much data, computation power, and time. However, it is difficult to achieve in low-resource or small-sample situations. Therefore, we propose VL-Meta, Vision Language Models for Multimodal Meta Learning. It (1) presents the vision-language mapper and multimodal fusion mapper, which are light model structures, to use the existing pre-trained models to make models understand images to language feature space and save training data, computation power, and time; (2) constructs the meta-task pool that can only use a small amount of data to construct enough training data and improve the generalization of the model to learn the data knowledge and task knowledge; (3) proposes the token-level training that can align inputs with the outputs during training to improve the model performance; and (4) adopts the multi-task fusion loss to learn the different abilities for the models. It achieves a good performance on the Visual Question Answering (VQA) task, which shows the feasibility and effectiveness of the model. This solution can help blind or visually impaired individuals obtain visual information.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Distilling vision-language pre-training models with modality-specific meta-learning
    Ma, Xinge
    Wang, Jin
    Zhang, Xuejie
    KNOWLEDGE-BASED SYSTEMS, 2025, 315
  • [2] VL-Few: Vision Language Alignment for Multimodal Few-Shot Meta Learning
    Ma, Han
    Fan, Baoyu
    Ng, Benjamin K.
    Lam, Chan-Tong
    APPLIED SCIENCES-BASEL, 2024, 14 (03):
  • [3] Meta-Learning Online Adaptation of Language Models
    Hu, Nathan
    Mitchell, Eric
    Manning, Christopher D.
    Finn, Chelsea
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 4418 - 4432
  • [4] Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models
    Li, Juncheng
    Gao, Minghe
    Wei, Longhui
    Tang, Siliang
    Zhang, Wenqiao
    Li, Mengze
    Ji, Wei
    Tian, Qi
    Chua, Tat-Seng
    Zhuang, Yueting
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2551 - 2562
  • [5] Learning to Prompt for Vision-Language Models
    Zhou, Kaiyang
    Yang, Jingkang
    Loy, Chen Change
    Liu, Ziwei
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2022, 130 (09) : 2337 - 2348
  • [6] Learning to Prompt for Vision-Language Models
    Kaiyang Zhou
    Jingkang Yang
    Chen Change Loy
    Ziwei Liu
    International Journal of Computer Vision, 2022, 130 : 2337 - 2348
  • [7] Meta-Personalizing Vision-Language Models to Find Named Instances in Video
    Yeh, Chun-Hsiao
    Russell, Bryan
    Sivic, Josef
    Heilbron, Fabian Caba
    Jenni, Simon
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19123 - 19132
  • [8] Conditional Prompt Learning for Vision-Language Models
    Zhou, Kaiyang
    Yang, Jingkang
    Loy, Chen Change
    Liu, Ziwei
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16795 - 16804
  • [9] Consistent prompt learning for vision-language models
    Zhang, Yonggang
    Tian, Xinmei
    KNOWLEDGE-BASED SYSTEMS, 2025, 310
  • [10] Conceptual Codebook Learning for Vision-Language Models
    Zhang, Yi
    Yu, Ke
    Wu, Siqi
    He, Zhihai
    COMPUTER VISION - ECCV 2024, PT LXXVII, 2024, 15135 : 235 - 251