VL-Meta: Vision-Language Models for Multimodal Meta-Learning

被引:3
|
作者
Ma, Han [1 ]
Fan, Baoyu [1 ]
Ng, Benjamin K. [1 ]
Lam, Chan-Tong [1 ]
机构
[1] Macao Polytech Univ, Fac Appl Sci, Taipa 999078, Macao, Peoples R China
关键词
vision-language models; multimodal learning; meta-learning; token-level training; visual question answering;
D O I
10.3390/math12020286
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Multimodal learning is a promising area in artificial intelligence (AI) that can make the model understand different kinds of data. Existing works are trying to re-train a new model based on pre-trained models that requires much data, computation power, and time. However, it is difficult to achieve in low-resource or small-sample situations. Therefore, we propose VL-Meta, Vision Language Models for Multimodal Meta Learning. It (1) presents the vision-language mapper and multimodal fusion mapper, which are light model structures, to use the existing pre-trained models to make models understand images to language feature space and save training data, computation power, and time; (2) constructs the meta-task pool that can only use a small amount of data to construct enough training data and improve the generalization of the model to learn the data knowledge and task knowledge; (3) proposes the token-level training that can align inputs with the outputs during training to improve the model performance; and (4) adopts the multi-task fusion loss to learn the different abilities for the models. It achieves a good performance on the Visual Question Answering (VQA) task, which shows the feasibility and effectiveness of the model. This solution can help blind or visually impaired individuals obtain visual information.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] Learning Domain Invariant Prompt for Vision-Language Models
    Zhao, Cairong
    Wang, Yubin
    Jiang, Xinyang
    Shen, Yifei
    Song, Kaitao
    Li, Dongsheng
    Miao, Duoqian
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1348 - 1360
  • [22] Transferable Multimodal Attack on Vision-Language Pre-training Models
    Wang, Haodi
    Dong, Kai
    Zhu, Zhilei
    Qin, Haotong
    Liu, Aishan
    Fang, Xiaolin
    Wang, Jiakai
    Liu, Xianglong
    45TH IEEE SYMPOSIUM ON SECURITY AND PRIVACY, SP 2024, 2024, : 1722 - 1740
  • [23] Vision-Language Recommendation via Attribute Augmented Multimodal Reinforcement Learning
    Yu, Tong
    Shen, Yilin
    Zhang, Ruiyi
    Zeng, Xiangyu
    Jin, Hongxia
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 39 - 47
  • [24] What Matters For Meta-Learning Vision Regression Tasks?
    Gao, Ning
    Ziesche, Hanna
    Ngo Anh Vien
    Volpp, Michael
    Neumann, Gerhard
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 14756 - 14766
  • [25] Vision-Language Models for Vision Tasks: A Survey
    Zhang, Jingyi
    Huang, Jiaxing
    Jin, Sheng
    Lu, Shijian
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) : 5625 - 5644
  • [26] GalLoP: Learning Global and Local Prompts for Vision-Language Models
    Lafon, Marc
    Ramzi, Elias
    Rambour, Clement
    Audebert, Nicolas
    Thome, Nicolas
    COMPUTER VISION - ECCV 2024, PT LXI, 2025, 15119 : 264 - 282
  • [27] Adapting Vision-Language Models via Learning to Inject Knowledge
    Xuan, Shiyu
    Yang, Ming
    Zhang, Shiliang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 5798 - 5809
  • [28] JoAPR: Cleaning the Lens of Prompt Learning for Vision-Language Models
    Guo, Yuncheng
    Guo, Xiaodong
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 28695 - 28705
  • [29] Visual In-Context Learning for Large Vision-Language Models
    Zhou, Yucheng
    Le, Xiang
    Wang, Qianning
    Shen, Jianbing
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 15890 - 15902
  • [30] Learning the Visualness of Text Using Large Vision-Language Models
    Verma, Gaurav
    Rossi, Ryan A.
    Tensmeyer, Christopher
    Gu, Jiuxiang
    Nenkova, Ani
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 2394 - 2408