VL-Meta: Vision-Language Models for Multimodal Meta-Learning

被引：3

作者：

Ma, Han ^{[1
]}

Fan, Baoyu ^{[1
]}

Ng, Benjamin K. ^{[1
]}

Lam, Chan-Tong ^{[1
]}

机构：

[1] Macao Polytech Univ, Fac Appl Sci, Taipa 999078, Macao, Peoples R China

来源：

MATHEMATICS | 2024年 / 12卷 / 02期

关键词：

vision-language models; multimodal learning; meta-learning; token-level training; visual question answering;

D O I：

10.3390/math12020286

中图分类号：

O1 [数学];

学科分类号：

0701 ; 070101 ;

摘要：

Multimodal learning is a promising area in artificial intelligence (AI) that can make the model understand different kinds of data. Existing works are trying to re-train a new model based on pre-trained models that requires much data, computation power, and time. However, it is difficult to achieve in low-resource or small-sample situations. Therefore, we propose VL-Meta, Vision Language Models for Multimodal Meta Learning. It (1) presents the vision-language mapper and multimodal fusion mapper, which are light model structures, to use the existing pre-trained models to make models understand images to language feature space and save training data, computation power, and time; (2) constructs the meta-task pool that can only use a small amount of data to construct enough training data and improve the generalization of the model to learn the data knowledge and task knowledge; (3) proposes the token-level training that can align inputs with the outputs during training to improve the model performance; and (4) adopts the multi-task fusion loss to learn the different abilities for the models. It achieves a good performance on the Visual Question Answering (VQA) task, which shows the feasibility and effectiveness of the model. This solution can help blind or visually impaired individuals obtain visual information.

引用

页数：16

共 50 条

[31] Prompt-guided and multimodal landscape scenicness assessments with vision-language models
Levering, Alex
Marcos, Diego
Jacobs, Nathan
Tuia, Devis
PLOS ONE, 2024, 19 (09):
[32] A Framework for Vision-Language Warm-up Tasks in Multimodal Dialogue Models
Lee, Jaewook
Park, Seongsik
Park, Seong-Heum
Kim, Hongjin
Kim, Harksoo
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 2789 - 2799
[33] Multimodal Search on Iconclass using Vision-Language Pre-Trained Models
Santini, Cristian
Posthumus, Etienne
Tietz, Tabea
Tan, Mary Ann
Bruns, Oleksandra
Sack, Harald
2023 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, JCDL, 2023, : 285 - 287
[34] Towards Multimodal Vision-Language Models Generating Non-generic Text
Robbins, Wes
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 13138 - 13139
[35] White-box Multimodal Jailbreaks Against Large Vision-Language Models
Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University, Shanghai, China
不详
不详
MM - Proc. ACM Int. Conf. Multimed., (6920-6928):
[36] Boosting adversarial transferability in vision-language models via multimodal feature heterogeneity
Chen, Long
Chen, Yuling
Ouyang, Zhi
Dou, Hui
Zhang, Yangwen
Sang, Haiwei
SCIENTIFIC REPORTS, 2025, 15 (01):
[37] Leveraging enhanced task embeddings for generalization in multimodal meta-learning
Rao, Shuzhen
Huang, Jun
NEURAL COMPUTING & APPLICATIONS, 2023, 35 (15): : 10765 - 10778
[38] Multimodal Meta-Learning for Cold-Start Sequential Recommendation
Pan, Xingyu
Chen, Yushuo
Tian, Changxin
Lin, Zihan
Wang, Jinpeng
Hu, He
Zhao, Wayne Xin
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 3421 - 3430
[39] MetaSTNet: Multimodal Meta-Learning for Cellular Traffic Conformal Prediction
Ma, Hui
Yang, Kai
IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, 2024, 11 (02): : 1999 - 2011
[40] Learning Meta-Learning (LML) dataset: Survey data of meta-learning parameters
Corraya, Sonia
Al Mamun, Shamim
Kaiser, M. Shamim
DATA IN BRIEF, 2023, 51

← 1 2 3 4 5 →