Using Augmented Small Multimodal Models to Guide Large Language Models for Multimodal Relation Extraction

被引：5

作者：

He, Wentao ^{[1
]}

Ma, Hanjie ^{[1
]}

Li, Shaohua ^{[1
]}

Dong, Hui ^{[2
]}

Zhang, Haixiang ^{[1
]}

Feng, Jie ^{[1
]}

机构：

[1] Zhejiang Sci Tech Univ, Sch Comp Sci & Technol, Hangzhou 310018, Peoples R China

[2] Hangzhou Codvis Technol Co Ltd, Hangzhou 311100, Peoples R China

来源：

APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 22期

关键词：

multimodal relation extraction; small multimodal guidance; multimodal relation data augmentation; flexible threshold loss; large language model;

D O I：

10.3390/app132212208

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Multimodal Relation Extraction (MRE) is a core task for constructing Multimodal Knowledge images (MKGs). Most current research is based on fine-tuning small-scale single-modal image and text pre-trained models, but we find that image-text datasets from network media suffer from data scarcity, simple text data, and abstract image information, which requires a lot of external knowledge for supplementation and reasoning. We use Multimodal Relation Data augmentation (MRDA) to address the data scarcity problem in MRE, and propose a Flexible Threshold Loss (FTL) to handle the imbalanced entity pair distribution and long-tailed classes. After obtaining prompt information from the small model as a guide model, we employ a Large Language Model (LLM) as a knowledge engine to acquire common sense and reasoning abilities. Notably, both stages of our framework are flexibly replaceable, with the first stage adapting to multimodal related classification tasks for small models, and the second stage replaceable by more powerful LLMs. Through experiments, our EMRE2llm model framework achieves state-of-the-art performance on the challenging MNRE dataset, reaching an 82.95% F1 score on the test set.

引用

页数：14

共 50 条

[31] PromptMTopic: Unsupervised Multimodal Topic Modeling of Memes using Large Language Models
Prakash, Nirmalendu
Wang, Han
Hoang, Nguyen Khoi
Hee, Ming Shan
Lee, Roy Ka-Wei
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 621 - 631
[32] Do Multimodal Large Language Models and Humans Ground Language Similarly?
Jones, Cameron R.
Bergen, Benjamin
Trott, Sean
COMPUTATIONAL LINGUISTICS, 2024, 50 (04) : 1415 - 1440
[33] ReactGenie: A Development Framework for Complex Multimodal Interactions Using Large Language Models
Yang, Jackie Junrui
Shi, Yingtian
Zhang, Yuhan
Li, Karina
Rosli, Daniel Wan
Jain, Anisha
Zhang, Shuning
Li, Tianshi
Landay, James A.
Lam, Monica S.
PROCEEDINGS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYTEMS (CHI 2024), 2024,
[34] SEED-Bench: Benchmarking Multimodal Large Language Models
Li, Bohao
Ge, Yuying
Ge, Yixiao
Wang, Guangzhi
Wang, Rui
Zhang, Ruimao
Shi, Ying
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13299 - 13308
[35] VCoder: Versatile Vision Encoders for Multimodal Large Language Models
Jain, Jitesh
Yang, Jianwei
Shi, Humphrey
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 27992 - 28002
[36] Large Language and Emerging Multimodal Foundation Models: Boundless Opportunities
Forghani, Reza
RADIOLOGY, 2024, 313 (01)
[37] Large Language and Multimodal Models Don't Come Cheap
Anderson, Margo
Perry, Tekla S.
IEEE SPECTRUM, 2023, 60 (07) : 13 - 13
[38] Large Language Models in Rheumatologic Diagnosis: A Multimodal Performance Analysis
Omar, Mahmud
Agbareia, Reem
Klang, Eyal
Naffaa, Mohammaed E.
JOURNAL OF RHEUMATOLOGY, 2025, 52 (02) : 187 - 188
[39] Multimodal large language models for inclusive collaboration learning tasks
Lewis, Armanda
NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES: PROCEEDINGS OF THE STUDENT RESEARCH WORKSHOP, 2022, : 202 - 210
[40] Exploring the Transferability of Visual Prompting for Multimodal Large Language Models
Zhang, Yichi
Dong, Yinpeng
Zhang, Siyuan
Min, Tianzan
Su, Hang
Zhu, Jun
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 26552 - 26562

← 1 2 3 4 5 →