Using Augmented Small Multimodal Models to Guide Large Language Models for Multimodal Relation Extraction

被引:5
|
作者
He, Wentao [1 ]
Ma, Hanjie [1 ]
Li, Shaohua [1 ]
Dong, Hui [2 ]
Zhang, Haixiang [1 ]
Feng, Jie [1 ]
机构
[1] Zhejiang Sci Tech Univ, Sch Comp Sci & Technol, Hangzhou 310018, Peoples R China
[2] Hangzhou Codvis Technol Co Ltd, Hangzhou 311100, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 22期
关键词
multimodal relation extraction; small multimodal guidance; multimodal relation data augmentation; flexible threshold loss; large language model;
D O I
10.3390/app132212208
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Multimodal Relation Extraction (MRE) is a core task for constructing Multimodal Knowledge images (MKGs). Most current research is based on fine-tuning small-scale single-modal image and text pre-trained models, but we find that image-text datasets from network media suffer from data scarcity, simple text data, and abstract image information, which requires a lot of external knowledge for supplementation and reasoning. We use Multimodal Relation Data augmentation (MRDA) to address the data scarcity problem in MRE, and propose a Flexible Threshold Loss (FTL) to handle the imbalanced entity pair distribution and long-tailed classes. After obtaining prompt information from the small model as a guide model, we employ a Large Language Model (LLM) as a knowledge engine to acquire common sense and reasoning abilities. Notably, both stages of our framework are flexibly replaceable, with the first stage adapting to multimodal related classification tasks for small models, and the second stage replaceable by more powerful LLMs. Through experiments, our EMRE2llm model framework achieves state-of-the-art performance on the challenging MNRE dataset, reaching an 82.95% F1 score on the test set.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] PromptMTopic: Unsupervised Multimodal Topic Modeling of Memes using Large Language Models
    Prakash, Nirmalendu
    Wang, Han
    Hoang, Nguyen Khoi
    Hee, Ming Shan
    Lee, Roy Ka-Wei
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 621 - 631
  • [32] Do Multimodal Large Language Models and Humans Ground Language Similarly?
    Jones, Cameron R.
    Bergen, Benjamin
    Trott, Sean
    COMPUTATIONAL LINGUISTICS, 2024, 50 (04) : 1415 - 1440
  • [33] ReactGenie: A Development Framework for Complex Multimodal Interactions Using Large Language Models
    Yang, Jackie Junrui
    Shi, Yingtian
    Zhang, Yuhan
    Li, Karina
    Rosli, Daniel Wan
    Jain, Anisha
    Zhang, Shuning
    Li, Tianshi
    Landay, James A.
    Lam, Monica S.
    PROCEEDINGS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYTEMS (CHI 2024), 2024,
  • [34] SEED-Bench: Benchmarking Multimodal Large Language Models
    Li, Bohao
    Ge, Yuying
    Ge, Yixiao
    Wang, Guangzhi
    Wang, Rui
    Zhang, Ruimao
    Shi, Ying
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13299 - 13308
  • [35] VCoder: Versatile Vision Encoders for Multimodal Large Language Models
    Jain, Jitesh
    Yang, Jianwei
    Shi, Humphrey
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 27992 - 28002
  • [36] Large Language and Emerging Multimodal Foundation Models: Boundless Opportunities
    Forghani, Reza
    RADIOLOGY, 2024, 313 (01)
  • [37] Large Language and Multimodal Models Don't Come Cheap
    Anderson, Margo
    Perry, Tekla S.
    IEEE SPECTRUM, 2023, 60 (07) : 13 - 13
  • [38] Large Language Models in Rheumatologic Diagnosis: A Multimodal Performance Analysis
    Omar, Mahmud
    Agbareia, Reem
    Klang, Eyal
    Naffaa, Mohammaed E.
    JOURNAL OF RHEUMATOLOGY, 2025, 52 (02) : 187 - 188
  • [39] Multimodal large language models for inclusive collaboration learning tasks
    Lewis, Armanda
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES: PROCEEDINGS OF THE STUDENT RESEARCH WORKSHOP, 2022, : 202 - 210
  • [40] Exploring the Transferability of Visual Prompting for Multimodal Large Language Models
    Zhang, Yichi
    Dong, Yinpeng
    Zhang, Siyuan
    Min, Tianzan
    Su, Hang
    Zhu, Jun
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 26552 - 26562