Using Augmented Small Multimodal Models to Guide Large Language Models for Multimodal Relation Extraction

被引:5
|
作者
He, Wentao [1 ]
Ma, Hanjie [1 ]
Li, Shaohua [1 ]
Dong, Hui [2 ]
Zhang, Haixiang [1 ]
Feng, Jie [1 ]
机构
[1] Zhejiang Sci Tech Univ, Sch Comp Sci & Technol, Hangzhou 310018, Peoples R China
[2] Hangzhou Codvis Technol Co Ltd, Hangzhou 311100, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 22期
关键词
multimodal relation extraction; small multimodal guidance; multimodal relation data augmentation; flexible threshold loss; large language model;
D O I
10.3390/app132212208
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Multimodal Relation Extraction (MRE) is a core task for constructing Multimodal Knowledge images (MKGs). Most current research is based on fine-tuning small-scale single-modal image and text pre-trained models, but we find that image-text datasets from network media suffer from data scarcity, simple text data, and abstract image information, which requires a lot of external knowledge for supplementation and reasoning. We use Multimodal Relation Data augmentation (MRDA) to address the data scarcity problem in MRE, and propose a Flexible Threshold Loss (FTL) to handle the imbalanced entity pair distribution and long-tailed classes. After obtaining prompt information from the small model as a guide model, we employ a Large Language Model (LLM) as a knowledge engine to acquire common sense and reasoning abilities. Notably, both stages of our framework are flexibly replaceable, with the first stage adapting to multimodal related classification tasks for small models, and the second stage replaceable by more powerful LLMs. Through experiments, our EMRE2llm model framework achieves state-of-the-art performance on the challenging MNRE dataset, reaching an 82.95% F1 score on the test set.
引用
收藏
页数:14
相关论文
共 50 条
  • [21] Investigating the Catastrophic Forgetting in Multimodal Large Language Models
    Zhai, Yuexiang
    Tong, Shengbang
    Li, Xiao
    Cai, Mu
    Qu, Qing
    Lee, Yong Jae
    Ma, Yi
    CONFERENCE ON PARSIMONY AND LEARNING, VOL 234, 2024, 234 : 202 - 227
  • [22] Multimodal Food Image Classification with Large Language Models
    Kim, Jun-Hwa
    Kim, Nam-Ho
    Jo, Donghyeok
    Won, Chee Sun
    ELECTRONICS, 2024, 13 (22)
  • [23] A Survey on Multimodal Large Language Models for Autonomous Driving
    Cui, Can
    Ma, Yunsheng
    Cao, Xu
    Ye, Wenqian
    Zhou, Yang
    Liang, Kaizhao
    Chen, Jintai
    Lu, Juanwu
    Yang, Zichong
    Liao, Kuei-Da
    Gao, Tianren
    Li, Erlong
    Tang, Kun
    Cao, Zhipeng
    Zhou, Tong
    Liu, Ao
    Yan, Xinrui
    Mei, Shuqi
    Cao, Jianguo
    Wang, Ziran
    Zheng, Chao
    2024 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS, WACVW 2024, 2024, : 958 - 979
  • [24] Woodpecker: hallucination correction for multimodal large language models
    Yin, Shukang
    Fu, Chaoyou
    Zhao, Sirui
    Xu, Tong
    Wang, Hao
    Sui, Dianbo
    Shen, Yunhang
    Li, Ke
    Sun, Xing
    Chen, Enhong
    SCIENCE CHINA-INFORMATION SCIENCES, 2024, 67 (12)
  • [25] Do multimodal large language models understand welding?
    Khvatskii, Grigorii
    Lee, Yong Suk
    Angst, Corey
    Gibbs, Maria
    Landers, Robert
    Chawla, Nitesh V.
    INFORMATION FUSION, 2025, 120
  • [26] Woodpecker: hallucination correction for multimodal large language models
    Shukang YIN
    Chaoyou FU
    Sirui ZHAO
    Tong XU
    Hao WANG
    Dianbo SUI
    Yunhang SHEN
    Ke LI
    Xing SUN
    Enhong CHEN
    Science China(Information Sciences), 2024, 67 (12) : 52 - 64
  • [27] Enhancing Relation Extraction Through Augmented Data: Large Language Models Unleashed
    Ali, Manzoor
    Nisar, Muhammad Sohail
    Saleem, Muhammad
    Moussallem, Diego
    Ngomo, Axel-Cyrille Ngonga
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, PT II, NLDB 2024, 2024, 14763 : 68 - 78
  • [28] Computing Architecture for Large-Language Models (LLMs) and Large Multimodal Models (LMMs)
    Liang, Bor-Sung
    PROCEEDINGS OF THE 2024 INTERNATIONAL SYMPOSIUM ON PHYSICAL DESIGN, ISPD 2024, 2024, : 233 - 234
  • [29] Multimodal Neural Language Models
    Kiros, Ryan
    Salakhutdinov, Ruslan
    Zemel, Richard
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 32 (CYCLE 2), 2014, 32 : 595 - 603
  • [30] Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models
    Wang, Youze
    Hu, Wenbo
    Dong, Yinpeng
    Liu, Jing
    Zhang, Hanwang
    Hong, Richang
    IEEE Transactions on Circuits and Systems for Video Technology,