Enhancing Visual Information Extraction with Large Language Models Through Layout-Aware Instruction Tuning

被引：0

作者：

Li, Teng ^{[1
]}

Wang, Jiapeng ^{[1
]}

Jin, Lianwen ^{[1
]}

机构：

[1] South China Univ Technol, Guangzhou, Peoples R China

来源：

PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT VII | 2025年 / 15037卷

基金：

中国国家自然科学基金;

关键词：

Visual Information Extraction; Large Language Model; Instruction Tuning;

D O I：

10.1007/978-981-97-8511-7_20

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, leveraging large language models (LLMs) for visually-rich document information extraction has made significant progress. Previous studies have simplified the task of visual information extraction into a document visual question answering task. This task involves a question-answer session that yields a single entity result at a time, serving as a means of validating the document understanding capabilities of large language models (LLMs). However, these methods encounter significant challenges in computational efficiency and cost when addressing the document digitization requirements for extracting multiple entities from a single document. This scenario is common in practical applications of visual information extraction. This paper builds upon large language model and incorporates document layout information through a document layout modeling branch. We also design a layout-aware and task-specific instruction set. To further enhance the model's proficiency in learning document layout information, we initially augment the tokenizer's vocabulary. Subsequently, the entire model undergoes fine-tuning to ensure improved adaptability to the expanded vocabulary and effective extraction of document layout features. By harnessing the exceptional language comprehension capabilities of LLMs, our model is capable of executing comprehensive entity extraction for an entire document in a single pass. Benefiting from the characteristics of generative large language models, we can accomplish multiple downstream tasks of visual information extraction using an individual model. Our experimental results demonstrate consistent improvement over the baseline model across a range of document visual information extraction tasks.

引用

页码：276 / 289

页数：14

共 50 条

[31] Enhancing Large Language Models Through External Domain Knowledge
Welz, Laslo
Lanquillon, Carsten
ARTIFICIAL INTELLIGENCE IN HCI, PT III, AI-HCI 2024, 2024, 14736 : 135 - 146
[32] Enhancing Relation Extraction from Biomedical Texts by Large Language Models
Asada, Masaki
Fukuda, Ken
ARTIFICIAL INTELLIGENCE IN HCI, PT III, AI-HCI 2024, 2024, 14736 : 3 - 14
[33] Model tuning or prompt Tuning? a study of large language models for clinical concept and relation extraction
Peng, Cheng
Yang, Xi
Smith, Kaleb E.
Yu, Zehao
Chen, Aokun
Bian, Jiang
Wu, Yonghui
JOURNAL OF BIOMEDICAL INFORMATICS, 2024, 153
[34] Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models
Li, Xin
Wu, Yunfei
Jiang, Xinghua
Guo, Zhihao
Gong, Mingming
Cao, Haoyu
Liu, Yinsong
Jiang, Deqiang
Sun, Xing
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 15546 - 15555
[35] Scaling and Adapting Large Language Models for Portuguese Open Information Extraction: A Comparative Study of Fine-Tuning and LoRA
Melo, Alan
Cabral, Bruno
Claro, Daniela Barreiro
INTELLIGENT SYSTEMS, BRACIS 2024, PT III, 2025, 15414 : 427 - 441
[36] A Comparative Analysis of Instruction Fine-Tuning Large Language Models for Financial Text Classification
Fatemi, Sorouralsadat
Hu, Yuheng
Mousavi, Maryam
ACM TRANSACTIONS ON MANAGEMENT INFORMATION SYSTEMS, 2025, 16 (01)
[37] JMedLoRA:Medical Domain Adaptation on Japanese Large Language Models using Instruction-tuning
Sukeda, Issey
Suzuki, Masahiro
Kodera, Satoshi
Sakaji, Hiroki
arXiv, 2023,
[38] Enhancing pixel-level analysis in medical imaging through visual instruction tuning: introducing PLAMi
Bai, Maocheng
Yu, Xiaosheng
Wang, Ying
Chen, Jubo
Zhang, Xiaofeng
Lyu, Pengfei
VISUAL COMPUTER, 2024,
[39] Structured information extraction from scientific text with large language models
John Dagdelen
Alexander Dunn
Sanghoon Lee
Nicholas Walker
Andrew S. Rosen
Gerbrand Ceder
Kristin A. Persson
Anubhav Jain
Nature Communications, 15
[40] Exploring Large Language Models for Low-Resource IT Information Extraction
Bhavya, Bhavya
Isaza, Paulina Toro
Deng, Yu
Nidd, Michael
Azad, Amar Prakash
Shwartz, Larisa
Zhai, ChengXiang
2023 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW 2023, 2023, : 1203 - 1212

← 1 2 3 4 5 →