Enhancing Visual Information Extraction with Large Language Models Through Layout-Aware Instruction Tuning

被引：0

作者：

Li, Teng ^{[1
]}

Wang, Jiapeng ^{[1
]}

Jin, Lianwen ^{[1
]}

机构：

[1] South China Univ Technol, Guangzhou, Peoples R China

来源：

PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT VII | 2025年 / 15037卷

基金：

中国国家自然科学基金;

关键词：

Visual Information Extraction; Large Language Model; Instruction Tuning;

D O I：

10.1007/978-981-97-8511-7_20

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, leveraging large language models (LLMs) for visually-rich document information extraction has made significant progress. Previous studies have simplified the task of visual information extraction into a document visual question answering task. This task involves a question-answer session that yields a single entity result at a time, serving as a means of validating the document understanding capabilities of large language models (LLMs). However, these methods encounter significant challenges in computational efficiency and cost when addressing the document digitization requirements for extracting multiple entities from a single document. This scenario is common in practical applications of visual information extraction. This paper builds upon large language model and incorporates document layout information through a document layout modeling branch. We also design a layout-aware and task-specific instruction set. To further enhance the model's proficiency in learning document layout information, we initially augment the tokenizer's vocabulary. Subsequently, the entire model undergoes fine-tuning to ensure improved adaptability to the expanded vocabulary and effective extraction of document layout features. By harnessing the exceptional language comprehension capabilities of LLMs, our model is capable of executing comprehensive entity extraction for an entire document in a single pass. Benefiting from the characteristics of generative large language models, we can accomplish multiple downstream tasks of visual information extraction using an individual model. Our experimental results demonstrate consistent improvement over the baseline model across a range of document visual information extraction tasks.

引用

页码：276 / 289

页数：14

共 50 条

[21] Exploring the new frontier of information extraction through large language models in urban analytics
Crooks, Andrew
Chen, Qingqing
ENVIRONMENT AND PLANNING B-URBAN ANALYTICS AND CITY SCIENCE, 2024, 51 (03) : 565 - 569
[22] Advancing entity recognition in biomedicine via instruction tuning of large language models
Keloth, Vipina K.
Hu, Yan
Xie, Qianqian
Peng, Xueqing
Wang, Yan
Zheng, Andrew
Selek, Melih
Raja, Kalpana
Wei, Chih Hsuan
Jin, Qiao
Lu, Zhiyong
Chen, Qingyu
Xu, Hua
BIOINFORMATICS, 2024, 40 (04)
[23] Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models
Zhang, Jinrui
Wang, Teng
Zhang, Haigang
Lu, Ping
Zheng, Feng
COMPUTER VISION - ECCV 2024, PT XXXVII, 2025, 15095 : 196 - 213
[24] WaveCoder: Widespread And Versatile Enhancement For Code Large Language Models By Instruction Tuning
Yu, Zhaojian
Zhang, Xin
Shang, Ning
Huang, Yangyu
Xu, Can
Zhao, Yishujie
Hu, Wenxiang
Yin, Qiufeng
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 5140 - 5153
[25] Failures Pave the Way: Enhancing Large Language Models through Tuning-free Rule Accumulation
Yang, Zeyuan
Li, Peng
Liu, Yang
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 1751 - 1777
[26] Enhancing Large Language Models with RAG for Visual Language Navigation in Continuous Environments
Bao, Xiaoan
Lv, Zhiqiang
Wu, Biao
ELECTRONICS, 2025, 14 (05):
[27] Enhancing the assessment of large language models in medical information generation
Leiwa, Aher K.
Lhusseiny, Bdelrahman M.
OPHTHALMOLOGY RETINA, 2024, 8 (05): : e15 - e15
[28] Automatic bridge inspection database construction through hybrid information extraction and large language models
Zhang, Chenhong
Lei, Xiaoming
Xia, Ye
Sun, Limin
DEVELOPMENTS IN THE BUILT ENVIRONMENT, 2024, 20
[29] Enhancing Chinese Essay Discourse Logic Evaluation Through Optimized Fine-Tuning of Large Language Models
Song, Jinwang
Song, Yanxin
Zhou, Guangyu
Fu, Wenhui
Zhang, Kunli
Zan, Hongying
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT V, NLPCC 2024, 2025, 15363 : 342 - 352
[30] Enhancing healthcare resource allocation through large language models
Wan, Fang
Wang, Kezhi
Wang, Tao
Qin, Hu
Fondrevelle, Julien
Duclos, Antoine
SWARM AND EVOLUTIONARY COMPUTATION, 2025, 94

← 1 2 3 4 5 →