Enhancing Visual Information Extraction with Large Language Models Through Layout-Aware Instruction Tuning

被引：0

作者：

Li, Teng ^{[1
]}

Wang, Jiapeng ^{[1
]}

Jin, Lianwen ^{[1
]}

机构：

[1] South China Univ Technol, Guangzhou, Peoples R China

来源：

PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT VII | 2025年 / 15037卷

基金：

中国国家自然科学基金;

关键词：

Visual Information Extraction; Large Language Model; Instruction Tuning;

D O I：

10.1007/978-981-97-8511-7_20

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, leveraging large language models (LLMs) for visually-rich document information extraction has made significant progress. Previous studies have simplified the task of visual information extraction into a document visual question answering task. This task involves a question-answer session that yields a single entity result at a time, serving as a means of validating the document understanding capabilities of large language models (LLMs). However, these methods encounter significant challenges in computational efficiency and cost when addressing the document digitization requirements for extracting multiple entities from a single document. This scenario is common in practical applications of visual information extraction. This paper builds upon large language model and incorporates document layout information through a document layout modeling branch. We also design a layout-aware and task-specific instruction set. To further enhance the model's proficiency in learning document layout information, we initially augment the tokenizer's vocabulary. Subsequently, the entire model undergoes fine-tuning to ensure improved adaptability to the expanded vocabulary and effective extraction of document layout features. By harnessing the exceptional language comprehension capabilities of LLMs, our model is capable of executing comprehensive entity extraction for an entire document in a single pass. Benefiting from the characteristics of generative large language models, we can accomplish multiple downstream tasks of visual information extraction using an individual model. Our experimental results demonstrate consistent improvement over the baseline model across a range of document visual information extraction tasks.

引用

页码：276 / 289

页数：14

共 50 条

[1] LAMBERT: Layout-Aware Language Modeling for Information Extraction
Garncarek, Lukasz
Powalski, Rafal
Stanislawek, Tomasz
Topolski, Bartosz
Halama, Piotr
Turski, Michal
Gralinski, Filip
DOCUMENT ANALYSIS AND RECOGNITION - ICDAR 2021, PT I, 2021, 12821 : 532 - 547
[2] LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding
Luo, Chuwei
Shen, Yufan
Zhu, Zhaoqing
Zheng, Qi
Yu, Zhi
Yao, Cong
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 15630 - 15640
[3] IAPT: Instruction-Aware Prompt Tuning for Large Language Models
Zhu, Wei
Tian, Aaron Xuxiang
Yin, Congrui
Ni, Yuan
Wang, Xiaoling
Xie, Guotong
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 14285 - 14304
[4] Layout-Aware Semi-automatic Information Extraction for Pharmaceutical Documents
Harmata, Simon
Hofer-Schmitz, Katharina
Phuong-Ha Nguyen
Quix, Christoph
Bakiu, Bujar
DATA INTEGRATION IN THE LIFE SCIENCES, DILS 2017, 2017, 10649 : 71 - 85
[5] Layout-aware information extraction from semi-structured medical images
Luo, Kangqi
Lu, Jinyi
Zhu, Kenny Q.
Gao, Weiguo
Wei, Jia
Zhang, Meizhuo
COMPUTERS IN BIOLOGY AND MEDICINE, 2019, 107 : 235 - 247
[6] Instruction Tuning Large Language Models for Multimodal Relation Extraction Using LoRA
Li, Zou
Pang, Ning
Zhao, Xiang
WEB INFORMATION SYSTEMS AND APPLICATIONS, WISA 2024, 2024, 14883 : 364 - 376
[7] Layout-Aware Information Extraction for Document-Grounded Dialogue: Dataset, Method and Demonstration
Zhang, Zhenyu
Yu, Bowen
Yu, Haiyang
Liu, Tingwen
Fu, Cheng
Li, Jingyang
Tang, Chengguang
Sun, Jian
Li, Yongbin
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 7252 - 7260
[8] CommonIT: Commonality-Aware Instruction Tuning for Large Language Models via Data Partitions
Rao, Jun
Liu, Xuebo
Lian, Lian
Cheng, Shengjun
Liao, Yunjie
Zhang, Min
EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 2024, : 10064 - 10083
[9] Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models
Wei, Mengxi
He, Yifan
Zhang, Qiong
PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, : 2367 - 2376
[10] OCTOPACK: INSTRUCTION TUNING CODE LARGE LANGUAGE MODELS
Muennighoff, Niklas
Liu, Qian
Zebaze, Armel
Zheng, Qinkai
Hui, Binyuan
Zhuo, Terry Yue
Singh, Swayam
Tang, Xiangru
von Werra, Leandro
Longpre, Shayne
arXiv, 2023,

← 1 2 3 4 5 →