Enhancing Visual Information Extraction with Large Language Models Through Layout-Aware Instruction Tuning

被引:0
|
作者
Li, Teng [1 ]
Wang, Jiapeng [1 ]
Jin, Lianwen [1 ]
机构
[1] South China Univ Technol, Guangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual Information Extraction; Large Language Model; Instruction Tuning;
D O I
10.1007/978-981-97-8511-7_20
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, leveraging large language models (LLMs) for visually-rich document information extraction has made significant progress. Previous studies have simplified the task of visual information extraction into a document visual question answering task. This task involves a question-answer session that yields a single entity result at a time, serving as a means of validating the document understanding capabilities of large language models (LLMs). However, these methods encounter significant challenges in computational efficiency and cost when addressing the document digitization requirements for extracting multiple entities from a single document. This scenario is common in practical applications of visual information extraction. This paper builds upon large language model and incorporates document layout information through a document layout modeling branch. We also design a layout-aware and task-specific instruction set. To further enhance the model's proficiency in learning document layout information, we initially augment the tokenizer's vocabulary. Subsequently, the entire model undergoes fine-tuning to ensure improved adaptability to the expanded vocabulary and effective extraction of document layout features. By harnessing the exceptional language comprehension capabilities of LLMs, our model is capable of executing comprehensive entity extraction for an entire document in a single pass. Benefiting from the characteristics of generative large language models, we can accomplish multiple downstream tasks of visual information extraction using an individual model. Our experimental results demonstrate consistent improvement over the baseline model across a range of document visual information extraction tasks.
引用
收藏
页码:276 / 289
页数:14
相关论文
共 50 条
  • [1] LAMBERT: Layout-Aware Language Modeling for Information Extraction
    Garncarek, Lukasz
    Powalski, Rafal
    Stanislawek, Tomasz
    Topolski, Bartosz
    Halama, Piotr
    Turski, Michal
    Gralinski, Filip
    DOCUMENT ANALYSIS AND RECOGNITION - ICDAR 2021, PT I, 2021, 12821 : 532 - 547
  • [2] LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding
    Luo, Chuwei
    Shen, Yufan
    Zhu, Zhaoqing
    Zheng, Qi
    Yu, Zhi
    Yao, Cong
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 15630 - 15640
  • [3] IAPT: Instruction-Aware Prompt Tuning for Large Language Models
    Zhu, Wei
    Tian, Aaron Xuxiang
    Yin, Congrui
    Ni, Yuan
    Wang, Xiaoling
    Xie, Guotong
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 14285 - 14304
  • [4] Layout-Aware Semi-automatic Information Extraction for Pharmaceutical Documents
    Harmata, Simon
    Hofer-Schmitz, Katharina
    Phuong-Ha Nguyen
    Quix, Christoph
    Bakiu, Bujar
    DATA INTEGRATION IN THE LIFE SCIENCES, DILS 2017, 2017, 10649 : 71 - 85
  • [5] Layout-aware information extraction from semi-structured medical images
    Luo, Kangqi
    Lu, Jinyi
    Zhu, Kenny Q.
    Gao, Weiguo
    Wei, Jia
    Zhang, Meizhuo
    COMPUTERS IN BIOLOGY AND MEDICINE, 2019, 107 : 235 - 247
  • [6] Instruction Tuning Large Language Models for Multimodal Relation Extraction Using LoRA
    Li, Zou
    Pang, Ning
    Zhao, Xiang
    WEB INFORMATION SYSTEMS AND APPLICATIONS, WISA 2024, 2024, 14883 : 364 - 376
  • [7] Layout-Aware Information Extraction for Document-Grounded Dialogue: Dataset, Method and Demonstration
    Zhang, Zhenyu
    Yu, Bowen
    Yu, Haiyang
    Liu, Tingwen
    Fu, Cheng
    Li, Jingyang
    Tang, Chengguang
    Sun, Jian
    Li, Yongbin
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 7252 - 7260
  • [8] CommonIT: Commonality-Aware Instruction Tuning for Large Language Models via Data Partitions
    Rao, Jun
    Liu, Xuebo
    Lian, Lian
    Cheng, Shengjun
    Liao, Yunjie
    Zhang, Min
    EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 2024, : 10064 - 10083
  • [9] Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models
    Wei, Mengxi
    He, Yifan
    Zhang, Qiong
    PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, : 2367 - 2376
  • [10] OCTOPACK: INSTRUCTION TUNING CODE LARGE LANGUAGE MODELS
    Muennighoff, Niklas
    Liu, Qian
    Zebaze, Armel
    Zheng, Qinkai
    Hui, Binyuan
    Zhuo, Terry Yue
    Singh, Swayam
    Tang, Xiangru
    von Werra, Leandro
    Longpre, Shayne
    arXiv, 2023,