Enhancing Visual Information Extraction with Large Language Models Through Layout-Aware Instruction Tuning

被引:0
|
作者
Li, Teng [1 ]
Wang, Jiapeng [1 ]
Jin, Lianwen [1 ]
机构
[1] South China Univ Technol, Guangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual Information Extraction; Large Language Model; Instruction Tuning;
D O I
10.1007/978-981-97-8511-7_20
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, leveraging large language models (LLMs) for visually-rich document information extraction has made significant progress. Previous studies have simplified the task of visual information extraction into a document visual question answering task. This task involves a question-answer session that yields a single entity result at a time, serving as a means of validating the document understanding capabilities of large language models (LLMs). However, these methods encounter significant challenges in computational efficiency and cost when addressing the document digitization requirements for extracting multiple entities from a single document. This scenario is common in practical applications of visual information extraction. This paper builds upon large language model and incorporates document layout information through a document layout modeling branch. We also design a layout-aware and task-specific instruction set. To further enhance the model's proficiency in learning document layout information, we initially augment the tokenizer's vocabulary. Subsequently, the entire model undergoes fine-tuning to ensure improved adaptability to the expanded vocabulary and effective extraction of document layout features. By harnessing the exceptional language comprehension capabilities of LLMs, our model is capable of executing comprehensive entity extraction for an entire document in a single pass. Benefiting from the characteristics of generative large language models, we can accomplish multiple downstream tasks of visual information extraction using an individual model. Our experimental results demonstrate consistent improvement over the baseline model across a range of document visual information extraction tasks.
引用
收藏
页码:276 / 289
页数:14
相关论文
共 50 条
  • [41] Structured information extraction from scientific text with large language models
    Dagdelen, John
    Dunn, Alexander
    Lee, Sanghoon
    Walker, Nicholas
    Rosen, Andrew S.
    Ceder, Gerbrand
    Persson, Kristin A.
    Jain, Anubhav
    NATURE COMMUNICATIONS, 2024, 15 (01)
  • [42] Toward Reliable Biodiversity Information Extraction From Large Language Models
    Elliott, Michael J.
    Fortes, Jose A. B.
    2024 IEEE 20TH INTERNATIONAL CONFERENCE ON E-SCIENCE, E-SCIENCE 2024, 2024,
  • [43] Enhancing Software-Related Information Extraction via Single-Choice Question Answering with Large Language Models
    Otto, Wolfgang
    Upadhyaya, Sharmila
    Dietze, Stefan
    NATURAL SCIENTIFIC LANGUAGE PROCESSING AND RESEARCH KNOWLEDGE GRAPHS, NSLP 2024, 2024, 14770 : 289 - 306
  • [44] Enhancing generalization in camera trap image recognition: Fine-tuning visual language models
    Yang, Zihe
    Tian, Ye
    Wang, Lifeng
    Zhang, Junguo
    NEUROCOMPUTING, 2025, 634
  • [45] A Design of Interface for Visual-Impaired People to Access Visual Information from Images Featuring Large Language Models and Visual Language Models
    Zhang, Zhe-Xin
    EXTENDED ABSTRACTS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, CHI 2024, 2024,
  • [46] Health Care Language Models and Their Fine-Tuning for Information Extraction: Scoping Review
    Nunes, Miguel
    Bone, Joao
    Ferreira, Joao C.
    Elvas, Luis B.
    JMIR MEDICAL INFORMATICS, 2024, 12
  • [47] Efficient Fine-Tuning Large Language Models for Knowledge-Aware Response Planning
    Minh Nguyen
    Kishan, K. C.
    Toan Nguyen
    Chadha, Ankit
    Thuy Vu
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, ECML PKDD 2023, PT II, 2023, 14170 : 593 - 611
  • [48] Integrating large language models and generative artificial intelligence tools into information literacy instruction
    Carroll, Alexander J.
    Borycz, Joshua
    JOURNAL OF ACADEMIC LIBRARIANSHIP, 2024, 50 (04):
  • [49] LiLTv2: Language-substitutable Layout-image Transformer for Visual Information Extraction
    Wang, Jiapeng
    Lin, Zening
    Huang, Dayi
    Xiong, Longfei
    Jin, Lianwen
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2025, 21 (03)
  • [50] Language Urban Odyssey: A Serious Game for Enhancing Second Language Acquisition through Large Language Models
    Zhao, Yijun
    Pan, Jiangyu
    Dong, Yan
    Dong, Tianshu
    Wang, Guanyun
    Ying, Fangtian
    Shen, Qihang
    Cao, Jiacheng
    EXTENDED ABSTRACTS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, CHI 2024, 2024,