LiLTv2: Language-substitutable Layout-image Transformer for Visual Information Extraction

被引:0
|
作者
Wang, Jiapeng [1 ]
Lin, Zening [1 ]
Huang, Dayi [2 ]
Xiong, Longfei [2 ]
Jin, Lianwen [3 ]
机构
[1] South China Univ Technol, Dept Elect & Informat, Guangzhou, Peoples R China
[2] Kingsoft Off, Zhuhai, Peoples R China
[3] South China Univ Technol, Guangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual Information Extraction; Multi-Modal Document Understanding; Self-Supervised Pre-Training;
D O I
10.1145/3708351
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Visual Information Extraction (VIE) has experienced substantial growth and heightened interest due to its pivotal role in intelligent document processing. However, most existing related pre-trained models typically can only process the data from a certain (set of) language(s)-often just English, representing a distinct limitation. To solve it, we present a Language-substitutable Layout-image Transformer (LiLTv2). It can be pre-trained just once on mono-lingual documents and then collaborate with off-the-shelf textual models in other languages during fine-tuning. Firstly, LiLTv2 utilizes a new dual-stream model architecture, one stream for substitutable text information and the other for layout and image information. Then, LiLTv2 has improved upon the optimization strategy and the diverse tasks adopted in the pre-training stage. Finally, we innovatively propose a teacher-student knowledge distillation learning with segment-level multi-modal features named SegKD. Extensive experimental results on widely used benchmarks can demonstrate the superior effectiveness of our method.
引用
收藏
页数:27
相关论文
共 1 条
  • [1] Enhancing Visual Information Extraction with Large Language Models Through Layout-Aware Instruction Tuning
    Li, Teng
    Wang, Jiapeng
    Jin, Lianwen
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT VII, 2025, 15037 : 276 - 289