LiLTv2: Language-substitutable Layout-image Transformer for Visual Information Extraction
被引:0
|
作者:
Wang, Jiapeng
论文数: 0引用数: 0
h-index: 0
机构:
South China Univ Technol, Dept Elect & Informat, Guangzhou, Peoples R ChinaSouth China Univ Technol, Dept Elect & Informat, Guangzhou, Peoples R China
Wang, Jiapeng
[1
]
Lin, Zening
论文数: 0引用数: 0
h-index: 0
机构:
South China Univ Technol, Dept Elect & Informat, Guangzhou, Peoples R ChinaSouth China Univ Technol, Dept Elect & Informat, Guangzhou, Peoples R China
Lin, Zening
[1
]
Huang, Dayi
论文数: 0引用数: 0
h-index: 0
机构:
Kingsoft Off, Zhuhai, Peoples R ChinaSouth China Univ Technol, Dept Elect & Informat, Guangzhou, Peoples R China
Huang, Dayi
[2
]
Xiong, Longfei
论文数: 0引用数: 0
h-index: 0
机构:
Kingsoft Off, Zhuhai, Peoples R ChinaSouth China Univ Technol, Dept Elect & Informat, Guangzhou, Peoples R China
Xiong, Longfei
[2
]
Jin, Lianwen
论文数: 0引用数: 0
h-index: 0
机构:
South China Univ Technol, Guangzhou, Peoples R ChinaSouth China Univ Technol, Dept Elect & Informat, Guangzhou, Peoples R China
Jin, Lianwen
[3
]
机构:
[1] South China Univ Technol, Dept Elect & Informat, Guangzhou, Peoples R China
[2] Kingsoft Off, Zhuhai, Peoples R China
[3] South China Univ Technol, Guangzhou, Peoples R China
Visual Information Extraction;
Multi-Modal Document Understanding;
Self-Supervised Pre-Training;
D O I:
10.1145/3708351
中图分类号:
TP [自动化技术、计算机技术];
学科分类号:
0812 ;
摘要:
Visual Information Extraction (VIE) has experienced substantial growth and heightened interest due to its pivotal role in intelligent document processing. However, most existing related pre-trained models typically can only process the data from a certain (set of) language(s)-often just English, representing a distinct limitation. To solve it, we present a Language-substitutable Layout-image Transformer (LiLTv2). It can be pre-trained just once on mono-lingual documents and then collaborate with off-the-shelf textual models in other languages during fine-tuning. Firstly, LiLTv2 utilizes a new dual-stream model architecture, one stream for substitutable text information and the other for layout and image information. Then, LiLTv2 has improved upon the optimization strategy and the diverse tasks adopted in the pre-training stage. Finally, we innovatively propose a teacher-student knowledge distillation learning with segment-level multi-modal features named SegKD. Extensive experimental results on widely used benchmarks can demonstrate the superior effectiveness of our method.