GeoLayoutLM: Geometric Pre-training for Visual Information Extraction

被引：22

作者：

Luo, Chuwei ^{[1
]}

Cheng, Changxu ^{[1
]}

Zheng, Qi ^{[1
]}

Yao, Cong ^{[1
]}

机构：

[1] DAMO Acad, Alibaba Grp, Hangzhou, Peoples R China

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.00685

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual information extraction (VIE) plays an important role in Document Intelligence. Generally, it is divided into two tasks: semantic entity recognition (SER) and relation extraction (RE). Recently, pre-trained models for documents have achieved substantial progress in VIE, particularly in SER. However, most of the existing models learn the geometric representation in an implicit way, which has been found insufficient for the RE task since geometric information is especially crucial for RE. Moreover, we reveal another factor that limits the performance of RE lies in the objective gap between the pre-training phase and the fine-tuning phase for RE. To tackle these issues, we propose in this paper a multi-modal framework, named GeoLayoutLM, for VIE. GeoLayoutLM explicitly models the geometric relations in pre-training, which we call geometric pre-training. Geometric pre-training is achieved by three specially designed geometry-related pre-training tasks. Additionally, novel relation heads, which are pre-trained by the geometric pre-training tasks and fine-tuned for RE, are elaborately designed to enrich and enhance the feature representation. According to extensive experiments on standard VIE benchmarks, GeoLayoutLM achieves highly competitive scores in the SER task and significantly outperforms the previous state-of-the-arts for RE (e.g., the F1 score of RE on FUNSD is boosted from 80.35% to 89.45%)(1).

引用

页码：7092 / 7101

页数：10

共 50 条

[1] Formula-Supervised Visual-Geometric Pre-training
Yamada, Ryosuke
Hara, Kensho
Kataoka, Hirokatsu
Makihara, Koshi
Inoue, Nakamasa
Yokota, Rio
Satoh, Yutaka
COMPUTER VISION-ECCV 2024, PT XXII, 2025, 15080 : 57 - 74
[2] Improving Information Extraction on Business Documents with Specific Pre-training Tasks
Douzon, Thibault
Duffner, Stefan
Garcia, Christophe
Espinas, Jeremy
DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 111 - 125
[3] Quality Diversity for Visual Pre-Training
Chavhan, Ruchika
Gouk, Henry
Li, Da
Hospedales, Timothy
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5361 - 5371
[4] Pre-training Methods in Information Retrieval
Fan, Yixing
Xie, Xiaohui
Cai, Yinqiong
Chen, Jia
Ma, Xinyu
Li, Xiangsheng
Zhang, Ruqing
Guo, Jiafeng
FOUNDATIONS AND TRENDS IN INFORMATION RETRIEVAL, 2022, 16 (03): : 178 - 317
[5] VILA: On Pre-training for Visual Language Models
Lin, Ji
Yin, Hongxu
Ping, Wei
Molchanov, Pavlo
Shoeybi, Mohammad
Han, Song
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 26679 - 26689
[6] A Method of Relation Extraction Using Pre-training Models
Wang, Yu
Sun, Yining
Ma, Zuchang
Gao, Lisheng
Xu, Yang
Wu, Yichen
2020 13TH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DESIGN (ISCID 2020), 2020, : 176 - 179
[7] Visual Alignment Pre-training for Sign Language Translation
Jiao, Peiqi
Min, Yuecong
Chen, Xilin
COMPUTER VISION - ECCV 2024, PT XLII, 2025, 15100 : 349 - 367
[8] Symbolizing Visual Features for Pre-training with Unlabeled Images
Kamata, Yuichi
Yamada, Moyuru
Kato, Keizo
Nakagawa, Akira
Okatani, Takayuki
PATTERN RECOGNITION, ACPR 2021, PT II, 2022, 13189 : 490 - 503
[9] Numerical Tuple Extraction from Tables with Pre-training
Yang, Qingping
Cao, Yixuan
Luo, Ping
PROCEEDINGS OF THE 28TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2022, 2022, : 2233 - 2241
[10] Learning Visual Prior via Generative Pre-Training
Xie, Jinheng
Ye, Kai
Li, Yudong
Li, Yuexiang
Lin, Kevin Qinghong
Zheng, Yefeng
Shen, Linlin
Shou, Mike Zheng
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,

← 1 2 3 4 5 →