Enhancing Visually-Rich Document Understanding via Layout Structure Modeling

被引：0

作者：

Li, Qiwei ^{[1
]}

Li, Zuchao ^{[1
]}

Cai, Xiantao ^{[1
]}

Du, Bo ^{[1
]}

Zhao, Hai ^{[2
]}

机构：

[1] Wuhan Univ, Sch Comp Sci, Wuhan, Hubei, Peoples R China

[2] Shanghai Jiao Tong Univ, Shanghai, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

关键词：

Document Understanding; Information Extraction; Graph Structure; Layout Analysis;

D O I：

10.1145/3581783.3612327

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In recent years, the use of multi-modal pre-trained Transformers has led to significant advancements in visually-rich document understanding. However, existing models have mainly focused on features such as text and vision while neglecting the importance of layout relationship between text nodes. In this paper, we propose GraphLayoutLM, a novel document understanding model that leverages the modeling of layout structure graph to inject document layout knowledge into the model. GraphLayoutLM utilizes a graph reordering algorithm to adjust the text sequence based on the graph structure. Additionally, our model uses a layout-aware multi-head self-attention layer to learn document layout knowledge. The proposed model enables the understanding of the spatial arrangement of text elements, improving document comprehension. We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD and it achieves state-of-the-art results among these datasets. Our experiment results demonstrate that our proposed method provides a significant improvement over existing approaches and showcases the importance of incorporating layout information into document understanding models. We also conduct an ablation study to investigate the contribution of each component of our model. The results show that both the graph reordering algorithm and the layout-aware multi-head self-attention layer play a crucial role in achieving the best performance.

引用

页码：4513 / 4523

页数：11

共 20 条

[1] VRDU: A Benchmark for Visually-rich Document Understanding
Wang, Zilong
Zhou, Yichao
Wei, Wei
Lee, Chen-Yu
Tata, Sandeep
[J]. PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, : 5184 - 5193
[2] XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding
Gu, Zhangxuan
Meng, Changhua
Wang, Ke
Lan, Jun
Wang, Weiqiang
Gu, Ming
Zhang, Liqing
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 4573 - 4582
[3] LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding
Xu, Yang
Xu, Yiheng
Lv, Tengchao
Cui, Lei
Wei, Furu
Wang, Guoxin
Lu, Yijuan
Florencio, Dinei
Zhang, Cha
Che, Wanxiang
Zhang, Min
Zhou, Lidong
[J]. 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 2579 - 2591
[4] LayerDoc: Layer-wise Extraction of Spatial Hierarchical Structure in Visually-Rich Documents
Mathur, Puneet
Jain, Rajiv
Mehra, Ashutosh
Gu, Jiuxiang
Dernoncourt, Franck
Anandhavelu, N.
Quan Tran
Kaynig-Fittkau, Verena
Nenkova, Ani
Manocha, Dinesh
Morariu, Vlad I.
[J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 3599 - 3609
[5] Reading order detection in visually-rich documents with multi-modal layout-aware relation prediction
Qiao, Liang
Li, Can
Cheng, Zhanzhan
Xu, Yunlu
Niu, Yi
Li, Xi
[J]. PATTERN RECOGNITION, 2024, 150
[6] MarkupLM: Pre-training of Text and Markup Language for Visually Rich Document Understanding
Li, Junlong
Xu, Yiheng
Cui, Lei
Wei, Furu
[J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 6078 - 6087
[7] Rule-based document structure understanding with a fuzzy combination of layout and textual features
Klink S.
Kieninger T.
[J]. International Journal on Document Analysis and Recognition, 2001, 4 (1) : 18 - 26
[8] Understanding Document Thematic Structure: A Systematic Review of Topic Modeling Algorithms
Osuntoki, Seun
Odumuyiwa, Victor
Sennaike, Oladipupo
[J]. JOURNAL OF INFORMATION AND ORGANIZATIONAL SCIENCES, 2022, 46 (02) : 305 - 322
[9] Modeling spatial layout for scene image understanding via a novel multiscale sum-product network
Yuan, Zehuan
Wang, Hao
Wang, Limin
Lu, Tong
Palaiahnakote, Shivakumara
Tan, Chew Lim
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2016, 63 : 231 - 240
[10] Enhancing multimedia document modeling through extended orbit-based rhetorical structure: an approach to media weighting for importance determination
Maredj, Azze-Eddine
Sadallah, Madjid
Tonkin, Nourreddine
[J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2024, 66 (03) : 1683 - 1707

← 1 2 →