Enhancing Visually-Rich Document Understanding via Layout Structure Modeling

被引:0
|
作者
Li, Qiwei [1 ]
Li, Zuchao [1 ]
Cai, Xiantao [1 ]
Du, Bo [1 ]
Zhao, Hai [2 ]
机构
[1] Wuhan Univ, Sch Comp Sci, Wuhan, Hubei, Peoples R China
[2] Shanghai Jiao Tong Univ, Shanghai, Peoples R China
关键词
Document Understanding; Information Extraction; Graph Structure; Layout Analysis;
D O I
10.1145/3581783.3612327
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, the use of multi-modal pre-trained Transformers has led to significant advancements in visually-rich document understanding. However, existing models have mainly focused on features such as text and vision while neglecting the importance of layout relationship between text nodes. In this paper, we propose GraphLayoutLM, a novel document understanding model that leverages the modeling of layout structure graph to inject document layout knowledge into the model. GraphLayoutLM utilizes a graph reordering algorithm to adjust the text sequence based on the graph structure. Additionally, our model uses a layout-aware multi-head self-attention layer to learn document layout knowledge. The proposed model enables the understanding of the spatial arrangement of text elements, improving document comprehension. We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD and it achieves state-of-the-art results among these datasets. Our experiment results demonstrate that our proposed method provides a significant improvement over existing approaches and showcases the importance of incorporating layout information into document understanding models. We also conduct an ablation study to investigate the contribution of each component of our model. The results show that both the graph reordering algorithm and the layout-aware multi-head self-attention layer play a crucial role in achieving the best performance.
引用
收藏
页码:4513 / 4523
页数:11
相关论文
共 20 条
  • [1] VRDU: A Benchmark for Visually-rich Document Understanding
    Wang, Zilong
    Zhou, Yichao
    Wei, Wei
    Lee, Chen-Yu
    Tata, Sandeep
    [J]. PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, : 5184 - 5193
  • [2] XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding
    Gu, Zhangxuan
    Meng, Changhua
    Wang, Ke
    Lan, Jun
    Wang, Weiqiang
    Gu, Ming
    Zhang, Liqing
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 4573 - 4582
  • [3] LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding
    Xu, Yang
    Xu, Yiheng
    Lv, Tengchao
    Cui, Lei
    Wei, Furu
    Wang, Guoxin
    Lu, Yijuan
    Florencio, Dinei
    Zhang, Cha
    Che, Wanxiang
    Zhang, Min
    Zhou, Lidong
    [J]. 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 2579 - 2591
  • [4] LayerDoc: Layer-wise Extraction of Spatial Hierarchical Structure in Visually-Rich Documents
    Mathur, Puneet
    Jain, Rajiv
    Mehra, Ashutosh
    Gu, Jiuxiang
    Dernoncourt, Franck
    Anandhavelu, N.
    Quan Tran
    Kaynig-Fittkau, Verena
    Nenkova, Ani
    Manocha, Dinesh
    Morariu, Vlad I.
    [J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 3599 - 3609
  • [5] Reading order detection in visually-rich documents with multi-modal layout-aware relation prediction
    Qiao, Liang
    Li, Can
    Cheng, Zhanzhan
    Xu, Yunlu
    Niu, Yi
    Li, Xi
    [J]. PATTERN RECOGNITION, 2024, 150
  • [6] MarkupLM: Pre-training of Text and Markup Language for Visually Rich Document Understanding
    Li, Junlong
    Xu, Yiheng
    Cui, Lei
    Wei, Furu
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 6078 - 6087
  • [7] Rule-based document structure understanding with a fuzzy combination of layout and textual features
    Klink S.
    Kieninger T.
    [J]. International Journal on Document Analysis and Recognition, 2001, 4 (1) : 18 - 26
  • [8] Understanding Document Thematic Structure: A Systematic Review of Topic Modeling Algorithms
    Osuntoki, Seun
    Odumuyiwa, Victor
    Sennaike, Oladipupo
    [J]. JOURNAL OF INFORMATION AND ORGANIZATIONAL SCIENCES, 2022, 46 (02) : 305 - 322
  • [9] Modeling spatial layout for scene image understanding via a novel multiscale sum-product network
    Yuan, Zehuan
    Wang, Hao
    Wang, Limin
    Lu, Tong
    Palaiahnakote, Shivakumara
    Tan, Chew Lim
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2016, 63 : 231 - 240
  • [10] Enhancing multimedia document modeling through extended orbit-based rhetorical structure: an approach to media weighting for importance determination
    Maredj, Azze-Eddine
    Sadallah, Madjid
    Tonkin, Nourreddine
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2024, 66 (03) : 1683 - 1707