Low-Dimensionality Information Extraction Model for Semi-structured Documents

被引:0
|
作者
Belhadj, Djedjiga [1 ]
Belaid, Abdel [1 ]
Belaid, Yolande [1 ]
机构
[1] Univ Lorraine LORIA, Campus Sci, F-54500 Vandoeuvre Les Nancy, France
关键词
Word embedding; Multimodal features; Fusion function;
D O I
10.1007/978-3-031-44237-7_8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most recent systems of information extraction (IE) from documents are regarded as complex models due to the large number of parameters they need and their resulting high memory footprint. In this paper, we propose a non complex model that extracts information from semi-structured documents (SSDs). We focus on the improvement of the model's input modelling that provide a low memory consumption and better performance. The SSD is modelled using graphs to benefit from its content and layout properties. A Multi-layer Graph Attention network (Multi-GAT) classifier built on the SSD graph is then used to predict the text entities. To get rid of the unknown word embeddings in this kind of document, we provide a simple and efficient method of pre-trained sub-word embeddings fusion that doesn't require any additional parameters. Our strategy for combining the multi-modal features of text, layout, and image entails concatenating the results of two Dense layers applied to the word embedding, position encoding and image embedding. Additionally, the graph adjacency matrix is built in a way to limit the graph dimension and enhance the classifier performance. All of these techniques improve the performance of our model while reducing its complexity and input dimensionality. Our model is evaluated on two artificial invoices datasets as well as one real dataset (SROIE). For the latter, we obtained a F1 score of 98.22%.
引用
收藏
页码:76 / 85
页数:10
相关论文
共 50 条
  • [1] Low-Dimensionality Information Extraction Model for Semi-structured Documents
    Belhadj, Djedjiga
    Belaïd, Abdel
    Belaïd, Yolande
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2023, 14184 LNCS : 76 - 85
  • [2] Information extraction from semi-structured web documents
    Yun, Bo-Hyun
    Seo, Chang-Ho
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, 2006, 4092 : 586 - 598
  • [3] Automatic Content Extraction on Semi-Structured Documents
    dos Santos, Jose Eduardo Bastos
    11TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR 2011), 2011, : 1235 - 1239
  • [4] Consideration of the Word's Neighborhood in GATs for Information Extraction in Semi-structured Documents
    Belhadj, Djedjiga
    Belaid, Yolande
    Belaid, Abdel
    DOCUMENT ANALYSIS AND RECOGNITION - ICDAR 2021, PT II, 2021, 12822 : 854 - 869
  • [5] A knowledge-based information extraction system for semi-structured labeled documents
    Yang, JY
    Oh, H
    Doh, KG
    Choi, J
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2002, 2002, 2412 : 105 - 110
  • [6] OLERA: OnLine extraction rule analysis for semi-structured documents
    Chang, CH
    Kuo, SC
    PROCEEDINGS OF THE IASTED INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND APPLICATIONS, VOLS 1AND 2, 2004, : 736 - 742
  • [7] EGA: An algorithm for automatic semi-structured Web documents extraction
    Li, LY
    Tang, SW
    Yang, DQ
    Wang, TJ
    Su, ZH
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, 2004, 2973 : 787 - 798
  • [8] Recognition techniques for extracting information from semi-structured documents
    Della Ventura, A
    Gagliardi, I
    Zonta, B
    DOCUMENT RECOGNITION AND RETRIEVAL VIII, 2001, 4307 : 130 - 137
  • [9] Supplementing domain knowledge to BERT with semi-structured information of documents
    Chen, Jing
    Wei, Zhihua
    Wang, Jiaqi
    Wang, Rui
    Gong, Chuanyang
    Zhang, Hongyun
    Miao, Duoqian
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 235
  • [10] An approach to semantic information retrieval in heterogeneous semi-structured documents
    Mrabet, Yassine
    Bennacer, Nacéra
    Pernelle, Nathalie
    Thiam, Mouhamadou
    CORIA 2010: Actes de la COnference en Recherche d'Information et Applications - Proceedings of the Conference on Information Retrieval and Applications, 2010, : 195 - 210