Low-Dimensionality Information Extraction Model for Semi-structured Documents

被引:0
|
作者
Belhadj, Djedjiga [1 ]
Belaid, Abdel [1 ]
Belaid, Yolande [1 ]
机构
[1] Univ Lorraine LORIA, Campus Sci, F-54500 Vandoeuvre Les Nancy, France
关键词
Word embedding; Multimodal features; Fusion function;
D O I
10.1007/978-3-031-44237-7_8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most recent systems of information extraction (IE) from documents are regarded as complex models due to the large number of parameters they need and their resulting high memory footprint. In this paper, we propose a non complex model that extracts information from semi-structured documents (SSDs). We focus on the improvement of the model's input modelling that provide a low memory consumption and better performance. The SSD is modelled using graphs to benefit from its content and layout properties. A Multi-layer Graph Attention network (Multi-GAT) classifier built on the SSD graph is then used to predict the text entities. To get rid of the unknown word embeddings in this kind of document, we provide a simple and efficient method of pre-trained sub-word embeddings fusion that doesn't require any additional parameters. Our strategy for combining the multi-modal features of text, layout, and image entails concatenating the results of two Dense layers applied to the word embedding, position encoding and image embedding. Additionally, the graph adjacency matrix is built in a way to limit the graph dimension and enhance the classifier performance. All of these techniques improve the performance of our model while reducing its complexity and input dimensionality. Our model is evaluated on two artificial invoices datasets as well as one real dataset (SROIE). For the latter, we obtained a F1 score of 98.22%.
引用
收藏
页码:76 / 85
页数:10
相关论文
共 50 条
  • [31] Transformation rules from semi-structured XML documents to database model
    Badr, Y
    Sayah, M
    Laforest, F
    Flory, A
    ACS/IEEE INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS, PROCEEDINGS, 2001, : 181 - 184
  • [32] Semi-structured documents mining: a review and comparison
    Madani, Amina
    Boussaid, Omar
    Zegour, Djamel Eddine
    17TH INTERNATIONAL CONFERENCE IN KNOWLEDGE BASED AND INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS - KES2013, 2013, 22 : 330 - 339
  • [33] Towards the automated verification of semi-structured documents
    Weitl, Franz
    Jaksic, Mirjana
    Freitag, Burkhard
    DATA & KNOWLEDGE ENGINEERING, 2009, 68 (03) : 292 - 317
  • [34] Retracted: Extracting information fro m semi-structured web documents: A framework
    Department of Computer Science and Engineering, Aalborg University, Niels Bohrs Vej 8, Esbjerg
    DK-6700, Denmark
    不详
    不详
    Lect. Notes Comput. Sci., 2008, (54-64):
  • [35] On the information content of semi-structured databases
    Levene, Mark
    Acta Cybernetica, 1998, 13 (03): : 257 - 275
  • [36] A storage and retrieval model based on XML for semi-structured information
    Gao, L
    Chen, HP
    Gu, JG
    Wang, JC
    Fang, HP
    Li, XH
    Proceedings of 2005 International Conference on Machine Learning and Cybernetics, Vols 1-9, 2005, : 33 - 38
  • [37] Supporting Semantic Search on Heterogeneous Semi-structured Documents
    Mrabet, Yassine
    Bennacer, Nacera
    Pernelle, Nathalie
    Thiam, Mouhamadou
    ADVANCED INFORMATION SYSTEMS ENGINEERING, PROCEEDINGS, 2010, 6051 : 224 - +
  • [38] Characteristic sets of strings common to semi-structured documents
    Ikeda, D
    DISCOVERY SCIENCE, PROCEEDINGS, 1999, 1721 : 139 - 147
  • [39] Filtering Semi-Structured Documents Based on Faceted Feedback
    Zhang, Lanbo
    Zhang, Yi
    Xing, Qianli
    PROCEEDINGS OF THE 34TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR'11), 2011, : 645 - 654
  • [40] RETRACTED: Extracting Information from Semi-structured Web Documents: A Framework (Retracted Article)
    Memon, Nasrullah
    Qureshi, Abdul Rasool
    Hicks, David
    Harkiolakis, Nicholas
    ADVANCED WEB AND NETWORK TECHNOLOGIES, AND APPLICATIONS, 2008, 4977 : 54 - +