Low-Dimensionality Information Extraction Model for Semi-structured Documents

被引:0
|
作者
Belhadj, Djedjiga [1 ]
Belaid, Abdel [1 ]
Belaid, Yolande [1 ]
机构
[1] Univ Lorraine LORIA, Campus Sci, F-54500 Vandoeuvre Les Nancy, France
关键词
Word embedding; Multimodal features; Fusion function;
D O I
10.1007/978-3-031-44237-7_8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most recent systems of information extraction (IE) from documents are regarded as complex models due to the large number of parameters they need and their resulting high memory footprint. In this paper, we propose a non complex model that extracts information from semi-structured documents (SSDs). We focus on the improvement of the model's input modelling that provide a low memory consumption and better performance. The SSD is modelled using graphs to benefit from its content and layout properties. A Multi-layer Graph Attention network (Multi-GAT) classifier built on the SSD graph is then used to predict the text entities. To get rid of the unknown word embeddings in this kind of document, we provide a simple and efficient method of pre-trained sub-word embeddings fusion that doesn't require any additional parameters. Our strategy for combining the multi-modal features of text, layout, and image entails concatenating the results of two Dense layers applied to the word embedding, position encoding and image embedding. Additionally, the graph adjacency matrix is built in a way to limit the graph dimension and enhance the classifier performance. All of these techniques improve the performance of our model while reducing its complexity and input dimensionality. Our model is evaluated on two artificial invoices datasets as well as one real dataset (SROIE). For the latter, we obtained a F1 score of 98.22%.
引用
收藏
页码:76 / 85
页数:10
相关论文
共 50 条
  • [21] Unsupervised Extraction of Product Information from Semi-structured Sources
    Walther, Maximilian
    13TH IEEE INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND INFORMATICS (CINTI 2012), 2012, : 257 - 262
  • [22] Chinese resume information extraction based on semi-structured text
    Wentan, Yan
    Yupeng, Qiao
    Chinese Control Conference, CCC, 2017, : 11177 - 11182
  • [23] Bootstrapping Information Extraction from Semi-structured Web Pages
    Carlson, Andrew
    Schafer, Charles
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PART I, PROCEEDINGS, 2008, 5211 : 195 - +
  • [24] Spatial Dependency Parsing for Semi-Structured Document Information Extraction
    Hwang, Wonseok
    Yim, Jinyeong
    Park, Seunghyun
    Yang, Sohee
    Seo, Minjoon
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 330 - 343
  • [25] Chinese resume information extraction based on semi-structured text
    Yan Wentan
    Qiao Yupeng
    PROCEEDINGS OF THE 36TH CHINESE CONTROL CONFERENCE (CCC 2017), 2017, : 11177 - 11182
  • [26] Learning Information Extraction Rules for Semi-Structured and Free Text
    Stephen Soderland
    Machine Learning, 1999, 34 : 233 - 272
  • [27] Information Extraction of Strategic Activities based on Semi-structured Text
    Ma, Xubu
    Guo, Ju-E
    Ma, Xubu
    2014 SEVENTH INTERNATIONAL JOINT CONFERENCE ON COMPUTATIONAL SCIENCES AND OPTIMIZATION (CSO), 2014, : 579 - 583
  • [28] Header metadata extraction from semi-structured documents using template matching
    Huang, Zewu
    Jin, Hai
    Yuan, Pingpeng
    Han, Zongfen
    ON THE MOVE TO MEANINGFUL INTERNET SYSTEMS 2006: OTM 2006 WORKSHOPS, PT 2, PROCEEDINGS, 2006, 4278 : 1776 - +
  • [29] Advancing the terminological classification of semi-structured documents
    Stratogiannis, Georgios
    Siolas, Georgios
    Stamou, Georgios
    Stafylopatis, Andreas
    Chortaras, Alexandros
    Tagaris, Athanasios
    2015 IEEE 27TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2015), 2015, : 333 - 339
  • [30] Partial retrieval of compressed semi-structured documents
    Gupta, Ashutosh
    Agarwal, Suneeta
    INTERNATIONAL JOURNAL OF COMPUTER APPLICATIONS IN TECHNOLOGY, 2010, 38 (04) : 239 - 249