Low-Dimensionality Information Extraction Model for Semi-structured Documents

被引:0
|
作者
Belhadj, Djedjiga [1 ]
Belaid, Abdel [1 ]
Belaid, Yolande [1 ]
机构
[1] Univ Lorraine LORIA, Campus Sci, F-54500 Vandoeuvre Les Nancy, France
关键词
Word embedding; Multimodal features; Fusion function;
D O I
10.1007/978-3-031-44237-7_8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most recent systems of information extraction (IE) from documents are regarded as complex models due to the large number of parameters they need and their resulting high memory footprint. In this paper, we propose a non complex model that extracts information from semi-structured documents (SSDs). We focus on the improvement of the model's input modelling that provide a low memory consumption and better performance. The SSD is modelled using graphs to benefit from its content and layout properties. A Multi-layer Graph Attention network (Multi-GAT) classifier built on the SSD graph is then used to predict the text entities. To get rid of the unknown word embeddings in this kind of document, we provide a simple and efficient method of pre-trained sub-word embeddings fusion that doesn't require any additional parameters. Our strategy for combining the multi-modal features of text, layout, and image entails concatenating the results of two Dense layers applied to the word embedding, position encoding and image embedding. Additionally, the graph adjacency matrix is built in a way to limit the graph dimension and enhance the classifier performance. All of these techniques improve the performance of our model while reducing its complexity and input dimensionality. Our model is evaluated on two artificial invoices datasets as well as one real dataset (SROIE). For the latter, we obtained a F1 score of 98.22%.
引用
收藏
页码:76 / 85
页数:10
相关论文
共 50 条
  • [41] A semantic network approach to semi-structured documents repositories
    Christophides, V
    Dorr, M
    Fundulaki, I
    RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES, 1997, 1324 : 305 - 324
  • [42] Toward structured retrieval in semi-structured information spaces
    Huffman, SB
    Baudin, C
    IJCAI-97 - PROCEEDINGS OF THE FIFTEENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 AND 2, 1997, : 751 - 756
  • [43] Information extraction from Web pages using semi-structured data alignment
    Kuboyama, Tetsuji
    Miyahara, Tetsuhiro
    Hirokawa, Sachio
    Itou, Eisuke
    WMSCI 2005: 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Vol 1, 2005, : 42 - 47
  • [44] Layout-aware information extraction from semi-structured medical images
    Luo, Kangqi
    Lu, Jinyi
    Zhu, Kenny Q.
    Gao, Weiguo
    Wei, Jia
    Zhang, Meizhuo
    COMPUTERS IN BIOLOGY AND MEDICINE, 2019, 107 : 235 - 247
  • [45] Automatic information extraction from semi-structured Web pages by pattern discovery
    Chang, CH
    Hsu, CN
    Lui, SC
    DECISION SUPPORT SYSTEMS, 2003, 35 (01) : 129 - 147
  • [46] A maximum entropy approach to Information Extraction from semi-structured and free text
    Chien, HL
    Ng, HT
    EIGHTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-02)/FOURTEENTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE (IAAI-02), PROCEEDINGS, 2002, : 786 - 791
  • [47] An Algorithm of Semi-structured Data Scheme Extraction Based on OEM Model
    Gong, An
    Yang, Xue-wei
    ADVANCED RESEARCH ON ELECTRONIC COMMERCE, WEB APPLICATION, AND COMMUNICATION, PT 1, 2011, 143 : 315 - 319
  • [48] Using ILP to construct features for information extraction from semi-structured text
    Ramakrishnan, Ganesh
    Joshil, Sachindra
    Balakrishnan, Sreeram
    Srinivasan, Ashwin
    INDUCTIVE LOGIC PROGRAMMING, 2008, 4894 : 211 - 224
  • [49] SEMI-STRUCTURED DOCUMENT EXTRACTION BASED ON DOCUMENT ELEMENT BLOCK MODEL
    Lv, Tao
    Liu, Jiang
    Lu, Fan
    Zhang, Peng
    Wang, Xinyan
    Wang, Cong
    PROCEEDINGS OF 2016 4TH IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENCE SYSTEMS (IEEE CCIS 2016), 2016, : 461 - 465
  • [50] List data extraction in semi-structured document
    Xu, H
    Li, JZ
    Xu, P
    WEB INFORMATION SYSTEMS ENGINEERING - WISE 2005, 2005, 3806 : 584 - 585