Learning from similarity and information extraction from structured documents

被引:0
|
作者
Holecek, Martin [1 ]
机构
[1] Charles Univ Prague, Fac Math & Phys, Dept Numer Math, Prague, Czech Republic
关键词
One-shot learning; Information extraction; Siamese networks; Similarity; Attention;
D O I
10.1007/s10032-021-00175-3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The automation of document processing has recently gained attention owing to its great potential to reduce manual work. Any improvement in information extraction systems or reduction in their error rates aids companies working with business documents because lowering reliance on cost-heavy and error-prone human work significantly improves the revenue. Neural networks have been applied to this area before, but they have been trained only on relatively small datasets with hundreds of documents so far. To successfully explore deep learning techniques and improve information extraction, we compiled a dataset with more than 25,000 documents. We expand on our previous work in which we proved that convolutions, graph convolutions, and self-attention can work together and exploit all the information within a structured document. Taking the fully trainable method one step further, we now design and examine various approaches to using Siamese networks, concepts of similarity, one-shot learning, and context/memory awareness. The aim is to improve micro F-1 of per-word classification in the huge real-world document dataset. The results verify that trainable access to a similar (yet still different) page, together with its already known target information, improves the information extraction. The experiments confirm that all proposed architecture parts (Siamese networks, employing class information, query-answer attention module and skip connections to a similar page) are all required to beat the previous results. The best model yields an 8.25% gain in the F-1 score over the previous state-of-the-art results. Qualitative analysis verifies that the new model performs better for all target classes. Additionally, multiple structural observations about the causes of the underperformance of some architectures are revealed, since all the techniques used in this work are not problem-specific and can be generalized for other tasks and contexts.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] Learning from similarity and information extraction from structured documents
    Holecek, Martin
    [J]. INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2021, 24 (03) : 149 - 165
  • [2] Learning from similarity and information extraction from structured documents
    Martin Holeček
    [J]. International Journal on Document Analysis and Recognition (IJDAR), 2021, 24 : 149 - 165
  • [3] Information extraction from the structured part of office documents
    Hao, XL
    Wang, JTL
    Ng, PA
    [J]. INFORMATION SCIENCES, 1996, 91 (3-4) : 245 - 274
  • [4] Information extraction from semi-structured web documents
    Yun, Bo-Hyun
    Seo, Chang-Ho
    [J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, 2006, 4092 : 586 - 598
  • [5] Extraction of chemical information from documents
    Villar, Hugo O.
    Betancort, Juan
    Hansen, Mark R.
    [J]. ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2010, 240
  • [6] Information Extraction from Legal Documents
    Cheng, Tin Tin
    Cua, Jeffrey Leonard
    Tan, Mark Davies
    Yao, Kenneth Gerard
    Roxas, Rachel Edita
    [J]. 2009 EIGHTH INTERNATIONAL SYMPOSIUM ON NATURAL LANGUAGE PROCESSING, PROCEEDINGS, 2009, : 157 - +
  • [7] Automatic Information Extraction from Electronic Documents Using Machine Learning
    Kamaleson, Nishanthan
    Chu, Dominique
    Otero, Fernando E. B.
    [J]. ARTIFICIAL INTELLIGENCE XXXVIII, 2021, 13101 : 183 - 194
  • [8] Jointly Learning Span Extraction and Sequence Labeling for Information Extraction from Business Documents
    Nguyen Hong Son
    Hieu M Yu
    Tuan-Anh D Nguyen
    Minh-Tien Nguyen
    [J]. 2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [9] Information extraction from structured documents using k-testable tree automaton inference
    Kosala, Raymond
    Blockeel, Hendrik
    Bruynooghe, Maurice
    Van den Bussche, Jan
    [J]. DATA & KNOWLEDGE ENGINEERING, 2006, 58 (02) : 129 - 158
  • [10] Information Extraction from Arabic Law Documents
    Abu Shamma, Samah
    Ayasa, Aseel
    Sleem, Wala'
    Yahya, Adnan
    [J]. 2020 IEEE 14TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT2020), 2020,