Learning from similarity and information extraction from structured documents

被引：0

作者：

Holecek, Martin ^{[1
]}

机构：

[1] Charles Univ Prague, Fac Math & Phys, Dept Numer Math, Prague, Czech Republic

来源：

INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION | 2021年

关键词：

One-shot learning; Information extraction; Siamese networks; Similarity; Attention;

D O I：

10.1007/s10032-021-00175-3

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The automation of document processing has recently gained attention owing to its great potential to reduce manual work. Any improvement in information extraction systems or reduction in their error rates aids companies working with business documents because lowering reliance on cost-heavy and error-prone human work significantly improves the revenue. Neural networks have been applied to this area before, but they have been trained only on relatively small datasets with hundreds of documents so far. To successfully explore deep learning techniques and improve information extraction, we compiled a dataset with more than 25,000 documents. We expand on our previous work in which we proved that convolutions, graph convolutions, and self-attention can work together and exploit all the information within a structured document. Taking the fully trainable method one step further, we now design and examine various approaches to using Siamese networks, concepts of similarity, one-shot learning, and context/memory awareness. The aim is to improve micro F-1 of per-word classification in the huge real-world document dataset. The results verify that trainable access to a similar (yet still different) page, together with its already known target information, improves the information extraction. The experiments confirm that all proposed architecture parts (Siamese networks, employing class information, query-answer attention module and skip connections to a similar page) are all required to beat the previous results. The best model yields an 8.25% gain in the F-1 score over the previous state-of-the-art results. Qualitative analysis verifies that the new model performs better for all target classes. Additionally, multiple structural observations about the causes of the underperformance of some architectures are revealed, since all the techniques used in this work are not problem-specific and can be generalized for other tasks and contexts.

引用

页数：17

共 50 条

[21] Bottom-up learning of logic programs for information extraction from hypertext documents
Thomas, B
[J]. KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2003, PROCEEDINGS, 2003, 2838 : 435 - 446
[22] Cooperative and Fast-Learning Information Extraction from Business Documents for Document Archiving
Esser, Daniel
[J]. ON THE MOVE TO MEANINGFUL INTERNET SYSTEMS: OTM 2013 WORKSHOPS, 2013, 8186 : 22 - 31
[23] Semantic information extraction from images of complex documents
Peanho, Claudio Antonio
Stagni, Henrique
Correa da Silva, Flavio Soares
[J]. APPLIED INTELLIGENCE, 2012, 37 (04) : 543 - 557
[24] Semantic information extraction from images of complex documents
Claudio Antonio Peanho
Henrique Stagni
Flavio Soares Correa da Silva
[J]. Applied Intelligence, 2012, 37 : 543 - 557
[25] Efficient Temporal Information Extraction from Korean Documents
Lim, Chae-Gyun
Choi, Ho-Jin
[J]. 2017 18TH IEEE INTERNATIONAL CONFERENCE ON MOBILE DATA MANAGEMENT (IEEE MDM 2017), 2017, : 366 - 370
[26] Template mining for information extraction from digital documents
Chowdhury, GG
[J]. LIBRARY TRENDS, 1999, 48 (01) : 182 - 208
[27] Information Extraction from Presentation-Oriented Documents
Ruffolo, Massimo
Oro, Ermelinda
[J]. ERCIM NEWS, 2012, (89): : 44 - 44
[28] Structure recognition and information extraction from tabular documents
Chandran, S
Balasubramanian, S
Gandhi, T
Prasad, A
Kasturi, R
Chhabra, A
[J]. INTERNATIONAL JOURNAL OF IMAGING SYSTEMS AND TECHNOLOGY, 1996, 7 (04) : 289 - 303
[29] Information extraction from documents for automating software testing
Lutsky, P
[J]. ARTIFICIAL INTELLIGENCE IN ENGINEERING, 2000, 14 (01): : 63 - 69
[30] Information Extraction from Handwritten Tables in Historical Documents
Andres, Jose
Ramon Prieto, Jose
Granell, Emilio
Romero, Veronica
Andreu Sanchez, Joan
Vidal, Enrique
[J]. DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 184 - 198

← 1 2 3 4 5 →