Cooperative and Fast-Learning Information Extraction from Business Documents for Document Archiving

被引:0
|
作者
Esser, Daniel [1 ]
机构
[1] Tech Univ Dresden, Comp Networks Grp, D-01062 Dresden, Germany
关键词
Document Layout Analysis; Information Extraction; Cooperative Extraction; Few-Exemplar-Learning;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automatic information extraction from scanned business documents is especially valuable in the application domain of document management and archiving. Although current solutions for document classification and extraction work pretty well, they still require a high effort of on-site configuration done by domain experts or administrators. Especially small office/home office (SOHO) users and private individuals often do not use such systems because of the need for configuration and long periods of training to reach acceptable extraction rates. Therefore we present a solution for information extraction out of scanned business documents that fits the requirements of these users. Our approach is highly adaptable to new document types and index fields and uses only a minimum of training documents to reach extraction rates comparable to related works and manual document indexing. By providing a cooperative extraction system, which allows sharing extraction knowledge between participants, we furthermore want to minimize the number of user feedback and increase the acceptance of such a system. A first evaluation of our solution according to a document set of 12,500 documents with 10 commonly used fields shows competitive results above 85% F1-measure. Results above 75% F1-measure are already reached with a minimal training set of only one document per template.
引用
收藏
页码:22 / 31
页数:10
相关论文
共 50 条
  • [1] Jointly Learning Span Extraction and Sequence Labeling for Information Extraction from Business Documents
    Nguyen Hong Son
    Hieu M Yu
    Tuan-Anh D Nguyen
    Minh-Tien Nguyen
    [J]. 2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [2] Jointly Learning Span Extraction and Sequence Labeling for Information Extraction from Business Documents
    Son, Nguyen Hong
    Yu, Hieu M.
    Nguyen, Tuan-Anh D.
    Nguyen, Minh-Tien
    [J]. Proceedings of the International Joint Conference on Neural Networks, 2022, 2022-July
  • [3] Information extraction from free-text business documents
    Abramowicz, W
    Piskorski, J
    [J]. ISSUES AND TRENDS OF INFORMATION TECHNOLOGY MANAGEMENT IN CONTEMPORARY ORGANIZATIONS, VOLS 1 AND 2, 2002, : 626 - 630
  • [4] Fast title extraction method for business documents
    Katsuyama, Y
    Naoi, S
    [J]. DOCUMENT RECOGNITION IV, 1997, 3027 : 192 - 201
  • [5] Learning from similarity and information extraction from structured documents
    Holecek, Martin
    [J]. INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2021,
  • [6] Learning from similarity and information extraction from structured documents
    Holecek, Martin
    [J]. INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2021, 24 (03) : 149 - 165
  • [7] Learning from similarity and information extraction from structured documents
    Martin Holeček
    [J]. International Journal on Document Analysis and Recognition (IJDAR), 2021, 24 : 149 - 165
  • [8] DocTr: Document Transformer for Structured Information Extraction in Documents
    Liao, Haofu
    RoyChowdhury, Aruni
    Li, Weijian
    Bansal, Ankan
    Zhang, Yuting
    Tu, Zhuowen
    Satzoda, Ravi Kumar
    Manmatha, R.
    Mahadevan, Vijay
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 19527 - 19537
  • [9] Intellix - End-User Trained Information Extraction for Document Archiving
    Schuster, Daniel
    Muthmann, Klemens
    Esser, Daniel
    Schill, Alexander
    Berger, Michael
    Weidling, Christoph
    Aliyev, Kamil
    Hofmeier, Andreas
    [J]. 2013 12TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2013, : 101 - 105
  • [10] Business Document Information Extraction: Towards Practical Benchmarks
    Skalicky, Matyas
    Simsa, Stepan
    Uricar, Michal
    Sulc, Milan
    [J]. EXPERIMENTAL IR MEETS MULTILINGUALITY, MULTIMODALITY, AND INTERACTION (CLEF 2022), 2022, 13390 : 105 - 117