Information extraction from the structured part of office documents

被引:4
|
作者
Hao, XL [1 ]
Wang, JTL [1 ]
Ng, PA [1 ]
机构
[1] NEW JERSEY INST TECHNOL,INST INTEGRATED SYST RES,DEPT INFORMAT & COMP SCI,NEWARK,NJ 07102
基金
美国国家科学基金会;
关键词
D O I
10.1016/0020-0255(96)00037-0
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The contents of office documents can be divided into structured and unstructured parts. In this paper, we present a sample-based approach to analyzing a document to form its layout and conceptual structures, and then extracting information from the structured part of the office documents. We represent a document's layout structure as an ordered labeled tree structure using nested segmentation, and its conceptual structure as a set of attribute type pairs. The layout similarities between the document to be processed and samples are identified by employing an approximate tree matching method. The conceptual similarities are identified by analyzing document and sample contents, and by measuring the degree of conceptual closeness. Finally, the information is extracted by instantiating the attributes specified in the conceptual structure based on the result of document structure analysis.
引用
收藏
页码:245 / 274
页数:30
相关论文
共 50 条
  • [31] Information Extraction from Handwritten Tables in Historical Documents
    Andres, Jose
    Ramon Prieto, Jose
    Granell, Emilio
    Romero, Veronica
    Andreu Sanchez, Joan
    Vidal, Enrique
    DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 184 - 198
  • [32] A knowledge-based information extraction system for semi-structured labeled documents
    Yang, JY
    Oh, H
    Doh, KG
    Choi, J
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2002, 2002, 2412 : 105 - 110
  • [33] Typed structured documents for information retrieval
    Dharap, C
    Bowman, CM
    PRINCIPLES OF DOCUMENT PROCESSING, 1997, 1293 : 135 - 151
  • [34] A spatially-aware algorithm for location extraction from structured documents
    Sharma, Praval
    Samal, Ashok
    Soh, Leen-Kiat
    Joshi, Deepti
    GEOINFORMATICA, 2023, 27 (04) : 645 - 679
  • [35] A spatially-aware algorithm for location extraction from structured documents
    Praval Sharma
    Ashok Samal
    Leen-Kiat Soh
    Deepti Joshi
    GeoInformatica, 2023, 27 : 645 - 679
  • [36] Recognition techniques for extracting information from semi-structured documents
    Della Ventura, A
    Gagliardi, I
    Zonta, B
    DOCUMENT RECOGNITION AND RETRIEVAL VIII, 2001, 4307 : 130 - 137
  • [37] Structured Information Extraction from Medical Texts in Bulgarian
    Boytcheva, Svetla
    CYBERNETICS AND INFORMATION TECHNOLOGIES, 2012, 12 (04) : 52 - 65
  • [38] Automatic Extraction of Structured Information from Drug Descriptions
    Slavescu, Radu Razvan
    Masca, Constantin
    Slavescu, Kinga Cristina
    MINING INTELLIGENCE AND KNOWLEDGE EXPLORATION, MIKE 2018, 2018, 11308 : 21 - 31
  • [39] Layout based information extraction from HTML']HTML documents
    Buraet, Radek
    ICDAR 2007: NINTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2007, : 624 - 628
  • [40] Information extraction from free-text business documents
    Abramowicz, W
    Piskorski, J
    ISSUES AND TRENDS OF INFORMATION TECHNOLOGY MANAGEMENT IN CONTEMPORARY ORGANIZATIONS, VOLS 1 AND 2, 2002, : 626 - 630