Information extraction from the structured part of office documents

被引:4
|
作者
Hao, XL [1 ]
Wang, JTL [1 ]
Ng, PA [1 ]
机构
[1] NEW JERSEY INST TECHNOL,INST INTEGRATED SYST RES,DEPT INFORMAT & COMP SCI,NEWARK,NJ 07102
基金
美国国家科学基金会;
关键词
D O I
10.1016/0020-0255(96)00037-0
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The contents of office documents can be divided into structured and unstructured parts. In this paper, we present a sample-based approach to analyzing a document to form its layout and conceptual structures, and then extracting information from the structured part of the office documents. We represent a document's layout structure as an ordered labeled tree structure using nested segmentation, and its conceptual structure as a set of attribute type pairs. The layout similarities between the document to be processed and samples are identified by employing an approximate tree matching method. The conceptual similarities are identified by analyzing document and sample contents, and by measuring the degree of conceptual closeness. Finally, the information is extracted by instantiating the attributes specified in the conceptual structure based on the result of document structure analysis.
引用
收藏
页码:245 / 274
页数:30
相关论文
共 50 条
  • [1] Learning from similarity and information extraction from structured documents
    Holecek, Martin
    INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2021,
  • [2] Learning from similarity and information extraction from structured documents
    Holecek, Martin
    INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2021, 24 (03) : 149 - 165
  • [3] Learning from similarity and information extraction from structured documents
    Martin Holeček
    International Journal on Document Analysis and Recognition (IJDAR), 2021, 24 : 149 - 165
  • [4] Information extraction from semi-structured web documents
    Yun, Bo-Hyun
    Seo, Chang-Ho
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, 2006, 4092 : 586 - 598
  • [5] Metadata extraction from office documents
    Stumbo, WK
    Handley, JC
    Archiving 2005, Final Program and Proceedings, 2005, : 184 - 187
  • [6] DocTr: Document Transformer for Structured Information Extraction in Documents
    Liao, Haofu
    RoyChowdhury, Aruni
    Li, Weijian
    Bansal, Ankan
    Zhang, Yuting
    Tu, Zhuowen
    Satzoda, Ravi Kumar
    Manmatha, R.
    Mahadevan, Vijay
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 19527 - 19537
  • [7] AN ALGEBRA FOR STRUCTURED OFFICE DOCUMENTS
    GUTING, RH
    ZICARI, R
    CHOY, DM
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1989, 7 (02) : 123 - 157
  • [8] Structured Information Extraction Technology for Official Documents Based on LTP
    Su, Beirong
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON MODELING, NATURAL LANGUAGE PROCESSING AND MACHINE LEARNING, CMNM 2024, 2024, : 115 - 121
  • [9] Extraction of chemical information from documents
    Villar, Hugo O.
    Betancort, Juan
    Hansen, Mark R.
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2010, 240
  • [10] Information Extraction from Legal Documents
    Cheng, Tin Tin
    Cua, Jeffrey Leonard
    Tan, Mark Davies
    Yao, Kenneth Gerard
    Roxas, Rachel Edita
    2009 EIGHTH INTERNATIONAL SYMPOSIUM ON NATURAL LANGUAGE PROCESSING, PROCEEDINGS, 2009, : 157 - +