Information extraction from the structured part of office documents

被引：4

作者：

Hao, XL ^{[1
]}

Wang, JTL ^{[1
]}

Ng, PA ^{[1
]}

机构：

[1] NEW JERSEY INST TECHNOL,INST INTEGRATED SYST RES,DEPT INFORMAT & COMP SCI,NEWARK,NJ 07102

来源：

INFORMATION SCIENCES | 1996年 / 91卷 / 3-4期

基金：

美国国家科学基金会;

关键词：

D O I：

10.1016/0020-0255(96)00037-0

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The contents of office documents can be divided into structured and unstructured parts. In this paper, we present a sample-based approach to analyzing a document to form its layout and conceptual structures, and then extracting information from the structured part of the office documents. We represent a document's layout structure as an ordered labeled tree structure using nested segmentation, and its conceptual structure as a set of attribute type pairs. The layout similarities between the document to be processed and samples are identified by employing an approximate tree matching method. The conceptual similarities are identified by analyzing document and sample contents, and by measuring the degree of conceptual closeness. Finally, the information is extracted by instantiating the attributes specified in the conceptual structure based on the result of document structure analysis.

引用

页码：245 / 274

页数：30

共 50 条

[1] Learning from similarity and information extraction from structured documents
Holecek, Martin
INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2021,
[2] Learning from similarity and information extraction from structured documents
Holecek, Martin
INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2021, 24 (03) : 149 - 165
[3] Learning from similarity and information extraction from structured documents
Martin Holeček
International Journal on Document Analysis and Recognition (IJDAR), 2021, 24 : 149 - 165
[4] Information extraction from semi-structured web documents
Yun, Bo-Hyun
Seo, Chang-Ho
KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, 2006, 4092 : 586 - 598
[5] Metadata extraction from office documents
Stumbo, WK
Handley, JC
Archiving 2005, Final Program and Proceedings, 2005, : 184 - 187
[6] DocTr: Document Transformer for Structured Information Extraction in Documents
Liao, Haofu
RoyChowdhury, Aruni
Li, Weijian
Bansal, Ankan
Zhang, Yuting
Tu, Zhuowen
Satzoda, Ravi Kumar
Manmatha, R.
Mahadevan, Vijay
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 19527 - 19537
[7] AN ALGEBRA FOR STRUCTURED OFFICE DOCUMENTS
GUTING, RH
ZICARI, R
CHOY, DM
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1989, 7 (02) : 123 - 157
[8] Structured Information Extraction Technology for Official Documents Based on LTP
Su, Beirong
PROCEEDINGS OF INTERNATIONAL CONFERENCE ON MODELING, NATURAL LANGUAGE PROCESSING AND MACHINE LEARNING, CMNM 2024, 2024, : 115 - 121
[9] Extraction of chemical information from documents
Villar, Hugo O.
Betancort, Juan
Hansen, Mark R.
ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2010, 240
[10] Information Extraction from Legal Documents
Cheng, Tin Tin
Cua, Jeffrey Leonard
Tan, Mark Davies
Yao, Kenneth Gerard
Roxas, Rachel Edita
2009 EIGHTH INTERNATIONAL SYMPOSIUM ON NATURAL LANGUAGE PROCESSING, PROCEEDINGS, 2009, : 157 - +

← 1 2 3 4 5 →