Information extraction from the structured part of office documents

被引：4

作者：

Hao, XL ^{[1
]}

Wang, JTL ^{[1
]}

Ng, PA ^{[1
]}

机构：

[1] NEW JERSEY INST TECHNOL,INST INTEGRATED SYST RES,DEPT INFORMAT & COMP SCI,NEWARK,NJ 07102

来源：

INFORMATION SCIENCES | 1996年 / 91卷 / 3-4期

基金：

美国国家科学基金会;

关键词：

D O I：

10.1016/0020-0255(96)00037-0

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The contents of office documents can be divided into structured and unstructured parts. In this paper, we present a sample-based approach to analyzing a document to form its layout and conceptual structures, and then extracting information from the structured part of the office documents. We represent a document's layout structure as an ordered labeled tree structure using nested segmentation, and its conceptual structure as a set of attribute type pairs. The layout similarities between the document to be processed and samples are identified by employing an approximate tree matching method. The conceptual similarities are identified by analyzing document and sample contents, and by measuring the degree of conceptual closeness. Finally, the information is extracted by instantiating the attributes specified in the conceptual structure based on the result of document structure analysis.

引用

页码：245 / 274

页数：30

共 50 条

[41] Extraction of Information from Public Health Emergency Web Documents
Wang, Li
Zhang, Yuanpeng
Qian, Danmin
Yao, Min
PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON AUTOMATION, MECHANICAL CONTROL AND COMPUTATIONAL ENGINEERING, 2015, 124 : 765 - 770
[42] Collaborative Information Extraction and Mining from Multiple Web Documents
Wong, Tak-Lam
Lam, Wai
Chan, Shing-Kit
PROCEEDINGS OF THE SIXTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2006, : 442 - 452
[43] XML as a means to support information extraction from legal documents
Martínez, MM
de la Fuente, P
Derniame, JC
COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 2003, 18 (05): : 263 - 277
[44] Automatic Key Information Extraction from Visually Rich Documents
De Trogoff, Charles
Hantach, Rim
Lechuga, Gisela
Calvez, Philippe
2022 21ST IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, ICMLA, 2022, : 89 - 96
[45] Lightweight Spatial Modeling for Combinatorial Information Extraction From Documents
Dong, Yanfei
Deng, Lambert
Zhang, Jiazheng
Yu, Xiaodong
Lin, Ting
Gelli, Francesco
Poriadecla, Soujanya
Lee, Wee Sun
17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 1471 - 1484
[46] Information theoretic retrieval with structured queries and documents
Carpineto, Claudio
Romano, Giovanni
Caracciolo, Caterina
COMPARATIVE EVALUATION OF XML INFORMATION RETRIEVAL SYSTEMS, 2007, 4518 : 178 - 184
[47] Editorial: Information extraction for health documents
Mensa, Enrico
Fernandez, Paloma Martinez
Roller, Roland
Radicioni, Daniele P.
FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2023, 6
[48] Automatic Content Extraction on Semi-Structured Documents
dos Santos, Jose Eduardo Bastos
11TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR 2011), 2011, : 1235 - 1239
[49] Jointly Learning Span Extraction and Sequence Labeling for Information Extraction from Business Documents
Nguyen Hong Son
Hieu M Yu
Tuan-Anh D Nguyen
Minh-Tien Nguyen
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
[50] Jointly Learning Span Extraction and Sequence Labeling for Information Extraction from Business Documents
Son, Nguyen Hong
Yu, Hieu M.
Nguyen, Tuan-Anh D.
Nguyen, Minh-Tien
Proceedings of the International Joint Conference on Neural Networks, 2022, 2022-July

← 1 2 3 4 5 →