Information extraction from the structured part of office documents

被引:4
|
作者
Hao, XL [1 ]
Wang, JTL [1 ]
Ng, PA [1 ]
机构
[1] NEW JERSEY INST TECHNOL,INST INTEGRATED SYST RES,DEPT INFORMAT & COMP SCI,NEWARK,NJ 07102
基金
美国国家科学基金会;
关键词
D O I
10.1016/0020-0255(96)00037-0
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The contents of office documents can be divided into structured and unstructured parts. In this paper, we present a sample-based approach to analyzing a document to form its layout and conceptual structures, and then extracting information from the structured part of the office documents. We represent a document's layout structure as an ordered labeled tree structure using nested segmentation, and its conceptual structure as a set of attribute type pairs. The layout similarities between the document to be processed and samples are identified by employing an approximate tree matching method. The conceptual similarities are identified by analyzing document and sample contents, and by measuring the degree of conceptual closeness. Finally, the information is extracted by instantiating the attributes specified in the conceptual structure based on the result of document structure analysis.
引用
收藏
页码:245 / 274
页数:30
相关论文
共 50 条
  • [41] Extraction of Information from Public Health Emergency Web Documents
    Wang, Li
    Zhang, Yuanpeng
    Qian, Danmin
    Yao, Min
    PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON AUTOMATION, MECHANICAL CONTROL AND COMPUTATIONAL ENGINEERING, 2015, 124 : 765 - 770
  • [42] Collaborative Information Extraction and Mining from Multiple Web Documents
    Wong, Tak-Lam
    Lam, Wai
    Chan, Shing-Kit
    PROCEEDINGS OF THE SIXTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2006, : 442 - 452
  • [43] XML as a means to support information extraction from legal documents
    Martínez, MM
    de la Fuente, P
    Derniame, JC
    COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 2003, 18 (05): : 263 - 277
  • [44] Automatic Key Information Extraction from Visually Rich Documents
    De Trogoff, Charles
    Hantach, Rim
    Lechuga, Gisela
    Calvez, Philippe
    2022 21ST IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, ICMLA, 2022, : 89 - 96
  • [45] Lightweight Spatial Modeling for Combinatorial Information Extraction From Documents
    Dong, Yanfei
    Deng, Lambert
    Zhang, Jiazheng
    Yu, Xiaodong
    Lin, Ting
    Gelli, Francesco
    Poriadecla, Soujanya
    Lee, Wee Sun
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 1471 - 1484
  • [46] Information theoretic retrieval with structured queries and documents
    Carpineto, Claudio
    Romano, Giovanni
    Caracciolo, Caterina
    COMPARATIVE EVALUATION OF XML INFORMATION RETRIEVAL SYSTEMS, 2007, 4518 : 178 - 184
  • [47] Editorial: Information extraction for health documents
    Mensa, Enrico
    Fernandez, Paloma Martinez
    Roller, Roland
    Radicioni, Daniele P.
    FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2023, 6
  • [48] Automatic Content Extraction on Semi-Structured Documents
    dos Santos, Jose Eduardo Bastos
    11TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR 2011), 2011, : 1235 - 1239
  • [49] Jointly Learning Span Extraction and Sequence Labeling for Information Extraction from Business Documents
    Nguyen Hong Son
    Hieu M Yu
    Tuan-Anh D Nguyen
    Minh-Tien Nguyen
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [50] Jointly Learning Span Extraction and Sequence Labeling for Information Extraction from Business Documents
    Son, Nguyen Hong
    Yu, Hieu M.
    Nguyen, Tuan-Anh D.
    Nguyen, Minh-Tien
    Proceedings of the International Joint Conference on Neural Networks, 2022, 2022-July