ONTOLOGY-BASED INFORMATION EXTRACTION FROM PDF DOCUMENTS WITH XONTO

被引:4
|
作者
Oro, Ermelinda [1 ]
Ruffolo, Massimo [2 ]
Sacca, Domenico [1 ]
机构
[1] Univ Calabria, Dept Elect Comp & Syst Sci, I-87036 Arcavacata Di Rende, CS, Italy
[2] Italian Natl Res Council, High Performance Comp & Networking Inst, I-87036 Arcavacata Di Rende, CS, Italy
关键词
Ontology-based information extraction; knowledge representation and reasoning; ontology; semantics; logic programming; attribute grammar; augmented transition network; PDF document; SYSTEM;
D O I
10.1142/S0218213009000354
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Information extraction is of paramount importance in several real world applications in the are as of business, competitive and military intelligence because it enables to acquire information contained in unstructured documents and store them in structured forms. Unstructured documents have different internal encodings, one of the most diffused encoding is the visualization-oriented Adobe portable document format (PDF). Although several sophisticated and indeed complex approaches were proposed, they are still limited in many aspects. In particular, existing information extraction systems cannot be applied to PDF documents because of their completely unstructured nature that posemany issues in defining IE approaches. In this paper the novel ontology-based system named XONTO, that allows these mantic extraction of information from PDF documents, is presented. The XONTO system is founded on the idea of self-describing ontologies in which objects and classes can be equipped by a set of rules named descriptors. These rules represent patterns that allow to automatically recognize and extract ontology objects contained in PDF documents also when information is arranged in tabular form. This way a self-describing ontology expresses these mantic of the information to extract and the rules that, inturn, populate itself. In the paper XONTO system behaviors and structure are sketched by means of a running example
引用
收藏
页码:673 / 695
页数:23
相关论文
共 50 条
  • [1] XONTO: An Ontology-based System for Semantic Information Extraction from PDF Documents
    Oro, Ermelinda
    Ruffolo, Massimo
    20TH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, VOL 1, PROCEEDINGS, 2008, : 118 - +
  • [2] Towards a System for Ontology-Based Information Extraction from PDF Documents
    Oro, Ermelinda
    Ruffolo, Massimo
    ON THE MOVE TO MEANINGFUL INTERNET SYSTEMS: OTM 2008, PT II, PROCEEDINGS, 2008, 5332 : 1482 - 1499
  • [3] Ontology-Based Hazard Information Extraction from Chinese Food Complaint Documents
    Yang, Xiquan
    Gao, Rui
    Han, Zhengfu
    Sui, Xin
    ADVANCES IN SWARM INTELLIGENCE, ICSI 2012, PT II, 2012, 7332 : 155 - 163
  • [4] Automatic ontology-based knowledge extraction from web documents
    Alani, H
    Kim, S
    Millard, DE
    Weal, MJ
    Hall, W
    Lewis, PH
    Shadbolt, NR
    IEEE INTELLIGENT SYSTEMS, 2003, 18 (01) : 14 - 21
  • [5] Ontology-Based Information Retrieval for Historical Documents
    Ramli, Fatihah
    Noah, Shahrul Azman
    Kurniawan, Tri Basuki
    2016 THIRD INTERNATIONAL CONFERENCE ON INFORMATION RETRIEVAL AND KNOWLEDGE MANAGEMENT (CAMP), 2016, : 55 - 59
  • [6] Ontology-Based Information Extraction from Spanish Forum
    Pena, Willy
    Melgar, Andres
    COMPUTATIONAL COLLECTIVE INTELLIGENCE (ICCCI 2015), PT I, 2015, 9329 : 351 - 360
  • [7] Ontology-Based Web Information Extraction
    Mo, Qian
    Chen, Yi-hong
    COMMUNICATIONS AND INFORMATION PROCESSING, PT 1, 2012, 288 : 118 - 126
  • [8] Ontology-based information retrieval and extraction
    Lee, CY
    Soo, VW
    ITRE 2005: 3RD INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY: RESEARCH AND EDUCATION, PROCEEDINGS, 2005, : 265 - 269
  • [9] An ontology-based index to retrieve documents with geographic information
    Luaces, Miguel R.
    Parama, Jose R.
    Pedreira, Oscar
    Seco, Diego
    SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, PROCEEDINGS, 2008, 5069 : 384 - 400
  • [10] Ontology-based information extraction from the World Wide Web
    Korst, Jan
    Geleijnse, Gijs
    de Jong, Nick
    Verschoor, Michael
    INTELLIGENT ALGORITHMS IN AMBIENT AND BIOMEDICAL COMPUTING, 2006, 7 : 149 - +