Wikipedia HTML']HTML Structure Analysis for Ontology Construction

被引:1
|
作者
Zarrad, Rim [1 ]
Doggaz, Narjes [2 ]
Zagrouba, Ezzedine [3 ]
机构
[1] Univ Manouba, Higher Inst Documentat, Lab LIMTIC, Ariana, Tunisia
[2] Univ Tunis El Manar, Fac Sci Tunisia, Lab LIPAH, Tunis, Tunisia
[3] Univ Tunis El Manar, Higher Inst Comp Sci, Lab LIMTIC, Tunis, Tunisia
来源
KNOWLEDGE ORGANIZATION | 2018年 / 45卷 / 02期
关键词
taxonomic relations; concepts; extracted semantic relations; Wikipedia; ontology construction;
D O I
10.5771/0943-7444-2018-2-108
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
Previously, the main problem of information extraction was to gather enough data. Today, the challenge is not to collect data but to interpret and represent them in order to deduce information. Ontologies are considered suitable solutions for organizing information. The classic methods for ontology construction from textual documents rely on natural language analysis and are generally based on statistical or linguistic approaches. However, these approaches do not consider the document structure which provides additional knowledge. In fact, the structural organization of documents also conveys meaning. In this context, new approaches focus on document structure analysis to extract knowledge. This paper describes a methodology for ontology construction from web data and especially from Wikipedia articles. It focuses mainly on document structure in order to extract the main concepts and their relations. The proposed methods extract not only taxonomic and non-taxonomic relations but also give the labels describing non-taxonomic relations. The extraction of non-taxonomic relations is established by analyzing the titles hierarchy in each document. A pattern matching is also applied in order to extract known semantic relations. We propose also to apply a refinement to the extracted relations in order to keep only those that are relevant. The refinement process is performed by applying the transitive property, checking the nature of the relations and analyzing taxonomic relations having inverted arguments. Experiments have been performed on French Wikipedia articles related to the medical field. Ontology evaluation is performed by comparing it to gold standards.
引用
收藏
页码:108 / 124
页数:17
相关论文
共 50 条
  • [41] Digital architectures: SGML, HTML']HTML, multimedia and the structure of electronic documents
    Heba, GM
    STC 1996 PROCEEDINGS - 43RD ANNUAL CONFERENCE: EVOLUTION/REVOLUTION, 1996, : 213 - 216
  • [42] The X factor: From HTML']HTML to XHTML']HTML
    Perlin, Neil
    2006 IEEE International Professional Communication Conference, 2006, : 190 - 192
  • [43] Automating the extraction of data from HTML']HTML tables with unknown structure
    Embley, DW
    Tao, C
    Liddle, SW
    DATA & KNOWLEDGE ENGINEERING, 2005, 54 (01) : 3 - 28
  • [44] Extracting Logical Hierarchical Structure of HTML']HTML Documents Based on Headings
    Manabe, Tomohiro
    Tajima, Keishi
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 8 (12): : 1606 - 1617
  • [45] HTML']HTML5 and the evolution of HTML']HTML; tracing the origins of digital platforms
    Tabares, Raul
    TECHNOLOGY IN SOCIETY, 2021, 65
  • [46] Automating HTML']HTML conversion
    Flynn, P
    DR DOBBS JOURNAL, 1996, 21 (10): : 8 - 8
  • [47] After HTML']HTML, VRML?
    Gustavson, R
    CD-ROM PROFESSIONAL, 1996, 9 (08): : 29 - 29
  • [48] Introducing HTML']HTML 5
    Wilson, Tom
    INFORMATION RESEARCH-AN INTERNATIONAL ELECTRONIC JOURNAL, 2011, 16 (02):
  • [49] Semantic HTML']HTML page segmentation using type analysis
    Yang, Xin
    Xiang, Peifeng
    Shi, Yuanchun
    2006 1ST INTERNATIONAL SYMPOSIUM ON PERVASIVE COMPUTING AND APPLICATIONS, PROCEEDINGS, 2006, : 669 - +
  • [50] To HTML']HTML or not: What are the questions?
    Gerrior, S
    Rodrigues, M
    Stein, ME
    McGill, FC
    Blair, SR
    SOCIETY FOR TECHNICAL COMMUNICATION 44TH ANNUAL CONFERENCE, 1997 PROCEEDINGS, 1997, : 387 - 390