Heading-based sectional hierarchy identification for HTML']HTML documents

被引:0
|
作者
Pembe, F. Canan [1 ]
Gungor, Tunga [1 ]
机构
[1] Bogazici Univ, Dept Comp Engn, TR-34342 Istanbul, Turkey
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Most of the documents found on the Web are prepared in HTML format which was basically designed for presentation of data. As a result, some limitations are encountered when these documents are accessed automatically for a semantic interpretation of their content. One such inadequacy is in representing the sectional hierarchy (i.e. sections and subsections) of these documents and the headings in this hierarchy. Automatically obtaining this information is a difficult task due to the underlying format and the cluttered structure encountered in most of the Web pages. In this paper, we propose a novel approach to extract heading-based sectional hierarchies of HTML documents. This is the first part of the research, where we aim to use this information in automatic summaries to improve Web search experience of Internet users.
引用
收藏
页码:75 / 80
页数:6
相关论文
共 50 条
  • [1] CONCEPTS EXTRACTION BASED ON HTML']HTML DOCUMENTS STRUCTURE
    Zarrad, Rim
    Doggaz, Narjes
    Zagrouba, Ezzeddine
    ICAART: PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE, VOL 1, 2012, : 503 - 506
  • [2] Extracting structures of HTML']HTML documents
    Lim, SJ
    Ng, YK
    TWELFTH INTERNATIONAL CONFERENCE ON INFORMATION NETWORKING (ICOIN-12), PROCEEDINGS, 1998, : 420 - 426
  • [3] Detecting tables in HTML']HTML documents
    Wang, YL
    Hu, JY
    DOCUMENT ANALYSIS SYSTEM V, PROCEEDINGS, 2002, 2423 : 249 - 260
  • [4] Reusing of Information Constructed in HTML']HTML Documents: A Conversion of HTML']HTML into OWL
    Hwangbo, Hoon
    Lee, Hongchul
    2008 INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION AND SYSTEMS, VOLS 1-4, 2008, : 769 - 773
  • [5] Representing OCRed documents in HTML']HTML
    Hong, T
    Srihari, SN
    PROCEEDINGS OF THE FOURTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS 1 AND 2, 1997, : 831 - 834
  • [6] Interactively restructuring HTML']HTML documents
    Bonhomme, S
    Roisin, C
    COMPUTER NETWORKS AND ISDN SYSTEMS, 1996, 28 (7-11): : 1075 - 1084
  • [7] Relevance-based content extraction of HTML']HTML documents
    Wu Qi
    Chen Xing-shu
    Zhu Kai
    Wang Chun-hui
    JOURNAL OF CENTRAL SOUTH UNIVERSITY, 2012, 19 (07) : 1921 - 1926
  • [8] Layout based information extraction from HTML']HTML documents
    Buraet, Radek
    ICDAR 2007: NINTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2007, : 624 - 628
  • [9] Translating hardcopy documents to HTML']HTML online documents
    Hoess, L
    SOCIETY FOR TECHNICAL COMMUNICATION 44TH ANNUAL CONFERENCE, 1997 PROCEEDINGS, 1997, : 378 - 381
  • [10] Enhancing HTML']HTML documents with ActiveX™
    Vincent, B
    45TH ANNUAL CONFERENCE ON IMAGINATION, INNOVATION AND COMMUNICATION, 1998, : 315 - 315