Heading-based sectional hierarchy identification for HTML']HTML documents

被引:0
|
作者
Pembe, F. Canan [1 ]
Gungor, Tunga [1 ]
机构
[1] Bogazici Univ, Dept Comp Engn, TR-34342 Istanbul, Turkey
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Most of the documents found on the Web are prepared in HTML format which was basically designed for presentation of data. As a result, some limitations are encountered when these documents are accessed automatically for a semantic interpretation of their content. One such inadequacy is in representing the sectional hierarchy (i.e. sections and subsections) of these documents and the headings in this hierarchy. Automatically obtaining this information is a difficult task due to the underlying format and the cluttered structure encountered in most of the Web pages. In this paper, we propose a novel approach to extract heading-based sectional hierarchies of HTML documents. This is the first part of the research, where we aim to use this information in automatic summaries to improve Web search experience of Internet users.
引用
收藏
页码:75 / 80
页数:6
相关论文
共 50 条
  • [41] Contextual weighted representations and indexing models for the retrieval of HTML']HTML documents
    Pereira, RAM
    Molinari, A
    Pasi, G
    SOFT COMPUTING, 2005, 9 (07) : 481 - 492
  • [42] Using Semantic-Level Tags in HTML']HTML/XML Documents
    Henschen, Lawrence J.
    Lee, Julia C.
    UNIVERSAL ACCESS IN HUMAN-COMPUTER INTERACTION: APPLICATIONS AND SERVICES, PT III, 2009, 5616 : 683 - 692
  • [43] Digital architectures: SGML, HTML']HTML, multimedia and the structure of electronic documents
    Heba, GM
    STC 1996 PROCEEDINGS - 43RD ANNUAL CONFERENCE: EVOLUTION/REVOLUTION, 1996, : 213 - 216
  • [44] Classification of HTML']HTML documents by Hidden Tree-Markov Models
    Diligenti, M
    Gori, M
    Maggini, M
    Scarselli, F
    SIXTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, PROCEEDINGS, 2001, : 849 - 853
  • [45] A resource for transforming HTML']HTML and molfile documents to XML compliant form
    Gkoutos, GV
    Kenway, PR
    Murray-Rust, P
    Rzepa, HS
    Wright, M
    INTERNET JOURNAL OF CHEMISTRY, 2001, 4 (05):
  • [46] Effectively retrieve HTML documents
    Liu, Fang
    Lu, Zhengding
    Xiaoxing Weixing Jisuanji Xitong/Mini-Micro Systems, 2000, 21 (09): : 986 - 988
  • [47] Bootstrapping semantic annotation for content-rich HTML']HTML documents
    Mukherjee, S
    Ramakrishnan, IV
    Singh, A
    ICDE 2005: 21ST INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2005, : 583 - 593
  • [48] Study on Text Information Extraction Model and Algorithm of HTML']HTML Documents
    Li Chunyan
    Jiang Ilaiyang
    PROCEEDINGS OF 2010 CROSS-STRAIT CONFERENCE ON INFORMATION SCIENCE AND TECHNOLOGY, 2010, : 399 - 403
  • [49] An integrated system of mining HTML']HTML texts and filtering structured documents
    Yun, BH
    Lim, ME
    Park, SH
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, 2003, 2637 : 350 - 355
  • [50] STRUCTURING DOCUMENTS WITH NEW HTML']HTML5 SEMANTIC ELEMENTS
    Fulanovic, Bojan
    Kucak, Danijell
    Djambic, Goran
    ANNALS OF DAAAM FOR 2012 & PROCEEDINGS OF THE 23RD INTERNATIONAL DAAAM SYMPOSIUM - INTELLIGENT MANUFACTURING AND AUTOMATION - FOCUS ON SUSTAINABILITY, 2012, 23 : 723 - 726