Logical structure analysis: From HTML']HTML to XML

被引:2
|
作者
Lee, Min-Hyung [1 ]
Kim, Yeon-Seok [1 ]
Lee, Kyong-Ho [1 ]
机构
[1] Yonsei Univ, Dept Comp Sci, Seoul 120749, South Korea
关键词
logical structure analysis; XML; information extraction; Web document analysis;
D O I
10.1016/j.csi.2006.02.001
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents an efficient method for extracting a logical structure from a Web document. The proposed method consists of three phases: visual grouping, element identification, and logical grouping. To produce a logical structure more accurately, the proposed method defines a document model that is able to describe logical structure information of a specific document class. Since the proposed method is based on a visual structure from the visual grouping phase as well as a document model that describes logical structure information of a document type, it supports sophisticated structure analysis. Experimental results with HTML documents from the Web show that the method has performed logical structure analysis successfully, compared with previous work. Particularly, the method generates XML documents as the result of structure analysis, so that it enhances the reusability of documents. (c) 2006 Elsevier B.V. All rights reserved.
引用
收藏
页码:109 / 124
页数:16
相关论文
共 50 条
  • [41] SGML、HTML、XML的比较
    袁琳
    李秉严
    四川图书馆学报, 2001, (03) : 34 - 36
  • [42] HTML']HTML Violations and Where to Find Them: A Longitudinal Analysis of Specification Violations in HTML']HTML
    Hantke, Florian
    Stock, Ben
    PROCEEDINGS OF THE 2022 22ND ACM INTERNET MEASUREMENT CONFERENCE, IMC 2022, 2022, : 358 - 373
  • [43] HTML到XML转换研究
    钱程
    阳小兰
    计算机与现代化, 2011, (08) : 39 - 41
  • [44] 比较分析XML与HTML
    曹风华
    电脑与信息技术, 2011, 19 (04) : 69 - 71
  • [45] ODBC XML HTML: The ABCs of Acronyms
    Cole, David
    Presstime, 2003, 25 (09): : 56 - 57
  • [46] Advanced user profile agent using structure analysis of HTML']HTML document
    Kwak, JH
    Kim, K
    Lee, CH
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL I AND II, 1999, : 319 - 323
  • [47] Rec.HTML']HTML: Declarative HTML']HTML
    Reynders, Bob
    Choi, Kwanghoon
    COMPANION PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON THE ART, SCIENCE, AND ENGINEERING OF PROGRAMMING (PROGRAMMING 2021 COMPANION), 2021, : 1 - 5
  • [48] Probabilistic model for structured document mapping application to automatic HTML']HTML to XML conversion
    Wisniewski, Guillaume
    Maes, Francis
    Denoyer, Ludovic
    Gallinari, Patrick
    MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION, PROCEEDINGS, 2007, 4571 : 854 - +
  • [49] CONCEPTS EXTRACTION BASED ON HTML']HTML DOCUMENTS STRUCTURE
    Zarrad, Rim
    Doggaz, Narjes
    Zagrouba, Ezzeddine
    ICAART: PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE, VOL 1, 2012, : 503 - 506
  • [50] WebVigiL: User profile-based change detection for HTML']HTML/XML documents
    Pandrangi, N
    Jacob, J
    Sanka, A
    Chakravarthy, S
    NEW HORIZONS IN INFORMATION MANAGEMENT, 2003, 2712 : 38 - 57