Logical structure analysis: From HTML']HTML to XML

被引:2
|
作者
Lee, Min-Hyung [1 ]
Kim, Yeon-Seok [1 ]
Lee, Kyong-Ho [1 ]
机构
[1] Yonsei Univ, Dept Comp Sci, Seoul 120749, South Korea
关键词
logical structure analysis; XML; information extraction; Web document analysis;
D O I
10.1016/j.csi.2006.02.001
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents an efficient method for extracting a logical structure from a Web document. The proposed method consists of three phases: visual grouping, element identification, and logical grouping. To produce a logical structure more accurately, the proposed method defines a document model that is able to describe logical structure information of a specific document class. Since the proposed method is based on a visual structure from the visual grouping phase as well as a document model that describes logical structure information of a document type, it supports sophisticated structure analysis. Experimental results with HTML documents from the Web show that the method has performed logical structure analysis successfully, compared with previous work. Particularly, the method generates XML documents as the result of structure analysis, so that it enhances the reusability of documents. (c) 2006 Elsevier B.V. All rights reserved.
引用
收藏
页码:109 / 124
页数:16
相关论文
共 50 条
  • [1] A gateway from HTML']HTML to XML
    Fu, T
    Liu, MC
    INTERNATIONAL DATABASE ENGINEERING AND APPLICATIONS SYMPOSIUM, PROCEEDINGS, 2004, : 205 - 214
  • [2] Getting to XML from HTML']HTML
    Wood, L
    SGML EUROPE '97 - CONFERENCE PROCEEDINGS, 1997, : 189 - 192
  • [3] Analysis of the HTML']HTML to XML Conversion Method
    Li Busheng
    Hu Jingfang
    PROCEEDINGS OF THE 2015 INTERNATIONAL SYMPOSIUM ON COMPUTERS & INFORMATICS, 2015, 13 : 64 - 69
  • [4] Extracting logical structures from HTML']HTML tables
    Kim, Yeon-Seok
    Lee, Kyong-Ho
    COMPUTER STANDARDS & INTERFACES, 2008, 30 (05) : 296 - 308
  • [5] Automatic HTML']HTML to XML conversion
    Li, SJ
    Liu, MC
    Ling, TW
    Peng, ZY
    ADVANCES IN WEB-AGE INFORMATION MANAGEMENT: PROCEEDINGS, 2004, 3129 : 714 - 719
  • [6] XSLT: Working with XML and HTML']HTML
    Owens, D
    TECHNICAL COMMUNICATION, 2002, 49 (04) : 481 - 483
  • [7] Wrapping HTML']HTML tables into XML
    Li, SJ
    Liu, MC
    Peng, ZY
    WEB INFORMATION SYSTEMS - WISE 2004, PROCEEDINGS, 2004, 3306 : 147 - 152
  • [8] Template resolution in XML/HTML']HTML
    Kristensen, A
    COMPUTER NETWORKS AND ISDN SYSTEMS, 1998, 30 (1-7): : 239 - 249
  • [9] Progress in separation of structure and style: HTML']HTML, XHTML']HTML, XML and cascading style sheets
    Fugate, James K.
    Vokurka, Robert J.
    INTERNATIONAL JOURNAL OF INNOVATION AND LEARNING, 2005, 2 (04) : 425 - 433
  • [10] Extracting Logical Hierarchical Structure of HTML']HTML Documents Based on Headings
    Manabe, Tomohiro
    Tajima, Keishi
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 8 (12): : 1606 - 1617