Logical structure analysis: From HTML']HTML to XML

被引:2
|
作者
Lee, Min-Hyung [1 ]
Kim, Yeon-Seok [1 ]
Lee, Kyong-Ho [1 ]
机构
[1] Yonsei Univ, Dept Comp Sci, Seoul 120749, South Korea
关键词
logical structure analysis; XML; information extraction; Web document analysis;
D O I
10.1016/j.csi.2006.02.001
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents an efficient method for extracting a logical structure from a Web document. The proposed method consists of three phases: visual grouping, element identification, and logical grouping. To produce a logical structure more accurately, the proposed method defines a document model that is able to describe logical structure information of a specific document class. Since the proposed method is based on a visual structure from the visual grouping phase as well as a document model that describes logical structure information of a document type, it supports sophisticated structure analysis. Experimental results with HTML documents from the Web show that the method has performed logical structure analysis successfully, compared with previous work. Particularly, the method generates XML documents as the result of structure analysis, so that it enhances the reusability of documents. (c) 2006 Elsevier B.V. All rights reserved.
引用
收藏
页码:109 / 124
页数:16
相关论文
共 50 条
  • [21] Lurching toward Babel: HTML']HTML, CSS, and XML
    Korpela, J
    COMPUTER, 1998, 31 (07) : 103 - +
  • [22] Research on content reuse of HTML']HTML based on XML
    Li, QS
    Chen, P
    COMPUTER SCIENCE AND TECHNOLOGY IN NEW CENTURY, 2001, : 521 - 525
  • [23] A typed representation for HTML']HTML and XML documents in Haskell
    Thiemann, P
    JOURNAL OF FUNCTIONAL PROGRAMMING, 2002, 12 (4-5) : 435 - 468
  • [24] Using XML metadata to enable the automatic generation and processing of HTML']HTML FORMS from XML documents
    Dubey, AK
    Chueh, HC
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2001, : 894 - 894
  • [25] Wikipedia HTML']HTML Structure Analysis for Ontology Construction
    Zarrad, Rim
    Doggaz, Narjes
    Zagrouba, Ezzedine
    KNOWLEDGE ORGANIZATION, 2018, 45 (02): : 108 - 124
  • [26] Automatic translation of HTML']HTML laws and regulations into an XML repository
    Psaila, G
    Brugali, D
    ISAS/CITSA 2004: International Conference on Cybernetics and Information Technologies, Systems and Applications and 10th International Conference on Information Systems Analysis and Synthesis, Vol 1, Proceedings: COMMUNICATIONS, INFORMATION TECHNOLOGIES AND COMPUTING, 2004, : 252 - 256
  • [27] Multipurpose Web publishing using HTML']HTML, XML, and CSS
    Lie, HW
    Saarela, J
    COMMUNICATIONS OF THE ACM, 1999, 42 (10) : 95 - 101
  • [28] A heuristic approach for converting HTML']HTML documents to XML documents
    Lim, SJ
    Ng, YK
    COMPUTATIONAL LOGIC - CL 2000, 2000, 1861 : 1182 - 1196
  • [29] TC-GXML A Transcoder for HTML']HTML to XML Grammar
    Singh, Raghuraj
    Verma, Prabhat
    Singh, Avinash Kumar
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON DATA STORAGE AND DATA ENGINEERING (DSDE 2010), 2010, : 34 - 38
  • [30] The SGML FAQ book: Understanding the foundation of HTML']HTML and XML
    Lunemann, RS
    TECHNICAL COMMUNICATION, 1998, 45 (03) : 408 - 409