WebView: A tool for retrieving internal structures and extracting information from HTML']HTML documents

被引:4
|
作者
Lim, SJ [1 ]
Ng, YK [1 ]
机构
[1] Brigham Young Univ, Dept Comp Sci, Provo, UT 84602 USA
关键词
D O I
10.1109/DASFAA.1999.765738
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
HTML [11, 12] is a well-accepted and widely used language for creating platform-independent documents to be posted on the Web, and HTML documents are semistructured in nature according to the HTML specification. We propose a tool, called WebView, which constructs the semistructured data graph (SDG) of an HTML document H to capture the internal structure of data embedded in H and in its (in)directly linked documents. On top of the SDG, WebView provides query processing capability for evaluating Set-like queries that are posted against the SDG, i.e., the source document(s), for extracting information from the SDG. Existing methods for extracting structured information from certain HTML documents with static internal structure, such as wrappers and integrators for data warehousing, can benefit from WebView.
引用
收藏
页码:71 / 80
页数:10
相关论文
共 50 条
  • [1] Extracting structures of HTML']HTML documents
    Lim, SJ
    Ng, YK
    [J]. TWELFTH INTERNATIONAL CONFERENCE ON INFORMATION NETWORKING (ICOIN-12), PROCEEDINGS, 1998, : 420 - 426
  • [2] Categorizing and extracting information from multilingual HTML']HTML documents
    Lim, SJ
    Ng, YK
    [J]. 9TH INTERNATIONAL DATABASE ENGINEERING & APPLICATION SYMPOSIUM, PROCEEDINGS, 2005, : 415 - 422
  • [3] Extracting logical structures from HTML']HTML tables
    Kim, Yeon-Seok
    Lee, Kyong-Ho
    [J]. COMPUTER STANDARDS & INTERFACES, 2008, 30 (05) : 296 - 308
  • [4] Extracting structures of HTML']HTML documents using a high-level stack machine
    Lim, SJ
    Ng, YK
    [J]. INFORMATION NETWORKING IN ASIA, 2001, 3 : 177 - 188
  • [5] Layout based information extraction from HTML']HTML documents
    Buraet, Radek
    [J]. ICDAR 2007: NINTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2007, : 624 - 628
  • [6] Reusing of Information Constructed in HTML']HTML Documents: A Conversion of HTML']HTML into OWL
    Hwangbo, Hoon
    Lee, Hongchul
    [J]. 2008 INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION AND SYSTEMS, VOLS 1-4, 2008, : 769 - 773
  • [7] Extracting Logical Hierarchical Structure of HTML']HTML Documents Based on Headings
    Manabe, Tomohiro
    Tajima, Keishi
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 8 (12): : 1606 - 1617
  • [8] Automatic discovery of semantic structures in HTML']HTML documents
    Mukherjee, S
    Yang, GZ
    Tan, WF
    Ramakrishnan, IV
    [J]. SEVENTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2003, : 245 - 249
  • [9] A case-based recognition of semantic structures in HTML']HTML documents - An automated transformation from HTML']HTML to XML
    Umehara, M
    Iwanuma, K
    Nabeshima, H
    [J]. INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2002, 2002, 2412 : 141 - 147
  • [10] Employing clustering techniques for automatic information extraction from HTML']HTML documents
    Ashraf, Fatima
    Oezyer, Tansel
    Alhajj, Reda
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS, 2008, 38 (05): : 660 - 673