WebView: A tool for retrieving internal structures and extracting information from HTML']HTML documents

被引：4

作者：

Lim, SJ ^{[1
]}

Ng, YK ^{[1
]}

机构：

[1] Brigham Young Univ, Dept Comp Sci, Provo, UT 84602 USA

来源：

6TH INTERNATIONAL CONFERENCE ON DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS | 1999年

关键词：

D O I：

10.1109/DASFAA.1999.765738

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

HTML [11, 12] is a well-accepted and widely used language for creating platform-independent documents to be posted on the Web, and HTML documents are semistructured in nature according to the HTML specification. We propose a tool, called WebView, which constructs the semistructured data graph (SDG) of an HTML document H to capture the internal structure of data embedded in H and in its (in)directly linked documents. On top of the SDG, WebView provides query processing capability for evaluating Set-like queries that are posted against the SDG, i.e., the source document(s), for extracting information from the SDG. Existing methods for extracting structured information from certain HTML documents with static internal structure, such as wrappers and integrators for data warehousing, can benefit from WebView.

引用

页码：71 / 80

页数：10

共 50 条

[1] Extracting structures of HTML']HTML documents
Lim, SJ
Ng, YK
[J]. TWELFTH INTERNATIONAL CONFERENCE ON INFORMATION NETWORKING (ICOIN-12), PROCEEDINGS, 1998, : 420 - 426
[2] Categorizing and extracting information from multilingual HTML']HTML documents
Lim, SJ
Ng, YK
[J]. 9TH INTERNATIONAL DATABASE ENGINEERING & APPLICATION SYMPOSIUM, PROCEEDINGS, 2005, : 415 - 422
[3] Extracting logical structures from HTML']HTML tables
Kim, Yeon-Seok
Lee, Kyong-Ho
[J]. COMPUTER STANDARDS & INTERFACES, 2008, 30 (05) : 296 - 308
[4] Extracting structures of HTML']HTML documents using a high-level stack machine
Lim, SJ
Ng, YK
[J]. INFORMATION NETWORKING IN ASIA, 2001, 3 : 177 - 188
[5] Layout based information extraction from HTML']HTML documents
Buraet, Radek
[J]. ICDAR 2007: NINTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2007, : 624 - 628
[6] Reusing of Information Constructed in HTML']HTML Documents: A Conversion of HTML']HTML into OWL
Hwangbo, Hoon
Lee, Hongchul
[J]. 2008 INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION AND SYSTEMS, VOLS 1-4, 2008, : 769 - 773
[7] Extracting Logical Hierarchical Structure of HTML']HTML Documents Based on Headings
Manabe, Tomohiro
Tajima, Keishi
[J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 8 (12): : 1606 - 1617
[8] Automatic discovery of semantic structures in HTML']HTML documents
Mukherjee, S
Yang, GZ
Tan, WF
Ramakrishnan, IV
[J]. SEVENTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2003, : 245 - 249
[9] A case-based recognition of semantic structures in HTML']HTML documents - An automated transformation from HTML']HTML to XML
Umehara, M
Iwanuma, K
Nabeshima, H
[J]. INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2002, 2002, 2412 : 141 - 147
[10] Employing clustering techniques for automatic information extraction from HTML']HTML documents
Ashraf, Fatima
Oezyer, Tansel
Alhajj, Reda
[J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS, 2008, 38 (05): : 660 - 673

← 1 2 3 4 5 →