WebView: A tool for retrieving internal structures and extracting information from HTML']HTML documents

被引：4

作者：

Lim, SJ ^{[1
]}

Ng, YK ^{[1
]}

机构：

[1] Brigham Young Univ, Dept Comp Sci, Provo, UT 84602 USA

来源：

6TH INTERNATIONAL CONFERENCE ON DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS | 1999年

关键词：

D O I：

10.1109/DASFAA.1999.765738

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

HTML [11, 12] is a well-accepted and widely used language for creating platform-independent documents to be posted on the Web, and HTML documents are semistructured in nature according to the HTML specification. We propose a tool, called WebView, which constructs the semistructured data graph (SDG) of an HTML document H to capture the internal structure of data embedded in H and in its (in)directly linked documents. On top of the SDG, WebView provides query processing capability for evaluating Set-like queries that are posted against the SDG, i.e., the source document(s), for extracting information from the SDG. Existing methods for extracting structured information from certain HTML documents with static internal structure, such as wrappers and integrators for data warehousing, can benefit from WebView.

引用

页码：71 / 80

页数：10

共 50 条

[21] A Method Research of Extracting Web Information Based on HTML']HTML 5 New Standard
Liu, Qing-hua
Feng, Li-yun
[J]. INTERNATIONAL CONFERENCE ON ELECTRICAL, CONTROL AND AUTOMATION ENGINEERING (ECAE 2013), 2013, : 520 - 524
[22] Detecting similar HTML']HTML documents using a fuzzy set information retrieval approach
Yerra, R
Ng, YK
[J]. 2005 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING, VOLS 1 AND 2, 2005, : 693 - 699
[23] A semi-automatic indexing system based on embedded information in HTML']HTML documents
Vallez, Mari
Pedraza-Jimenez, Rafael
Codina, Lluis
Blanco, Saul
Rovira, Cristofol
[J]. LIBRARY HI TECH, 2015, 33 (02) : 195 - 210
[24] A Method of Readability Assessment for Web Documents Using Text Features and HTML']HTML Structures
Yamasaki, Takahiro
Tokiwa, Kin-Ichiroh
[J]. ELECTRONICS AND COMMUNICATIONS IN JAPAN, 2014, 97 (10) : 1 - 10
[25] Creating HTML']HTML or Markdown documents from within Stata using webdoc
Jann, Ben
[J]. STATA JOURNAL, 2017, 17 (01): : 3 - 38
[26] Multimedia information extraction from HTML']HTML product catalogues
Labsky, Martin
Praks, Pavel
Svatek, Vojtech
Svab, Ondrej
[J]. DATESO 2005 - DATABASES, TEXTS, SPECIFICATIONS, OBJECTS, 2005, : 84 - 93
[27] Information extraction from HTML']HTML pages and its integration
Itai, K
Takasu, A
Adachi, J
[J]. 2003 SYMPOSIUM ON APPLICATIONS AND THE INTERNET WORKSHOPS, PROCEEDINGS, 2003, : 276 - 281
[28] Automatically extracting ontologically specified data from HTML']HTML tables of unknown structure
Embley, DW
Tao, C
Liddle, SW
[J]. CONCEPTUAL MODELING - ER 2002, 2002, 2503 : 322 - 337
[29] Information extraction from HTML']HTML tables base on domain ontology
Hsiao, SL
Chou, SC
Chang, LP
[J]. IKE'03: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE ENGINEERING, VOLS 1 AND 2, 2003, : 70 - 76
[30] HTML']HTML-LSTM: Information Extraction from HTML']HTML Tables in Web Pages Using Tree-Structured LSTM
Kawamura, Kazuki
Yamamoto, Akihiro
[J]. DISCOVERY SCIENCE (DS 2021), 2021, 12986 : 29 - 43

← 1 2 3 4 5 →