WebView: A tool for retrieving internal structures and extracting information from HTML']HTML documents

被引:4
|
作者
Lim, SJ [1 ]
Ng, YK [1 ]
机构
[1] Brigham Young Univ, Dept Comp Sci, Provo, UT 84602 USA
关键词
D O I
10.1109/DASFAA.1999.765738
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
HTML [11, 12] is a well-accepted and widely used language for creating platform-independent documents to be posted on the Web, and HTML documents are semistructured in nature according to the HTML specification. We propose a tool, called WebView, which constructs the semistructured data graph (SDG) of an HTML document H to capture the internal structure of data embedded in H and in its (in)directly linked documents. On top of the SDG, WebView provides query processing capability for evaluating Set-like queries that are posted against the SDG, i.e., the source document(s), for extracting information from the SDG. Existing methods for extracting structured information from certain HTML documents with static internal structure, such as wrappers and integrators for data warehousing, can benefit from WebView.
引用
收藏
页码:71 / 80
页数:10
相关论文
共 50 条
  • [21] A Method Research of Extracting Web Information Based on HTML']HTML 5 New Standard
    Liu, Qing-hua
    Feng, Li-yun
    [J]. INTERNATIONAL CONFERENCE ON ELECTRICAL, CONTROL AND AUTOMATION ENGINEERING (ECAE 2013), 2013, : 520 - 524
  • [22] Detecting similar HTML']HTML documents using a fuzzy set information retrieval approach
    Yerra, R
    Ng, YK
    [J]. 2005 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING, VOLS 1 AND 2, 2005, : 693 - 699
  • [23] A semi-automatic indexing system based on embedded information in HTML']HTML documents
    Vallez, Mari
    Pedraza-Jimenez, Rafael
    Codina, Lluis
    Blanco, Saul
    Rovira, Cristofol
    [J]. LIBRARY HI TECH, 2015, 33 (02) : 195 - 210
  • [24] A Method of Readability Assessment for Web Documents Using Text Features and HTML']HTML Structures
    Yamasaki, Takahiro
    Tokiwa, Kin-Ichiroh
    [J]. ELECTRONICS AND COMMUNICATIONS IN JAPAN, 2014, 97 (10) : 1 - 10
  • [25] Creating HTML']HTML or Markdown documents from within Stata using webdoc
    Jann, Ben
    [J]. STATA JOURNAL, 2017, 17 (01): : 3 - 38
  • [26] Multimedia information extraction from HTML']HTML product catalogues
    Labsky, Martin
    Praks, Pavel
    Svatek, Vojtech
    Svab, Ondrej
    [J]. DATESO 2005 - DATABASES, TEXTS, SPECIFICATIONS, OBJECTS, 2005, : 84 - 93
  • [27] Information extraction from HTML']HTML pages and its integration
    Itai, K
    Takasu, A
    Adachi, J
    [J]. 2003 SYMPOSIUM ON APPLICATIONS AND THE INTERNET WORKSHOPS, PROCEEDINGS, 2003, : 276 - 281
  • [28] Automatically extracting ontologically specified data from HTML']HTML tables of unknown structure
    Embley, DW
    Tao, C
    Liddle, SW
    [J]. CONCEPTUAL MODELING - ER 2002, 2002, 2503 : 322 - 337
  • [29] Information extraction from HTML']HTML tables base on domain ontology
    Hsiao, SL
    Chou, SC
    Chang, LP
    [J]. IKE'03: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE ENGINEERING, VOLS 1 AND 2, 2003, : 70 - 76
  • [30] HTML']HTML-LSTM: Information Extraction from HTML']HTML Tables in Web Pages Using Tree-Structured LSTM
    Kawamura, Kazuki
    Yamamoto, Akihiro
    [J]. DISCOVERY SCIENCE (DS 2021), 2021, 12986 : 29 - 43