An Approach to Assess the Quality of Web Pages in the Deep Web

被引:0
|
作者
Nie, Tiezheng [1 ]
Yu, Ge [1 ]
Shen, Derong [1 ]
Kou, Yue [1 ]
Yue, Dejun [1 ]
机构
[1] Northeastern Univ, Coll Informat Sci & Engn, Shenyang 110819, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Web pages contain a large number of structured data, which are useful for many advanced applications. Existing works mainly focused on extracting structured data from web pages by individual wrappers but ignored the quality for these underlying web pages, which in fact impact the extracting results seriously. Thus, we define the quality of a web page by the data quality a wrapper can achieve in extraction. This paper proposes a novel approach to assess the quality of web pages in the deep web. In our approach, we first define the schema of web data with a hierarchical model. Then web pages are dealt with as XML documents and parsed into a DOM tree. The data units and attribute values in the web page are annotated with the schema semantics and the XPATH of position in the DOM tree. Based on the annotation, we build an assessment model for the quality of web pages with two dimensions: the structure complexity and the text complexity of node in the DOM tree. The quality is partitioned into three quality levels in our model, and the quality of web pages in the same quality level is compared by the proposed formulas. Moreover, we design an XQuery-based wrapper to extract the web page and validate our quality model since most of existing wrappers can not handle the data with hierarchical structure. The wrapper generates XQuery statements to extract web data with the annotation information. The experimental results demonstrated our approach is accurate for assessing the data quality of web pages. It is very helpful for data quality control in the deep web related applications.
引用
收藏
页码:514 / 525
页数:12
相关论文
共 50 条
  • [1] An Approach for Restructuring of Web Pages
    Prasanna, Chennupati. R.
    Kishore, M. Venkata
    Rao, P. Srinivasa
    Sandeep, L. Mohana
    Lakshmi, D. Rajya
    [J]. INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2010, 10 (02): : 102 - 104
  • [2] Classification of deep Web databases based on the context of Web pages
    School of Computer Science and Technology, Shandong University, Ji'nan 250101, China
    [J]. Ruan Jian Xue Bao, 2008, 2 (267-274):
  • [3] A System for Assessing the Quality of Web Pages
    Mohammed, Walaa Ibrahim
    El-Beltagy, Samhaa R.
    [J]. 2013 9TH INTERNATIONAL CONFERENCE ON INNOVATIONS IN INFORMATION TECHNOLOGY (IIT), 2013,
  • [4] Extricating web pages from deep web using deaima architecture
    Devasirvatham, Weslin
    Thiyagarajan, Joshva Devadas
    [J]. THEORETICAL COMPUTER SCIENCE, 2022, 931 : 93 - 103
  • [5] A rendering approach for stereoscopic web pages
    Zhang, Jianlong
    Wang, Wenmin
    Wang, Ronggang
    Chen, Qinshui
    [J]. STEREOSCOPIC DISPLAYS AND APPLICATIONS XXV, 2014, 9011
  • [6] How users assess web pages for information seeking
    Tombros, A
    Ruthven, I
    Jose, JM
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2005, 56 (04): : 327 - 344
  • [7] An approach to identify duplicated Web pages
    Di Lucca, GA
    Di Penta, M
    Fasolino, AR
    [J]. 26TH ANNUAL INTERNATIONAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE, PROCEEDINGS, 2002, : 481 - 486
  • [8] Functional Classification of Web Pages with Deep Learning
    Balim, Caner
    Ozkan, Kemal
    [J]. 2019 27TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2019,
  • [9] Data extraction from Deep Web pages
    Yang, Jufeng
    Shi, Guangshun
    Zheng, Yan
    Wang, Qingren
    [J]. CIS: 2007 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY, PROCEEDINGS, 2007, : 237 - 241
  • [10] Filtering Web pages for quality indicators: An empirical approach to finding high quality consumer health information on the World Wide Web
    Price, SL
    Hersh, WR
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 1999, : 911 - 915