An approach to identify duplicated Web pages

被引:62
|
作者
Di Lucca, GA [1 ]
Di Penta, M [1 ]
Fasolino, AR [1 ]
机构
[1] Univ Naples Federico II, Dipartimento Informat & Sistemist, I-80125 Naples, Italy
关键词
Web engineering; Web site analysis; Web site metrics; source code clones; clone analysis; software metrics;
D O I
10.1109/CMPSAC.2002.1045051
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
A relevant consequence of the unceasing expansion of the Web and e-commerce is the growth of the demand of new Web sites and Web applications. As a result, Web sites and applications are usually developed without a formalized process, but Web pages are directly coded in an incremental way, where new pages are obtained by duplicating existing ones. Duplicated Web pages, having the same structure and just differing for the data they include, can be considered as clones. The identification of clones may reduce the effort devoted to test, maintain and evolve Web sites and applications. Moreover, clone detection among different Web sites aims to detect cases of possible plagiarism. In this paper we propose an approach, based on similarity, metrics, to detect duplicated pages in Web sites and applications, implemented with HTML language and ASP technology. The proposed approach has been assessed by analyzing several Web sites and Web applications. The obtained results are reported in the paper with respect to some case studies.
引用
收藏
页码:481 / 486
页数:4
相关论文
共 50 条
  • [1] An FW-BF Based Approach on Elimination of Duplicated Web Pages
    Ma, Leiming
    Xia, Zhengyou
    [J]. INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2016, 2016, 9937 : 183 - 191
  • [2] Analysis of Duplicated Web Pages Identification Methods in Search Engine
    Duan, Fei
    Zheng, Yan
    [J]. 2010 2ND INTERNATIONAL WORKSHOP ON DATABASE TECHNOLOGY AND APPLICATIONS PROCEEDINGS (DBTA), 2010,
  • [3] Semantic Keywords-Based Duplicated Web Pages Removing
    Weng, Yunhe
    Li, Lei
    Zhong, Yixin
    [J]. IEEE NLP-KE 2008: PROCEEDINGS OF INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING, 2008, : 318 - 324
  • [4] An Approach for Restructuring of Web Pages
    Prasanna, Chennupati. R.
    Kishore, M. Venkata
    Rao, P. Srinivasa
    Sandeep, L. Mohana
    Lakshmi, D. Rajya
    [J]. INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2010, 10 (02): : 102 - 104
  • [5] An Improved Algorithm of STC for the Deletion of Duplicated Web pages Based on Repeated Strings
    Wang Huijiao
    Yin Bo
    Hou Jie
    [J]. THIRD INTERNATIONAL CONFERENCE ON GENETIC AND EVOLUTIONARY COMPUTING, 2009, : 414 - 417
  • [6] An Approach to Assess the Quality of Web Pages in the Deep Web
    Nie, Tiezheng
    Yu, Ge
    Shen, Derong
    Kou, Yue
    Yue, Dejun
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2011, 2011, 6637 : 514 - 525
  • [7] A rendering approach for stereoscopic web pages
    Zhang, Jianlong
    Wang, Wenmin
    Wang, Ronggang
    Chen, Qinshui
    [J]. STEREOSCOPIC DISPLAYS AND APPLICATIONS XXV, 2014, 9011
  • [8] Experiment Research on Duplicated Web Pages of Chinese Elimination Algorithm Based on Improved TextTiling
    Tan, Ran
    Dai, ZhiRong
    Xue, Yanxin
    [J]. ADVANCES IN COMPUTER SCIENCE, ENVIRONMENT, ECOINFORMATICS, AND EDUCATION, PT III, 2011, 216 : 378 - +
  • [9] Research on New Algorithm of Topic-Oriented Crawler and Duplicated Web Pages Detection
    Zhang, Yong-Heng
    Zhang, Feng
    [J]. INTELLIGENT COMPUTING THEORIES AND APPLICATIONS, ICIC 2012, 2012, 7390 : 35 - 42
  • [10] An approach to predict the task efficiency of web pages
    Sangita Saha
    Apurbalal Senapati
    Ranjan Maity
    [J]. Multimedia Tools and Applications, 2023, 82 : 25217 - 25233