An approach to identify duplicated Web pages

被引：62

作者：

Di Lucca, GA ^{[1
]}

Di Penta, M ^{[1
]}

Fasolino, AR ^{[1
]}

机构：

[1] Univ Naples Federico II, Dipartimento Informat & Sistemist, I-80125 Naples, Italy

来源：

26TH ANNUAL INTERNATIONAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE, PROCEEDINGS | 2002年

关键词：

Web engineering; Web site analysis; Web site metrics; source code clones; clone analysis; software metrics;

D O I：

10.1109/CMPSAC.2002.1045051

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

A relevant consequence of the unceasing expansion of the Web and e-commerce is the growth of the demand of new Web sites and Web applications. As a result, Web sites and applications are usually developed without a formalized process, but Web pages are directly coded in an incremental way, where new pages are obtained by duplicating existing ones. Duplicated Web pages, having the same structure and just differing for the data they include, can be considered as clones. The identification of clones may reduce the effort devoted to test, maintain and evolve Web sites and applications. Moreover, clone detection among different Web sites aims to detect cases of possible plagiarism. In this paper we propose an approach, based on similarity, metrics, to detect duplicated pages in Web sites and applications, implemented with HTML language and ASP technology. The proposed approach has been assessed by analyzing several Web sites and Web applications. The obtained results are reported in the paper with respect to some case studies.

引用

页码：481 / 486

页数：4

共 50 条

[21] Rule identification from Web pages by the XRML approach
Kang, J
Lee, JK
[J]. DECISION SUPPORT SYSTEMS, 2005, 41 (01) : 205 - 227
[22] Classification of Web Pages on Attractiveness: A Supervised Learning approach
Khade, Ganesh
Kumar, Sudhakar
Bhattacharya, Samit
[J]. 4TH INTERNATIONAL CONFERENCE ON INTELLIGENT HUMAN COMPUTER INTERACTION (IHCI 2012), 2012,
[23] A reverse engineering approach for automatic annotation of Web pages
Roberto De Virgilio
Flavius Frasincar
Walter Hop
Stephan Lachner
[J]. Multimedia Tools and Applications, 2013, 64 : 119 - 140
[24] Clustering algorithms and latent semantic indexing to identify similar pages in web applications
De Lucia, Andrea
Risi, Michele
Tortora, Genoveffa
Scanniello, Giuseppe
[J]. WSE 2007: NINTH IEEE INTERNATIONAL SYMPOSIUM ON WEB SITE EVOLUTION, PROCEEDINGS, 2007, : 65 - +
[25] EVILSEED: A Guided Approach to Finding Malicious Web Pages
Invernizzi, Luca
Comparetti, Paolo Milani
Benvenuti, Stefano
Kruegel, Christopher
Cova, Marco
Vigna, Giovanni
[J]. 2012 IEEE SYMPOSIUM ON SECURITY AND PRIVACY (SP), 2012, : 428 - 442
[26] Identify Language Origin of Personal Names with Normalized Appearance Number of Web Pages
You, Jiali
Chen, Yining
Chu, Min
Zhao, Yong
Wang, Jinlin
[J]. INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1352 - +
[27] Extraction of web news from web pages using a ternary tree approach
Laishram, Debina
Sebastian, Merin
[J]. 2015 SECOND INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING AND COMMUNICATION ENGINEERING ICACCE 2015, 2015, : 628 - 633
[28] A hybrid approach for extracting informative content from web pages
Uzun, Erdinc
Agun, Hayri Volkan
Yerlikaya, Tarik
[J]. INFORMATION PROCESSING & MANAGEMENT, 2013, 49 (04) : 928 - 944
[29] A System's Approach Towards Domain Identification of Web Pages
Gupta, Sonali
Bhatia, Komal Kumar
[J]. 2012 2ND IEEE INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND GRID COMPUTING (PDGC), 2012, : 870 - 875
[30] A Style Sheets Based Approach for Semantic Transformation of Web Pages
Prasad, Gollapudi V. R. J. Sai
Choppella, Venkatesh
Chimalakonda, Sridhar
[J]. DISTRIBUTED COMPUTING AND INTERNET TECHNOLOGY (ICDCIT 2018), 2018, 10722 : 240 - 255

← 1 2 3 4 5 →