An approach to identify duplicated Web pages

被引:62
|
作者
Di Lucca, GA [1 ]
Di Penta, M [1 ]
Fasolino, AR [1 ]
机构
[1] Univ Naples Federico II, Dipartimento Informat & Sistemist, I-80125 Naples, Italy
关键词
Web engineering; Web site analysis; Web site metrics; source code clones; clone analysis; software metrics;
D O I
10.1109/CMPSAC.2002.1045051
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
A relevant consequence of the unceasing expansion of the Web and e-commerce is the growth of the demand of new Web sites and Web applications. As a result, Web sites and applications are usually developed without a formalized process, but Web pages are directly coded in an incremental way, where new pages are obtained by duplicating existing ones. Duplicated Web pages, having the same structure and just differing for the data they include, can be considered as clones. The identification of clones may reduce the effort devoted to test, maintain and evolve Web sites and applications. Moreover, clone detection among different Web sites aims to detect cases of possible plagiarism. In this paper we propose an approach, based on similarity, metrics, to detect duplicated pages in Web sites and applications, implemented with HTML language and ASP technology. The proposed approach has been assessed by analyzing several Web sites and Web applications. The obtained results are reported in the paper with respect to some case studies.
引用
收藏
页码:481 / 486
页数:4
相关论文
共 50 条
  • [21] Rule identification from Web pages by the XRML approach
    Kang, J
    Lee, JK
    [J]. DECISION SUPPORT SYSTEMS, 2005, 41 (01) : 205 - 227
  • [22] Classification of Web Pages on Attractiveness: A Supervised Learning approach
    Khade, Ganesh
    Kumar, Sudhakar
    Bhattacharya, Samit
    [J]. 4TH INTERNATIONAL CONFERENCE ON INTELLIGENT HUMAN COMPUTER INTERACTION (IHCI 2012), 2012,
  • [23] A reverse engineering approach for automatic annotation of Web pages
    Roberto De Virgilio
    Flavius Frasincar
    Walter Hop
    Stephan Lachner
    [J]. Multimedia Tools and Applications, 2013, 64 : 119 - 140
  • [24] Clustering algorithms and latent semantic indexing to identify similar pages in web applications
    De Lucia, Andrea
    Risi, Michele
    Tortora, Genoveffa
    Scanniello, Giuseppe
    [J]. WSE 2007: NINTH IEEE INTERNATIONAL SYMPOSIUM ON WEB SITE EVOLUTION, PROCEEDINGS, 2007, : 65 - +
  • [25] EVILSEED: A Guided Approach to Finding Malicious Web Pages
    Invernizzi, Luca
    Comparetti, Paolo Milani
    Benvenuti, Stefano
    Kruegel, Christopher
    Cova, Marco
    Vigna, Giovanni
    [J]. 2012 IEEE SYMPOSIUM ON SECURITY AND PRIVACY (SP), 2012, : 428 - 442
  • [26] Identify Language Origin of Personal Names with Normalized Appearance Number of Web Pages
    You, Jiali
    Chen, Yining
    Chu, Min
    Zhao, Yong
    Wang, Jinlin
    [J]. INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1352 - +
  • [27] Extraction of web news from web pages using a ternary tree approach
    Laishram, Debina
    Sebastian, Merin
    [J]. 2015 SECOND INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING AND COMMUNICATION ENGINEERING ICACCE 2015, 2015, : 628 - 633
  • [28] A hybrid approach for extracting informative content from web pages
    Uzun, Erdinc
    Agun, Hayri Volkan
    Yerlikaya, Tarik
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2013, 49 (04) : 928 - 944
  • [29] A System's Approach Towards Domain Identification of Web Pages
    Gupta, Sonali
    Bhatia, Komal Kumar
    [J]. 2012 2ND IEEE INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND GRID COMPUTING (PDGC), 2012, : 870 - 875
  • [30] A Style Sheets Based Approach for Semantic Transformation of Web Pages
    Prasad, Gollapudi V. R. J. Sai
    Choppella, Venkatesh
    Chimalakonda, Sridhar
    [J]. DISTRIBUTED COMPUTING AND INTERNET TECHNOLOGY (ICDCIT 2018), 2018, 10722 : 240 - 255