AN INVESTIGATION OF CLUSTERING ALGORITHMS IN THE IDENTIFICATION OF SIMILAR WEB PAGES

被引:0
|
作者
De Lucia, Andrea [1 ]
Risi, Michele [1 ]
Scanniello, Giuseppe [2 ]
Tortora, Genoveffa [1 ]
机构
[1] Univ Salerno, Dipartimento Matemat & Informat, Salerno, Italy
[2] Univ Basilicata, Dipartimento Matemat & Informat, Potenza, Italy
来源
JOURNAL OF WEB ENGINEERING | 2009年 / 8卷 / 04期
关键词
clone analysis; clustering algorithms; latent semantic indexing; Levenshtein string edit distances; program comprehension; reverse engineering;
D O I
暂无
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In this paper we investigate the effect of using clustering algorithms in the reverse engineering field to identify pages that are similar either at the structural level or at the content level. To this end, we have used two instances of a general process that only differ for the measure used to compare web pages. In particular, two web pages at the structural level and at the content level are compared by using the Levenshtein edit distances and Latent Semantic Indexing, respectively. The static pages of two web applications and one static web site have been used to compare the results achieved by using the considered clustering algorithms both at the structural and content level. On these applications we generally achieved comparable results. However, the investigation has also suggested some heuristics to quickly identify the best partition of web pages into clusters among the possible partitions both at the structural and at the content level.
引用
收藏
页码:346 / 370
页数:25
相关论文
共 50 条
  • [1] Comparing clustering algorithms for the identification of similar pages in web applications
    De Lucia, Andrea
    Risi, Michele
    Scanniello, Giuseppe
    Tortora, Genoveffa
    [J]. WEB ENGINEERING, PROCEEDINGS, 2007, 4607 : 415 - +
  • [2] Clustering algorithms and latent semantic indexing to identify similar pages in web applications
    De Lucia, Andrea
    Risi, Michele
    Tortora, Genoveffa
    Scanniello, Giuseppe
    [J]. WSE 2007: NINTH IEEE INTERNATIONAL SYMPOSIUM ON WEB SITE EVOLUTION, PROCEEDINGS, 2007, : 65 - +
  • [3] Identifying similar pages in Web applications using a competitive clustering algorithm
    De Lucia, Andrea
    Scanniello, Giuseppe
    Tortora, Genoveffa
    [J]. JOURNAL OF SOFTWARE MAINTENANCE AND EVOLUTION-RESEARCH AND PRACTICE, 2007, 19 (05): : 281 - 296
  • [4] Extending link-based algorithms for similar web pages with neighborhood structure
    Lin, Zhenjiang
    Lyu, Michael R.
    King, Irwin
    [J]. PROCEEDINGS OF THE IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE: WI 2007, 2007, : 263 - 266
  • [5] Clustering Web Pages into Hierarchical Categories
    Yao, Zhongmei
    Choi, Ben
    [J]. INTERNATIONAL JOURNAL OF INTELLIGENT INFORMATION TECHNOLOGIES, 2007, 3 (02) : 17 - 35
  • [6] Clustering Web pages into hierarchial categories
    Louisiana Tech University, Ruston, LA, United States
    [J]. Int. J. Intell. Inf. Technologies, 2007, 2 (17-35):
  • [7] Clustering Web pages based on their structure
    Crescenzi, V
    Merialdo, P
    Missier, P
    [J]. DATA & KNOWLEDGE ENGINEERING, 2005, 54 (03) : 279 - 299
  • [8] Block Clustering for Web Pages Categorization
    Charrad, Malika
    Lechevallier, Yves
    ben Ahmed, Mohamed
    Saporta, Gilbert
    [J]. INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING, PROCEEDINGS, 2009, 5788 : 260 - +
  • [9] A Review on Web Pages Clustering Techniques
    Patel, Dipak
    Zaveri, Mukesh
    [J]. TRENDS IN NETWORKS AND COMMUNICATIONS, 2011, 197 : 700 - 710
  • [10] Web pages reordering and clustering based on web patterns
    Kudelka, Milos
    Snasel, Vaclav
    Lehecka, Ondrej
    El-Qawasmeh, Eyas
    Pokorny, Jaroslav
    [J]. SOFSEM 2008: THEORY AND PRACTICE OF COMPUTER SCIENCE, 2008, 4910 : 731 - +