Estimating clustering indexes in data streams

被引:0
|
作者
Buriol, Luciana S. [1 ]
Frahling, Gereon [2 ]
Leonardi, Stefano [3 ]
Sohler, Christian [4 ]
机构
[1] Univ Fed Rio Grande do Sul, Porto Alegre, RS, Brazil
[2] Google Res, New York, NY USA
[3] Univ Roma La Sapienza, Rome, Italy
[4] Univ Paderborn, Paderborn, Germany
来源
关键词
D O I
暂无
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
We present random sampling algorithms that with probability at least 1 - delta compute a (1 +/- epsilon)-approximation of the clustering coefficient and of the number of bipartite clique subgraphs of a graph given as an incidence stream of edges. The space used by our algorithm to estimate the clustering coefficient is inversely related to the clustering coefficient of the network itself. The space used by our algorithm to compute the number K-3,K-3 of bipartite cliques is proportional to the ratio between the number of K-1,K-3 and K-3,K-3 in the graph. Since the space complexity depends only on the structure of the input graph and not on the number of nodes, our algorithms scale very well with increasing graph size. Therefore they provide a basic tool to analyze the structure of dense clusters in large graphs and have many applications in the discovery of web communities, the analysis of the structure of large social networks and the probing of frequent patterns in large graphs. We implemented both algorithms and evaluated their performance on networks from different application domains and of different size; The largest instance is a webgraph consisting of more than 135 million nodes and I billion edges. Both algorithms compute accurate results in reasonable time on the tested instances.
引用
收藏
页码:618 / +
页数:3
相关论文
共 50 条
  • [31] Internal Clustering Evaluation of Data Streams
    Hassani, Marwan
    Seidl, Thomas
    [J]. TRENDS AND APPLICATIONS IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2015, 2015, 9441 : 198 - 209
  • [32] Clustering data streams: Theory and practice
    Guha, S
    Meyerson, A
    Mishra, N
    Motwani, R
    O'Callaghan, L
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2003, 15 (03) : 515 - 528
  • [33] On estimating frequency moments of data streams
    Ganguly, Sumit
    Cormode, Graham
    [J]. APPROXIMATION, RANDOMIZATION, AND COMBINATORIAL OPTIMIZATION: ALGORITHMS AND TECHNIQUES, 2007, 4627 : 479 - +
  • [34] Estimating Multilevel Models on Data Streams
    L. Ippel
    M. C. Kaptein
    J. K. Vermunt
    [J]. Psychometrika, 2019, 84 : 41 - 64
  • [35] Estimating entropy over data streams
    Bhuvanagiri, Lakshminath
    Canguly, Sumit
    [J]. ALGORITHMS - ESA 2006, PROCEEDINGS, 2006, 4168 : 148 - 159
  • [36] A DATA STREAMS CLUSTERING ALGORITHM BASED ON INTERVAL DATA
    Li, Yan
    Ye, Ming
    Wang, Huiwen
    Liu, Dan
    Che, Yin
    [J]. PROCEEDINGS OF THE 38TH INTERNATIONAL CONFERENCE ON COMPUTERS AND INDUSTRIAL ENGINEERING, VOLS 1-3, 2008, : 2775 - 2778
  • [37] Estimating process capability indexes for autocorrelated data
    Zhang, NF
    [J]. JOURNAL OF APPLIED STATISTICS, 1998, 25 (04) : 559 - 574
  • [38] Efficient incremental subspace clustering in data streams
    Kontaki, Maria
    Papadopoulos, Apostolos N.
    Manolopoulos, Yannis
    [J]. 10TH INTERNATIONAL DATABASE ENGINEERING AND APPLICATIONS SYMPOSIUM, PROCEEDINGS, 2006, : 53 - 60
  • [39] Active clustering data streams with affinity propagation
    Abdulah, Sameh
    Atwa, Walid
    Abdelmoniem, Ahmed M.
    [J]. ICT EXPRESS, 2022, 8 (02): : 276 - 282
  • [40] Divisive clustering of high dimensional data streams
    David P. Hofmeyr
    Nicos G. Pavlidis
    Idris A. Eckley
    [J]. Statistics and Computing, 2016, 26 : 1101 - 1120