Estimating clustering indexes in data streams

被引:0
|
作者
Buriol, Luciana S. [1 ]
Frahling, Gereon [2 ]
Leonardi, Stefano [3 ]
Sohler, Christian [4 ]
机构
[1] Univ Fed Rio Grande do Sul, Porto Alegre, RS, Brazil
[2] Google Res, New York, NY USA
[3] Univ Roma La Sapienza, Rome, Italy
[4] Univ Paderborn, Paderborn, Germany
来源
关键词
D O I
暂无
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
We present random sampling algorithms that with probability at least 1 - delta compute a (1 +/- epsilon)-approximation of the clustering coefficient and of the number of bipartite clique subgraphs of a graph given as an incidence stream of edges. The space used by our algorithm to estimate the clustering coefficient is inversely related to the clustering coefficient of the network itself. The space used by our algorithm to compute the number K-3,K-3 of bipartite cliques is proportional to the ratio between the number of K-1,K-3 and K-3,K-3 in the graph. Since the space complexity depends only on the structure of the input graph and not on the number of nodes, our algorithms scale very well with increasing graph size. Therefore they provide a basic tool to analyze the structure of dense clusters in large graphs and have many applications in the discovery of web communities, the analysis of the structure of large social networks and the probing of frequent patterns in large graphs. We implemented both algorithms and evaluated their performance on networks from different application domains and of different size; The largest instance is a webgraph consisting of more than 135 million nodes and I billion edges. Both algorithms compute accurate results in reasonable time on the tested instances.
引用
收藏
页码:618 / +
页数:3
相关论文
共 50 条
  • [1] Clustering data streams
    Guha, S
    Mishra, N
    Motwani, R
    O'Callaghan, L
    [J]. 41ST ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, PROCEEDINGS, 2000, : 359 - 366
  • [2] Estimating missing data in data streams
    Jiang, Nan
    Gruenwald, Le
    [J]. ADVANCES IN DATABASES: CONCEPTS, SYSTEMS AND APPLICATIONS, 2007, 4443 : 981 - +
  • [3] Clustering Text Data Streams
    刘玉葆
    蔡嘉荣
    印鉴
    傅蔚慈
    [J]. Journal of Computer Science & Technology, 2008, 23 (01) : 112 - 128
  • [4] Clustering text data streams
    Liu, Yu-Bao
    Cai, Jia-Rong
    Yin, Jian
    Fu, Ada Wai-Chee
    [J]. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2008, 23 (01) : 112 - 128
  • [5] Clustering transactional data streams
    Li, Yanrong
    Gopalan, Raj P.
    [J]. AI 2006: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 4304 : 1069 - +
  • [6] Correlation Clustering in Data Streams
    Ahn, Kook Jin
    Cormode, Graham
    Guha, Sudipto
    McGregor, Andrew
    Wirth, Anthony
    [J]. ALGORITHMICA, 2021, 83 (07) : 1980 - 2017
  • [7] Clustering Text Data Streams
    Yu-Bao Liu
    Jia-Rong Cai
    Jian Yin
    Ada Wai-Chee Fu
    [J]. Journal of Computer Science and Technology, 2008, 23 : 112 - 128
  • [8] Clustering categorical data streams
    He, Zengyou
    Xu, Xiaofei
    Deng, Shengchun
    Huang, Joshua Zhexue
    [J]. JOURNAL OF COMPUTATIONAL METHODS IN SCIENCES AND ENGINEERING, 2011, 11 (04) : 185 - 192
  • [9] Correlation Clustering in Data Streams
    Kook Jin Ahn
    Graham Cormode
    Sudipto Guha
    Andrew McGregor
    Anthony Wirth
    [J]. Algorithmica, 2021, 83 : 1980 - 2017
  • [10] Correlation Clustering in Data Streams
    Ahn, Kook Jin
    Cormode, Graham
    Guha, Sudipto
    McGregor, Andrew
    Wirth, Anthony
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 37, 2015, 37 : 2237 - 2246