Streaming Euclidean k-median and k-means with o(log n) Space

被引:0
|
作者
Cohen-Addad, Vincent [1 ]
Woodruff, David P. [2 ]
Zhou, Samson [3 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[3] Texas A&M Univ, College Stn, TX USA
关键词
streaming model; clustering; sublinear algorithms; CORESETS;
D O I
10.1109/FOCS57990.2023.00057
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We consider the classic Euclidean k-median and k-means objective on data streams, where the goal is to provide a (1+epsilon)-approximation to the optimal k-median or k-means solution, while using as little memory as possible. Over the last 20 years, clustering in data streams has received a tremendous amount of attention and has been the test-bed for a large variety of new techniques, including coresets, the merge-and-reduce framework, bicriteria approximation, sensitivity sampling, and so on. Despite this intense effort to obtain smaller sketches for these problems, all known techniques require storing at least Omega(log(n Delta)) words of memory, where n is size of the input and Delta is the aspect ratio. A natural question is if one can beat this logarithmic dependence on n and Delta. In this paper, we break this barrier by first giving an insertion-only streaming algorithm that achieves a (1 + epsilon)-approximation to the more general (k, z)-clustering problem, using (O) over tilde (dk/epsilon(2)) center dot (2(z log z)) center dot min (1/epsilon(z), k) center dot poly(log log(n Delta)) words of memory. Our techniques can also be used to achieve two-pass algorithms for k-median and k-means clustering on dynamic streams using (O) over tilde (1/epsilon(2)) center dot poly(d, k, log log(n Delta)) words of memory.
引用
收藏
页码:883 / 908
页数:26
相关论文
共 50 条
  • [21] A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering
    Ben-David, Shai
    MACHINE LEARNING, 2007, 66 (2-3) : 243 - 257
  • [22] Anomaly Detection by Using Streaming K-Means and Batch K-Means
    Wang, Zhuo
    Zhou, Yanghui
    Li, Gangmin
    2020 5TH IEEE INTERNATIONAL CONFERENCE ON BIG DATA ANALYTICS (IEEE ICBDA 2020), 2020, : 11 - 17
  • [23] A refined approximation for Euclidean k-means
    Grandoni, Fabrizio
    Ostrovsky, Rafail
    Rabani, Yuval
    Schulman, Leonard J.
    Venkat, Rakesh
    INFORMATION PROCESSING LETTERS, 2022, 176
  • [24] Improved Coresets for Euclidean k-Means
    Cohen-Addad, Vincent
    Larsen, Kasper Green
    Saulpic, David
    Schwiegelshohn, Chris
    Sheikh-Omar, Omar Ali
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [25] On the k-means/median cost function
    Bhattacharya, Anup
    Freund, Yoav
    Jaiswal, Ragesh
    INFORMATION PROCESSING LETTERS, 2022, 177
  • [26] On the k-means/median cost function
    Bhattacharya, Anup
    Freund, Yoav
    Jaiswal, Ragesh
    Information Processing Letters, 2022, 177
  • [27] On Euclidean k-Means Clustering with α-Center Proximity
    Deshpande, Amit
    Louis, Anand
    Singh, Apoorv Vikram
    22ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 89, 2019, 89
  • [28] Clustering Stable Instances of Euclidean k-means
    Dutta, Abhratanu
    Vijayaraghavan, Aravindan
    Wang, Alex
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [29] The priority k-median problem
    Kumar, Amit
    Sabharwal, Yogish
    FSTTCS 2007: FOUNDATIONS OF SOFTWARE TECHNOLOGY AND THEORETICAL COMPUTER SCIENCE, PROCEEDINGS, 2007, 4855 : 71 - +
  • [30] An Objective for Hierarchical Clustering in Euclidean Space and Its Connection to Bisecting K-means
    Wang, Yuyan
    Moseley, Benjamin
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 6307 - 6314