Streaming Euclidean k-median and k-means with o(log n) Space

被引：0

作者：

Cohen-Addad, Vincent ^{[1
]}

Woodruff, David P. ^{[2
]}

Zhou, Samson ^{[3
]}

机构：

[1] Google Res, Mountain View, CA 94043 USA

[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[3] Texas A&M Univ, College Stn, TX USA

来源：

2023 IEEE 64TH ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, FOCS | 2023年

关键词：

streaming model; clustering; sublinear algorithms; CORESETS;

D O I：

10.1109/FOCS57990.2023.00057

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

We consider the classic Euclidean k-median and k-means objective on data streams, where the goal is to provide a (1+epsilon)-approximation to the optimal k-median or k-means solution, while using as little memory as possible. Over the last 20 years, clustering in data streams has received a tremendous amount of attention and has been the test-bed for a large variety of new techniques, including coresets, the merge-and-reduce framework, bicriteria approximation, sensitivity sampling, and so on. Despite this intense effort to obtain smaller sketches for these problems, all known techniques require storing at least Omega(log(n Delta)) words of memory, where n is size of the input and Delta is the aspect ratio. A natural question is if one can beat this logarithmic dependence on n and Delta. In this paper, we break this barrier by first giving an insertion-only streaming algorithm that achieves a (1 + epsilon)-approximation to the more general (k, z)-clustering problem, using (O) over tilde (dk/epsilon(2)) center dot (2(z log z)) center dot min (1/epsilon(z), k) center dot poly(log log(n Delta)) words of memory. Our techniques can also be used to achieve two-pass algorithms for k-median and k-means clustering on dynamic streams using (O) over tilde (1/epsilon(2)) center dot poly(d, k, log log(n Delta)) words of memory.

引用

页码：883 / 908

页数：26

共 50 条

[21] A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering
Ben-David, Shai
MACHINE LEARNING, 2007, 66 (2-3) : 243 - 257
[22] Anomaly Detection by Using Streaming K-Means and Batch K-Means
Wang, Zhuo
Zhou, Yanghui
Li, Gangmin
2020 5TH IEEE INTERNATIONAL CONFERENCE ON BIG DATA ANALYTICS (IEEE ICBDA 2020), 2020, : 11 - 17
[23] A refined approximation for Euclidean k-means
Grandoni, Fabrizio
Ostrovsky, Rafail
Rabani, Yuval
Schulman, Leonard J.
Venkat, Rakesh
INFORMATION PROCESSING LETTERS, 2022, 176
[24] Improved Coresets for Euclidean k-Means
Cohen-Addad, Vincent
Larsen, Kasper Green
Saulpic, David
Schwiegelshohn, Chris
Sheikh-Omar, Omar Ali
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
[25] On the k-means/median cost function
Bhattacharya, Anup
Freund, Yoav
Jaiswal, Ragesh
INFORMATION PROCESSING LETTERS, 2022, 177
[26] On the k-means/median cost function
Bhattacharya, Anup
Freund, Yoav
Jaiswal, Ragesh
Information Processing Letters, 2022, 177
[27] On Euclidean k-Means Clustering with α-Center Proximity
Deshpande, Amit
Louis, Anand
Singh, Apoorv Vikram
22ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 89, 2019, 89
[28] Clustering Stable Instances of Euclidean k-means
Dutta, Abhratanu
Vijayaraghavan, Aravindan
Wang, Alex
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
[29] The priority k-median problem
Kumar, Amit
Sabharwal, Yogish
FSTTCS 2007: FOUNDATIONS OF SOFTWARE TECHNOLOGY AND THEORETICAL COMPUTER SCIENCE, PROCEEDINGS, 2007, 4855 : 71 - +
[30] An Objective for Hierarchical Clustering in Euclidean Space and Its Connection to Bisecting K-means
Wang, Yuyan
Moseley, Benjamin
THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 6307 - 6314

← 1 2 3 4 5 →