Large-scale k-means clustering via variance reduction

被引:20
|
作者
Zhao, Yawei [1 ]
Ming, Yuewei [1 ]
Liu, Xinwang [1 ]
Zhu, En [1 ]
Zhao, Kaikai [2 ]
Yin, Jianping [3 ]
机构
[1] Natl Univ Def Technol, Changsha, Hunan, Peoples R China
[2] Naval Aeronaut Univ, Yantai, Shandong, Peoples R China
[3] Dongguan Univ Technol, Dongguan, Guangdong, Peoples R China
基金
中国国家自然科学基金;
关键词
k-Means clustering; Large-scale clustering; Variance reduction;
D O I
10.1016/j.neucom.2018.03.059
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the increase of the volume of data such as images in web, it is challenging to perform k-means clustering on millions or even billions of images efficiently. One of the reasons is that k-means requires to use a batch of training data to update cluster centers at every iteration, which is time-consuming. Conventionally, k-means is accelerated by using one or a mini-batch of instances to update the centers, which leads to a bad performance due to the stochastic noise. In the paper, we decrease such stochastic noise, and accelerate k-means by using variance reduction technique. Specifically, we propose a position correction mechanism to correct the drift of the cluster centers, and propose a variance reduced k-means named VRKM. Furthermore, we optimize VRKM by reducing its computational cost, and propose a new variant of the variance reduced k-means named VRKM++. Comparing with VRKM, VRKM++ does not have to compute the batch gradient, and is more efficient. Extensive empirical studies show that our methods VRKM and VRKM++ outperform the state-of-the-art method, and obtain about 2 x and 4 speedups for large-scale clustering, respectively. The source code is available at https://www.github.com/YaweiZhao/VRKM.sofia-ml. (C) 2018 Elsevier B.V. All rights reserved.
引用
收藏
页码:184 / 194
页数:11
相关论文
共 50 条
  • [21] K-means Clustering Algorithm for Large-scale Chinese Commodity Information Web Based on Hadoop
    Geng Yushui
    Zhang Lishuo
    [J]. 14TH INTERNATIONAL SYMPOSIUM ON DISTRIBUTED COMPUTING AND APPLICATIONS FOR BUSINESS, ENGINEERING AND SCIENCE (DCABES 2015), 2015, : 256 - 259
  • [22] A Semantic Partition Algorithm Based on Improved K-Means Clustering for Large-Scale Indoor Areas
    Shi, Kegong
    Yan, Jinjin
    Yang, Jinquan
    [J]. ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION, 2024, 13 (02)
  • [23] Optimal Operation of Large-scale Electric Vehicles Based on Improved K-means Clustering Algorithm
    Liu, Jian
    Xu, Weifeng
    Liu, Zhijun
    Fu, Guanhua
    Jiang, Yunpeng
    Zhao, Ergang
    [J]. PROCEEDINGS OF 2022 5TH INTERNATIONAL CONFERENCE ON ROBOT SYSTEMS AND APPLICATIONS, ICRSA2022, 2022, : 23 - 28
  • [24] Efficient adaptive large-scale text clustering method based on genetic K-means algorithm
    Dai, Wenhua
    Jiao, Cuizhen
    He, Tingting
    [J]. RECENT ADVANCE OF CHINESE COMPUTING TECHNOLOGIES, 2007, : 281 - 285
  • [25] A MapReduce-based parallel K-means clustering for large-scale CIM data verification
    Deng, Chuang
    Liu, Yang
    Xu, Lixiong
    Yang, Jie
    Liu, Junyong
    Li, Siguang
    Li, Maozhen
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2016, 28 (11): : 3096 - 3114
  • [26] Very large-scale data classification based on K-means clustering and multi-kernel SVM
    Tinglong Tang
    Shengyong Chen
    Meng Zhao
    Wei Huang
    Jake Luo
    [J]. Soft Computing, 2019, 23 : 3793 - 3801
  • [27] Very large-scale data classification based on K-means clustering and multi-kernel SVM
    Tang, Tinglong
    Chen, Shengyong
    Zhao, Meng
    Huang, Wei
    Luo, Jake
    [J]. SOFT COMPUTING, 2019, 23 (11) : 3793 - 3801
  • [28] A Simple but Powerful Heuristic Method for Accelerating k-Means Clustering of Large-Scale Data in Life Science
    Ichikawa, Kazuki
    Morishita, Shinichi
    [J]. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2014, 11 (04) : 681 - 692
  • [29] A sample-based hierarchical adaptive K-means clustering method for large-scale video retrieval
    Liao, Kaiyang
    Liu, Guizhong
    Xiao, Li
    Liu, Chaoteng
    [J]. KNOWLEDGE-BASED SYSTEMS, 2013, 49 : 123 - 133
  • [30] Practical Privacy-Preserving MapReduce Based K-Means Clustering Over Large-Scale Dataset
    Yuan, Jiawei
    Tian, Yifan
    [J]. IEEE TRANSACTIONS ON CLOUD COMPUTING, 2019, 7 (02) : 568 - 579