Large-scale k-means clustering via variance reduction

被引:20
|
作者
Zhao, Yawei [1 ]
Ming, Yuewei [1 ]
Liu, Xinwang [1 ]
Zhu, En [1 ]
Zhao, Kaikai [2 ]
Yin, Jianping [3 ]
机构
[1] Natl Univ Def Technol, Changsha, Hunan, Peoples R China
[2] Naval Aeronaut Univ, Yantai, Shandong, Peoples R China
[3] Dongguan Univ Technol, Dongguan, Guangdong, Peoples R China
基金
中国国家自然科学基金;
关键词
k-Means clustering; Large-scale clustering; Variance reduction;
D O I
10.1016/j.neucom.2018.03.059
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the increase of the volume of data such as images in web, it is challenging to perform k-means clustering on millions or even billions of images efficiently. One of the reasons is that k-means requires to use a batch of training data to update cluster centers at every iteration, which is time-consuming. Conventionally, k-means is accelerated by using one or a mini-batch of instances to update the centers, which leads to a bad performance due to the stochastic noise. In the paper, we decrease such stochastic noise, and accelerate k-means by using variance reduction technique. Specifically, we propose a position correction mechanism to correct the drift of the cluster centers, and propose a variance reduced k-means named VRKM. Furthermore, we optimize VRKM by reducing its computational cost, and propose a new variant of the variance reduced k-means named VRKM++. Comparing with VRKM, VRKM++ does not have to compute the batch gradient, and is more efficient. Extensive empirical studies show that our methods VRKM and VRKM++ outperform the state-of-the-art method, and obtain about 2 x and 4 speedups for large-scale clustering, respectively. The source code is available at https://www.github.com/YaweiZhao/VRKM.sofia-ml. (C) 2018 Elsevier B.V. All rights reserved.
引用
收藏
页码:184 / 194
页数:11
相关论文
共 50 条
  • [1] Scalable k-means for large-scale clustering
    Ming, Yuewei
    Zhu, En
    Wang, Mao
    Liu, Qiang
    Liu, Xinwang
    Yin, Jianping
    [J]. INTELLIGENT DATA ANALYSIS, 2019, 23 (04) : 825 - 838
  • [2] Compressed K-Means for Large-Scale Clustering
    Shen, Xiaobo
    Liu, Weiwei
    Tsang, Ivor
    Shen, Fumin
    Sun, Quan-Sen
    [J]. THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 2527 - 2533
  • [3] Deep clustering of small molecules at large-scale via variational autoencoder embedding and K-means
    Hamid Hadipour
    Chengyou Liu
    Rebecca Davis
    Silvia T. Cardona
    Pingzhao Hu
    [J]. BMC Bioinformatics, 23
  • [4] Regularized and Sparse Stochastic K-Means for Distributed Large-Scale Clustering
    Jumutc, Vilen
    Langone, Rocco
    Suykens, Johan A. K.
    [J]. PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 2535 - 2540
  • [5] Fast K-means for Large Scale Clustering
    Hu, Qinghao
    Wu, Jiaxiang
    Bai, Lu
    Zhang, Yifan
    Cheng, Jian
    [J]. CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2017, : 2099 - 2102
  • [6] Deep clustering of small molecules at large-scale via variational autoencoder embedding and K-means
    Hadipour, Hamid
    Liu, Chengyou
    Davis, Rebecca
    Cardona, Silvia T.
    Hu, Pingzhao
    [J]. BMC BIOINFORMATICS, 2022, 23 (SUPPL 4)
  • [7] Variance Reduced K-means Clustering
    Zhao, Yawei
    Ming, Yuewei
    Liu, Xinwang
    Zhu, En
    Yin, Jianping
    [J]. THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 8187 - 8188
  • [8] Extractive Text Summarization on Large-scale Dataset Using K-Means Clustering
    Ti-Hon Nguyen
    Thanh-Nghi Do
    [J]. ADVANCES AND TRENDS IN ARTIFICIAL INTELLIGENCE: THEORY AND PRACTICES IN ARTIFICIAL INTELLIGENCE, 2022, 13343 : 737 - 746
  • [9] Large-scale k-means clustering with user-centric privacy preservation
    Sakuma, Jun
    Kobayashi, Shigenobu
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2008, 5012 : 320 - 332
  • [10] Fault tolerant decentralised K-Means clustering for asynchronous large-scale networks
    Di Fatta, Giuseppe
    Blasa, Francesco
    Cafiero, Simone
    Fortino, Giancarlo
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2013, 73 (03) : 317 - 329