One-pass MapReduce-based clustering method for mixed large scale data

被引:0
|
作者
Mohamed Aymen Ben HajKacem
Chiheb-Eddine Ben N’cir
Nadia Essoussi
机构
[1] Université de Tunis,Institut Supérieur de Gestion de Tunis, LARODEC
关键词
K-prototypes; One-pass MapReduce; Large scale data; Mixed data; Pruning strategy;
D O I
暂无
中图分类号
学科分类号
摘要
Big data is often characterized by a huge volume and a mixed types of attributes namely, numeric and categorical. K-prototypes has been fitted into MapReduce framework and hence it has become a solution for clustering mixed large scale data. However, k-prototypes requires computing all distances between each of the cluster centers and the data points. Many of these distance computations are redundant, because data points usually stay in the same cluster after first few iterations. Also, k-prototypes is not suitable for running within MapReduce framework: the iterative nature of k-prototypes cannot be modeled through MapReduce since at each iteration of k-prototypes, the whole data set must be read and written to disks and this results a high input/output (I/O) operations. To deal with these issues, we propose a new one-pass accelerated MapReduce-based k-prototypes clustering method for mixed large scale data. The proposed method reads and writes data only once which reduces largely the I/O operations compared to existing MapReduce implementation of k-prototypes. Furthermore, the proposed method is based on a pruning strategy to accelerate the clustering process by reducing the redundant distance computations between cluster centers and data points. Experiments performed on simulated and real data sets show that the proposed method is scalable and improves the efficiency of the existing k-prototypes methods.
引用
收藏
页码:619 / 636
页数:17
相关论文
共 50 条
  • [21] MapReduce-based Data Processing on IoT
    Satoh, Ichiro
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE (ITHINGS) - 2014 IEEE INTERNATIONAL CONFERENCE ON GREEN COMPUTING AND COMMUNICATIONS (GREENCOM) - 2014 IEEE INTERNATIONAL CONFERENCE ON CYBER-PHYSICAL-SOCIAL COMPUTING (CPS), 2014, : 161 - 168
  • [22] MapReduce-based parallel learning for large-scale remote sensing images
    Huang, Fenghua
    [J]. Open Automation and Control Systems Journal, 2014, 6 (01): : 1962 - 1974
  • [23] Online image search result grouping with MapReduce-based image clustering and graph construction for large-scale photos
    Hsieh, Liang-Chi
    Wu, Guan-Long
    Hsu, Yu-Ming
    Hsu, Winston
    [J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2014, 25 (02) : 384 - 395
  • [24] MapReduce-based approach on short text conversation clustering
    [J]. Zhang, Y. (zyszjhz@163.com), 1600, Binary Information Press (10):
  • [25] A MapReduce-based K-means clustering algorithm
    YiMin Mao
    DeJin Gan
    D. S. Mwakapesa
    Y. A. Nanehkaran
    Tao Tao
    XueYu Huang
    [J]. The Journal of Supercomputing, 2022, 78 : 5181 - 5202
  • [26] BEstream: Batch Capturing with Elliptic Function for One-Pass Data Stream Clustering
    Wattanakitrungroj, Niwan
    Maneeroj, Saranya
    Lursinsap, Chidchanok
    [J]. DATA & KNOWLEDGE ENGINEERING, 2018, 117 : 53 - 70
  • [27] A MapReduce-based K-means clustering algorithm
    Mao, YiMin
    Gan, DeJin
    Mwakapesa, D. S.
    Nanehkaran, Y. A.
    Tao, Tao
    Huang, XueYu
    [J]. JOURNAL OF SUPERCOMPUTING, 2022, 78 (04): : 5181 - 5202
  • [28] MapReduce-based large-scale online social network worm simulation
    He, Liang
    Feng, Deng-Guo
    Wang, Rui
    Su, Pu-Rui
    Ying, Ling-Yun
    [J]. He, L. (windhl@yahoo.cn), 1666, Chinese Academy of Sciences (24): : 1666 - 1682
  • [29] ARLS: A MapReduce-based output analysis tool for large-scale simulations
    Lee, Kangsun
    Jung, Kwanghoon
    Park, Joonho
    Kwon, Dongseop
    [J]. ADVANCES IN ENGINEERING SOFTWARE, 2016, 95 : 28 - 37
  • [30] A MapReduce-based approach for shortest path problem in large-scale networks
    Aridhi, Sabeur
    Lacomme, Philippe
    Ren, Libo
    Vincent, Benjamin
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2015, 41 : 151 - 165