One-pass MapReduce-based clustering method for mixed large scale data

被引:0
|
作者
Mohamed Aymen Ben HajKacem
Chiheb-Eddine Ben N’cir
Nadia Essoussi
机构
[1] Université de Tunis,Institut Supérieur de Gestion de Tunis, LARODEC
关键词
K-prototypes; One-pass MapReduce; Large scale data; Mixed data; Pruning strategy;
D O I
暂无
中图分类号
学科分类号
摘要
Big data is often characterized by a huge volume and a mixed types of attributes namely, numeric and categorical. K-prototypes has been fitted into MapReduce framework and hence it has become a solution for clustering mixed large scale data. However, k-prototypes requires computing all distances between each of the cluster centers and the data points. Many of these distance computations are redundant, because data points usually stay in the same cluster after first few iterations. Also, k-prototypes is not suitable for running within MapReduce framework: the iterative nature of k-prototypes cannot be modeled through MapReduce since at each iteration of k-prototypes, the whole data set must be read and written to disks and this results a high input/output (I/O) operations. To deal with these issues, we propose a new one-pass accelerated MapReduce-based k-prototypes clustering method for mixed large scale data. The proposed method reads and writes data only once which reduces largely the I/O operations compared to existing MapReduce implementation of k-prototypes. Furthermore, the proposed method is based on a pruning strategy to accelerate the clustering process by reducing the redundant distance computations between cluster centers and data points. Experiments performed on simulated and real data sets show that the proposed method is scalable and improves the efficiency of the existing k-prototypes methods.
引用
收藏
页码:619 / 636
页数:17
相关论文
共 50 条
  • [31] Aeromancer: A Workflow Manager for Large-Scale MapReduce-Based Scientific Workflows
    Mohamed, Nabeel
    Maji, Nabanita
    Zhang, Jing
    Timoshevskaya, Nataliya
    Feng, Wu-Chun
    [J]. 2014 IEEE 13TH INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (TRUSTCOM), 2014, : 739 - 746
  • [32] Accelerating one-pass clustering by cluster selection racing
    Labroche, Nicolas
    Detyniecki, Marcin
    Baerecke, Thomas
    [J]. 2013 IEEE 25TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI), 2013, : 491 - 498
  • [33] One-Pass Incomplete Multi-View Clustering
    Hu, Menglei
    Chen, Songcan
    [J]. THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 3838 - 3845
  • [34] Technological Surveillance in Big Data Environments by using a MapReduce-based Method
    Daniel San Martin Pascal Filho
    Douglas Dyllon Jeronimo de Macedo
    Moisés Lima Dutra
    [J]. Mobile Networks and Applications, 2022, 27 : 1931 - 1940
  • [35] One-pass heuristics for large-scale unconstrained binary quadratic problems
    Glover, F
    Alidaee, B
    Rego, C
    Kochenberger, G
    [J]. EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2002, 137 (02) : 272 - 287
  • [36] Atrak: a MapReduce-based data warehouse for big data
    Barkhordari, Mohammadhossein
    Niamanesh, Mahdi
    [J]. JOURNAL OF SUPERCOMPUTING, 2017, 73 (10): : 4596 - 4610
  • [37] Data Categorization Using Hadoop MapReduce-Based Parallel K-Means Clustering
    Ansari Z.
    Afzal A.
    Sardar T.H.
    [J]. Journal of The Institution of Engineers (India): Series B, 2019, 100 (02) : 95 - 103
  • [38] Distributed Big Data Clustering using MapReduce-based Fuzzy C-Medoids
    Sardar T.H.
    Ansari Z.
    [J]. Journal of The Institution of Engineers (India): Series B, 2022, 103 (01) : 73 - 82
  • [39] Technological Surveillance in Big Data Environments by using a MapReduce-based Method
    Pascal Filho, Daniel San Martin
    Jeronimo de Macedo, Douglas Dyllon
    Dutra, Moises Lima
    [J]. MOBILE NETWORKS & APPLICATIONS, 2022, 27 (05): : 1931 - 1940
  • [40] MR-ELM: a MapReduce-based framework for large-scale ELM training in big data era
    Chen, Jiaoyan
    Chen, Huajun
    Wan, Xiangyi
    Zheng, Guozhou
    [J]. NEURAL COMPUTING & APPLICATIONS, 2016, 27 (01): : 101 - 110