Effect of Corpus Size Selection on Performance of Map-Reduce Based Distributed K-Means for Big Textual Data Clustering

被引:5
|
作者
Ketu, Shwet [1 ]
Prasad, Bakshi Rohit [1 ]
Agarwal, Sonali [1 ]
机构
[1] Indian Inst Informat Technol, Allahabad, Uttar Pradesh, India
关键词
Big Textual data; MapReduce; Clustering; K-Means; Distributed K-Means; MAPREDUCE;
D O I
10.1145/2818567.2818653
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In current era, we are experiencing tremendous growth in database sizes, types, users, working environments and data access speeds. This situation coined a new term Big Data which are large and complex datasets used for extracting meaningful knowledge. One of the main challenges in processing Big Data is its huge volume which is a common characteristic of huge collection of textual data also. Handling such voluminous big textual data using conventional data mining techniques such as clustering becomes impractical because of algorithmic incompetence to address the large computation time. This research work is mainly focused on big text data clustering using MapReduce based Distributed K-Means algorithm combined with corpus selection technique for a significant decrement of overall computation time. Four benchmark datasets have been used to explore the relationship between corpus size and computation time. It is found that the corpus selection technique significantly effective in reduction of overall processing time.
引用
收藏
页码:256 / 260
页数:5
相关论文
共 50 条
  • [1] Design of MAP-REDUCE and K-MEANS based Network Clustering protocol for Sensor Networks
    Patole, Jyoti R.
    Abraham, Jibi
    [J]. 2012 THIRD INTERNATIONAL CONFERENCE ON COMPUTING COMMUNICATION & NETWORKING TECHNOLOGIES (ICCCNT), 2012,
  • [2] k-Means Clustering of Lines for Big Data
    Marom, Yair
    Feldman, Dan
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [3] The fast clustering algorithm for the big data based on K-means
    Xie, Ting
    Zhang, Taiping
    [J]. INTERNATIONAL JOURNAL OF WAVELETS MULTIRESOLUTION AND INFORMATION PROCESSING, 2020, 18 (06)
  • [4] A Novel K-Means based Clustering Algorithm for Big Data
    Sinha, Ankita
    Jana, Prasanta K.
    [J]. 2016 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2016, : 1875 - 1879
  • [5] Improved k-Means Clustering Algorithm for Big Data Based on Distributed SmartphoneNeural Engine Processor
    Awad, Fouad H.
    Hamad, Murtadha M.
    [J]. ELECTRONICS, 2022, 11 (06)
  • [6] Big Data Clustering with Kernel k-Means: Resources, Time and Performance
    Tsapanos, Nikolaos
    Tefas, Anastasios
    Nikolaidis, Nikolaos
    Pitas, Ioannis
    [J]. INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2018, 27 (04)
  • [7] NOVEL CBIR System Using Spark MAP-Reduce with a Firefly Macqueen's K-Means Clustering Algorithm
    Sunitha, T.
    Sivarani, T. S.
    [J]. IETE JOURNAL OF RESEARCH, 2023, 69 (10) : 6955 - 6969
  • [8] Performance Enhancement of Distributed K-Means Clustering for Big Data Analytics Through In-memory Computation
    Ketu, Shwet
    Agarwal, Sonali
    [J]. 2015 EIGHTH INTERNATIONAL CONFERENCE ON CONTEMPORARY COMPUTING (IC3), 2015, : 318 - 324
  • [9] Design of Intelligent K-Means Based on Spark for Big Data Clustering
    Kusuma, Ilham
    Ma'sum, M. Anwar
    Habibie, Novian
    Jatmiko, Wisnu
    Suhartanto, Heru
    [J]. 2016 INTERNATIONAL WORKSHOP ON BIG DATA AND INFORMATION SECURITY (IWBIS), 2016, : 89 - 95
  • [10] How to Use K-means for Big Data Clustering?
    Mussabayev, Rustam
    Mladenovic, Nenad
    Jarboui, Bassem
    Mussabayev, Ravil
    [J]. PATTERN RECOGNITION, 2023, 137