Effect of Corpus Size Selection on Performance of Map-Reduce Based Distributed K-Means for Big Textual Data Clustering

被引：5

作者：

Ketu, Shwet ^{[1
]}

Prasad, Bakshi Rohit ^{[1
]}

Agarwal, Sonali ^{[1
]}

机构：

[1] Indian Inst Informat Technol, Allahabad, Uttar Pradesh, India

来源：

6TH INTERNATIONAL CONFERENCE ON COMPUTER & COMMUNICATION TECHNOLOGY (ICCCT-2015) | 2015年

关键词：

Big Textual data; MapReduce; Clustering; K-Means; Distributed K-Means; MAPREDUCE;

D O I：

10.1145/2818567.2818653

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

In current era, we are experiencing tremendous growth in database sizes, types, users, working environments and data access speeds. This situation coined a new term Big Data which are large and complex datasets used for extracting meaningful knowledge. One of the main challenges in processing Big Data is its huge volume which is a common characteristic of huge collection of textual data also. Handling such voluminous big textual data using conventional data mining techniques such as clustering becomes impractical because of algorithmic incompetence to address the large computation time. This research work is mainly focused on big text data clustering using MapReduce based Distributed K-Means algorithm combined with corpus selection technique for a significant decrement of overall computation time. Four benchmark datasets have been used to explore the relationship between corpus size and computation time. It is found that the corpus selection technique significantly effective in reduction of overall processing time.

引用

页码：256 / 260

页数：5

共 50 条

[1] Design of MAP-REDUCE and K-MEANS based Network Clustering protocol for Sensor Networks
Patole, Jyoti R.
Abraham, Jibi
[J]. 2012 THIRD INTERNATIONAL CONFERENCE ON COMPUTING COMMUNICATION & NETWORKING TECHNOLOGIES (ICCCNT), 2012,
[2] k-Means Clustering of Lines for Big Data
Marom, Yair
Feldman, Dan
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[3] The fast clustering algorithm for the big data based on K-means
Xie, Ting
Zhang, Taiping
[J]. INTERNATIONAL JOURNAL OF WAVELETS MULTIRESOLUTION AND INFORMATION PROCESSING, 2020, 18 (06)
[4] A Novel K-Means based Clustering Algorithm for Big Data
Sinha, Ankita
Jana, Prasanta K.
[J]. 2016 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2016, : 1875 - 1879
[5] Improved k-Means Clustering Algorithm for Big Data Based on Distributed SmartphoneNeural Engine Processor
Awad, Fouad H.
Hamad, Murtadha M.
[J]. ELECTRONICS, 2022, 11 (06)
[6] Big Data Clustering with Kernel k-Means: Resources, Time and Performance
Tsapanos, Nikolaos
Tefas, Anastasios
Nikolaidis, Nikolaos
Pitas, Ioannis
[J]. INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2018, 27 (04)
[7] NOVEL CBIR System Using Spark MAP-Reduce with a Firefly Macqueen's K-Means Clustering Algorithm
Sunitha, T.
Sivarani, T. S.
[J]. IETE JOURNAL OF RESEARCH, 2023, 69 (10) : 6955 - 6969
[8] Performance Enhancement of Distributed K-Means Clustering for Big Data Analytics Through In-memory Computation
Ketu, Shwet
Agarwal, Sonali
[J]. 2015 EIGHTH INTERNATIONAL CONFERENCE ON CONTEMPORARY COMPUTING (IC3), 2015, : 318 - 324
[9] Design of Intelligent K-Means Based on Spark for Big Data Clustering
Kusuma, Ilham
Ma'sum, M. Anwar
Habibie, Novian
Jatmiko, Wisnu
Suhartanto, Heru
[J]. 2016 INTERNATIONAL WORKSHOP ON BIG DATA AND INFORMATION SECURITY (IWBIS), 2016, : 89 - 95
[10] How to Use K-means for Big Data Clustering?
Mussabayev, Rustam
Mladenovic, Nenad
Jarboui, Bassem
Mussabayev, Ravil
[J]. PATTERN RECOGNITION, 2023, 137

← 1 2 3 4 5 →