A Cheap Feature Selection Approach for the K-Means Algorithm

被引：19

作者：

Capo, Marco ^{[1
]}

Perez, Aritz ^{[1
]}

Lozano, Jose A. ^{[1
,2
]}

机构：

[1] Basque Ctr Appl Math, Bilbao 48009, Spain

[2] Univ Basque Country, UPV EHU, Intelligent Syst Grp, Dept Comp Sci & Artificial Intelligence, San Sebastian 20018, Spain

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2021年 / 32卷 / 05期

关键词：

Dimensionality reduction; K-means clustering; feature selection; parallelization; unsupervised learning; MEANS CLUSTERING-ALGORITHM;

D O I：

10.1109/TNNLS.2020.3002576

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The increase in the number of features that need to be analyzed in a wide variety of areas, such as genome sequencing, computer vision, or sensor networks, represents a challenge for the K-means algorithm. In this regard, different dimensionality reduction approaches for the K-means algorithm have been designed recently, leading to algorithms that have proved to generate competitive clusterings. Unfortunately, most of these techniques tend to have fairly high computational costs and/or might not be easy to parallelize. In this article, we propose a fully parallelizable feature selection technique intended for the K-means algorithm. The proposal is based on a novel feature relevance measure that is closely related to the K-means error of a given clustering. Given a disjoint partition of the features, the technique consists of obtaining a clustering for each subset of features and selecting the m features with the highest relevance measure. The computational cost of this approach is just O(m . max{n . K, log m}) per subset of features. We additionally provide a theoretical analysis on the quality of the obtained solution via our proposal and empirically analyze its performance with respect to well-known feature selection and feature extraction techniques. Such an analysis shows that our proposal consistently obtains the results with lower K-means error than all the considered feature selection techniques: Laplacian scores, maximum variance, multicluster feature selection, and random selection while also requiring similar or lower computational times than these approaches. Moreover, when compared with feature extraction techniques, such as random projections, the proposed approach also shows a noticeable improvement in both error and computational time.

引用

页码：2195 / 2208

页数：14

共 50 条

[31] FEATURE SELECTION VIA INCORPORATING STIEFEL MANIFOLD IN RELAXED K-MEANS
Cai, Guohao
Zhang, Rui
Nie, Feiping
Li, Xuelong
[J]. 2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 1503 - 1507
[32] Discriminatively embedded fuzzy K-Means clustering with feature selection strategy
Zhao, Peng
Zhang, Yongxin
Ma, Youzhong
Zhao, Xiaowei
Fan, Xunli
[J]. APPLIED INTELLIGENCE, 2023, 53 (16) : 18959 - 18970
[33] Discriminatively embedded fuzzy K-Means clustering with feature selection strategy
Peng Zhao
Yongxin Zhang
Youzhong Ma
Xiaowei Zhao
Xunli Fan
[J]. Applied Intelligence, 2023, 53 : 18959 - 18970
[34] A Novel Stability Based Feature Selection Framework for k-means Clustering
Mavroeidis, Dimitrios
Marchiori, Elena
[J]. MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PT II, 2011, 6912 : 421 - 436
[35] The Hybrid Feature Selection k-means Method for Arabic Webpage Classification
Alghamdi, Hanan
Selamat, Ali
[J]. JURNAL TEKNOLOGI, 2014, 70 (05):
[36] Subspace clustering of text documents with feature weighting K-means algorithm
Jing, LP
Ng, MK
Xu, J
Huang, JZ
[J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2005, 3518 : 802 - 812
[37] A K-means Text Clustering Algorithm Based on Subject Feature Vector
Duo, Ji
Zhang, Peng
Hao, Liu
[J]. JOURNAL OF WEB ENGINEERING, 2021, 20 (06): : 1935 - 1946
[38] K-Means algorithm based on multi-feature-induced order
Wan, Benting
Huang, Weikang
Pierre, Bilivogui
Cheng, Youyu
Zhou, Shufen
[J]. GRANULAR COMPUTING, 2024, 9 (02)
[39] Modifying Genetic Algorithm with Species and Sexual Selection by using K-means Algorithm
Patel, Rahila
Raghuwanshi, M. M.
Jaiswal, Anil N.
[J]. 2009 IEEE INTERNATIONAL ADVANCE COMPUTING CONFERENCE, VOLS 1-3, 2009, : 114 - +
[40] Gene Selection for High Dimensional Data Using K-Means Clustering Algorithm and Statistical Approach
Ahmad, Farzana Kabir
Yusof, Yuhanis
Othman, Nor Hayati
[J]. 2014 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND TECHNOLOGY (ICCST), 2014,

← 1 2 3 4 5 →