A Cheap Feature Selection Approach for the K-Means Algorithm

被引：19

作者：

Capo, Marco ^{[1
]}

Perez, Aritz ^{[1
]}

Lozano, Jose A. ^{[1
,2
]}

机构：

[1] Basque Ctr Appl Math, Bilbao 48009, Spain

[2] Univ Basque Country, UPV EHU, Intelligent Syst Grp, Dept Comp Sci & Artificial Intelligence, San Sebastian 20018, Spain

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2021年 / 32卷 / 05期

关键词：

Dimensionality reduction; K-means clustering; feature selection; parallelization; unsupervised learning; MEANS CLUSTERING-ALGORITHM;

D O I：

10.1109/TNNLS.2020.3002576

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The increase in the number of features that need to be analyzed in a wide variety of areas, such as genome sequencing, computer vision, or sensor networks, represents a challenge for the K-means algorithm. In this regard, different dimensionality reduction approaches for the K-means algorithm have been designed recently, leading to algorithms that have proved to generate competitive clusterings. Unfortunately, most of these techniques tend to have fairly high computational costs and/or might not be easy to parallelize. In this article, we propose a fully parallelizable feature selection technique intended for the K-means algorithm. The proposal is based on a novel feature relevance measure that is closely related to the K-means error of a given clustering. Given a disjoint partition of the features, the technique consists of obtaining a clustering for each subset of features and selecting the m features with the highest relevance measure. The computational cost of this approach is just O(m . max{n . K, log m}) per subset of features. We additionally provide a theoretical analysis on the quality of the obtained solution via our proposal and empirically analyze its performance with respect to well-known feature selection and feature extraction techniques. Such an analysis shows that our proposal consistently obtains the results with lower K-means error than all the considered feature selection techniques: Laplacian scores, maximum variance, multicluster feature selection, and random selection while also requiring similar or lower computational times than these approaches. Moreover, when compared with feature extraction techniques, such as random projections, the proposed approach also shows a noticeable improvement in both error and computational time.

引用

页码：2195 / 2208

页数：14

共 50 条

[1] Feature Selection Algorithm Based on K-means Clustering
Tang, Xue
Dong, Min
Bi, Sheng
Pei, Maofeng
Cao, Dan
Xie, Cheche
Chi, Sunhuang
[J]. 2017 IEEE 7TH ANNUAL INTERNATIONAL CONFERENCE ON CYBER TECHNOLOGY IN AUTOMATION, CONTROL, AND INTELLIGENT SYSTEMS (CYBER), 2017, : 1522 - 1527
[2] Genetic-based K-means algorithm for selection of feature variables
Yu, Zhiwen
Wong, Hau-San
[J]. 18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 2, PROCEEDINGS, 2006, : 744 - +
[3] Feature selection for k-means clustering stability: theoretical analysis and an algorithm
Mavroeidis, Dimitrios
Marchiori, Elena
[J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2014, 28 (04) : 918 - 960
[4] Feature selection for k-means clustering stability: theoretical analysis and an algorithm
Dimitrios Mavroeidis
Elena Marchiori
[J]. Data Mining and Knowledge Discovery, 2014, 28 : 918 - 960
[5] Deterministic Feature Selection for k-Means Clustering
Boutsidis, Christos
Magdon-Ismail, Malik
[J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 2013, 59 (09) : 6099 - 6110
[6] Feature Selection Embedded Robust K-Means
Zhang, Qian
Peng, Chong
[J]. IEEE ACCESS, 2020, 8 : 166164 - 166175
[7] Kernel Penalized K-means: A feature selection method based on Kernel K-means
Maldonado, Sebastian
Carrizosa, Emilio
Weber, Richard
[J]. INFORMATION SCIENCES, 2015, 322 : 150 - 160
[8] Research and Application of Improved K-means Algorithm Based on Fuzzy Feature Selection
Li, Xiuyun
Yang, Jie
Wang, Qing
Fan, Jinjin
Liu, Peng
[J]. FIFTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 1, PROCEEDINGS, 2008, : 401 - 405
[9] Gravitational search algorithm and K-means for simultaneous feature selection and data clustering: a multi-objective approach
Jay Prakash
Pramod Kumar Singh
[J]. Soft Computing, 2019, 23 : 2083 - 2100
[10] Gravitational search algorithm and K-means for simultaneous feature selection and data clustering: a multi-objective approach
Prakash, Jay
Singh, Pramod Kumar
[J]. SOFT COMPUTING, 2019, 23 (06) : 2083 - 2100

← 1 2 3 4 5 →