A Fast Clustering-Based Feature Subset Selection Algorithm for High-Dimensional Data

被引:391
|
作者
Song, Qinbao [1 ]
Ni, Jingjie [1 ]
Wang, Guangtao [1 ]
机构
[1] Xi An Jiao Tong Univ, Dept Comp Sci & Technol, Xian 710049, Shaanxi, Peoples R China
基金
中国国家自然科学基金;
关键词
Feature subset selection; filter method; feature clustering; graph-based clustering; STATISTICAL COMPARISONS; INFORMATION; CLASSIFIERS; RELEVANCE;
D O I
10.1109/TKDE.2011.181
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. A feature selection algorithm may be evaluated from both the efficiency and effectiveness points of view. While the efficiency concerns the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. Based on these criteria, a fast clustering-based feature selection algorithm (FAST) is proposed and experimentally evaluated in this paper. The FAST algorithm works in two steps. In the first step, features are divided into clusters by using graph-theoretic clustering methods. In the second step, the most representative feature that is strongly related to target classes is selected from each cluster to form a subset of features. Features in different clusters are relatively independent, the clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features. To ensure the efficiency of FAST, we adopt the efficient minimum-spanning tree (MST) clustering method. The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical study. Extensive experiments are carried out to compare FAST and several representative feature selection algorithms, namely, FCBF, ReliefF, CFS, Consist, and FOCUS-SF, with respect to four types of well-known classifiers, namely, the probability-based Naive Bayes, the tree-based C4.5, the instance-based IB1, and the rule-based RIPPER before and after feature selection. The results, on 35 publicly available real-world high-dimensional image, microarray, and text data, demonstrate that the FAST not only produces smaller subsets of features but also improves the performances of the four types of classifiers.
引用
收藏
页码:1 / 14
页数:14
相关论文
共 50 条
  • [21] A differential evolution based feature combination selection algorithm for high-dimensional data
    Guan, Boxin
    Zhao, Yuhai
    Yin, Ying
    Li, Yuan
    [J]. INFORMATION SCIENCES, 2021, 547 : 870 - 886
  • [22] A Fast Hybrid Feature Selection Based on Correlation-Guided Clustering and Particle Swarm Optimization for High-Dimensional Data
    Song, Xian-Fang
    Zhang, Yong
    Gong, Dun-Wei
    Gao, Xiao-Zhi
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2022, 52 (09) : 9573 - 9586
  • [23] A hybrid algorithm for feature subset selection in high-dimensional datasets using FICA and IWSSr algorithm
    Moradkhani, Mostafa
    Amiri, Ali
    Javaherian, Mohsen
    Safari, Hossein
    [J]. APPLIED SOFT COMPUTING, 2015, 35 : 123 - 135
  • [24] An incremental updating method for clustering-based high-dimensional data indexing
    Wang, B
    Gan, JQ
    [J]. COMPUTATIONAL INTELLIGENCE AND SECURITY, PT 1, PROCEEDINGS, 2005, 3801 : 495 - 502
  • [25] Clustering algorithm of high-dimensional data based on units
    School of In formation Engineering, Hubei Institute for Nationalities, Enshi 445000, China
    [J]. Jisuanji Yanjiu yu Fazhan, 2007, 9 (1618-1623): : 1618 - 1623
  • [26] Feature selection for high-dimensional data
    Destrero A.
    Mosci S.
    De Mol C.
    Verri A.
    Odone F.
    [J]. Computational Management Science, 2009, 6 (1) : 25 - 40
  • [27] Feature selection for high-dimensional data
    Bolón-Canedo V.
    Sánchez-Maroño N.
    Alonso-Betanzos A.
    [J]. Progress in Artificial Intelligence, 2016, 5 (2) : 65 - 75
  • [28] Feature selection algorithm based on optimized genetic algorithm and the application in high-dimensional data processing
    Feng, Guilian
    [J]. PLOS ONE, 2024, 19 (05):
  • [29] Accurate and fast feature selection workflow for high-dimensional omics data
    Perez-Riverol, Yasset
    Kuhn, Max
    Vizcaino, Juan Antonio
    Hitz, Marc-Phillip
    Audain, Enrique
    [J]. PLOS ONE, 2017, 12 (12):
  • [30] Clustering-based feature selection
    School of Informatics, Guangdong University of Foreign Studies, Guangzhou 510006, China
    [J]. Tien Tzu Hsueh Pao, 2008, SUPPL. (157-160):