Similarity of feature selection methods: An empirical study across data intensive classification tasks

被引：59

作者：

Dessi, Nicoletta ^{[1
]}

Pes, Barbara ^{[1
]}

机构：

[1] Univ Cagliari, Dipartimento Matemat & Informat, I-09124 Cagliari, Italy

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2015年 / 42卷 / 10期

关键词：

Data mining; Knowledge discovery; Feature selection; Similarity measures; GENE SELECTION; FEATURE-EXTRACTION; PREDICTION; CANCER; ALGORITHMS; REDUCTION; SYSTEM; TUMOR;

D O I：

10.1016/j.eswa.2015.01.069

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In the past two decades, the dimensionality of datasets involved in machine learning and data mining applications has increased explosively. Therefore, feature selection has become a necessary step to make the analysis more manageable and to extract useful knowledge about a given domain. A large variety of feature selection techniques are available in literature, and their comparative analysis is a very difficult task. So far, few studies have investigated, from a theoretical and/or experimental point of view, the degree of similarity/dissimilarity among the available techniques, namely the extent to which they tend to produce similar results within specific application contexts. This kind of similarity analysis is of crucial importance when two or more methods are combined in an ensemble fashion: indeed the ensemble paradigm is beneficial only if the involved methods are capable of giving different and complementary representations of the considered domain. This paper gives a contribution in this direction by proposing an empirical approach to evaluate the degree of consistency among the outputs of different selection algorithms in the context of high dimensional classification tasks. Leveraging on a proper similarity index, we systematically compared the feature subsets selected by eight popular selection methods, representatives of different selection approaches, and derived a similarity trend for feature subsets of increasing size. Through an extensive experimentation involving sixteen datasets from three challenging domains (Internet advertisements, text categorization and micro-array data classification), we obtained useful insight into the pattern of agreement of the considered methods. In particular, our results revealed how multivariate selection approaches systematically produce feature subsets that overlap to a small extent with those selected by the other methods. (C) 2015 Elsevier Ltd. All rights reserved.

引用

页码：4632 / 4642

页数：11

共 50 条

[1] Empirical evaluation of feature selection methods in classification
Cehovin, Luka
Bosnic, Zoran
INTELLIGENT DATA ANALYSIS, 2010, 14 (03) : 265 - 281
[2] Data-driven Feature Selection Methods for Text Classification: an Empirical Evaluation
Fragoso, Rogerio C. P.
Pinheiro, Roberto H. W.
Cavalcanti, George D. C.
JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2019, 25 (04) : 334 - 360
[3] Empirical study of feature selection methods based on individual feature evaluation for classification problems
Arauzo-Azofra, Antonio
Aznarte, Jose Luis
Benitez, Jose M.
EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (07) : 8170 - 8177
[4] Feature selection for classification tasks: Expert knowledge or traditional methods?
Camilo Corrales, David
Lasso, Emmanuel
Ledezma, Agapito
Carlos Corrales, Juan
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2018, 34 (05) : 2825 - 2835
[5] An empirical study on the joint impact of feature selection and data resampling on imbalance classification
Zhang, Chongsheng
Soda, Paolo
Bi, Jingjun
Fan, Gaojuan
Almpanidis, George
Garcia, Salvador
Ding, Weiping
APPLIED INTELLIGENCE, 2023, 53 (05) : 5449 - 5461
[6] An empirical study on the joint impact of feature selection and data resampling on imbalance classification
Chongsheng Zhang
Paolo Soda
Jingjun Bi
Gaojuan Fan
George Almpanidis
Salvador García
Weiping Ding
Applied Intelligence, 2023, 53 : 5449 - 5461
[7] An empirical evaluation for feature selection methods in phishing email classification
2013, CRL Publishing (28):
[8] Feature selection methods for multiphase reactors data classification
Tarca, LA
Grandjean, BPA
Larachi, F
INDUSTRIAL & ENGINEERING CHEMISTRY RESEARCH, 2005, 44 (04) : 1073 - 1084
[9] Impact of feature selection methods on data classification for IDS
Jiang, Shuai
Xu, Xiaolong
2019 INTERNATIONAL CONFERENCE ON CYBER-ENABLED DISTRIBUTED COMPUTING AND KNOWLEDGE DISCOVERY (CYBERC), 2019, : 174 - 180
[10] Correction to: An empirical study on the joint impact of feature selection and data resampling on imbalance classification
Chongsheng Zhang
Paolo Soda
Jingjun Bi
Gaojuan Fan
George Almpanidis
Salvador García
Weiping Ding
Applied Intelligence, 2023, 53 : 8506 - 8506

← 1 2 3 4 5 →