Risk of Selection of Irrelevant Features from High-Dimensional Data with Small Sample Size

被引:0
|
作者
Maciejewski, Henryk [1 ]
机构
[1] Wroclaw Univ Technol, Inst Comp Engn Control & Robot, Ul Janiszewskiego 11-17, PL-50370 Wroclaw, Poland
关键词
CANCER;
D O I
10.1007/978-3-319-13881-7_44
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
In this work we demonstrate the effect of small sample size on the risk that feature selection algorithms will select irrelevant features when dealing with high-dimensional data. We develop a simple analytical model to quantify this risk; we verify this model by the means of simulation. These results (i) explain the inherent instability of feature selection from high-dimensional, small sample size data and (ii) can be used to estimate the minimum required sample size which leads to good stability of features. Such results are useful when dealing with data from high-throughput studies.
引用
收藏
页码:399 / 405
页数:7
相关论文
共 50 条
  • [41] Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes
    Hoyle, David C.
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2008, 9 : 2733 - 2759
  • [42] Surrogate Sample-Assisted Particle Swarm Optimization for Feature Selection on High-Dimensional Data
    Song, Xianfang
    Zhang, Yong
    Gong, Dunwei
    Liu, Hui
    Zhang, Wanqiu
    [J]. IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2023, 27 (03) : 595 - 609
  • [43] Separability tests for high-dimensional, low-sample size multivariate repeated measures data
    Simpson, Sean L.
    Edwards, Lloyd J.
    Styner, Martin A.
    Muller, Keith E.
    [J]. JOURNAL OF APPLIED STATISTICS, 2014, 41 (11) : 2450 - 2461
  • [44] Sample Size Considerations of Prediction-Validation Methods in High-Dimensional Data for Survival Outcomes
    Pang, Herbert
    Jung, Sin-Ho
    [J]. GENETIC EPIDEMIOLOGY, 2013, 37 (03) : 276 - 282
  • [45] DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets
    Alibeigi, Mina
    Hashemi, Sattar
    Hamzeh, Ali
    [J]. DATA & KNOWLEDGE ENGINEERING, 2012, 81-82 : 67 - 103
  • [46] Simultaneous Feature and Model Selection for High-Dimensional Data
    Perolini, Alessandro
    Guerif, Sebastien
    [J]. 2011 23RD IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2011), 2011, : 47 - 50
  • [47] On the scalability of feature selection methods on high-dimensional data
    V. Bolón-Canedo
    D. Rego-Fernández
    D. Peteiro-Barral
    A. Alonso-Betanzos
    B. Guijarro-Berdiñas
    N. Sánchez-Maroño
    [J]. Knowledge and Information Systems, 2018, 56 : 395 - 442
  • [48] Bayesian variable selection in clustering high-dimensional data
    Tadesse, MG
    Sha, N
    Vannucci, M
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2005, 100 (470) : 602 - 617
  • [49] On landmark selection and sampling in high-dimensional data analysis
    Belabbas, Mohamed-Ali
    Wolfe, Patrick J.
    [J]. PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2009, 367 (1906): : 4295 - 4312
  • [50] Clonal Selection Classification Algorithm for High-Dimensional Data
    Liu, Ruochen
    Zhang, Ping
    Jiao, Licheng
    [J]. LIFE SYSTEM MODELING AND INTELLIGENT COMPUTING, PT II, 2010, 98 : 89 - 95