Risk of Selection of Irrelevant Features from High-Dimensional Data with Small Sample Size

被引:0
|
作者
Maciejewski, Henryk [1 ]
机构
[1] Wroclaw Univ Technol, Inst Comp Engn Control & Robot, Ul Janiszewskiego 11-17, PL-50370 Wroclaw, Poland
关键词
CANCER;
D O I
10.1007/978-3-319-13881-7_44
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
In this work we demonstrate the effect of small sample size on the risk that feature selection algorithms will select irrelevant features when dealing with high-dimensional data. We develop a simple analytical model to quantify this risk; we verify this model by the means of simulation. These results (i) explain the inherent instability of feature selection from high-dimensional, small sample size data and (ii) can be used to estimate the minimum required sample size which leads to good stability of features. Such results are useful when dealing with data from high-throughput studies.
引用
收藏
页码:399 / 405
页数:7
相关论文
共 50 条
  • [1] Reproducibility and Sample Size in High-Dimensional Data
    Seo, Won Seok
    Choi, Jeea
    Jeong, Hyeong Chul
    Cho, HyungJun
    [J]. KOREAN JOURNAL OF APPLIED STATISTICS, 2010, 23 (06) : 1067 - 1080
  • [2] Classifier for Chinese traditional medicine with high-dimensional and small sample-size data
    Zhang, LX
    Zhao, YN
    Yang, ZH
    Wang, JX
    Cai, SQ
    Liu, HY
    [J]. PROCEEDINGS OF THE 4TH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION, VOLS 1-4, 2002, : 330 - 334
  • [3] Sample size requirements for training high-dimensional risk predictors
    Dobbin, Kevin K.
    Song, Xiao
    [J]. BIOSTATISTICS, 2013, 14 (04) : 639 - 652
  • [4] An Efficient Dimensionality Reduction Approach for Small-sample Size and High-dimensional Data Modeling
    Qiu, Xintao
    Fu, Dongmei
    Fu, Zhenduo
    [J]. JOURNAL OF COMPUTERS, 2014, 9 (03) : 576 - 580
  • [5] Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data
    Cantor, Erika
    Guauque-Olarte, Sandra
    Leon, Roberto
    Chabert, Steren
    Salas, Rodrigo
    [J]. BIODATA MINING, 2024, 17 (01):
  • [6] General power and sample size calculations for high-dimensional genomic data
    van Iterson, Maarten
    van de Wiel, Mark A.
    Boer, Judith M.
    de Menezes, Renee X.
    [J]. STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY, 2013, 12 (04) : 449 - 467
  • [7] Sample size planning for survival prediction with focus on high-dimensional data
    Goette, Heiko
    Zwiener, Isabella
    [J]. STATISTICS IN MEDICINE, 2013, 32 (05) : 787 - 807
  • [8] Small sample sizes: A big data problem in high-dimensional data analysis
    Konietschke, Frank
    Schwab, Karima
    Pauly, Markus
    [J]. STATISTICAL METHODS IN MEDICAL RESEARCH, 2021, 30 (03) : 687 - 701
  • [9] A Hybrid Feature Selection Algorithm Applied to High-dimensional Imbalanced Small-sample Data Classification
    Feng, Fang
    Lv, Qingquan
    Wang, Mingsong
    Yang, Xuhui
    Zhou, Qingguo
    Zhou, Rui
    [J]. 2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 41 - 46
  • [10] Filtering High-Dimensional Methylation Marks With Extremely Small Sample Size: An Application to Gastric Cancer Data
    Chen, Xin
    Zhang, Qingrun
    Chekouo, Thierry
    [J]. FRONTIERS IN GENETICS, 2021, 12