Sample size requirements for learning to classify with high-dimensional biomarker panels

被引:6
|
作者
McKeigue, Paul [1 ]
机构
[1] Univ Edinburgh, Usher Inst Populat Hlth Sci & Informat, Old Med Sch,Teviot Pl, Edinburgh EH8 9AG, Midlothian, Scotland
关键词
Sample size; linear classifier; Bayesian; high-dimensional; EVENTS; NUMBER;
D O I
10.1177/0962280217738807
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
A common problem in biomedical research is to calculate the sample size required to learn a classifier using a (possibly high-dimensional) panel of biomarkers. This paper describes a simple method based on a Gaussian approximation for calculating the predictive performance of the learned classifier given the size of the biomarker panel, the size of the training sample, and the optimal predictive performance (expressed as C-statistic Copt) of the biomarker panel that could be obtained if a training sample of unlimited size were available. Under the assumption that the biomarker effect sizes have the same correlation structure as the biomarkers, the required sample size does not depend upon these correlations, but only upon Copt and upon the sparsity of the distribution of effect sizes, defined by the proportion of biomarkers that have nonzero effects. To learn a classifier that extracts 80% of the predictive information, the required case sample size varies from about 0.1 cases per variable for a panel with Copt=0.9 and a sparse distribution of effect sizes (such that 1% of biomarkers have nonzero effect sizes) to nine cases per variable for a panel with Copt=0.75 and a diffuse distribution of effect sizes.
引用
收藏
页码:904 / 910
页数:7
相关论文
共 50 条
  • [1] Sample size requirements for training high-dimensional risk predictors
    Dobbin, Kevin K.
    Song, Xiao
    [J]. BIOSTATISTICS, 2013, 14 (04) : 639 - 652
  • [2] Reproducibility and Sample Size in High-Dimensional Data
    Seo, Won Seok
    Choi, Jeea
    Jeong, Hyeong Chul
    Cho, HyungJun
    [J]. KOREAN JOURNAL OF APPLIED STATISTICS, 2010, 23 (06) : 1067 - 1080
  • [3] Learning to classify from impure samples with high-dimensional data
    Komiske, Patrick T.
    Metodiev, Eric M.
    Nachman, Benjamin
    Schwartz, Matthew D.
    [J]. PHYSICAL REVIEW D, 2018, 98 (01)
  • [4] On the Sample Complexity of Privately Learning Unbounded High-Dimensional Gaussians
    Aden-Ali, Ishaq
    Ashtiani, Hassan
    Kamath, Gautam
    [J]. ALGORITHMIC LEARNING THEORY, VOL 132, 2021, 132
  • [5] General power and sample size calculations for high-dimensional genomic data
    van Iterson, Maarten
    van de Wiel, Mark A.
    Boer, Judith M.
    de Menezes, Renee X.
    [J]. STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY, 2013, 12 (04) : 449 - 467
  • [6] Nearly Optimal Sample Size in Hypothesis Testing for High-Dimensional Regression
    Javanmard, Adel
    Montanari, Andrea
    [J]. 2013 51ST ANNUAL ALLERTON CONFERENCE ON COMMUNICATION, CONTROL, AND COMPUTING (ALLERTON), 2013, : 1427 - 1434
  • [7] Scale adjustments for classifiers in high-dimensional, low sample size settings
    Chan, Yao-Ban
    Hall, Peter
    [J]. BIOMETRIKA, 2009, 96 (02) : 469 - 478
  • [8] Sample size planning for survival prediction with focus on high-dimensional data
    Goette, Heiko
    Zwiener, Isabella
    [J]. STATISTICS IN MEDICINE, 2013, 32 (05) : 787 - 807
  • [9] Sensitivity analysis approaches to high-dimensional screening problems at low sample size
    Becker, W. E.
    Tarantola, S.
    Deman, G.
    [J]. JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2018, 88 (11) : 2089 - 2110
  • [10] Sample size determination for high dimensional parameter estimation with application to biomarker identification
    Jiang, Binyan
    Li, Jialiang
    [J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2018, 118 : 54 - 65