Sample size requirements for training high-dimensional risk predictors

被引:7
|
作者
Dobbin, Kevin K. [1 ]
Song, Xiao [1 ]
机构
[1] Univ Georgia, Coll Publ Hlth, Athens, GA 30602 USA
关键词
Conditional score; Cox regression; High-dimensional data; Risk prediction; Sample size; Training set; PROPORTIONAL HAZARDS MODEL; SURVIVAL ANALYSIS; COX REGRESSION; ERROR RATE; SIGNATURE; VALIDATION; ESTIMATOR; RULE;
D O I
10.1093/biostatistics/kxt022
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
A common objective of biomarker studies is to develop a predictor of patient survival outcome. Determining the number of samples required to train a predictor from survival data is important for designing such studies. Existing sample size methods for training studies use parametric models for the high-dimensional data and cannot handle a right-censored dependent variable. We present a new training sample size method that is non-parametric with respect to the high-dimensional vectors, and is developed for a right-censored response. The method can be applied to any prediction algorithm that satisfies a set of conditions. The sample size is chosen so that the expected performance of the predictor is within a user-defined tolerance of optimal. The central method is based on a pilot dataset. To quantify uncertainty, a method to construct a confidence interval for the tolerance is developed. Adequacy of the size of the pilot dataset is discussed. An alternative model-based version of our method for estimating the tolerance when no adequate pilot dataset is available is presented. The model-based method requires a covariance matrix be specified, but we show that the identity covariance matrix provides adequate sample size when the user specifies three key quantities. Application of the sample size method to two microarray datasets is discussed.
引用
收藏
页码:639 / 652
页数:14
相关论文
共 50 条
  • [1] Sample size requirements for learning to classify with high-dimensional biomarker panels
    McKeigue, Paul
    [J]. STATISTICAL METHODS IN MEDICAL RESEARCH, 2019, 28 (03) : 904 - 910
  • [2] Reproducibility and Sample Size in High-Dimensional Data
    Seo, Won Seok
    Choi, Jeea
    Jeong, Hyeong Chul
    Cho, HyungJun
    [J]. KOREAN JOURNAL OF APPLIED STATISTICS, 2010, 23 (06) : 1067 - 1080
  • [3] Risk of Selection of Irrelevant Features from High-Dimensional Data with Small Sample Size
    Maciejewski, Henryk
    [J]. STOCHASTIC MODELS, STATISTICS AND THEIR APPLICATIONS, 2015, 122 : 399 - 405
  • [4] General power and sample size calculations for high-dimensional genomic data
    van Iterson, Maarten
    van de Wiel, Mark A.
    Boer, Judith M.
    de Menezes, Renee X.
    [J]. STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY, 2013, 12 (04) : 449 - 467
  • [5] Nearly Optimal Sample Size in Hypothesis Testing for High-Dimensional Regression
    Javanmard, Adel
    Montanari, Andrea
    [J]. 2013 51ST ANNUAL ALLERTON CONFERENCE ON COMMUNICATION, CONTROL, AND COMPUTING (ALLERTON), 2013, : 1427 - 1434
  • [6] Scale adjustments for classifiers in high-dimensional, low sample size settings
    Chan, Yao-Ban
    Hall, Peter
    [J]. BIOMETRIKA, 2009, 96 (02) : 469 - 478
  • [7] Sample size planning for survival prediction with focus on high-dimensional data
    Goette, Heiko
    Zwiener, Isabella
    [J]. STATISTICS IN MEDICINE, 2013, 32 (05) : 787 - 807
  • [8] On robust regression with high-dimensional predictors
    El Karoui, Noureddine
    Bean, Derek
    Bickel, Peter J.
    Lim, Chinghway
    Yu, Bin
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2013, 110 (36) : 14557 - 14562
  • [9] Sensitivity analysis approaches to high-dimensional screening problems at low sample size
    Becker, W. E.
    Tarantola, S.
    Deman, G.
    [J]. JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2018, 88 (11) : 2089 - 2110
  • [10] Significance analysis of high-dimensional, low-sample size partially labeled data
    Lu, Qiyi
    Qiao, Xingye
    [J]. JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2016, 176 : 78 - 94