Sample size requirements for training high-dimensional risk predictors

被引:7
|
作者
Dobbin, Kevin K. [1 ]
Song, Xiao [1 ]
机构
[1] Univ Georgia, Coll Publ Hlth, Athens, GA 30602 USA
关键词
Conditional score; Cox regression; High-dimensional data; Risk prediction; Sample size; Training set; PROPORTIONAL HAZARDS MODEL; SURVIVAL ANALYSIS; COX REGRESSION; ERROR RATE; SIGNATURE; VALIDATION; ESTIMATOR; RULE;
D O I
10.1093/biostatistics/kxt022
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
A common objective of biomarker studies is to develop a predictor of patient survival outcome. Determining the number of samples required to train a predictor from survival data is important for designing such studies. Existing sample size methods for training studies use parametric models for the high-dimensional data and cannot handle a right-censored dependent variable. We present a new training sample size method that is non-parametric with respect to the high-dimensional vectors, and is developed for a right-censored response. The method can be applied to any prediction algorithm that satisfies a set of conditions. The sample size is chosen so that the expected performance of the predictor is within a user-defined tolerance of optimal. The central method is based on a pilot dataset. To quantify uncertainty, a method to construct a confidence interval for the tolerance is developed. Adequacy of the size of the pilot dataset is discussed. An alternative model-based version of our method for estimating the tolerance when no adequate pilot dataset is available is presented. The model-based method requires a covariance matrix be specified, but we show that the identity covariance matrix provides adequate sample size when the user specifies three key quantities. Application of the sample size method to two microarray datasets is discussed.
引用
收藏
页码:639 / 652
页数:14
相关论文
共 50 条
  • [21] An Efficient Dimensionality Reduction Approach for Small-sample Size and High-dimensional Data Modeling
    Qiu, Xintao
    Fu, Dongmei
    Fu, Zhenduo
    [J]. JOURNAL OF COMPUTERS, 2014, 9 (03) : 576 - 580
  • [22] Testing high-dimensional normality based on classical skewness and Kurtosis with a possible small sample size
    Liang, Jiajuan
    Tang, Man-Lai
    Zhao, Xuejing
    [J]. COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2019, 48 (23) : 5719 - 5732
  • [23] Network-based dimensionality reduction of high-dimensional, low-sample-size datasets
    Kosztyan, Zsolt T.
    Kurbucz, Marcell T.
    Katona, Attila I.
    [J]. KNOWLEDGE-BASED SYSTEMS, 2022, 251
  • [24] Is the empirical out-of-sample variance an informative risk measure for the high-dimensional portfolios?
    Bodnar, Taras
    Parolya, Nestor
    Thorsen, Erik
    [J]. FINANCE RESEARCH LETTERS, 2023, 54
  • [25] Multivariate multidistance tests for high-dimensional low sample size case-control studies
    Marozzi, Marco
    [J]. STATISTICS IN MEDICINE, 2015, 34 (09) : 1511 - 1526
  • [26] Sample Size Considerations of Prediction-Validation Methods in High-Dimensional Data for Survival Outcomes
    Pang, Herbert
    Jung, Sin-Ho
    [J]. GENETIC EPIDEMIOLOGY, 2013, 37 (03) : 276 - 282
  • [27] A training algorithm for classification of high-dimensional data
    Vieira, A
    Barradas, N
    [J]. NEUROCOMPUTING, 2003, 50 : 461 - 472
  • [28] High-dimensional regression with ordered multiple categorical predictors
    Huang, Lei
    Hang, Weiqiang
    Chao, Yue
    [J]. STATISTICS IN MEDICINE, 2020, 39 (03) : 294 - 309
  • [29] Semiparametric quantile averaging in the presence of high-dimensional predictors
    De Gooijer, Jan G.
    Zerom, Dawit
    [J]. INTERNATIONAL JOURNAL OF FORECASTING, 2019, 35 (03) : 891 - 909
  • [30] MARGINAL SCREENING FOR HIGH-DIMENSIONAL PREDICTORS OF SURVIVAL OUTCOMES
    Huang, Tzu-Jung
    McKeague, Ian W.
    Qian, Min
    [J]. STATISTICA SINICA, 2019, 29 (04) : 2105 - 2139