Sample Size Considerations of Prediction-Validation Methods in High-Dimensional Data for Survival Outcomes

被引:31
|
作者
Pang, Herbert [1 ]
Jung, Sin-Ho [1 ]
机构
[1] Duke Univ, Sch Med, Dept Biostat & Bioinformat, Durham, NC USA
关键词
gene expression; GWAS; high-dimensional data; prediction validation; sample size; survival; FALSE DISCOVERY RATE; MICROARRAY DATA-ANALYSIS; POWER; SHRINKAGE; SELECTION; FORESTS;
D O I
10.1002/gepi.21721
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
A variety of prediction methods are used to relate high-dimensional genome data with a clinical outcome using a prediction model. Once a prediction model is developed from a data set, it should be validated using a resampling method or an independent data set. Although the existing prediction methods have been intensively evaluated by many investigators, there has not been a comprehensive study investigating the performance of the validation methods, especially with a survival clinical outcome. Understanding the properties of the various validation methods can allow researchers to perform more powerful validations while controlling for type I error. In addition, sample size calculation strategy based on these validation methods is lacking. We conduct extensive simulations to examine the statistical properties of these validation strategies. In both simulations and a real data example, we have found that 10-fold cross-validation with permutation gave the best power while controlling type I error close to the nominal level. Based on this, we have also developed a sample size calculation method that will be used to design a validation study with a user-chosen combination of prediction. Microarray and genome-wide association studies data are used as illustrations. The power calculation method in this presentation can be used for the design of any biomedical studies involving high-dimensional data and survival outcomes.
引用
收藏
页码:276 / 282
页数:7
相关论文
共 50 条
  • [1] Sample size planning for survival prediction with focus on high-dimensional data
    Goette, Heiko
    Zwiener, Isabella
    [J]. STATISTICS IN MEDICINE, 2013, 32 (05) : 787 - 807
  • [2] Reproducibility and Sample Size in High-Dimensional Data
    Seo, Won Seok
    Choi, Jeea
    Jeong, Hyeong Chul
    Cho, HyungJun
    [J]. KOREAN JOURNAL OF APPLIED STATISTICS, 2010, 23 (06) : 1067 - 1080
  • [3] Sparse kernel methods for high-dimensional survival data
    Evers, Ludger
    Messow, Claudia-Martina
    [J]. BIOINFORMATICS, 2008, 24 (14) : 1632 - 1638
  • [4] A comparison of machine learning methods for survival analysis of high-dimensional clinical data for dementia prediction
    Spooner, Annette
    Chen, Emily
    Sowmya, Arcot
    Sachdev, Perminder
    Kochan, Nicole A.
    Trollor, Julian
    Brodaty, Henry
    [J]. SCIENTIFIC REPORTS, 2020, 10 (01)
  • [5] A comparison of machine learning methods for survival analysis of high-dimensional clinical data for dementia prediction
    Annette Spooner
    Emily Chen
    Arcot Sowmya
    Perminder Sachdev
    Nicole A. Kochan
    Julian Trollor
    Henry Brodaty
    [J]. Scientific Reports, 10
  • [6] General power and sample size calculations for high-dimensional genomic data
    van Iterson, Maarten
    van de Wiel, Mark A.
    Boer, Judith M.
    de Menezes, Renee X.
    [J]. STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY, 2013, 12 (04) : 449 - 467
  • [7] Penalized regression calibration: A method for the prediction of survival outcomes using complex longitudinal and high-dimensional data
    Signorelli, Mirko
    Spitali, Pietro
    Szigyarto, Cristina Al-Khalili
    Tsonaka, Roula
    [J]. STATISTICS IN MEDICINE, 2021, 40 (27) : 6178 - 6196
  • [8] Comparison of the Cluster Validation Methods for High-dimensional (Gene Expression) Data
    Jeong, Yunkyoung
    Baek, Jangsun
    [J]. KOREAN JOURNAL OF APPLIED STATISTICS, 2007, 20 (01) : 167 - 181
  • [9] Reliability of Cross-Validation for SVMs in High-Dimensional, Low Sample Size Scenarios
    Klement, Sascha
    Mamlouk, Amir Madany
    Martinetz, Thomas
    [J]. ARTIFICIAL NEURAL NETWORKS - ICANN 2008, PT I, 2008, 5163 : 41 - 50
  • [10] An evaluation of resampling methods for assessment of survival risk prediction in high-dimensional settings
    Subramanian, Jyothi
    Simon, Richard
    [J]. STATISTICS IN MEDICINE, 2011, 30 (06) : 642 - 653