Sparse Regression in Cancer Genomics: Comparing Variable Selection and Predictions in Real World Data

被引:2
|
作者
O'Shea, Robert J. [1 ]
Tsoka, Sophia [2 ]
Cook, Gary J. R. [1 ,3 ,4 ]
Goh, Vicky [1 ,5 ]
机构
[1] Kings Coll London, Sch Biomed Engn & Imaging Sci, Dept Canc Imaging, 5th Floor,Becket House,1 Lambeth Palace Rd, London SE1 7EU, England
[2] Kings Coll London, Sch Nat & Math Sci, Dept Informat, London, England
[3] Kings Coll London, London, England
[4] St Thomas Hosp, Guys & St Thomas PET Ctr, London, England
[5] Guys & St Thomas NHS Fdn Trust, Dept Radiol, London, England
基金
英国工程与自然科学研究理事会;
关键词
Artificial intelligence; gene regulatory networks; models; statistical; computational biology; genomics; GENE-EXPRESSION OMNIBUS; MODEL SELECTION; LASSO; SUBSET; REGULARIZATION; OPTIMIZATION;
D O I
10.1177/11769351211056298
中图分类号
R73 [肿瘤学];
学科分类号
100214 ;
摘要
BACKGROUND: Evaluation of gene interaction models in cancer genomics is challenging, as the true distribution is uncertain. Previous analyses have benchmarked models using synthetic data or databases of experimentally verified interactions - approaches which are susceptible to misrepresentation and incompleteness, respectively. The objectives of this analysis are to (1) provide a real-world data-driven approach for comparing performance of genomic model inference algorithms, (2) compare the performance of LASSO, elastic net, best-subset selection, L0L1 penalisation and L0L2 penalisation in real genomic data and (3) compare algorithmic preselection according to performance in our benchmark datasets to algorithmic selection by internal cross-validation. METHODS: Five large (n approximate to 4000) genomic datasets were extracted from Gene Expression Omnibus. 'Gold-standard' regression models were trained on subspaces of these datasets (n approximate to 4000, p = 500 ). Penalised regression models were trained on small samples from these subspaces (n is an element of {25, 75, 150}, p = 500) and validated against the gold-standard models. Variable selection performance and out-of-sample prediction were assessed. Penalty 'preselection' according to test performance in the other 4 datasets was compared to selection internal cross-validation error minimisation. RESULTS: L1L2-penalisation achieved the highest cosine similarity between estimated coefficients and those of gold-standard models. L0L2-penalised models explained the greatest proportion of variance in test responses, though performance was unreliable in low signal:noise conditions. L0L2 also attained the highest overall median variable selection F1 score. Penalty preselection significantly outperformed selection by internal cross-validation in each of 3 examined metrics. CONCLUSIONS: This analysis explores a novel approach for comparisons of model selection approaches in real genomic data from 5 cancers. Our benchmarking datasets have been made publicly available for use in future research. Our findings support the use of L0L2 penalisation for structural selection and L1L2 penalisation for coefficient recovery in genomic data. Evaluation of learning algorithms according to observed test performance in external genomic datasets yields valuable insights into actual test performance, providing a data-driven complement to internal cross-validation in genomic regression tasks.
引用
收藏
页数:15
相关论文
共 50 条
  • [41] Cultural group selection is plausible, but the predictions of its hypotheses should be tested with real-world data
    Turchin, Peter
    Currie, Thomas E.
    BEHAVIORAL AND BRAIN SCIENCES, 2016, 39 : e55
  • [42] A Sparse PLS for Variable Selection when Integrating Omics Data
    Le Cao, Kim-Anh
    Rossouw, Debra
    Robert-Granie, Christele
    Besse, Philippe
    STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY, 2008, 7 (01)
  • [43] Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression Data
    Dousti Mousavi, Niloufar
    Yang, Jie
    Aldirawi, Hani
    GENES, 2023, 14 (02)
  • [44] Variable Selection for Mixed Data Clustering: Application in Human Population Genomics
    Matthieu Marbac
    Mohammed Sedki
    Tienne Patin
    Journal of Classification, 2020, 37 : 124 - 142
  • [45] Variable Selection for Mixed Data Clustering: Application in Human Population Genomics
    Marbac, Matthieu
    Sedki, Mohammed
    Patin, Tienne
    JOURNAL OF CLASSIFICATION, 2020, 37 (01) : 124 - 142
  • [46] Variable selection in semiparametric hazard regression for multivariate survival data
    Liu, Jicai
    Zhang, Riquan
    Zhao, Weihua
    Lv, Yazhao
    JOURNAL OF MULTIVARIATE ANALYSIS, 2015, 142 : 26 - 40
  • [47] Bayesian Variable Selection Regression of Multivariate Responses for Group Data
    Liquet, B.
    Mengersen, K.
    Pettitt, A. N.
    Sutton, M.
    BAYESIAN ANALYSIS, 2017, 12 (04): : 1039 - 1067
  • [48] Variable selection in regression models including functional data predictors
    Liu K.
    Wang S.
    Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics, 2019, 45 (10): : 1990 - 1994
  • [49] Interquantile shrinkage and variable selection for longitudinal data in regression models
    Wan, Chuang
    Zhong, Wei
    Li, Chenjing
    Song, Xinyuan
    SCIENCE CHINA-MATHEMATICS, 2025,
  • [50] Variable selection in censored quantile regression with high dimensional data
    Yali Fan
    Yanlin Tang
    Zhongyi Zhu
    ScienceChina(Mathematics), 2018, 61 (04) : 641 - 658