AucPR: An AUC-based approach using penalized regression for disease prediction with high-dimensional omics data

被引:9
|
作者
Yu, Wenbao [1 ]
Park, Taesung [1 ,2 ]
机构
[1] Seoul Natl Univ, Dept Stat, Seoul 151742, South Korea
[2] Seoul Natl Univ, Interdisciplinary Program Bioinformat, Seoul 151742, South Korea
来源
BMC GENOMICS | 2014年 / 15卷
基金
新加坡国家研究基金会;
关键词
MARKER SELECTION; GENOME-WIDE; ROC CURVE; GENE; EXPRESSION; CLASSIFICATION; AREA; CANCER; TUMOR; REGULARIZATION;
D O I
10.1186/1471-2164-15-S10-S1
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Motivation: It is common to get an optimal combination of markers for disease classification and prediction when multiple markers are available. Many approaches based on the area under the receiver operating characteristic curve (AUC) have been proposed. Existing works based on AUC in a high-dimensional context depend mainly on a non-parametric, smooth approximation of AUC, with no work using a parametric AUC-based approach, for high-dimensional data. Results: We propose an AUC-based approach using penalized regression (AucPR), which is a parametric method used for obtaining a linear combination for maximizing the AUC. To obtain the AUC maximizer in a high-dimensional context, we transform a classical parametric AUC maximizer, which is used in a low-dimensional context, into a regression framework and thus, apply the penalization regression approach directly. Two kinds of penalization, lasso and elastic net, are considered. The parametric approach can avoid some of the difficulties of a conventional non-parametric AUC-based approach, such as the lack of an appropriate concave objective function and a prudent choice of the smoothing parameter. We apply the proposed AucPR for gene selection and classification using four real microarray and synthetic data. Through numerical studies, AucPR is shown to perform better than the penalized logistic regression and the non-parametric AUC-based method, in the sense of AUC and sensitivity for a given specificity, particularly when there are many correlated genes. Conclusion: We propose a powerful parametric and easily-implementable linear classifier AucPR, for gene selection and disease prediction for high-dimensional data. AucPR is recommended for its good prediction performance. Beside gene expression microarray data, AucPR can be applied to other types of high-dimensional omics data, such as miRNA and protein data.
引用
收藏
页数:12
相关论文
共 50 条