Robust biomarker discovery for hepatocellular carcinoma from high-throughput data by multiple feature selection methods

被引:17
|
作者
Zhang, Zishuang [1 ]
Liu, Zhi-Ping [1 ,2 ]
机构
[1] Shandong Univ, Sch Control Sci & Engn, Dept Biomed Engn, Jinan 250061, Shandong, Peoples R China
[2] Shandong Univ, Ctr Intelligent Med, Jinan 250061, Shandong, Peoples R China
基金
中国国家自然科学基金;
关键词
Biomarker discovery; Omics data; Feature selection; Akaike information criterion; Hepatocellular carcinoma; IDENTIFICATION; DISEASES;
D O I
10.1186/s12920-021-00957-4
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Background Hepatocellular carcinoma (HCC) is one of the most common cancers. The discovery of specific genes severing as biomarkers is of paramount significance for cancer diagnosis and prognosis. The high-throughput omics data generated by the cancer genome atlas (TCGA) consortium provides a valuable resource for the discovery of HCC biomarker genes. Numerous methods have been proposed to select cancer biomarkers. However, these methods have not investigated the robustness of identification with different feature selection techniques. Methods We use six different recursive feature elimination methods to select the gene signiatures of HCC from TCGA liver cancer data. The genes shared in the six selected subsets are proposed as robust biomarkers. Akaike information criterion (AIC) is employed to explain the optimization process of feature selection, which provides a statistical interpretation for the feature selection in machine learning methods. And we use several methods to validate the screened biomarkers. Results In this paper, we propose a robust method for discovering biomarker genes for HCC from gene expression data. Specifically, we implement recursive feature elimination cross-validation (RFE-CV) methods based on six different classication algorithms. The overlaps in the discovered gene sets via different methods are referred as the identified biomarkers. We give an interpretation of the feature selection process based on machine learning using AIC in statistics. Furthermore, the features selected by the backward logistic stepwise regression via AIC minimum theory are completely contained in the identified biomarkers. Through the classification results, the superiority of interpretable robust biomarker discovery method is verified. Conclusions It is found that overlaps among gene subsets contain different quantitative features selected by the RFE-CV of 6 classifiers. The AIC values in the model selection provide a theoretical foundation for the feature selection process of biomarker discovery via machine learning. What's more, genes containing in more optimally selected subsets make better biological sense and implication. The quality of feature selection is improved by the intersections of biomarkers selected from different classifiers. This is a general method suitable for screening biomarkers of complex diseases from high-throughput data.
引用
收藏
页数:12
相关论文
共 50 条
  • [21] Novel high-throughput applications for NAFLD diagnostics and biomarker discovery
    Giraudi, Pablo J.
    Stephenson, Adam M.
    Tiribelli, Claudio
    Rosso, Natalia
    HEPATOMA RESEARCH, 2021, 7
  • [22] High-throughput metabolomics enables biomarker discovery in prostate cancer
    Liang Q.
    Liu H.
    Xie L.-X.
    Li X.
    Zhang A.-H.
    Liang, Qun (qunliangomics@163.com), 1600, Royal Society of Chemistry (07): : 2587 - 2593
  • [23] High-throughput metabolomics enables biomarker discovery in prostate cancer
    Liang, Qun
    Liu, Han
    Xie, Li-xiang
    Li, Xue
    Zhang, Ai-Hua
    RSC ADVANCES, 2017, 7 (05): : 2587 - 2593
  • [24] Multiple Sclerosis Biomarker Discovery via Bayesian Feature Selection
    Pour, Ali Foroughi
    Dalton, Lori A.
    PROCEEDINGS OF THE 7TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS, 2016, : 540 - 541
  • [25] Robust statistical methods for hit selection in RNA interference high-throughput screening experiments
    Zhang, XHD
    Yang, XTC
    Chung, NJ
    Gates, A
    Stec, E
    Kunapuli, P
    Holder, DJ
    Ferrer, M
    Espeseth, AS
    PHARMACOGENOMICS, 2006, 7 (03) : 299 - 309
  • [26] High-Throughput Methods in the Discovery and Study of Biomaterials and Materiobiology
    Yang, Liangliang
    Pijuan-Galito, Sara
    Rho, Hoon Suk
    Vasilevich, Aliaksei S.
    Eren, Aysegul Dede
    Ge, Lu
    Habibovic, Pamela
    Alexander, Morgan R.
    de Boer, Jan
    Carlier, Aurelie
    van Rijn, Patrick
    Zhou, Qihui
    CHEMICAL REVIEWS, 2021, 121 (08) : 4561 - 4677
  • [27] Comparative Study of Feature Selection and Classification Techniques for High-Throughput DNA Methylation Data
    Alkuhlani, Alhasan
    Nassef, Mohammad
    Farag, Ibrahim
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENT SYSTEMS AND INFORMATICS 2016, 2017, 533 : 793 - 803
  • [28] The identification of key genes and pathways in hepatocellular carcinoma by bioinformatics analysis of high-throughput data
    Zhang, Chaoyang
    Peng, Li
    Zhang, Yaqin
    Liu, Zhaoyang
    Li, Wenling
    Chen, Shilian
    Li, Guancheng
    MEDICAL ONCOLOGY, 2017, 34 (06)
  • [29] The identification of key genes and pathways in hepatocellular carcinoma by bioinformatics analysis of high-throughput data
    Chaoyang Zhang
    Li Peng
    Yaqin Zhang
    Zhaoyang Liu
    Wenling Li
    Shilian Chen
    Guancheng Li
    Medical Oncology, 2017, 34
  • [30] Genome variation discovery with high-throughput sequencing data
    Dalca, Adrian V.
    Brudno, Michael
    BRIEFINGS IN BIOINFORMATICS, 2010, 11 (01) : 3 - 14