Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers

被引:0
|
作者
Lavanya, C. [1 ]
Pooja, S. [1 ]
Kashyap, Abhay H. [2 ]
Rahaman, Abdur [2 ]
Niranjan, Swarna [3 ]
Niranjan, Vidya [1 ,4 ]
机构
[1] RV Coll Engn, Dept Biotechnol, Bengaluru, Karnataka, India
[2] RV Coll Engn, Dept Comp Sci & Engn, Bengaluru, Karnataka, India
[3] RV Coll Engn, Dept AIML, Bengaluru, Karnataka, India
[4] RV Coll Engn, Dept Biotechnol, Mysore Rd,RV Vidyaniketan Post, Bangalore 560059, Karnataka, India
关键词
Lung cancer; biomarkers; supervised machine learning; random forest classifier; RNA-Seq;
D O I
暂无
中图分类号
R73 [肿瘤学];
学科分类号
100214 ;
摘要
Lung cancer is considered the most common and the deadliest cancer type. Lung cancer could be mainly of 2 types: small cell lung cancer and non-small cell lung cancer. Non-small cell lung cancer is affected by about 85% while small cell lung cancer is only about 14%. Over the last decade, functional genomics has arisen as a revolutionary tool for studying genetics and uncovering changes in gene expression. RNA-Seq has been applied to investigate the rare and novel transcripts that aid in discovering genetic changes that occur in tumours due to different lung cancers. Although RNA-Seq helps to understand and characterise the gene expression involved in lung cancer diagnostics, discovering the biomarkers remains a challenge. Usage of classification models helps uncover and classify the biomarkers based on gene expression levels over the different lung cancers. The current research concentrates on computing transcript statistics from gene transcript files with a normalised fold change of genes and identifying quantifiable differences in gene expression levels between the reference genome and lung cancer samples. The collected data is analysed, and machine learning models were developed to classify genes as causing NSCLC, causing SCLC, causing both or neither. An exploratory data analysis was performed to identify the probability distribution and principal features. Due to the limited number of features available, all of them were used in predicting the class. To address the imbalance in the dataset, an under-sampling algorithm Near Miss was carried out on the dataset. For classification, the research primarily focused on 4 supervised machine learning algorithms: Logistic Regression, KNN classifier, SVM classifier and Random Forest classifier and additionally, 2 ensemble algorithms were considered: XGboost and AdaBoost. Out of these, based on the weighted metrics considered, the Random Forest classifier showing 87% accuracy was considered to be the best performing algorithm and thus was used to predict the biomarkers causing NSCLC and SCLC. The imbalance and limited features in the dataset restrict any further improvement in the model's accuracy or precision. In our present study using the gene expression values (LogFC, P Value) as the feature sets in the Random Forest Classifier BRAF, KRAS, NRAS, EGFR is predicted to be the possible biomarkers causing NSCLC and ATF6, ATF3, PGDFA, PGDFD, PGDFC and PIP5K1C is predicted to be the possible biomarkers causing SCLC from the transcriptome analysis. It gave a precision of 91.3% and 91% recall after fine tuning. Some of the common biomarkers predicted for NSCLC and SCLC were CDK4, CDK6, BAK1, CDKN1A, DDB2.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers
    Lavanya, C.
    Pooja, S.
    Kashyap, Abhay H.
    Rahaman, Abdur
    Niranjan, Swarna
    Niranjan, Vidya
    CANCER INFORMATICS, 2023, 22
  • [2] Lung cancer prediction using random forest
    Rajini A.
    Jabbar M.A.
    Recent Advances in Computer Science and Communications, 2021, 14 (05) : 1650 - 1657
  • [3] Recommending research collaborations using link prediction and random forest classifiers
    Raf Guns
    Ronald Rousseau
    Scientometrics, 2014, 101 : 1461 - 1473
  • [4] Recommending research collaborations using link prediction and random forest classifiers
    Guns, Raf
    Rousseau, Ronald
    SCIENTOMETRICS, 2014, 101 (02) : 1461 - 1473
  • [5] Staging of non-small cell lung cancer using random forest classifiers based on radiomics
    Aouadi, S.
    Hammoud, R.
    Torfeh, T.
    Al-Hammadi, N.
    RADIOTHERAPY AND ONCOLOGY, 2020, 152 : S845 - S845
  • [6] PhospredRF: Prediction of Protein Phosphorylation Sites using a consensus of Random Forest classifiers
    Banerjee, Sagnik
    Basu, Subhadip
    Ghosh, Debjyoti
    Nasipuri, Mita
    2015 INTERNATIONAL CONFERENCE AND WORKSHOP ON COMPUTING AND COMMUNICATION (IEMCON), 2015,
  • [7] Detection of Skin Cancer Using SVM, Random Forest and kNN Classifiers
    Murugan, A.
    Nair, S. Anu H.
    Kumar, K. P. Sanal
    JOURNAL OF MEDICAL SYSTEMS, 2019, 43 (08)
  • [8] Detection of Skin Cancer Using SVM, Random Forest and kNN Classifiers
    A. Murugan
    S.Anu H. Nair
    K. P. Sanal Kumar
    Journal of Medical Systems, 2019, 43
  • [9] Breast Cancer Recurrence Prediction Using Random Forest Model
    Al-Quraishi, Tahsien
    Abawajy, Jemal H.
    Chowdhury, Morshed U.
    Rajasegarar, Sutharshan
    Abdalrada, Ahmad Shaker
    RECENT ADVANCES ON SOFT COMPUTING AND DATA MINING (SCDM 2018), 2018, 700 : 318 - 329
  • [10] Random Forest for Breast Cancer Prediction
    Octaviani, T. L.
    Rustam, Z.
    PROCEEDINGS OF THE 4TH INTERNATIONAL SYMPOSIUM ON CURRENT PROGRESS IN MATHEMATICS AND SCIENCES (ISCPMS2018), 2019, 2168