Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers

被引:0
|
作者
Lavanya, C. [1 ]
Pooja, S. [1 ]
Kashyap, Abhay H. [2 ]
Rahaman, Abdur [2 ]
Niranjan, Swarna [3 ]
Niranjan, Vidya [1 ,4 ]
机构
[1] RV Coll Engn, Dept Biotechnol, Bengaluru, Karnataka, India
[2] RV Coll Engn, Dept Comp Sci & Engn, Bengaluru, Karnataka, India
[3] RV Coll Engn, Dept AIML, Bengaluru, Karnataka, India
[4] RV Coll Engn, Dept Biotechnol, Mysore Rd,RV Vidyaniketan Post, Bangalore 560059, Karnataka, India
关键词
Lung cancer; biomarkers; supervised machine learning; random forest classifier; RNA-Seq;
D O I
暂无
中图分类号
R73 [肿瘤学];
学科分类号
100214 ;
摘要
Lung cancer is considered the most common and the deadliest cancer type. Lung cancer could be mainly of 2 types: small cell lung cancer and non-small cell lung cancer. Non-small cell lung cancer is affected by about 85% while small cell lung cancer is only about 14%. Over the last decade, functional genomics has arisen as a revolutionary tool for studying genetics and uncovering changes in gene expression. RNA-Seq has been applied to investigate the rare and novel transcripts that aid in discovering genetic changes that occur in tumours due to different lung cancers. Although RNA-Seq helps to understand and characterise the gene expression involved in lung cancer diagnostics, discovering the biomarkers remains a challenge. Usage of classification models helps uncover and classify the biomarkers based on gene expression levels over the different lung cancers. The current research concentrates on computing transcript statistics from gene transcript files with a normalised fold change of genes and identifying quantifiable differences in gene expression levels between the reference genome and lung cancer samples. The collected data is analysed, and machine learning models were developed to classify genes as causing NSCLC, causing SCLC, causing both or neither. An exploratory data analysis was performed to identify the probability distribution and principal features. Due to the limited number of features available, all of them were used in predicting the class. To address the imbalance in the dataset, an under-sampling algorithm Near Miss was carried out on the dataset. For classification, the research primarily focused on 4 supervised machine learning algorithms: Logistic Regression, KNN classifier, SVM classifier and Random Forest classifier and additionally, 2 ensemble algorithms were considered: XGboost and AdaBoost. Out of these, based on the weighted metrics considered, the Random Forest classifier showing 87% accuracy was considered to be the best performing algorithm and thus was used to predict the biomarkers causing NSCLC and SCLC. The imbalance and limited features in the dataset restrict any further improvement in the model's accuracy or precision. In our present study using the gene expression values (LogFC, P Value) as the feature sets in the Random Forest Classifier BRAF, KRAS, NRAS, EGFR is predicted to be the possible biomarkers causing NSCLC and ATF6, ATF3, PGDFA, PGDFD, PGDFC and PIP5K1C is predicted to be the possible biomarkers causing SCLC from the transcriptome analysis. It gave a precision of 91.3% and 91% recall after fine tuning. Some of the common biomarkers predicted for NSCLC and SCLC were CDK4, CDK6, BAK1, CDKN1A, DDB2.
引用
收藏
页数:15
相关论文
共 50 条
  • [31] Outlier Prediction Using Random Forest Classifier
    Mohandoss, Divya Pramasani
    Shi, Yong
    Suo, Kun
    2021 IEEE 11TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE (CCWC), 2021, : 27 - 33
  • [32] Statistical geometry based prediction of nonsynonymous SNP functional effects using random forest and neuro-fuzzy classifiers
    Barenboim, Maxim
    Masso, Majid
    Vaisman, Iosif I.
    Jamison, D. Curtis
    PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2008, 71 (04) : 1930 - 1939
  • [33] VARIABLE INTERACTION MEASURES WITH RANDOM FOREST CLASSIFIERS
    Kelly, Cassidy
    Okada, Kazunori
    2012 9TH IEEE INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (ISBI), 2012, : 154 - 157
  • [34] Prediction of novel mouse TLR9 agonists using a random forest approach
    Khanna, Varun
    Li, Lei
    Fung, Johnson
    Ranganathan, Shoba
    Petrovsky, Nikolai
    BMC MOLECULAR AND CELL BIOLOGY, 2019, 20 (Suppl 2)
  • [35] Prediction of novel mouse TLR9 agonists using a random forest approach
    Varun Khanna
    Lei Li
    Johnson Fung
    Shoba Ranganathan
    Nikolai Petrovsky
    BMC Molecular and Cell Biology, 20
  • [36] Object-oriented mapping of urban trees using Random Forest classifiers
    Puissant, Anne
    Rougier, Simon
    Stumpf, Andre
    INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2014, 26 : 235 - 245
  • [37] Fully automated stroke tissue estimation using random forest classifiers (FASTER)
    McKinley, Richard
    Hani, Levin
    Gralla, Jan
    El-Koussy, M.
    Bauer, S.
    Arnold, M.
    Fischer, U.
    Jung, S.
    Mattmann, Kaspar
    Reyes, Mauricio
    Wiest, Roland
    JOURNAL OF CEREBRAL BLOOD FLOW AND METABOLISM, 2017, 37 (08): : 2728 - 2741
  • [38] Prognostic prediction for inflammatory breast cancer patients using random survival forest modeling
    Jia, Yiwei
    Li, Chaofan
    Feng, Cong
    Sun, Shiyu
    Cai, Yifan
    Yao, Peizhuo
    Wei, Xinyu
    Feng, Zeyao
    Liu, Yanbin
    Lv, Wei
    Wu, Huizi
    Wu, Fei
    Zhang, Lu
    Zhang, Shuqun
    Ma, Xingcong
    TRANSLATIONAL ONCOLOGY, 2025, 52
  • [39] Classification and Prediction of Breast Cancer using Linear Regression, Decision Tree and Random Forest
    Murugan, S.
    Kumar, B. Muthu
    Amudha, S.
    2017 INTERNATIONAL CONFERENCE ON CURRENT TRENDS IN COMPUTER, ELECTRICAL, ELECTRONICS AND COMMUNICATION (CTCEEC), 2017, : 763 - 766
  • [40] Prediction and Prioritization of Rare Oncogenic Mutations in the Cancer Kinome Using Novel Features and Multiple Classifiers
    ManChon, U.
    Talevich, Eric
    Katiyar, Samiksha
    Rasheed, Khaled
    Kannan, Natarajan
    PLOS COMPUTATIONAL BIOLOGY, 2014, 10 (04)