Multimetric feature selection for analyzing multicategory outcomes of colorectal cancer: random forest and multinomial logistic regression models

被引:7
|
作者
Feng, Catherine H. [1 ]
Disis, Mary L. [2 ]
Cheng, Chao [3 ,4 ,5 ]
Zhang, Lanjing [6 ,7 ,8 ,9 ]
机构
[1] Montgomery High Sch, Skillman, NJ USA
[2] Univ Washington, UW Med Canc Vaccine Inst, Seattle, WA USA
[3] Baylor Coll Med, Dept Med, Sect Epidemiol & Populat Sci, Houston, TX USA
[4] Baylor Coll Med, Dept Med, Houston, TX 77030 USA
[5] Baylor Coll Med, Dan L Duncan Comprehens Canc Ctr, Houston, TX USA
[6] Rutgers State Univ, Dept Biol Sci, Newark, NJ 07102 USA
[7] Med Ctr Princeton, Dept Pathol, Plainsboro, NJ 08536 USA
[8] Rutgers Canc Inst New Jersey, New Brunswick, NJ 08901 USA
[9] Rutgers State Univ, Ernest Mario Sch Pharm, Dept Chem Biol, Piscataway, NJ 08854 USA
关键词
MICROSATELLITE INSTABILITY; POOR SURVIVAL; GENE-EXPRESSION; TUMOR DEPOSIT; COLON-CANCER; PROGNOSIS; MUTATION; SMAD4;
D O I
10.1038/s41374-021-00662-x
中图分类号
R-3 [医学研究方法]; R3 [基础医学];
学科分类号
1001 ;
摘要
Colorectal cancer (CRC) is one of the most common cancers worldwide, and a leading cause of cancer deaths. Better classifying multicategory outcomes of CRC with clinical and omic data may help adjust treatment regimens based on individual's risk. Here, we selected the features that were useful for classifying four-category survival outcome of CRC using the clinical and transcriptomic data, or clinical, transcriptomic, microsatellite instability and selected oncogenic-driver data (all data) of TCGA. We also optimized multimetric feature selection to develop the best multinomial logistic regression (MLR) and random forest (RF) models that had the highest accuracy, precision, recall and F1 score, respectively. We identified 2073 differentially expressed genes of the TCGA RNASeq dataset. MLR overall outperformed RF in the multimetric feature selection. In both RF and MLR models, precision, recall and F1 score increased as the feature number increased and peaked at the feature number of 600-1000, while the models' accuracy remained stable. The best model was the MLR one with 825 features based on sum of squared coefficients using all data, and attained the best accuracy of 0.855, F1 of 0.738 and precision of 0.832, which were higher than those using clinical and transcriptomic data. The top-ranked features in the MLR model of the best performance using clinical and transcriptomic data were different from those using all data. However, pathologic staging, HBS1L, TSPYL4, and TP53TG3B were the overlapping top-20 ranked features in the best models using clinical and transcriptomic, or all data. Thus, we developed a multimetric feature-selection based MLR model that outperformed RF models in classifying four-category outcome of CRC patients. Interestingly, adding microsatellite instability and oncogenic-driver data to clinical and transcriptomic data improved models' performances. Precision and recall of tuned algorithms may change significantly as the feature number changes, but accuracy appears not sensitive to these changes. Multimetric feature selection for analyzing multicategory outcomes of colorectal cancer: random forest and multinomial logistic regression models It is unclear how to best classify cancer outcomes using 'omic data. We developed a multimetric feature-selection based multinomial logistic regression model that outperformed random forest models in classifying 4-category outcome of colorectal cancer. Adding microsatellite instability and oncogenic-driver data to clinical and transcriptomic data improves models' performances, with pathologic staging, HBS1L, TSPYL4, and TP53TG3B as important features. Interestingly, precision and recall of tuned algorithms change as the feature number changes, but accuracy does not.
引用
收藏
页码:236 / 244
页数:9
相关论文
共 50 条
  • [1] Multinomial logistic regression-based feature selection for hyperspectral data
    Pal, Mahesh
    [J]. INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2012, 14 (01): : 214 - 220
  • [2] Random feature selection using random subspace logistic regression
    Wichitaksorn, Nuttanan
    Kang, Yingyue
    Zhang, Faqiang
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 217
  • [3] Bayesian model selection for logistic regression models with random intercept
    Wagner, Helga
    Duller, Christine
    [J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2012, 56 (05) : 1256 - 1274
  • [4] Multinomial Logistic Regression and Random Forest Classifiers in Digital Mapping of Soil Classes in Western Haiti
    Jeune, Wesly
    Francelino, Marcio Rocha
    de Souza, Eliana
    Fernandes Filho, Elpidio Inacio
    Rocha, Genelicio Crusoe
    [J]. REVISTA BRASILEIRA DE CIENCIA DO SOLO, 2018, 42
  • [5] A Comparison of Logistic Regression, Random Forest Models in Predicting the Risk of Diabetes
    Zhang, Baoxin
    Lu, Li
    Hou, Jiaqi
    [J]. THIRD INTERNATIONAL SYMPOSIUM ON IMAGE COMPUTING AND DIGITAL MEDICINE (ISICDM 2019), 2019, : 231 - 234
  • [6] The probabilistic reduction approach to specifying multinomial logistic regression models in health outcomes research
    Bergtold, Jason S.
    Onukwugha, Eberechukwu
    [J]. JOURNAL OF APPLIED STATISTICS, 2014, 41 (10) : 2206 - 2221
  • [7] MULTISTAGE LOGISTIC REGRESSION MODEL FOR ANALYZING SURVIVAL FROM COLORECTAL CANCER
    Ahmad, Yuhaniz
    Zain, Zakiyah
    Aziz, Nazrina
    [J]. INTERNATIONAL JOURNAL OF TECHNOLOGY, 2018, 9 (08) : 1618 - 1627
  • [8] Logistic Regression Models for a Fast CBIR Method Based on Feature Selection
    Ksantini, R.
    Ziou, D.
    Colin, B.
    Dubeau, F.
    [J]. 20TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2007, : 2790 - 2795
  • [9] Evaluation of feature selection methods utilizing random forest and logistic regression for lung tissue categorization using HRCT images
    Vishraj, Rashmi
    Gupta, Savita
    Singh, Sukhwinder
    [J]. EXPERT SYSTEMS, 2023, 40 (08)
  • [10] An improved forecast of precipitation type using correlation-based feature selection and multinomial logistic regression
    Moon, Seung-Hyun
    Kim, Yong-Hyuk
    [J]. ATMOSPHERIC RESEARCH, 2020, 240