Performance and efficiency of machine learning algorithms for analyzing rectangular biomedical data

被引：25

作者：

Deng, Fei ^{[1
]}

Huang, Jibing ^{[1
]}

Yuan, Xiaoling ^{[2
]}

Cheng, Chao ^{[3
,4
]}

Zhang, Lanjing ^{[5
,6
,7
,8
]}

机构：

[1] Shanghai Inst Technol, Sch Elect & Elect Engn, Shanghai, Peoples R China

[2] Shanghai Jiao Tong Univ, Shanghai Ninth Peoples Hosp, Dept Infect Dis, Sch Med Shanghai, Shanghai, Peoples R China

[3] Baylor Coll Med, Dept Med, Houston, TX 77030 USA

[4] Baylor Coll Med, Inst Clin & Translat Res, Houston, TX 77030 USA

[5] Med Ctr Princeton, Dept Pathol, Plainsboro, NJ 08536 USA

[6] Rutgers State Univ, Dept Biol Sci, Newark, NJ 07103 USA

[7] Rutgers Canc Inst New Jersey, New Brunswick, NJ 08901 USA

[8] Rutgers State Univ, Ernest Mario Sch Pharm, Dept Chem Biol, Piscataway, NJ 08901 USA

来源：

LABORATORY INVESTIGATION | 2021年 / 101卷 / 04期

关键词：

BREAST-CANCER; DECISION TREE; COLORECTAL-CANCER; TUMOR DEPOSIT; RANDOM FOREST; DIAGNOSIS; SURVIVAL; DEATH; TERM;

D O I：

10.1038/s41374-020-00525-x

中图分类号：

R-3 [医学研究方法]; R3 [基础医学];

学科分类号：

1001 ;

摘要：

Most biomedical datasets, including those of 'omics, population studies, and surveys, are rectangular in shape and have few missing data. Recently, their sample sizes have grown significantly. Rigorous analyses on these large datasets demand considerably more efficient and more accurate algorithms. Machine learning (ML) algorithms have been used to classify outcomes in biomedical datasets, including random forests (RF), decision tree (DT), artificial neural networks (ANN), and support vector machine (SVM). However, their performance and efficiency in classifying multi-category outcomes of rectangular data are poorly understood. Therefore, we compared these metrics among the 4 ML algorithms. As an example, we created a large rectangular dataset using the female breast cancers in the surveillance, epidemiology, and end results-18 database, which were diagnosed in 2004 and followed up until December 2016. The outcome was the five-category cause of death, namely alive, non-breast cancer, breast cancer, cardiovascular disease, and other cause. We analyzed the 54 dichotomized features from similar to 45,000 patients using MatLab (version 2018a) and the tenfold cross-validation approach. The accuracy in classifying five-category cause of death with DT, RF, ANN, and SVM was 69.21%, 70.23%, 70.16%, and 69.06%, respectively, which was higher than the accuracy of 68.12% with multinomial logistic regression. Based on the features' information entropy, we optimized dimension reduction (i.e., reduce the number of features in models). We found 32 or more features were required to maintain similar accuracy, while the running time decreased from 55.57 s for 54 features to 25.99 s for 32 features in RF, from 12.92 s to 10.48 s in ANN, and from 175.50 s to 67.81 s in SVM. In summary, we here show that RF, DT, ANN, and SVM had similar accuracy for classifying multi-category outcomes in this large rectangular dataset. Dimension reduction based on information gain will increase the model's efficiency while maintaining classification accuracy.

引用

页码：430 / 441

页数：12

共 50 条

[1] Analyzing Data Efficiency and Performance of Machine Learning Algorithms for Assessing Low Back Pain Physical Rehabilitation Exercises
Marusic, Aleksa
Annabi, Louis
Nguyen, Sao Mai
Tapus, Adriana
[J]. 2023 EUROPEAN CONFERENCE ON MOBILE ROBOTS, ECMR, 2023, : 331 - 336
[2] Advanced machine learning algorithms for biomedical data and imaging
Tanveer, Mohammad
Rastogi, Reshma
Lin, Chin-Teng
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (20) : 30005 - 30005
[3] Advanced machine learning algorithms for biomedical data and imaging
[J]. Multimedia Tools and Applications, 2021, 80 : 30005 - 30005
[4] An Experimental Analysis of Machine Learning Classification Algorithms on Biomedical Data
Das, Himansu
Naik, Bighnaraj
Behera, H. S.
[J]. PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON COMMUNICATION, DEVICES AND COMPUTING, 2020, 602 : 525 - 539
[5] Performance of Machine Learning Algorithms and Diversity in Data
Sug, Hyontai
[J]. 22ND INTERNATIONAL CONFERENCE ON CIRCUITS, SYSTEMS, COMMUNICATIONS AND COMPUTERS (CSCC 2018), 2018, 210
[6] Large-Scale Machine Learning Algorithms for Biomedical Data Science
Huang, Heng
[J]. ACM-BCB'19: PROCEEDINGS OF THE 10TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND HEALTH INFORMATICS, 2019, : 4 - 4
[7] Guest Editorial Advanced Machine Learning Algorithms for Biomedical Data and Imaging
Tanveer, M.
Lin, Chin-Teng
Kumar Singh, Amit
[J]. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (10) : 4809 - 4813
[8] Biomedical Image Classification in a Big Data Architecture Using Machine Learning Algorithms
Tchito Tchapga, Christian
Mih, Thomas Attia
Tchagna Kouanou, Aurelle
Fozin Fonzin, Theophile
Kuetche Fogang, Platini
Mezatio, Brice Anicet
Tchiotsop, Daniel
[J]. JOURNAL OF HEALTHCARE ENGINEERING, 2021, 2021
[9] A Review of Machine Learning Algorithms for Biomedical Applications
V. A. Binson
Sania Thomas
M. Subramoniam
J. Arun
S. Naveen
S. Madhu
[J]. Annals of Biomedical Engineering, 2024, 52 : 1159 - 1183
[10] A Review of Machine Learning Algorithms for Biomedical Applications
Binson, V. A.
Thomas, Sania
Subramoniam, M.
Arun, J.
Naveen, S.
Madhu, S.
[J]. ANNALS OF BIOMEDICAL ENGINEERING, 2024, 52 (04) : 1051 - 1066

← 1 2 3 4 5 →