Performance and efficiency of machine learning algorithms for analyzing rectangular biomedical data

被引:25
|
作者
Deng, Fei [1 ]
Huang, Jibing [1 ]
Yuan, Xiaoling [2 ]
Cheng, Chao [3 ,4 ]
Zhang, Lanjing [5 ,6 ,7 ,8 ]
机构
[1] Shanghai Inst Technol, Sch Elect & Elect Engn, Shanghai, Peoples R China
[2] Shanghai Jiao Tong Univ, Shanghai Ninth Peoples Hosp, Dept Infect Dis, Sch Med Shanghai, Shanghai, Peoples R China
[3] Baylor Coll Med, Dept Med, Houston, TX 77030 USA
[4] Baylor Coll Med, Inst Clin & Translat Res, Houston, TX 77030 USA
[5] Med Ctr Princeton, Dept Pathol, Plainsboro, NJ 08536 USA
[6] Rutgers State Univ, Dept Biol Sci, Newark, NJ 07103 USA
[7] Rutgers Canc Inst New Jersey, New Brunswick, NJ 08901 USA
[8] Rutgers State Univ, Ernest Mario Sch Pharm, Dept Chem Biol, Piscataway, NJ 08901 USA
关键词
BREAST-CANCER; DECISION TREE; COLORECTAL-CANCER; TUMOR DEPOSIT; RANDOM FOREST; DIAGNOSIS; SURVIVAL; DEATH; TERM;
D O I
10.1038/s41374-020-00525-x
中图分类号
R-3 [医学研究方法]; R3 [基础医学];
学科分类号
1001 ;
摘要
Most biomedical datasets, including those of 'omics, population studies, and surveys, are rectangular in shape and have few missing data. Recently, their sample sizes have grown significantly. Rigorous analyses on these large datasets demand considerably more efficient and more accurate algorithms. Machine learning (ML) algorithms have been used to classify outcomes in biomedical datasets, including random forests (RF), decision tree (DT), artificial neural networks (ANN), and support vector machine (SVM). However, their performance and efficiency in classifying multi-category outcomes of rectangular data are poorly understood. Therefore, we compared these metrics among the 4 ML algorithms. As an example, we created a large rectangular dataset using the female breast cancers in the surveillance, epidemiology, and end results-18 database, which were diagnosed in 2004 and followed up until December 2016. The outcome was the five-category cause of death, namely alive, non-breast cancer, breast cancer, cardiovascular disease, and other cause. We analyzed the 54 dichotomized features from similar to 45,000 patients using MatLab (version 2018a) and the tenfold cross-validation approach. The accuracy in classifying five-category cause of death with DT, RF, ANN, and SVM was 69.21%, 70.23%, 70.16%, and 69.06%, respectively, which was higher than the accuracy of 68.12% with multinomial logistic regression. Based on the features' information entropy, we optimized dimension reduction (i.e., reduce the number of features in models). We found 32 or more features were required to maintain similar accuracy, while the running time decreased from 55.57 s for 54 features to 25.99 s for 32 features in RF, from 12.92 s to 10.48 s in ANN, and from 175.50 s to 67.81 s in SVM. In summary, we here show that RF, DT, ANN, and SVM had similar accuracy for classifying multi-category outcomes in this large rectangular dataset. Dimension reduction based on information gain will increase the model's efficiency while maintaining classification accuracy.
引用
收藏
页码:430 / 441
页数:12
相关论文
共 50 条
  • [1] Analyzing Data Efficiency and Performance of Machine Learning Algorithms for Assessing Low Back Pain Physical Rehabilitation Exercises
    Marusic, Aleksa
    Annabi, Louis
    Nguyen, Sao Mai
    Tapus, Adriana
    [J]. 2023 EUROPEAN CONFERENCE ON MOBILE ROBOTS, ECMR, 2023, : 331 - 336
  • [2] Advanced machine learning algorithms for biomedical data and imaging
    Tanveer, Mohammad
    Rastogi, Reshma
    Lin, Chin-Teng
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (20) : 30005 - 30005
  • [3] Advanced machine learning algorithms for biomedical data and imaging
    [J]. Multimedia Tools and Applications, 2021, 80 : 30005 - 30005
  • [4] An Experimental Analysis of Machine Learning Classification Algorithms on Biomedical Data
    Das, Himansu
    Naik, Bighnaraj
    Behera, H. S.
    [J]. PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON COMMUNICATION, DEVICES AND COMPUTING, 2020, 602 : 525 - 539
  • [5] Performance of Machine Learning Algorithms and Diversity in Data
    Sug, Hyontai
    [J]. 22ND INTERNATIONAL CONFERENCE ON CIRCUITS, SYSTEMS, COMMUNICATIONS AND COMPUTERS (CSCC 2018), 2018, 210
  • [6] Large-Scale Machine Learning Algorithms for Biomedical Data Science
    Huang, Heng
    [J]. ACM-BCB'19: PROCEEDINGS OF THE 10TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND HEALTH INFORMATICS, 2019, : 4 - 4
  • [7] Guest Editorial Advanced Machine Learning Algorithms for Biomedical Data and Imaging
    Tanveer, M.
    Lin, Chin-Teng
    Kumar Singh, Amit
    [J]. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (10) : 4809 - 4813
  • [8] Biomedical Image Classification in a Big Data Architecture Using Machine Learning Algorithms
    Tchito Tchapga, Christian
    Mih, Thomas Attia
    Tchagna Kouanou, Aurelle
    Fozin Fonzin, Theophile
    Kuetche Fogang, Platini
    Mezatio, Brice Anicet
    Tchiotsop, Daniel
    [J]. JOURNAL OF HEALTHCARE ENGINEERING, 2021, 2021
  • [9] A Review of Machine Learning Algorithms for Biomedical Applications
    V. A. Binson
    Sania Thomas
    M. Subramoniam
    J. Arun
    S. Naveen
    S. Madhu
    [J]. Annals of Biomedical Engineering, 2024, 52 : 1159 - 1183
  • [10] A Review of Machine Learning Algorithms for Biomedical Applications
    Binson, V. A.
    Thomas, Sania
    Subramoniam, M.
    Arun, J.
    Naveen, S.
    Madhu, S.
    [J]. ANNALS OF BIOMEDICAL ENGINEERING, 2024, 52 (04) : 1051 - 1066