Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis and Prognosis

被引:3
|
作者
Gurcan, Fatih [1 ]
Soylu, Ahmet [2 ]
机构
[1] Karadeniz Tech Univ, Fac Econ & Adm Sci, Dept Management Informat Syst, TR-61080 Trabzon, Turkiye
[2] Norwegian Univ Sci & Technol, Fac Informat Technol & Elect Engn, Dept Comp Sci, N-2815 Gjovik, Norway
关键词
cancer diagnosis and prognosis; class imbalance; machine learning; resampling techniques; random forest; predictive modeling; MULTICLASS;
D O I
10.3390/cancers16193417
中图分类号
R73 [肿瘤学];
学科分类号
100214 ;
摘要
Simple Summary This research focuses on improving cancer diagnosis and prognosis by addressing a common problem in data analysis known as class imbalance, where some patient groups are underrepresented. The authors aim to evaluate different resampling methods that can balance the data and enhance the performance of various classification algorithms used to predict cancer outcomes. By testing a wide range of techniques across multiple cancer datasets, this study identifies the best-performing classifier, Random Forest, along with the most effective resampling method, SMOTEENN. These findings provide valuable insights for researchers and healthcare professionals, enabling them to make more accurate predictions and ultimately improve patient care. This research could pave the way for the development of more reliable machine learning applications in the medical field.Abstract Background/Objectives: This study aims to evaluate the performance of various classification algorithms and resampling methods across multiple diagnostic and prognostic cancer datasets, addressing the challenges of class imbalance. Methods: A total of five datasets were analyzed, including three diagnostic datasets (Wisconsin Breast Cancer Database, Cancer Prediction Dataset, Lung Cancer Detection Dataset) and two prognostic datasets (Seer Breast Cancer Dataset, Differentiated Thyroid Cancer Recurrence Dataset). Nineteen resampling methods from three categories were employed, and ten classifiers from four distinct categories were utilized for comparison. Results: The results demonstrated that hybrid sampling methods, particularly SMOTEENN, achieved the highest mean performance at 98.19%, followed by IHT (97.20%) and RENN (96.48%). In terms of classifiers, Random Forest showed the best performance with a mean value of 94.69%, with Balanced Random Forest and XGBoost following closely. The baseline method (no resampling) yielded a significantly lower performance of 91.33%, highlighting the effectiveness of resampling techniques in improving model outcomes. Conclusions: This research underscores the importance of resampling methods in enhancing classification performance on imbalanced datasets, providing valuable insights for researchers and healthcare professionals. The findings serve as a foundation for future studies aimed at integrating machine learning techniques in cancer diagnosis and prognosis, with recommendations for further research on hybrid models and clinical applications.
引用
收藏
页数:19
相关论文
共 50 条
  • [41] Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features
    Budhi, Gregorius Satia
    Chiong, Raymond
    Wang, Zuli
    MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (09) : 13079 - 13097
  • [42] Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features
    Gregorius Satia Budhi
    Raymond Chiong
    Zuli Wang
    Multimedia Tools and Applications, 2021, 80 : 13079 - 13097
  • [43] Learning on Class Imbalanced Data to Classify Peer-to-Peer Applications in IP Traffic using Resampling Techniques
    Zhong, Weicai
    Raahemi, Bijan
    Liu, Jing
    IJCNN: 2009 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1- 6, 2009, : 1573 - +
  • [44] Discovering new genetic markers for breast cancer diagnosis via advanced machine learning techniques
    Wang, Lingzhen
    Aguilar, Robert
    JOURNAL OF IMMUNOLOGY, 2024, 212 (01):
  • [45] Development of PDAC diagnosis and prognosis evaluation models based on machine learning
    Xiao, Yingqi
    Sun, Shixin
    Zheng, Naxin
    Zhao, Jing
    Li, Xiaohan
    Xu, Jianmin
    Li, Haolian
    Du, Chenran
    Zeng, Lijun
    Zhang, Juling
    Yin, Xiuyun
    Huang, Yuan
    Yang, Xuemei
    Yuan, Fang
    Jia, Xingwang
    Li, Boan
    Li, Bo
    BMC CANCER, 2025, 25 (01)
  • [46] Learning from Class-imbalanced Data with a Model-Agnostic Framework for Machine Intelligent Diagnosis
    Wu, Jingyao
    Zhao, Zhibin
    Sun, Chuang
    Yan, Ruqiang
    Chen, Xuefeng
    RELIABILITY ENGINEERING & SYSTEM SAFETY, 2021, 216
  • [47] Application of advanced machine learning techniques to improve prognosis in primary breast angiosarcoma
    Kamal, H.
    Alshwayyat, S.
    Alshwayyat, T. A.
    Mahadeen, A. Ziad
    Alshwayyat, M.
    Alkharabsheh, A.
    ANNALS OF ONCOLOGY, 2024, 35 : S351 - S351
  • [48] Advanced machine learning techniques for cardiovascular disease early detection and diagnosis
    Baghdadi, Nadiah A.
    Abdelaliem, Sally Mohammed Farghaly
    Malki, Amer
    Gad, Ibrahim
    Ewis, Ashraf
    Atlam, Elsayed
    JOURNAL OF BIG DATA, 2023, 10 (01)
  • [49] Advanced machine learning techniques for cardiovascular disease early detection and diagnosis
    Nadiah A. Baghdadi
    Sally Mohammed Farghaly Abdelaliem
    Amer Malki
    Ibrahim Gad
    Ashraf Ewis
    Elsayed Atlam
    Journal of Big Data, 10
  • [50] Uncovering and Correcting Shortcut Learning in Machine Learning Models for Skin Cancer Diagnosis
    Nauta, Meike
    Walsh, Ricky
    Dubowski, Adam
    Seifert, Christin
    DIAGNOSTICS, 2022, 12 (01)