Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis and Prognosis

被引:3
|
作者
Gurcan, Fatih [1 ]
Soylu, Ahmet [2 ]
机构
[1] Karadeniz Tech Univ, Fac Econ & Adm Sci, Dept Management Informat Syst, TR-61080 Trabzon, Turkiye
[2] Norwegian Univ Sci & Technol, Fac Informat Technol & Elect Engn, Dept Comp Sci, N-2815 Gjovik, Norway
关键词
cancer diagnosis and prognosis; class imbalance; machine learning; resampling techniques; random forest; predictive modeling; MULTICLASS;
D O I
10.3390/cancers16193417
中图分类号
R73 [肿瘤学];
学科分类号
100214 ;
摘要
Simple Summary This research focuses on improving cancer diagnosis and prognosis by addressing a common problem in data analysis known as class imbalance, where some patient groups are underrepresented. The authors aim to evaluate different resampling methods that can balance the data and enhance the performance of various classification algorithms used to predict cancer outcomes. By testing a wide range of techniques across multiple cancer datasets, this study identifies the best-performing classifier, Random Forest, along with the most effective resampling method, SMOTEENN. These findings provide valuable insights for researchers and healthcare professionals, enabling them to make more accurate predictions and ultimately improve patient care. This research could pave the way for the development of more reliable machine learning applications in the medical field.Abstract Background/Objectives: This study aims to evaluate the performance of various classification algorithms and resampling methods across multiple diagnostic and prognostic cancer datasets, addressing the challenges of class imbalance. Methods: A total of five datasets were analyzed, including three diagnostic datasets (Wisconsin Breast Cancer Database, Cancer Prediction Dataset, Lung Cancer Detection Dataset) and two prognostic datasets (Seer Breast Cancer Dataset, Differentiated Thyroid Cancer Recurrence Dataset). Nineteen resampling methods from three categories were employed, and ten classifiers from four distinct categories were utilized for comparison. Results: The results demonstrated that hybrid sampling methods, particularly SMOTEENN, achieved the highest mean performance at 98.19%, followed by IHT (97.20%) and RENN (96.48%). In terms of classifiers, Random Forest showed the best performance with a mean value of 94.69%, with Balanced Random Forest and XGBoost following closely. The baseline method (no resampling) yielded a significantly lower performance of 91.33%, highlighting the effectiveness of resampling techniques in improving model outcomes. Conclusions: This research underscores the importance of resampling methods in enhancing classification performance on imbalanced datasets, providing valuable insights for researchers and healthcare professionals. The findings serve as a foundation for future studies aimed at integrating machine learning techniques in cancer diagnosis and prognosis, with recommendations for further research on hybrid models and clinical applications.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] Diagnosis of Breast Cancer on Imbalanced Dataset Using Various Sampling Techniques and Machine Learning Models
    Gupta, Ruchita
    Bhargava, Rupal
    Jayabalan, Manoj
    2021 14TH INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ESYSTEMS ENGINEERING (DESE), 2021, : 162 - 167
  • [2] A multiple resampling method for learning from imbalanced data sets
    Estabrooks, A
    Jo, TH
    Japkowicz, N
    COMPUTATIONAL INTELLIGENCE, 2004, 20 (01) : 18 - 36
  • [3] A comparative analysis of machine learning techniques for imbalanced data
    Mrad, Ali Ben
    Lahiani, Amine
    Mefteh-Wali, Salma
    Mselmi, Nada
    ANNALS OF OPERATIONS RESEARCH, 2024,
  • [4] A Comparison of Resampling Techniques for Medical Data Using Machine Learning
    Alahmari, Fahad
    JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT, 2020, 19 (01)
  • [5] A Robust Enhanced Ensemble Learning Method for Breast Cancer Data Diagnosis on Imbalanced Data
    Wang, Zhenzhen
    Xie, Junde
    Zhang, Jia
    IEEE ACCESS, 2024, 12 : 189776 - 189788
  • [6] Integration of multimodal imaging data with machine learning for improved diagnosis and prognosis in neuroimaging
    Bhattacharya, Saurabh
    Prusty, Sashikanta
    Pande, Sanjay P.
    Gulhane, Monali
    Lavate, Santosh H.
    Rakesh, Nitin
    Veerasamy, Saravanan
    FRONTIERS IN HUMAN NEUROSCIENCE, 2025, 19
  • [7] Enhanced Cervical Cancer Diagnosis Using Advanced Transfer Learning Techniques
    Shandilya, Gunjan
    Anand, Vatsala
    Chauhan, Rahul
    Pokhariya, Hemant Singh
    Gupta, Sheifali
    2024 2ND WORLD CONFERENCE ON COMMUNICATION & COMPUTING, WCONF 2024, 2024,
  • [8] Advancing preeclampsia prediction: a tailored machine learning pipeline integrating resampling and ensemble models for handling imbalanced medical data
    Yinyao Ma
    Hanlin Lv
    Yanhua Ma
    Xiao Wang
    Longting Lv
    Xuxia Liang
    Lei Wang
    BioData Mining, 18 (1)
  • [9] Evolutionary Online Machine Learning from Imbalanced Data
    Stein, Anthony
    2016 IEEE 1ST INTERNATIONAL WORKSHOPS ON FOUNDATIONS AND APPLICATIONS OF SELF* SYSTEMS (FAS*W), 2016, : 281 - 286
  • [10] Data Driven Prognosis of Cervical Cancer Using ClassBalancing and Machine Learning Techniques
    Arora M.
    Dhawan S.
    Singh K.
    EAI Endorsed Transactions on Energy Web, 2020, 7 (30) : 1 - 9