Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis and Prognosis

被引:3
|
作者
Gurcan, Fatih [1 ]
Soylu, Ahmet [2 ]
机构
[1] Karadeniz Tech Univ, Fac Econ & Adm Sci, Dept Management Informat Syst, TR-61080 Trabzon, Turkiye
[2] Norwegian Univ Sci & Technol, Fac Informat Technol & Elect Engn, Dept Comp Sci, N-2815 Gjovik, Norway
关键词
cancer diagnosis and prognosis; class imbalance; machine learning; resampling techniques; random forest; predictive modeling; MULTICLASS;
D O I
10.3390/cancers16193417
中图分类号
R73 [肿瘤学];
学科分类号
100214 ;
摘要
Simple Summary This research focuses on improving cancer diagnosis and prognosis by addressing a common problem in data analysis known as class imbalance, where some patient groups are underrepresented. The authors aim to evaluate different resampling methods that can balance the data and enhance the performance of various classification algorithms used to predict cancer outcomes. By testing a wide range of techniques across multiple cancer datasets, this study identifies the best-performing classifier, Random Forest, along with the most effective resampling method, SMOTEENN. These findings provide valuable insights for researchers and healthcare professionals, enabling them to make more accurate predictions and ultimately improve patient care. This research could pave the way for the development of more reliable machine learning applications in the medical field.Abstract Background/Objectives: This study aims to evaluate the performance of various classification algorithms and resampling methods across multiple diagnostic and prognostic cancer datasets, addressing the challenges of class imbalance. Methods: A total of five datasets were analyzed, including three diagnostic datasets (Wisconsin Breast Cancer Database, Cancer Prediction Dataset, Lung Cancer Detection Dataset) and two prognostic datasets (Seer Breast Cancer Dataset, Differentiated Thyroid Cancer Recurrence Dataset). Nineteen resampling methods from three categories were employed, and ten classifiers from four distinct categories were utilized for comparison. Results: The results demonstrated that hybrid sampling methods, particularly SMOTEENN, achieved the highest mean performance at 98.19%, followed by IHT (97.20%) and RENN (96.48%). In terms of classifiers, Random Forest showed the best performance with a mean value of 94.69%, with Balanced Random Forest and XGBoost following closely. The baseline method (no resampling) yielded a significantly lower performance of 91.33%, highlighting the effectiveness of resampling techniques in improving model outcomes. Conclusions: This research underscores the importance of resampling methods in enhancing classification performance on imbalanced datasets, providing valuable insights for researchers and healthcare professionals. The findings serve as a foundation for future studies aimed at integrating machine learning techniques in cancer diagnosis and prognosis, with recommendations for further research on hybrid models and clinical applications.
引用
收藏
页数:19
相关论文
共 50 条
  • [21] Machine Learning Approaches for Breast Cancer Diagnosis and Prognosis
    Sharma, Ayush
    Kulshrestha, Sudhanshu
    Daniel, Sibi
    2017 INTERNATIONAL CONFERENCE ON SOFT COMPUTING AND ITS ENGINEERING APPLICATIONS (ICSOFTCOMP), 2017,
  • [22] A Step Towards the Explainability of Microarray Data for Cancer Diagnosis with Machine Learning Techniques
    Nogueira, Adara S. R.
    Ferreira, Artur J.
    Figueiredo, Mario A. T.
    PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION APPLICATIONS AND METHODS (ICPRAM), 2021, : 362 - 369
  • [23] Supervised machine learning models applied to disease diagnosis and prognosis
    Mariani, Maria C.
    Tweneboah, Osei K.
    Bhuiyan, Md Al Masum
    AIMS PUBLIC HEALTH, 2019, 6 (04): : 405 - 423
  • [24] Prediction of construction accident outcomes based on an imbalanced dataset through integrated resampling techniques and machine learning methods
    Koc, Kerim
    Ekmekcioglu, Omer
    Gurgun, Asli Pelin
    ENGINEERING CONSTRUCTION AND ARCHITECTURAL MANAGEMENT, 2023, 30 (09) : 4486 - 4517
  • [25] The future of pancreatic cancer prognosis: machine learning and radiomics integration
    Verma, Amogh
    Singh, Jaskirat
    Ndabashinze, Rodrigue
    Sah, Sanjit
    Pant, Manu
    Khatib, Mahalaqua N.
    Singh, Mahendra P.
    Zahiruddin, Quazi S.
    Rustagi, Sarvesh
    INTERNATIONAL JOURNAL OF SURGERY OPEN, 2024, 62 (05) : 653 - 655
  • [26] Comparative Studies on Resampling Techniques in Machine Learning and Deep Learning Models for Drug-Target Interaction Prediction
    Azlim Khan, Azwaar Khan
    Ahamed Hassain Malim, Nurul Hashimah
    MOLECULES, 2023, 28 (04):
  • [27] Enhancing techniques for learning decision trees from imbalanced data
    Ikram Chaabane
    Radhouane Guermazi
    Mohamed Hammami
    Advances in Data Analysis and Classification, 2020, 14 : 677 - 745
  • [28] Enhancing techniques for learning decision trees from imbalanced data
    Chaabane, Ikram
    Guermazi, Radhouane
    Hammami, Mohamed
    ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2020, 14 (03) : 677 - 745
  • [29] Diagnosis of skin cancer using machine learning techniques
    Murugan, A.
    Nair, S. Anu H.
    Preethi, A. Angelin Peace
    Kumar, K. P. Sanal
    MICROPROCESSORS AND MICROSYSTEMS, 2021, 81
  • [30] Prognosis of Cervical Cancer Disease by Applying Machine Learning Techniques
    Kumawat, Gaurav
    Vishwakarma, Santosh Kumar
    Chakrabarti, Prasun
    Chittora, Pankaj
    Chakrabarti, Tulika
    Lin, Jerry Chun-Wei
    JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2023, 32 (01)