Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis and Prognosis

被引:3
|
作者
Gurcan, Fatih [1 ]
Soylu, Ahmet [2 ]
机构
[1] Karadeniz Tech Univ, Fac Econ & Adm Sci, Dept Management Informat Syst, TR-61080 Trabzon, Turkiye
[2] Norwegian Univ Sci & Technol, Fac Informat Technol & Elect Engn, Dept Comp Sci, N-2815 Gjovik, Norway
关键词
cancer diagnosis and prognosis; class imbalance; machine learning; resampling techniques; random forest; predictive modeling; MULTICLASS;
D O I
10.3390/cancers16193417
中图分类号
R73 [肿瘤学];
学科分类号
100214 ;
摘要
Simple Summary This research focuses on improving cancer diagnosis and prognosis by addressing a common problem in data analysis known as class imbalance, where some patient groups are underrepresented. The authors aim to evaluate different resampling methods that can balance the data and enhance the performance of various classification algorithms used to predict cancer outcomes. By testing a wide range of techniques across multiple cancer datasets, this study identifies the best-performing classifier, Random Forest, along with the most effective resampling method, SMOTEENN. These findings provide valuable insights for researchers and healthcare professionals, enabling them to make more accurate predictions and ultimately improve patient care. This research could pave the way for the development of more reliable machine learning applications in the medical field.Abstract Background/Objectives: This study aims to evaluate the performance of various classification algorithms and resampling methods across multiple diagnostic and prognostic cancer datasets, addressing the challenges of class imbalance. Methods: A total of five datasets were analyzed, including three diagnostic datasets (Wisconsin Breast Cancer Database, Cancer Prediction Dataset, Lung Cancer Detection Dataset) and two prognostic datasets (Seer Breast Cancer Dataset, Differentiated Thyroid Cancer Recurrence Dataset). Nineteen resampling methods from three categories were employed, and ten classifiers from four distinct categories were utilized for comparison. Results: The results demonstrated that hybrid sampling methods, particularly SMOTEENN, achieved the highest mean performance at 98.19%, followed by IHT (97.20%) and RENN (96.48%). In terms of classifiers, Random Forest showed the best performance with a mean value of 94.69%, with Balanced Random Forest and XGBoost following closely. The baseline method (no resampling) yielded a significantly lower performance of 91.33%, highlighting the effectiveness of resampling techniques in improving model outcomes. Conclusions: This research underscores the importance of resampling methods in enhancing classification performance on imbalanced datasets, providing valuable insights for researchers and healthcare professionals. The findings serve as a foundation for future studies aimed at integrating machine learning techniques in cancer diagnosis and prognosis, with recommendations for further research on hybrid models and clinical applications.
引用
收藏
页数:19
相关论文
共 50 条
  • [31] Active Learning From Imbalanced Data: A Solution of Online Weighted Extreme Learning Machine
    Yu, Hualong
    Yang, Xibei
    Zheng, Shang
    Sun, Changyin
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2019, 30 (04) : 1088 - 1103
  • [32] Machine Learning and Synthetic Minority Oversampling Techniques for Imbalanced Data: Improving Machine Failure Prediction
    Wah, Yap Bee
    Ismail, Azlan
    Azid, Nur Niswah Naslina
    Jaafar, Jafreezal
    Aziz, Izzatdin Abdul
    Hasan, Mohd Hilmi
    Zain, Jasni Mohamad
    CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 75 (03): : 4821 - 4841
  • [33] Medical Diagnostic Models an Implementation of Machine Learning Techniques for Diagnosis in Breast Cancer Patients
    Borah, Rupam
    Dhimal, Sunil
    Sharma, Kalpana
    ADVANCED COMPUTATIONAL AND COMMUNICATION PARADIGMS, VOL 1, 2018, 475 : 395 - 405
  • [34] Predicting severely imbalanced data disk drive failures with machine learning models
    Ahmed, Jishan
    Green II, Robert C.
    MACHINE LEARNING WITH APPLICATIONS, 2022, 9
  • [35] A Method for Analyzing the Performance Impact of Imbalanced Binary Data on Machine Learning Models
    Zheng, Ming
    Wang, Fei
    Hu, Xiaowen
    Miao, Yuhao
    Cao, Huo
    Tang, Mingjing
    AXIOMS, 2022, 11 (11)
  • [36] Metabolomic machine learning predictor for diagnosis and prognosis of gastric cancer
    Chen, Yangzi
    Wang, Bohong
    Zhao, Yizi
    Shao, Xinxin
    Wang, Mingshuo
    Ma, Fuhai
    Yang, Laishou
    Nie, Meng
    Jin, Peng
    Yao, Ke
    Song, Haibin
    Lou, Shenghan
    Wang, Hang
    Yang, Tianshu
    Tian, Yantao
    Han, Peng
    Hu, Zeping
    NATURE COMMUNICATIONS, 2024, 15 (01)
  • [37] Metabolomic machine learning predictor for diagnosis and prognosis of gastric cancer
    Yangzi Chen
    Bohong Wang
    Yizi Zhao
    Xinxin Shao
    Mingshuo Wang
    Fuhai Ma
    Laishou Yang
    Meng Nie
    Peng Jin
    Ke Yao
    Haibin Song
    Shenghan Lou
    Hang Wang
    Tianshu Yang
    Yantao Tian
    Peng Han
    Zeping Hu
    Nature Communications, 15
  • [38] Brain Cancer Diagnosis and Enhancing Prognosis with Machine Learning and Imaging
    Miao, K. H.
    Miao, J. H.
    JOURNAL OF INVESTIGATIVE MEDICINE, 2024, 72 (01)
  • [39] A Comparative Study of Shallow Machine Learning Models and Deep Learning Models for Landslide Susceptibility Assessment Based on Imbalanced Data
    Xu, Shiluo
    Song, Yingxu
    Hao, Xiulan
    FORESTS, 2022, 13 (11):
  • [40] Machine Learning based Video Coding using Data-driven Techniques and Advanced Models
    Kwong, Sam
    PROCEEDINGS OF THE 2019 IEEE 18TH INTERNATIONAL CONFERENCE ON COGNITIVE INFORMATICS & COGNITIVE COMPUTING (ICCI*CC 2019), 2019, : 4 - 4