The effect of rebalancing techniques on the classification performance in cyberbullying datasets

被引:0
|
作者
Marwa Khairy
Tarek M. Mahmoud
Tarek Abd-El-Hafeez
机构
[1] Minia University,Faculty of Computers and Information
[2] University of Sadat City,Faculty of Computers and Artificial Intelligence, Computer Science Department
[3] Minia University,Computer Science Department, Faculty of Science
[4] Deraya University,undefined
来源
关键词
Classification; Cyberbullying; Undersampling; Oversampling; SMOTE; TOMEK;
D O I
暂无
中图分类号
学科分类号
摘要
Cyberbullying detection systems rely increasingly on machine learning techniques. However, class imbalance in cyberbullying datasets, where the percentage of normal labeled classes is higher than that of abnormal labeled ones, presents a significant challenge for classification algorithms. This issue is particularly problematic in two-class datasets, where conventional machine learning methods tend to perform poorly on minority class samples due to the influence of the majority class. To address this problem, researchers have proposed various oversampling and undersampling techniques. In this paper, we investigate the effectiveness of such techniques in addressing class imbalance in cyberbullying datasets. We conduct an experimental study that involves a preprocessing step to enhance machine learning algorithm performance. We then examine the impact of imbalanced data on classification performance for four cyberbullying datasets. To study the classification performance on balanced cyberbullying datasets, we employ four resampling techniques, namely random undersampling, random oversampling, SMOTE, and SMOTE + TOMEK. We evaluate the impact of each rebalancing technique on classification performance using eight well-known classification algorithms. Our findings demonstrate that the performance of resampling techniques depends on the dataset size, imbalance ratio, and classifier used. The conducted experiments proved that there are no techniques that will always perform better the others.
引用
收藏
页码:1049 / 1065
页数:16
相关论文
共 50 条
  • [31] RSMOTE: improving classification performance over imbalanced medical datasets
    Mehdi Naseriparsa
    Ahmed Al-Shammari
    Ming Sheng
    Yong Zhang
    Rui Zhou
    Health Information Science and Systems, 8
  • [32] Coverage-performance curves for classification in datasets with atypical data
    Hashemi, S
    2002 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-4, PROCEEDINGS, 2002, : 1946 - 1950
  • [33] RSMOTE: improving classification performance over imbalanced medical datasets
    Naseriparsa, Mehdi
    Al-Shammari, Ahmed
    Sheng, Ming
    Zhang, Yong
    Zhou, Rui
    HEALTH INFORMATION SCIENCE AND SYSTEMS, 2020, 8 (01)
  • [34] Using OVA modeling to improve classification performance for large datasets
    Lutu, Patricia E. N.
    Engelbrecht, Andries P.
    EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (04) : 4358 - 4376
  • [35] Improving neural network performance on the classification of complex geographic datasets
    Gahegan M.
    German G.
    West G.
    Journal of Geographical Systems, 1999, 1 (1) : 3 - 22
  • [36] Effects of the Use of Boosting on Classification Performance of Imbalanced Bioinformatics Datasets
    Khoshgoftaar, Taghi M.
    Fazelpour, Alireza
    Dittman, David J.
    Napolitano, Amri
    2014 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOENGINEERING (BIBE), 2014, : 420 - 426
  • [37] Empirical Study to Evaluate the Performance of Classification Algorithms on Public Datasets
    Bramesh, S. M.
    Kumar, K. M. Anil
    EMERGING RESEARCH IN ELECTRONICS, COMPUTER SCIENCE AND TECHNOLOGY, ICERECT 2018, 2019, 545 : 447 - 455
  • [38] Performance Analysis of Classification and Ranking Techniques
    Koturwar, Praful
    Girase, Sheetal
    Mukhopadhyay, Debajyoti
    2015 ANNUAL IEEE INDIA CONFERENCE (INDICON), 2015,
  • [39] Classification of Heuristic Techniques for Performance Comparisons
    Nazri, Engku Muhammad
    Murairwa, Stanley
    2016 12TH INTERNATIONAL CONFERENCE ON MATHEMATICS, STATISTICS, AND THEIR APPLICATIONS (ICMSA), 2016, : 19 - 24
  • [40] Machine Learning Classification Based Techniques for Fraud Discovery in Credit Card Datasets
    Ogundokun, Roseline Oluwaseun
    Misra, Sanjay
    Ogundokun, Opeyemi Eyitayo
    Oluranti, Jonathan
    Maskeliunas, Rytis
    APPLIED INFORMATICS (ICAI 2021), 2021, 1455 : 26 - 38