The effect of rebalancing techniques on the classification performance in cyberbullying datasets

被引:0
|
作者
Marwa Khairy
Tarek M. Mahmoud
Tarek Abd-El-Hafeez
机构
[1] Minia University,Faculty of Computers and Information
[2] University of Sadat City,Faculty of Computers and Artificial Intelligence, Computer Science Department
[3] Minia University,Computer Science Department, Faculty of Science
[4] Deraya University,undefined
来源
关键词
Classification; Cyberbullying; Undersampling; Oversampling; SMOTE; TOMEK;
D O I
暂无
中图分类号
学科分类号
摘要
Cyberbullying detection systems rely increasingly on machine learning techniques. However, class imbalance in cyberbullying datasets, where the percentage of normal labeled classes is higher than that of abnormal labeled ones, presents a significant challenge for classification algorithms. This issue is particularly problematic in two-class datasets, where conventional machine learning methods tend to perform poorly on minority class samples due to the influence of the majority class. To address this problem, researchers have proposed various oversampling and undersampling techniques. In this paper, we investigate the effectiveness of such techniques in addressing class imbalance in cyberbullying datasets. We conduct an experimental study that involves a preprocessing step to enhance machine learning algorithm performance. We then examine the impact of imbalanced data on classification performance for four cyberbullying datasets. To study the classification performance on balanced cyberbullying datasets, we employ four resampling techniques, namely random undersampling, random oversampling, SMOTE, and SMOTE + TOMEK. We evaluate the impact of each rebalancing technique on classification performance using eight well-known classification algorithms. Our findings demonstrate that the performance of resampling techniques depends on the dataset size, imbalance ratio, and classifier used. The conducted experiments proved that there are no techniques that will always perform better the others.
引用
收藏
页码:1049 / 1065
页数:16
相关论文
共 50 条
  • [41] Experimental Comparison of Sampling Techniques for Imbalanced Datasets Using Various Classification Models
    Pattanayak, Sanjibani Sudha
    Rout, Minakhi
    PROGRESS IN ADVANCED COMPUTING AND INTELLIGENT ENGINEERING, VOL 2, 2018, 564 : 13 - 22
  • [42] Rebalancing and the Value Effect
    Chaves, Denis B.
    Arnott, Robert D.
    JOURNAL OF PORTFOLIO MANAGEMENT, 2012, 38 (04): : 59 - +
  • [43] Integrating Data Mining Techniques for Naive Bayes Classification: Applications to Medical Datasets
    Changpetch, Pannapa
    Pitpeng, Apasiri
    Hiriote, Sasiprapa
    Yuangyai, Chumpol
    COMPUTATION, 2021, 9 (09)
  • [44] Conditional Wasserstein Generative Adversarial Networks for Rebalancing Iris Image Datasets
    Li, Yung-Hui
    Aslam, Muhammad Saqlain
    Harfiya, Latifa Nabila
    Chang, Ching-Chun
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2021, E104D (09) : 1450 - 1458
  • [45] The Effect of Social, Verbal, Physical, and Cyberbullying Victimization on Academic Performance
    Torres, Christopher E.
    D'Alessio, Stewart J.
    Stolzenberg, Lisa
    VICTIMS & OFFENDERS, 2020, 15 (01) : 1 - 21
  • [46] Curating Cyberbullying Datasets: a Human-AI Collaborative Approach
    Gomez C.E.
    Sztainberg M.O.
    Trana R.E.
    International Journal of Bullying Prevention, 2022, 4 (1) : 35 - 46
  • [47] Cyberbullying Classification using Text Mining
    Noviantho
    Isa, Sani Muhamad
    Ashianti, Livia
    2017 1ST INTERNATIONAL CONFERENCE ON INFORMATICS AND COMPUTATIONAL SCIENCES (ICICOS), 2017, : 241 - 245
  • [48] A survey of image classification methods and techniques for improving classification performance
    Lu, D.
    Weng, Q.
    INTERNATIONAL JOURNAL OF REMOTE SENSING, 2007, 28 (05) : 823 - 870
  • [49] Telecom churn prediction and used techniques, datasets and performance measures: a review
    Hemlata Jain
    Ajay Khunteta
    Sumit Srivastava
    Telecommunication Systems, 2021, 76 : 613 - 630
  • [50] Performance Analysis of Frequent Itemset Finding Techniques using Sparse Datasets
    Tushar, Patel S.
    2015 INTERNATIONAL CONFERENCE ON COMPUTER, COMMUNICATION AND CONTROL (IC4), 2015,