The effect of rebalancing techniques on the classification performance in cyberbullying datasets

被引:0
|
作者
Marwa Khairy
Tarek M. Mahmoud
Tarek Abd-El-Hafeez
机构
[1] Minia University,Faculty of Computers and Information
[2] University of Sadat City,Faculty of Computers and Artificial Intelligence, Computer Science Department
[3] Minia University,Computer Science Department, Faculty of Science
[4] Deraya University,undefined
来源
关键词
Classification; Cyberbullying; Undersampling; Oversampling; SMOTE; TOMEK;
D O I
暂无
中图分类号
学科分类号
摘要
Cyberbullying detection systems rely increasingly on machine learning techniques. However, class imbalance in cyberbullying datasets, where the percentage of normal labeled classes is higher than that of abnormal labeled ones, presents a significant challenge for classification algorithms. This issue is particularly problematic in two-class datasets, where conventional machine learning methods tend to perform poorly on minority class samples due to the influence of the majority class. To address this problem, researchers have proposed various oversampling and undersampling techniques. In this paper, we investigate the effectiveness of such techniques in addressing class imbalance in cyberbullying datasets. We conduct an experimental study that involves a preprocessing step to enhance machine learning algorithm performance. We then examine the impact of imbalanced data on classification performance for four cyberbullying datasets. To study the classification performance on balanced cyberbullying datasets, we employ four resampling techniques, namely random undersampling, random oversampling, SMOTE, and SMOTE + TOMEK. We evaluate the impact of each rebalancing technique on classification performance using eight well-known classification algorithms. Our findings demonstrate that the performance of resampling techniques depends on the dataset size, imbalance ratio, and classifier used. The conducted experiments proved that there are no techniques that will always perform better the others.
引用
收藏
页码:1049 / 1065
页数:16
相关论文
共 50 条
  • [21] ON IMPROVING PERFORMANCE OF CLASSIFICATION TECHNIQUES
    PHILLIP, PJ
    JOURNAL OF EXPERIMENTAL EDUCATION, 1970, 39 (01): : 69 - &
  • [22] Pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability
    Martin H van Vliet
    Fabien Reyal
    Hugo M Horlings
    Marc J van de Vijver
    Marcel JT Reinders
    Lodewyk FA Wessels
    BMC Genomics, 9
  • [23] THE EFFECT OF MARKET PROXY REBALANCING POLICIES ON DETECTING ABNORMAL PERFORMANCE
    ZIVNEY, TL
    THOMPSON, DJ
    JOURNAL OF FINANCIAL RESEARCH, 1989, 12 (04) : 293 - 299
  • [24] Pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability
    van Vliet, Martin H.
    Reyal, Fabien
    Horlings, Hugo M.
    van de Vijver, Marc J.
    Reinders, Marcel J. T.
    Wessels, Lodewyk F. A.
    BMC GENOMICS, 2008, 9 (1)
  • [25] Cyberbullying detection framework for short and imbalanced Arabic datasets
    Alzaqebah, Malek
    Jaradat, Ghaith M.
    Nassan, Dania
    Alnasser, Rawan
    Alsmadi, Mutasem K.
    Almarashdeh, Ibrahim
    Jawarneh, Sana
    Alwohaibi, Maram
    Al-Mulla, Noha A.
    Alshehab, Nouf
    Alkhushayni, Suboh
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2023, 35 (08)
  • [26] A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets
    Carmen Lai
    Marcel JT Reinders
    Laura J van't Veer
    Lodewyk FA Wessels
    BMC Bioinformatics, 7
  • [27] The Effect of Preprocessing Techniques, Applied to Numeric Features, on Classification Algorithms' Performance
    Alshdaifat, Esra'a
    Alshdaifat, Doa'a
    Alsarhan, Ayoub
    Hussein, Fairouz
    El-Salhi, Subhieh Moh'd Faraj S.
    DATA, 2021, 6 (02) : 1 - 23
  • [28] A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets
    Lai, Carmen
    Reinders, Marcel J. T.
    van't Veer, Laura J.
    Wessels, Lodewyk F. A.
    BMC BIOINFORMATICS, 2006, 7 (1)
  • [29] Classification of Cyberbullying Text in Arabic
    Rachid, Benaissa Azzeddine
    Azza, Harbaoui
    Ben Ghezala, Hajjami Henda
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [30] Improving the Performance of Sentiment Classification on Imbalanced Datasets With Transfer Learning
    Xiao, Z.
    Wang, L.
    Du, J. Y.
    IEEE ACCESS, 2019, 7 : 28281 - 28290