KerMinSVM for imbalanced datasets with a case study on arabic comics classification

被引:5
|
作者
Nayal, Ammar [1 ]
Jomaa, Hadi [1 ]
Awad, Marlette [1 ]
机构
[1] Amer Univ Beirut, Dept Elect & Comp Engn, Beirut, Lebanon
基金
新加坡国家研究基金会;
关键词
Imbalance datasets; Support vector machines; Arabic comics analysis; Natural language processing; Supervised classification;
D O I
10.1016/j.engappai.2017.01.001
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Many studies have been performed to classify large-sized text documents using different classifiers, ranging from simple distance classifiers such as K-Nearest-Neighbor (KNN) to more advanced classifiers such as Support Vector Machines. Traditional approaches fail when a short text is encountered due to sparsity resulting from a limited number of words. Another common problem in text classification is class imbalance (CI). CI occurs when one class of the data contains most of the samples while the other class contains only a few. Standard classifiers, when applied to imbalanced data, result in high accuracy for the majority class and low accuracy for the minority one. We were motivated to propose a novel framework for classifying the content of Arabic comics; therefore, we propose KerMinSVM, a kernel extension of our previously proposed MinSVM coupled with a new dimensionality featuring a reduction scheme based on word root frequency ratios (WRFR). KerMinSVM was tested on multiple imbalanced benchmark datasets, and the results were verified using three measures: accuracy, F-measure, and statistical analysis. WRFR was applied to the manual construction of the Arabic comic text dataset to detect strong content in children's comic books. Test results revealed that our proposed framework outperformed most of the methods for imbalanced datasets and short text classification.
引用
收藏
页码:159 / 169
页数:11
相关论文
共 50 条
  • [21] Deep Learning Applied to Imbalanced Malware Datasets Classification
    Salas, Marcelo Palma
    de Geus, Paulo Licio
    JOURNAL OF INTERNET SERVICES AND APPLICATIONS, 2024, 15 (01) : 342 - 359
  • [22] SVM CLASSIFICATION BASED ON THE IMBALANCED DATASETS FOR PROBLEMS OF PSYCHODIAGNOSTICS
    Demidova, Liliya
    Klyueva, Irina
    Pylkin, Alexander
    ICPE 2017: INTERNATIONAL CONFERENCE ON PSYCHOLOGY AND EDUCATION, 2017, 33 : 95 - 103
  • [23] Binary classification of imbalanced datasets using conformal prediction
    Norinder, Ulf
    Boyer, Scott
    JOURNAL OF MOLECULAR GRAPHICS & MODELLING, 2017, 72 : 256 - 265
  • [24] ARCID: A New Approach to Deal with Imbalanced Datasets Classification
    Abdellatif, Safa
    Ben Hassine, Mohamed Ali
    Ben Yahia, Sadok
    Bouzeghoub, Amel
    SOFSEM 2018: THEORY AND PRACTICE OF COMPUTER SCIENCE, 2018, 10706 : 569 - 580
  • [25] Study of Hellinger Distance as a splitting metric for Random Forests in balanced and imbalanced classification datasets
    Aler, Ricardo
    Valls, Jose M.
    Bostrom, Henrik
    EXPERT SYSTEMS WITH APPLICATIONS, 2020, 149
  • [26] A contemporary feature selection and classification framework for imbalanced biomedical datasets
    Bikku, Thulasi
    Nandam, Sambasiva Rao
    Akepogu, Ananda Rao
    EGYPTIAN INFORMATICS JOURNAL, 2018, 19 (03) : 191 - 198
  • [27] RSMOTE: improving classification performance over imbalanced medical datasets
    Naseriparsa, Mehdi
    Al-Shammari, Ahmed
    Sheng, Ming
    Zhang, Yong
    Zhou, Rui
    HEALTH INFORMATION SCIENCE AND SYSTEMS, 2020, 8 (01)
  • [28] Imbalanced datasets classification by fuzzy rule extraction and genetic algorithms
    Soler, Vicenc
    Cerquides, Jesus
    Sabria, Josep
    Roig, Jordi
    Prim, Marta
    ICDM 2006: SIXTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, WORKSHOPS, 2006, : 330 - 334
  • [29] An efficient classification approach in imbalanced datasets for intrinsic plagiarism detection
    Andrianna Polydouri
    Eleni Vathi
    Georgios Siolas
    Andreas Stafylopatis
    Evolving Systems, 2020, 11 : 503 - 515
  • [30] Kernel-Based SMOTE for SVM Classification of Imbalanced Datasets
    Mathew, Josey
    Luo, Ming
    Pang, Chee Khiang
    Chan, Hian Leng
    IECON 2015 - 41ST ANNUAL CONFERENCE OF THE IEEE INDUSTRIAL ELECTRONICS SOCIETY, 2015, : 1127 - 1132