Arabic Authorship Attribution Using Synthetic Minority Over-Sampling Technique and Principal Components Analysis for Imbalanced Documents

被引:4
|
作者
Hadjadj, Hassina [1 ]
Sayoud, Halim [1 ]
机构
[1] USTHB Univ, Bab Ezzouar, Algeria
关键词
Arabic Language; Authorship Attribution; BayesNet; Imbalanced Datasets; Principal Component Analysis (PCA); SMO-SVM; Synthetic Minority Over-Sampling Technique (SMOTE); SMOTE;
D O I
10.4018/IJCINI.20211001.oa33
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dealing with imbalanced data represents a great challenge in data mining as well as in machine learning task. In this investigation, the authors are interested in the problem of class imbalance in authorship attribution (AA) task, with specific application on Arabic text data. This article proposes a new hybrid approach based on principal components analysis (PCA) and synthetic minority over-sampling technique (SMOTE), which considerably improve the performances of authorship attribution on imbalanced data. The used dataset contains seven Arabic books written by seven different scholars, which are segmented into text segments of the same size, with an average length of 2,900 words per text. The obtained results of the experiments show that the proposed approach using the SMO-SVM classifier presents high performance in terms of authorship attribution accuracy (100%), especially with starting character-bigrams. In addition, the proposed method appears quite interesting by improving the AA performances in imbalanced datasets, mainly with function words.
引用
收藏
页数:17
相关论文
共 50 条
  • [41] Preprocessing of Imbalanced Breast Cancer Data using Feature Selection Combined with Over-Sampling Technique for classification
    Jojan, Janjira
    Srivihok, Anongnart
    [J]. 2013 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER SCIENCE AND INFORMATION SYSTEMS (ICACSIS), 2013, : 407 - 412
  • [42] Speech Emotion Recognition Based on Selective Interpolation Synthetic Minority Over-Sampling Technique in Small Sample Environment
    Liu, Zhen-Tao
    Wu, Bao-Han
    Li, Dan-Yun
    Xiao, Peng
    Mao, Jun-Wei
    [J]. SENSORS, 2020, 20 (08)
  • [43] The selection of wart treatment method based on Synthetic Minority Over-sampling Technique and Axiomatic Fuzzy Set theory
    Jia, Wenjuan
    Xia, Hao
    Jia, Lijuan
    Deng, Yingjie
    Liu, Xiaodong
    [J]. BIOCYBERNETICS AND BIOMEDICAL ENGINEERING, 2020, 40 (01) : 517 - 526
  • [44] Precise transformer fault diagnosis via random forest model enhanced by synthetic minority over-sampling technique
    Prasojo, Rahman Azis
    Putra, Muhammad Akmal A.
    Ekojono
    Apriyani, Meyti Eka
    Rahmanto, Anugrah Nur
    Ghoneim, Sherif S. M.
    Mahmoud, Karar
    Lehtonen, Matti
    Darwish, Mohamed M. F.
    [J]. ELECTRIC POWER SYSTEMS RESEARCH, 2023, 220
  • [45] Applying Synthetic Minority Over-sampling Technique and Support Vector Machine to Develop a Classifier for Parkinson's disease
    Byeon, Haewon
    Kim, Byungsoo
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2021, 12 (03) : 96 - 101
  • [46] A Back Propagation Neural Network Model with the Synthetic Minority Over-Sampling Technique for Construction Company Bankruptcy Prediction
    Thanh-Long, Ngo
    Tran-Minh
    Hong-Chuong, Le
    [J]. INTERNATIONAL JOURNAL OF SUSTAINABLE CONSTRUCTION ENGINEERING AND TECHNOLOGY, 2022, 13 (03): : 68 - 79
  • [47] LVQ-SMOTE - Learning Vector Quantization based Synthetic Minority Over-sampling Technique for biomedical data
    Nakamura, Munehiro
    Kajiwara, Yusuke
    Otsuka, Atsushi
    Kimura, Haruhiko
    [J]. BIODATA MINING, 2013, 6
  • [48] ANALYSIS AND SIMULATION OF ACCURACY OF CREDIT STATUS CLASSIFICATION WITH BOOTSTRAP AGGREGATING (BAGGING) AND SYNTHETIC MINORITY OVER-SAMPLING (SMOTE)
    Efendi, Achmad
    Amrullah, Ahmad A. N.
    Fitriani, Rahma
    Rahayudi, Bayu
    [J]. INTERNATIONAL JOURNAL OF AGRICULTURAL AND STATISTICAL SCIENCES, 2021, 17 : 925 - 938
  • [49] Multi-view feature fusion and density-based minority over-sampling technique for amyloid protein prediction under imbalanced data
    Yang, Runtao
    Liu, Jiaming
    Zhang, Qian
    Zhang, Lina
    [J]. APPLIED SOFT COMPUTING, 2024, 150
  • [50] Synthetic minority over-sampling technique-enhanced machine learning models for predicting recurrence of postoperative chronic subdural hematoma
    Ni, Zhihui
    Zhu, Yehao
    Qian, Yiwei
    Li, Xinbo
    Xing, Zhenqiu
    Zhou, Yinan
    Chen, Yu
    Huang, Lijie
    Yang, Jianjing
    Zhuge, Qichuan
    [J]. FRONTIERS IN NEUROLOGY, 2024, 15