Automatic Arabic text categorization: A comprehensive comparative study

被引:45
|
作者
Hmeidi, Ismail [1 ]
Al-Ayyoub, Mahmoud [1 ]
Abdulla, Nawaf A. [1 ]
Almodawar, Abdalrahman A. [1 ]
Abooraig, Raddad [1 ]
Mahyoub, Nizar A. [1 ]
机构
[1] Jordan Univ Sci & Technol, Irbid 22110, Jordan
关键词
Arabic text categorization; classification; decision table; decision tree; K-Nearest Neighbour; light stemming; naive Bayes; RapidMiner; root-based stemming; Support Vector Machine; Weka; HYBRID APPROACH; PERFORMANCE; WORDS;
D O I
10.1177/0165551514558172
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text categorization or classification (TC) is concerned with placing text documents in their proper category according to their contents. Owing to the various applications of TC and the large volume of text documents uploaded on the Internet daily, the need for such an automated method stems from the difficulty and tedium of performing such a process manually. The usefulness of TC is manifested in different fields and needs. For instance, the ability to automatically classify an article or an email into its right class (Arts, Economics, Politics, Sports, etc.) would be appreciated by individual users as well as companies. This paper is concerned with TC of Arabic articles. It contains a comparison of the five best known algorithms for TC. It also studies the effects of utilizing different Arabic stemmers (light and root-based stemmers) on the effectiveness of these classifiers. Furthermore, a comparison between different data mining software tools (Weka and RapidMiner) is presented. The results illustrate the good accuracy provided by the SVM classifier, especially when used with the light10 stemmer. This outcome can be used in future as a baseline to compare with other unexplored classifiers and Arabic stemmers.
引用
收藏
页码:114 / 124
页数:11
相关论文
共 50 条
  • [1] Arabic Text Categorization: a Comparative Study of Different Representation Modes
    Elberrichi, Zakaria
    Abidi, Karima
    [J]. INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2012, 9 (05) : 465 - 470
  • [2] Automatic Arabic Text Categorization using Bayesian Learning
    Kadhim, Mahmood H.
    Omar, Nazlia
    [J]. 2012 7TH INTERNATIONAL CONFERENCE ON COMPUTING AND CONVERGENCE TECHNOLOGY (ICCCT2012), 2012, : 415 - 419
  • [3] A Comparative Study of Statistical Feature Reduction Methods for Arabic Text Categorization
    Harrag, Fouzi
    El-Qawasmeh, Eyas
    Al-Salman, Abdul Malik S.
    [J]. NETWORKED DIGITAL TECHNOLOGIES, PT 2, 2010, 88 : 676 - +
  • [4] A Comparative Study of Some Automatic Arabic Text Diacritization Systems
    Mijlad, Ali
    El Younoussi, Yacine
    [J]. ADVANCES IN HUMAN-COMPUTER INTERACTION, 2022, 2022
  • [5] Automatic text categorization:: Case study
    Corrêa, RF
    Ludermir, TB
    [J]. VII BRAZILIAN SYMPOSIUM ON NEURAL NETWORKS, PROCEEDINGS, 2002, : 150 - 150
  • [6] A comparative study on text representation schemes in text categorization
    Song, FX
    Liu, SH
    Yang, JY
    [J]. PATTERN ANALYSIS AND APPLICATIONS, 2005, 8 (1-2) : 199 - 209
  • [7] A comparative study on text representation schemes in text categorization
    Fengxi Song
    Shuhai Liu
    Jingyu Yang
    [J]. Pattern Analysis and Applications, 2005, 8 : 199 - 209
  • [8] Comprehensive and Evolution Study Focusing on Comparative Analysis of Automatic Text Summarization
    Patel, Rima
    Thakkar, Amit
    Makwana, Kamlesh
    Patel, Jay
    [J]. INFORMATION AND COMMUNICATION TECHNOLOGY FOR INTELLIGENT SYSTEMS (ICTIS 2017) - VOL 2, 2018, 84 : 383 - 389
  • [9] Machine learning for Arabic text categorization
    Duwairi, Rehab M.
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2006, 57 (08): : 1005 - 1010
  • [10] SANAD: Single-label Arabic News Articles Dataset for automatic text categorization
    Einea, Omar
    Elnagar, Ashraf
    Al Debsi, Ridhwan
    [J]. DATA IN BRIEF, 2019, 25