Effect of Word Segmentation on Arabic Text Classification

被引:0
|
作者
Al-Thubaity, Abdulmohsen [1 ]
Al-Subaie, Abdullah [1 ]
机构
[1] King Abdulaziz City Sci & Technol, Natl Ctr Comp Technol & Appl Math, Riyadh, Saudi Arabia
关键词
Arabic text classification; text preprocessing; classification performance;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The preprocessing stage in text classification is one of the factors affecting the accuracy of text classification. Text preprocessing involves several steps such as removing stop words, punctuation, and numerals. For Arabic text classification, stemming and root extraction were proposed as additional preprocessing steps. The resulting stems and roots are then used as features for Arabic text classification. In this study, we propose word segmentation as an additional preprocessing step. We used a dataset comprising 4,900 newspaper articles evenly distributed into seven classes. We conducted our experiments on segmented and nonsegmented versions of this dataset. We used chi-squared to select top-ranked features, LTC as a representation schema, and SVM as a classifier. By measuring the accuracy, precision, recall, and F - measure, we evaluated the use of word orthography as a feature for Arabic text classification before and after segmentation. In all of the experiments we conducted, the classification performance for the segmented dataset outperformed the nonsegmented dataset with the same number of features. Furthermore, we can attain the same classification performance with nonsegmented datasets using fewer features.
引用
收藏
页码:127 / 131
页数:5
相关论文
共 50 条
  • [41] Compression-Based Arabic Text Classification
    Ta'amneh, Haneen
    Abu Keshek, Ehsan
    Issa, Manar Bani
    Al-Ayyoub, Mahmoud
    Jararweh, Yaser
    [J]. 2014 IEEE/ACS 11TH INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS (AICCSA), 2014, : 594 - 600
  • [42] The impact of indexing approaches on Arabic text classification
    Al-Badarneh, Amer
    Al-Shawakfa, Emad
    Bani-Ismail, Basel
    Al-Rababah, Khaleel
    Shatnawi, Safwan
    [J]. JOURNAL OF INFORMATION SCIENCE, 2017, 43 (02) : 159 - 173
  • [43] Arabic Text Classification based on Semantic Relations
    Hijazi, Musab
    Zeki, Akram
    Ismail, Amelia
    [J]. INTERNATIONAL JOURNAL OF MATHEMATICS AND COMPUTER SCIENCE, 2022, 17 (02): : 937 - 946
  • [44] A Deep Learning Approach for Arabic Text Classification
    Sundus, Katrina
    Al-Haj, Fatima
    Hammo, Bassam
    [J]. 2019 2ND INTERNATIONAL CONFERENCE ON NEW TRENDS IN COMPUTING SCIENCES (ICTCS), 2019, : 258 - 264
  • [45] NADA: New Arabic Dataset for Text Classification
    Alalyani, Nada
    Marie-Sainte, Souad Larabi
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2018, 9 (09) : 206 - 212
  • [46] Evaluating Various Tokenizers for Arabic Text Classification
    Zaid Alyafeai
    Maged S. Al-shaibani
    Mustafa Ghaleb
    Irfan Ahmad
    [J]. Neural Processing Letters, 2023, 55 : 2911 - 2933
  • [47] A SEQUENTIAL ALGORITHM FOR THE SEGMENTATION OF TYPEWRITTEN ARABIC DIGITIZED TEXT
    SHOUKRY, A
    [J]. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 1991, 16 (04): : 543 - 556
  • [48] Arabic text classification based on analogical proportions
    Bounhas, Myriam
    Elayeb, Bilel
    Chouigui, Amina
    Hussain, Amir
    Cambria, Erik
    [J]. EXPERT SYSTEMS, 2024,
  • [49] Evaluating Various Tokenizers for Arabic Text Classification
    Alyafeai, Zaid
    Al-shaibani, Maged S.
    Ghaleb, Mustafa
    Ahmad, Irfan
    [J]. NEURAL PROCESSING LETTERS, 2023, 55 (03) : 2911 - 2933
  • [50] Named entity recognition and classification for text in arabic
    Abuleil, S
    Evens, M
    [J]. INTELLIGENT AND ADAPTIVE SYSTEMS AND SOFTWARE ENGINEERING, 2004, : 89 - 94