Effect of Word Segmentation on Arabic Text Classification

被引:0
|
作者
Al-Thubaity, Abdulmohsen [1 ]
Al-Subaie, Abdullah [1 ]
机构
[1] King Abdulaziz City Sci & Technol, Natl Ctr Comp Technol & Appl Math, Riyadh, Saudi Arabia
关键词
Arabic text classification; text preprocessing; classification performance;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The preprocessing stage in text classification is one of the factors affecting the accuracy of text classification. Text preprocessing involves several steps such as removing stop words, punctuation, and numerals. For Arabic text classification, stemming and root extraction were proposed as additional preprocessing steps. The resulting stems and roots are then used as features for Arabic text classification. In this study, we propose word segmentation as an additional preprocessing step. We used a dataset comprising 4,900 newspaper articles evenly distributed into seven classes. We conducted our experiments on segmented and nonsegmented versions of this dataset. We used chi-squared to select top-ranked features, LTC as a representation schema, and SVM as a classifier. By measuring the accuracy, precision, recall, and F - measure, we evaluated the use of word orthography as a feature for Arabic text classification before and after segmentation. In all of the experiments we conducted, the classification performance for the segmented dataset outperformed the nonsegmented dataset with the same number of features. Furthermore, we can attain the same classification performance with nonsegmented datasets using fewer features.
引用
收藏
页码:127 / 131
页数:5
相关论文
共 50 条
  • [1] Lines segmentation and word extraction of Arabic handwritten text
    Lamsaf, Asmae
    Aitkerroum, Mounir
    Boulaknadel, Siham
    Fakhri, Youssef
    [J]. PROCEEDINGS OF THE 3RD INTERNATIONAL CONFERENCE ON SMART CITY APPLICATIONS (SCA'18), 2018,
  • [2] Arabic Text Classification Based on Word and Document Embeddings
    El Mahdaouy, Abdelkader
    Gaussier, Eric
    El Alaoui, Said Ouatik
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENT SYSTEMS AND INFORMATICS 2016, 2017, 533 : 32 - 41
  • [3] Text classification with improved word embedding and adaptive segmentation
    Sun, Guoying
    Cheng, Yanan
    Zhang, Zhaoxin
    Tong, Xiaojun
    Chai, Tingting
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 238
  • [4] Chinese text classification without automatic word segmentation
    Liu, Wei
    Allison, Ben
    Guthrie, David
    Guthrie, Louise
    [J]. ALPIT 2007: PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON ADVANCED LANGUAGE PROCESSING AND WEB INFORMATION TECHNOLOGY, 2007, : 45 - +
  • [5] Using Word N-Grams as Features in Arabic Text Classification
    Al-Thubaity, Abdulmohsen
    Alhoshan, Muneera
    Hazzaa, Itisam
    [J]. SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING, 2015, 569 : 35 - 43
  • [6] Word Stretching for Effective Segmentation and Classification of Historical Arabic Handwritten Documents
    Al Aghbari, Zaher
    Brook, Salama
    [J]. RCIS 2009: PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON RESEARCH CHALLENGES IN INFORMATION SCIENCE, 2009, : 217 - 224
  • [7] An efficient, font independent word and character segmentation algorithm for printed Arabic text
    Qaroush, Aziz
    Jaber, Bassam
    Mohammad, Khader
    Washaha, Mahdi
    Maali, Eman
    Nayef, Nibal
    [J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2022, 34 (01) : 1330 - 1344
  • [8] Word segmentation of handwritten text using supervised classification techniques
    Sun, Yi
    Butler, Timothy S.
    Shafarenko, Alex
    Adams, Rod
    Loomes, Martin
    Davey, Neil
    [J]. APPLIED SOFT COMPUTING, 2007, 7 (01) : 71 - 88
  • [9] The Effect of Stemming on Arabic Text Classification: An Empirical Study
    Wahbeh, Abdullah
    Al-Kabi, Mohammed
    Al-Radaideh, Qasem
    Al-Shawakfa, Emad
    Alsmadi, Izzat
    [J]. INTERNATIONAL JOURNAL OF INFORMATION RETRIEVAL RESEARCH, 2011, 1 (03) : 54 - 70
  • [10] The Effect of using Light Stemming for Arabic Text Classification
    Atwan, Jaffar
    Wedyan, Mohammad
    Bsoul, Qusay
    Hamadeen, Ahmad
    Alturki, Ryan
    Ikram, Mohammed
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2021, 12 (05) : 768 - 773