Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification

被引:2
|
作者
Endalie, Demeke [1 ]
Haile, Getamesay [1 ]
Abebe, Wondmagegn Taye [2 ]
机构
[1] Jimma Inst Technol, Fac Comp & Informat, Jimma, Oromia, Ethiopia
[2] Jimma Inst Technol, Fac Civil & Environm Engn, Jimma, Oromia, Ethiopia
关键词
Chi-square; Document frequency; Extra tree classifier; Feature selection; Genetic algorithm; Information gain; Text classification;
D O I
10.7717/peerj-cs.961
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text classification is the process of categorizing documents based on their content into a predefined set of categories. Text classification algorithms typically represent documents as collections of words and it deals with a large number of features. The selection of appropriate features becomes important when the initial feature set is quite large. In this paper, we present a hybrid of document frequency (DF) and genetic algorithm (GA)-based feature selection method for Amharic text classification. We evaluate this feature selection method on Amharic news documents obtained from the Ethiopian News Agency (ENA). The number of categories used in this study is 13. Our experimental results showed that the proposed feature selection method outperformed other feature selection methods utilized for Amharic news document classification. Combining the proposed feature selection method with Extra Tree Classifier (ETC) improves classification accuracy. It improves classification accuracy up to 1% higher than the hybrid of DF, information gain (IG), chi-square (CHI), and principal component analysis (PCA), 2.47% greater than GA and 3.86% greater than a hybrid of DF, IG, and CHI.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Hybrid Feature Selection for Amharic News Document Classification
    Endalie, Demeke
    Haile, Getamesay
    [J]. MATHEMATICAL PROBLEMS IN ENGINEERING, 2021, 2021
  • [2] Weighted Document Frequency for Feature Selection in Text Classification
    Li, Baoli
    Yan, Qiuling
    Xu, Zhenqiang
    Wang, Guicai
    [J]. PROCEEDINGS OF 2015 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, 2015, : 132 - 135
  • [3] Investigating Optimal Feature Selection Method to Improve the Performance of Amharic Text Document Classification
    Alemu, Tamir Anteneh
    Tegegnie, Alemu Kumilachew
    [J]. AFRICAN JOURNAL OF LIBRARY ARCHIVES AND INFORMATION SCIENCE, 2019, 29 (02): : 103 - 113
  • [4] Sampling and feature selection in a genetic algorithm for document clustering
    Casillas, A
    de Lena, MTG
    Martínez, R
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2004, 2945 : 601 - 612
  • [5] Feature selection for document type classification
    Taghva, Kazem
    Vergara, Jason
    [J]. PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY: NEW GENERATIONS, 2008, : 179 - 182
  • [6] Designing a hybrid dimension reduction for improving the performance of Amharic news document classification
    Endalie, Demeke
    Tegegne, Tesfa
    [J]. PLOS ONE, 2021, 16 (05):
  • [7] Feature selection for the classification of large document collections
    Brank, Janez
    Mladenic, Dunja
    Grobelnik, Marko
    Milic-Frayling, Natasa
    [J]. JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2008, 14 (10) : 1562 - 1596
  • [8] The impact of feature selection on medical document classification
    Parlak, Bekir
    Uysal, Alper Kursat
    [J]. 2016 11TH IBERIAN CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES (CISTI), 2016,
  • [9] Feature selection for document classification based on topology
    El Barbary, O. G.
    Salama, A. S.
    [J]. EGYPTIAN INFORMATICS JOURNAL, 2018, 19 (02) : 129 - 132
  • [10] Discriminative Feature Analysis and Selection for Document Classification
    Chinta, Punya Murthy
    Murty, M. Narasimha
    [J]. NEURAL INFORMATION PROCESSING, ICONIP 2012, PT I, 2012, 7663 : 366 - 374