Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification

被引:2
|
作者
Endalie, Demeke [1 ]
Haile, Getamesay [1 ]
Abebe, Wondmagegn Taye [2 ]
机构
[1] Jimma Inst Technol, Fac Comp & Informat, Jimma, Oromia, Ethiopia
[2] Jimma Inst Technol, Fac Civil & Environm Engn, Jimma, Oromia, Ethiopia
关键词
Chi-square; Document frequency; Extra tree classifier; Feature selection; Genetic algorithm; Information gain; Text classification;
D O I
10.7717/peerj-cs.961
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text classification is the process of categorizing documents based on their content into a predefined set of categories. Text classification algorithms typically represent documents as collections of words and it deals with a large number of features. The selection of appropriate features becomes important when the initial feature set is quite large. In this paper, we present a hybrid of document frequency (DF) and genetic algorithm (GA)-based feature selection method for Amharic text classification. We evaluate this feature selection method on Amharic news documents obtained from the Ethiopian News Agency (ENA). The number of categories used in this study is 13. Our experimental results showed that the proposed feature selection method outperformed other feature selection methods utilized for Amharic news document classification. Combining the proposed feature selection method with Extra Tree Classifier (ETC) improves classification accuracy. It improves classification accuracy up to 1% higher than the hybrid of DF, information gain (IG), chi-square (CHI), and principal component analysis (PCA), 2.47% greater than GA and 3.86% greater than a hybrid of DF, IG, and CHI.
引用
收藏
页数:14
相关论文
共 50 条
  • [21] Document Classification with a weighted Frequency Pattern tree algorithm
    Dsouza, Froila Helixia
    Ananthanarayana, V. S.
    [J]. PROCEEDINGS OF 2016 INTERNATIONAL CONFERENCE ON DATA MINING AND ADVANCED COMPUTING (SAPIENCE), 2016, : 29 - 34
  • [22] An extended document frequency metric for feature selection in text categorization
    Xu, Yan
    Wang, Bin
    Li, JinTao
    Jing, Hongfang
    [J]. INFORMATION RETRIEVAL TECHNOLOGY, 2008, 4993 : 71 - +
  • [23] Feature Selection Based on Term Frequency Reordering of Document Level
    Zhou, Hongfang
    Zhang, Yingjie
    Liu, Hongjiang
    Zhang, Yao
    [J]. IEEE ACCESS, 2018, 6 : 51655 - 51668
  • [24] Research on the Feature Selection Algorithm of Chinese News Classification
    Gong, Jun-peng
    Wen, Yu-jun
    Song, Qing
    [J]. INTERNATIONAL CONFERENCE ON SIMULATION, MODELLING AND MATHEMATICAL STATISTICS (SMMS 2015), 2015, : 455 - 458
  • [25] A Novel Feature Selection Approach Based on Document Frequency of Segmented Term Frequency
    Zhou, Hongfang
    Han, Shuang
    Liu, Yibin
    [J]. IEEE ACCESS, 2018, 6 : 53811 - 53821
  • [26] Traditional and Swarm Intelligent Based Text Feature Selection for Document Classification
    Kyaw, Khin Sandar
    Limsiroratana, Somchai
    [J]. ISCIT 2019: PROCEEDINGS OF 2019 19TH INTERNATIONAL SYMPOSIUM ON COMMUNICATIONS AND INFORMATION TECHNOLOGIES (ISCIT), 2019, : 226 - 231
  • [27] On the Feature Selection and Classification Based on Information Gain for Document Sentiment Analysis
    Pratiwi, Asriyanti Indah
    Adiwijaya
    [J]. APPLIED COMPUTATIONAL INTELLIGENCE AND SOFT COMPUTING, 2018, 2018
  • [28] An improved document classification approach with maximum entropy and entropy feature selection
    Pang, Xiu-Li
    Feng, Yu-Qiang
    Jiang, Wei
    [J]. PROCEEDINGS OF 2007 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2007, : 3911 - +
  • [29] Categorical Term Frequency Probability Based Feature Selection for Document Categorization
    Li, Qiang
    He, Liang
    Lin, Xin
    [J]. 2013 INTERNATIONAL CONFERENCE OF SOFT COMPUTING AND PATTERN RECOGNITION (SOCPAR), 2013, : 60 - 65
  • [30] Feature selection using new document frequency and improved Tabu search
    Zhu, Haodong
    Zhong, Yong
    [J]. Huazhong Keji Daxue Xuebao (Ziran Kexue Ban)/Journal of Huazhong University of Science and Technology (Natural Science Edition), 2010, 38 (02): : 4 - 7