Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification

被引:2
|
作者
Endalie, Demeke [1 ]
Haile, Getamesay [1 ]
Abebe, Wondmagegn Taye [2 ]
机构
[1] Jimma Inst Technol, Fac Comp & Informat, Jimma, Oromia, Ethiopia
[2] Jimma Inst Technol, Fac Civil & Environm Engn, Jimma, Oromia, Ethiopia
关键词
Chi-square; Document frequency; Extra tree classifier; Feature selection; Genetic algorithm; Information gain; Text classification;
D O I
10.7717/peerj-cs.961
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text classification is the process of categorizing documents based on their content into a predefined set of categories. Text classification algorithms typically represent documents as collections of words and it deals with a large number of features. The selection of appropriate features becomes important when the initial feature set is quite large. In this paper, we present a hybrid of document frequency (DF) and genetic algorithm (GA)-based feature selection method for Amharic text classification. We evaluate this feature selection method on Amharic news documents obtained from the Ethiopian News Agency (ENA). The number of categories used in this study is 13. Our experimental results showed that the proposed feature selection method outperformed other feature selection methods utilized for Amharic news document classification. Combining the proposed feature selection method with Extra Tree Classifier (ETC) improves classification accuracy. It improves classification accuracy up to 1% higher than the hybrid of DF, information gain (IG), chi-square (CHI), and principal component analysis (PCA), 2.47% greater than GA and 3.86% greater than a hybrid of DF, IG, and CHI.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] Integrating Rich Document Representations for Text Classification
    Jiang, Suqi
    Lewris, Jason
    Voltmer, Michael
    Wang, Hongning
    [J]. 2016 IEEE SYSTEMS AND INFORMATION ENGINEERING DESIGN SYMPOSIUM (SIEDS), 2016, : 303 - 308
  • [42] An Embedded-Based Weighted Feature Selection Algorithm for Classifying Web Document
    Shankar, G. Siva
    Ashokkumar, P.
    Vinayakumar, R.
    Ghosh, Uttam
    Mansoor, Wathiq
    Alnumay, Waleed S.
    [J]. WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2020, 2020 (2020):
  • [43] Algorithm learning based neural network integrating feature selection and classification
    Yoon, Hyunsoo
    Park, Cheong-Sool
    Kim, Jun Seok
    Baek, Jun-Geol
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2013, 40 (01) : 231 - 241
  • [44] Feature Selection Method Combined Optimized Document Frequency with Improved RBF Network
    Zhu, Hao-Dong
    Zhao, Xiang-Hui
    Zhong, Yong
    [J]. ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS, 2009, 5678 : 796 - 803
  • [45] Document Classification of SuDer Turkish News Corpora
    Sen, Mehmet Umut
    Yanikoglu, Berrin
    [J]. 2018 26TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2018,
  • [46] Itemsets-Based Amharic Document Categorization Using an Extended A Priori Algorithm
    Hailu, Abraham
    Assabie, Yaregal
    [J]. HUMAN LANGUAGE TECHNOLOGY: CHALLENGES FOR COMPUTER SCIENCE AND LINGUISTICS, 2016, 9561 : 317 - 326
  • [47] A hierarchical feature decomposition clustering algorithm for unsupervised classification of document image types
    Curtis, Dean
    Kubushyn, Vitaliy
    Yfantis, E. A.
    Rogers, Michael
    [J]. ICMLA 2007: SIXTH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, PROCEEDINGS, 2007, : 423 - 428
  • [48] Feature Selection for Fake News Classification
    Sverdrup-Thygeson, Simen
    Haddow, Pauline C.
    [J]. 2021 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (IEEE SSCI 2021), 2021,
  • [49] LDA Based Feature Selection for Document Clustering
    Kumar, B. Shravan
    Ravi, Vadlamani
    [J]. COMPUTE'17: PROCEEDINGS OF THE 10TH ANNUAL ACM INDIA COMPUTE CONFERENCE, 2017, : 125 - 130
  • [50] Substring selection for biomedical document classification
    Han, Bo
    Obradovic, Zoran
    Hu, Zhang-Zhi
    Wu, Cathy H.
    Vucetic, Slobodan
    [J]. BIOINFORMATICS, 2006, 22 (17) : 2136 - 2142