Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification

被引:2
|
作者
Endalie, Demeke [1 ]
Haile, Getamesay [1 ]
Abebe, Wondmagegn Taye [2 ]
机构
[1] Jimma Inst Technol, Fac Comp & Informat, Jimma, Oromia, Ethiopia
[2] Jimma Inst Technol, Fac Civil & Environm Engn, Jimma, Oromia, Ethiopia
关键词
Chi-square; Document frequency; Extra tree classifier; Feature selection; Genetic algorithm; Information gain; Text classification;
D O I
10.7717/peerj-cs.961
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text classification is the process of categorizing documents based on their content into a predefined set of categories. Text classification algorithms typically represent documents as collections of words and it deals with a large number of features. The selection of appropriate features becomes important when the initial feature set is quite large. In this paper, we present a hybrid of document frequency (DF) and genetic algorithm (GA)-based feature selection method for Amharic text classification. We evaluate this feature selection method on Amharic news documents obtained from the Ethiopian News Agency (ENA). The number of categories used in this study is 13. Our experimental results showed that the proposed feature selection method outperformed other feature selection methods utilized for Amharic news document classification. Combining the proposed feature selection method with Extra Tree Classifier (ETC) improves classification accuracy. It improves classification accuracy up to 1% higher than the hybrid of DF, information gain (IG), chi-square (CHI), and principal component analysis (PCA), 2.47% greater than GA and 3.86% greater than a hybrid of DF, IG, and CHI.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] A Novel Attribute Weighting Method with Genetic Algorithm for Document Classification
    Ay, Sinan
    Dogan, Yavuz Selim
    Alver, Seyfullah
    Kaya, Cetin
    [J]. 2016 24TH SIGNAL PROCESSING AND COMMUNICATION APPLICATION CONFERENCE (SIU), 2016, : 1129 - 1132
  • [32] Feature reduction for Web document classification
    Song, MuHee
    Kang, DongJin
    Lee, SangJo
    [J]. 2006 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY, PTS 1 AND 2, PROCEEDINGS, 2006, : 785 - 788
  • [33] Feature Selection for Document Flow Segmentation
    Hamdi, Ahmed
    Coustaty, Mickael
    Joseph, Aurelie
    d'Andecy, Vincent Poulain
    Doucet, Antoine
    Ogier, Jean-Marc
    [J]. 2018 13TH IAPR INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS (DAS), 2018, : 245 - 250
  • [34] Trajectory Classification Using Feature Selection by Genetic Algorithm
    Saini, Rajkumar
    Kumar, Pradeep
    Roy, Partha Pratim
    Pal, Umapada
    [J]. PROCEEDINGS OF 3RD INTERNATIONAL CONFERENCE ON COMPUTER VISION AND IMAGE PROCESSING, CVIP 2018, VOL 2, 2020, 1024 : 377 - 388
  • [35] Comparison of term frequency and document frequency based feature selection metrics in text categorization
    Azam, Nouman
    Yao, JingTao
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (05) : 4760 - 4768
  • [36] Efficient Feature Selection and Domain Relevance Term Weighting Method for Document Classification
    Khan, Aurangzeb
    Baharudin, Baharum
    Khan, Khairullah
    [J]. 2010 SECOND INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND APPLICATIONS: ICCEA 2010, PROCEEDINGS, VOL 2, 2010, : 398 - 403
  • [37] Automated Document Classification for News Article in Bahasa Indonesia based on Term Frequency Inverse Document Frequency (TF-IDF) Approach
    Hakim, An Aulia
    Erwin, Alva
    Eng, Kho I.
    Galinium, Maulahikmah
    Muliady, Wahyu
    [J]. 2014 6TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND ELECTRICAL ENGINEERING (ICITEE), 2014, : 29 - 32
  • [38] FE-TAC: an effective document classification method combining feature extraction and feature selection
    Singh, Kshetrimayum Nareshkumar
    Devi, Haobam Mamata
    Mahant, Anjana Kakoti
    Dorendro, Ahongsangbam
    [J]. International Journal of Applied Decision Sciences, 2023, 16 (06) : 717 - 740
  • [39] Representative terrn based feature selection method for SVM based document classification
    Kang, YH
    [J]. KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 1, PROCEEDINGS, 2005, 3681 : 56 - 61
  • [40] Iterative Feature Selection using Information Gain & Naive Bayes for Document Classification
    Rahman, Chowdhury Mofizur
    Afroze, Lameya
    Refath, Naznin Sultana
    Shawon, Nafin
    [J]. 2018 21ST INTERNATIONAL CONFERENCE OF COMPUTER AND INFORMATION TECHNOLOGY (ICCIT), 2018,