An enhanced feature selection method for text classification

被引:0
|
作者
Kang, Jinbeom [1 ]
Lee, Eunshil [1 ]
Hong, Kwanghee [1 ]
Park, Jeahyun [1 ]
Kim, Taehwan [1 ]
Park, Juyoung [1 ]
Choi, Joongmin [1 ]
Yang, Jaeyoung [1 ]
机构
[1] Hanyang Univ, Dept Comp Sci & Engn, Ansan, Kunngi Do, South Korea
关键词
feature selection; impurity of words; unbalanced distribution; machine learning; text classification;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Feature selection in machine learning is a task of identifying a set of representative terms or features from a document collection that are mainly used in text classification. Existing feature selection methods including information gain and X chi(2)-test focus on those features that are useful for all topics, and consequently lack the power of selecting those features that are truly the representatives of a particular topic (or class). Also, these methods assume that the distribution of documents for each class is balanced. However, this assumption affects negatively to the classification accuracy because real-world document collections rarely have a balanced distribution, and also it is difficult to prepare a set of training documents with even number of documents for each class. To resolve this problem, we propose a new feature selection method for text classification that focuses on the purity of a word that emphasizes its representativeness for a particular class. Also our method assumes unbalanced distribution of documents over multiple classes, and combines feature values with the weight factors that,reflect the number of training documents in each class. In summary, we can obtain feature candidates using the word purity and then select the features with the unbalanced distribution of documents. Via some experiments, we demonstrate that our method outperforms existing methods.
引用
收藏
页码:36 / 41
页数:6
相关论文
共 50 条
  • [1] Efficient Method for Feature Selection in Text Classification
    Sun, Jian
    Zhang, Xiang
    Liao, Dan
    Chang, Victor
    [J]. 2017 INTERNATIONAL CONFERENCE ON ENGINEERING AND TECHNOLOGY (ICET), 2017,
  • [2] Text feature selection method for hierarchical classification
    Zhu, Cui-Ling
    Ma, Jun
    Zhang, Dong-Mei
    [J]. Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, 2011, 24 (01): : 103 - 110
  • [3] A new feature selection method for text classification
    Uchyigit, Gulden
    Clark, Keith
    [J]. INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2007, 21 (02) : 423 - 438
  • [4] Feature Selection Method of Text Tendency Classification
    Li, Yanling
    Dai, Guanzhong
    Li, Gang
    [J]. FIFTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 2, PROCEEDINGS, 2008, : 34 - +
  • [5] A New Filter Feature Selection Method for Text Classification
    Cekik, Rasim
    [J]. IEEE Access, 2024, 12 : 139316 - 139335
  • [6] Statera: A Balanced Feature Selection Method for Text Classification
    Gama Bispo, Braian Varjao
    Rios, Tatiane Nogueira
    [J]. 2018 7TH BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS), 2018, : 260 - 265
  • [7] A Hybrid Feature Selection Method For Vietnamese Text Classification
    Nguyen Tri Hai
    Tuan Dinh Le
    Nguyen Hoang Nghia
    Vu Thanh Nguyen
    [J]. 2015 SEVENTH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE), 2015, : 91 - 96
  • [8] A parallel feature selection method study for text classification
    Li, Zhao
    Lu, Wei
    Sun, Zhanquan
    Xing, Weiwei
    [J]. NEURAL COMPUTING & APPLICATIONS, 2017, 28 : S513 - S524
  • [9] A parallel feature selection method study for text classification
    Zhao Li
    Wei Lu
    Zhanquan Sun
    Weiwei Xing
    [J]. Neural Computing and Applications, 2017, 28 : 513 - 524
  • [10] A novel probabilistic feature selection method for text classification
    Uysal, Alper Kursat
    Gunal, Serkan
    [J]. KNOWLEDGE-BASED SYSTEMS, 2012, 36 : 226 - 235