A New Big Data Feature Selection Approach for Text Classification

被引:4
|
作者
Amazal, Houda [1 ]
Kissi, Mohamed [1 ]
机构
[1] Univ Hassan II Casablanca, Fac Sci & Technol, Comp Sci Lab, Mohammadia, Morocco
关键词
PARALLEL FEATURE-SELECTION; ALGORITHM; MODEL;
D O I
10.1155/2021/6645345
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Feature selection (FS) is a fundamental task for text classification problems. Text feature selection aims to represent documents using the most relevant features. This process can reduce the size of datasets and improve the performance of the machine learning algorithms. Many researchers have focused on elaborating efficient FS techniques. However, most of the proposed approaches are evaluated for small datasets and validated using single machines. As textual data dimensionality becomes higher, traditional FS methods must be improved and parallelized to handle textual big data. This paper proposes a distributed approach for feature selection based on mutual information (MI) method, which is widely applied in pattern recognition and machine learning. A drawback of MI is that it ignores the frequency of the terms during the selection of features. The proposal introduces a distributed FS method, namely, Maximum Term Frequency-Mutual Information (MTF-MI), based on term frequency and mutual information techniques to improve the quality of the selected features. The proposed approach is implemented on Hadoop using the MapReduce programming model. The effectiveness of MTF-MI is demonstrated through several text classification experiments using the multinomial Naive Bayes classifier on three datasets. Through a series of tests, the results reveal that the proposed MTF-MI method improves the classification results compared with four state-of-the-art methods in terms of macro-F1 and micro-F1 measures.
引用
收藏
页数:10
相关论文
共 50 条
  • [41] A Bayesian feature selection paradigm for text classification
    Feng, Guozhong
    Guo, Jianhua
    Jing, Bing-Yi
    Hao, Lizhu
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2012, 48 (02) : 283 - 302
  • [42] Text feature selection method for hierarchical classification
    Zhu, Cui-Ling
    Ma, Jun
    Zhang, Dong-Mei
    [J]. Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, 2011, 24 (01): : 103 - 110
  • [43] Composite Feature Extraction and Selection for Text Classification
    Wan, Chuan
    Wang, Yuling
    Liu, Yaoze
    Ji, Jinchao
    Feng, Guozhong
    [J]. IEEE ACCESS, 2019, 7 : 35208 - 35219
  • [44] Higher order feature selection for text classification
    Jan Bakus
    Mohamed S. Kamel
    [J]. Knowledge and Information Systems, 2006, 9 : 468 - 491
  • [45] A feature selection and classification technique for text categorization
    Girgis, MR
    Aly, AA
    [J]. INTERNATIONAL JOURNAL OF COOPERATIVE INFORMATION SYSTEMS, 2003, 12 (04) : 441 - 454
  • [46] Effective feature selection technique for text classification
    Seetha, Hari
    Murty, M. Narasimha
    Saravanan, R.
    [J]. INTERNATIONAL JOURNAL OF DATA MINING MODELLING AND MANAGEMENT, 2015, 7 (03) : 165 - 184
  • [47] Higher order feature selection for text classification
    Bakus, J
    Kamel, MS
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2006, 9 (04) : 468 - 491
  • [48] Feature selection for text classification with Naive Bayes
    Chen, Jingnian
    Huang, Houkuan
    Tian, Shengfeng
    Qu, Youli
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (03) : 5432 - 5435
  • [49] Optimal Feature Selection for Imbalanced Text Classification
    Khurana, Anshu
    Verma, Om Prakash
    [J]. IEEE Transactions on Artificial Intelligence, 2023, 4 (01): : 135 - 147
  • [50] Feature selection improves text classification accuracy
    不详
    [J]. IEEE INTELLIGENT SYSTEMS, 2005, 20 (06) : 75 - 75