A New Big Data Feature Selection Approach for Text Classification

被引:4
|
作者
Amazal, Houda [1 ]
Kissi, Mohamed [1 ]
机构
[1] Univ Hassan II Casablanca, Fac Sci & Technol, Comp Sci Lab, Mohammadia, Morocco
关键词
PARALLEL FEATURE-SELECTION; ALGORITHM; MODEL;
D O I
10.1155/2021/6645345
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Feature selection (FS) is a fundamental task for text classification problems. Text feature selection aims to represent documents using the most relevant features. This process can reduce the size of datasets and improve the performance of the machine learning algorithms. Many researchers have focused on elaborating efficient FS techniques. However, most of the proposed approaches are evaluated for small datasets and validated using single machines. As textual data dimensionality becomes higher, traditional FS methods must be improved and parallelized to handle textual big data. This paper proposes a distributed approach for feature selection based on mutual information (MI) method, which is widely applied in pattern recognition and machine learning. A drawback of MI is that it ignores the frequency of the terms during the selection of features. The proposal introduces a distributed FS method, namely, Maximum Term Frequency-Mutual Information (MTF-MI), based on term frequency and mutual information techniques to improve the quality of the selected features. The proposed approach is implemented on Hadoop using the MapReduce programming model. The effectiveness of MTF-MI is demonstrated through several text classification experiments using the multinomial Naive Bayes classifier on three datasets. Through a series of tests, the results reveal that the proposed MTF-MI method improves the classification results compared with four state-of-the-art methods in terms of macro-F1 and micro-F1 measures.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] A new approach to feature selection in text classification
    Wang, Y
    Wang, XJ
    [J]. PROCEEDINGS OF 2005 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-9, 2005, : 3814 - 3819
  • [2] An online approach for feature selection for classification in big data
    Nazar, Nasrin Banu
    Senthilkumar, Radha
    [J]. TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2017, 25 (01) : 163 - 171
  • [3] Evolutionary Feature Selection for Big Data Classification: A MapReduce Approach
    Peralta, Daniel
    del Rio, Sara
    Ramirez-Gallego, Sergio
    Triguero, Isaac
    Benitez, Josem.
    Herrera, Francisco
    [J]. MATHEMATICAL PROBLEMS IN ENGINEERING, 2015, 2015
  • [4] A new feature selection method for text classification
    Uchyigit, Gulden
    Clark, Keith
    [J]. INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2007, 21 (02) : 423 - 438
  • [5] Effective Text Classification by a Supervised Feature Selection Approach
    Basu, Tanmay
    Murthy, C. A.
    [J]. 12TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2012), 2012, : 918 - 925
  • [6] A new approach to feature selection for text categorization
    Li, SS
    Zong, CQ
    [J]. PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (IEEE NLP-KE'05), 2005, : 626 - 630
  • [7] A New Approach of Feature Selection for Text Categorization
    CUI Zifeng~1
    2. Department of Computer Science and Engineering
    [J]. Wuhan University Journal of Natural Sciences, 2006, (05) : 1335 - 1339
  • [8] A Filter Based Feature Set Selection Approach for Big Data Classification of Patient Records
    Vinod, D. Franklin
    Vasudevan, V.
    [J]. 2016 INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONICS, AND OPTIMIZATION TECHNIQUES (ICEEOT), 2016, : 3684 - 3687
  • [9] Two new feature selection metrics for text classification
    Sahin, Durmus Ozkan
    Kilic, Erdal
    [J]. AUTOMATIKA, 2019, 60 (02) : 162 - 171
  • [10] Feature Selection in Text Classification
    Sahin, Durmus Ozkan
    Ates, Nurullah
    Kilic, Erdal
    [J]. 2016 24TH SIGNAL PROCESSING AND COMMUNICATION APPLICATION CONFERENCE (SIU), 2016, : 1777 - 1780