The Hybrid Filter Feature Selection Methods for Improving High-Dimensional Text Categorization

被引:13
|
作者
Le Nguyen Hoai Nam [1 ]
Ho Bao Quoc [1 ]
机构
[1] VNUHCM Univ Sci, Sch Informat Technol, Dept Informat Syst, Ho Chi Minh City, Vietnam
关键词
Filter feature selection; high dimensionality; text categorization; clustering quality; CLASSIFICATION; ALGORITHM; ACCURACY; CATEGORY;
D O I
10.1142/S021848851750009X
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The bag-of-words technique is often used to present a document in text categorization. However, for a large set of documents where the dimension of the bag-of-words vector is very high, text categorization becomes a serious challenge as a result of sparse data, over-fitting, and irrelevant features. A filter feature selection method reduces the number of features by eliminating irrelevant features from the bag-of-words vector. In this paper, we analyze the weak points and strong points of two filter feature selection approaches which are the frequency-based approach and the cluster-based approach. Thanks to the analysis, we propose hybrid filter feature selection methods, named the Frequency-Cluster Feature Selection (FCFS) and the Detailed Frequency-Cluster Feature Selection (DtFCFS), to further improve the performance of the filter feature selection process in text categorization. The FCFS is a combination of the Frequency-based approach and the Cluster-based approach, while the DtFCFS, a detailed version of the FCFS, is a comprehensively hybrid cluster-based method. We do experiments with four benchmark datasets (the Reuters-21578 and Newsgroup dataset for news classification, the Ohsumed dataset for medical document classification, and the LingSpam dataset for email classification) to compare the proposed methods with six related wellknown methods such as the Comprehensive Measurement Feature Selection (CMFS), the Optimal Orthogonal Centroid Feature Selection (OCFS), the Crossed Centroid Feature Selection (CIIC), the Information Gain (IG), the Chi-square (CHI), and the Deviation from Poisson Feature Selection (DFPFS). In terms of the Micro-F1, the Macro-F1, and the dimension reduction rate, the DtFCFS is superior to the other methods, while the FCFS shows competitive and even superior performance to the good methods, especially for the Macro-F1.
引用
收藏
页码:235 / 265
页数:31
相关论文
共 50 条
  • [1] Enhanced Filter Feature Selection Methods for Arabic Text Categorization
    Ghareb, Abdullah Saeed
    Abu Bakara, Azuraliza
    Al-Radaideh, Qasem A.
    Hamdan, Abdul Razak
    [J]. INTERNATIONAL JOURNAL OF INFORMATION RETRIEVAL RESEARCH, 2018, 8 (02) : 1 - 24
  • [2] Benchmark for filter methods for feature selection in high-dimensional classification data
    Bommert, Andrea
    Sun, Xudong
    Bischl, Bernd
    Rahnenfuehrer, Joerg
    Lang, Michel
    [J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2020, 143
  • [3] A filter feature selection for high-dimensional data
    Janane, Fatima Zahra
    Ouaderhman, Tayeb
    Chamlal, Hasna
    [J]. JOURNAL OF ALGORITHMS & COMPUTATIONAL TECHNOLOGY, 2023, 17
  • [4] Improving Text Categorization by Multicriteria Feature Selection
    Doan, Son
    Horiguchi, Susumu
    [J]. JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2005, 9 (05) : 570 - 575
  • [5] A hybrid feature selection method for text categorization
    Montanes, E.
    Quevedo, J. R.
    Combarro, E. F.
    Diaz, I.
    Ranilla, J.
    [J]. INTERNATIONAL JOURNAL OF UNCERTAINTY FUZZINESS AND KNOWLEDGE-BASED SYSTEMS, 2007, 15 (02) : 133 - 151
  • [6] Hybrid Filter and Genetic Algorithm-Based Feature Selection for Improving Cancer Classification in High-Dimensional Microarray Data
    Ali, Waleed
    Saeed, Faisal
    [J]. PROCESSES, 2023, 11 (02)
  • [7] Benchmark of filter methods for feature selection in high-dimensional gene expression survival data
    Bommert, Andrea
    Welchowski, Thomas
    Schmid, Matthias
    Rahnenfuehrer, Joerg
    [J]. BRIEFINGS IN BIOINFORMATICS, 2022, 23 (01)
  • [8] A hybrid feature selection method for high-dimensional data
    Taheri, Nooshin
    Nezamabadi-pour, Hossein
    [J]. 2014 4TH INTERNATIONAL CONFERENCE ON COMPUTER AND KNOWLEDGE ENGINEERING (ICCKE), 2014, : 141 - 145
  • [9] Hybrid Feature Selection for High-Dimensional Manufacturing Data
    Sun, Yajuan
    Yu, Jianlin
    Li, Xiang
    Wu, Ji Yan
    Lu, Wen Feng
    [J]. 2021 26TH IEEE INTERNATIONAL CONFERENCE ON EMERGING TECHNOLOGIES AND FACTORY AUTOMATION (ETFA), 2021,
  • [10] A hybrid feature selection scheme for high-dimensional data
    Ganjei, Mohammad Ahmadi
    Boostani, Reza
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2022, 113