Enhancing Effectiveness of Dimension Reduction in Text Classification

被引:4
|
作者
Seyyedi, Seyyed Hossein [1 ]
Minaei-Bidgoli, Behrouz [2 ]
机构
[1] Islamic Azad Univ, Qazvin Branch, Fac Comp & Informat Technol Engn, Qazvin 3419915195, Iran
[2] Iran Univ Sci & Technol, Sch Comp Engn, Tehran 1684613114, Iran
关键词
Classification; spam detection; high-dimensionality; feature selection; feature extraction; FEATURE-SELECTION METHOD; LATENT SEMANTIC ANALYSIS; GENETIC ALGORITHM; NEURAL-NETWORK; INFORMATION GAIN; SPAM DETECTION; CATEGORIZATION; FEATURES; SYSTEM; IDENTIFICATION;
D O I
10.1142/S0218213017500087
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Nowadays, text is one prevalent forms of data and text classification is a widely used data mining task, which has various application fields. One mass-produced instance of text is email. As a communication medium, despite having a lot of advantages, email suffers from a serious problem. The number of spam emails has steadily increased in the recent years, leading to considerable irritation. Therefore, spam detection has emerged as a separate field of text classification. A primary challenge of text classification, which is more severe in spam detection and impedes the process, is high-dimensionality of feature space. Various dimension reduction methods have been proposed that produce a lower dimensional space compared to the original. These methods are divided mainly into two groups: feature selection and feature extraction. This research deals with dimension reduction in the text classification task and especially performs experiments in the spam detection field. We employ Information Gain (IG) and Chi-square Statistic (CHI) as well-known feature selection methods. Also, we propose a new feature extraction method called Sprinkled Semantic Feature Space (SSFS). Furthermore, this paper presents a new hybrid method called IG SSFS. In IG SSFS, we combine the selection and extraction processes to reap the benefits from both. To evaluate the mentioned methods in the spam detection field, experiments are conducted on some well-known email datasets. According to the results, SSFS demonstrated superior effectiveness over the basic selection methods in terms of improving classifiers' performance, and IG SSFS further enhanced the performance despite consuming less processing time.
引用
收藏
页数:21
相关论文
共 50 条
  • [21] Effects of Dimension Reduction In Mammograms Classification
    Oral, Canan
    Sezgin, Hatice
    [J]. 2013 8TH INTERNATIONAL CONFERENCE ON ELECTRICAL AND ELECTRONICS ENGINEERING (ELECO), 2013, : 630 - 633
  • [22] Text Document Preprocessing and Dimension Reduction Techniques for Text Document Clustering
    Kadhim, Ammar Ismael
    Cheah, Yu-N
    Ahamed, Nurul Hashimah
    [J]. PROCEEDINGS 2014 4TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE WITH APPLICATIONS IN ENGINEERING AND TECHNOLOGY ICAIET 2014, 2014, : 69 - 73
  • [23] Towards Dimension Reduction: A Balanced Relative Discrimination Feature Ranking Technique for Efficient Text Classification (BRDC)
    Nasir, Muhammad
    Samsudin, Noor Azah
    Sharif, Wareesa
    Baowidan, Souad
    Arshad, Humaira
    Mushtaq, Muhammad Faheem
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (07) : 676 - 688
  • [24] Enhancing text classification using synopses extraction
    Ma, LP
    Shepherd, J
    Zhang, YC
    [J]. FOURTH INTERNATIONAL CONFERENCE ON WEB INFORMATION SYSTEMS ENGINEERING, PROCEEDINGS, 2003, : 115 - 124
  • [25] A Parallel Algorithm for Bayesian Text Classification Based on Noise Elimination and Dimension Reduction in Spark Computing Environment
    Tang, Zhuo
    Xiao, Wei
    Lu, Bin
    Zuo, Youfei
    Zhou, Yuan
    Li, Keqin
    [J]. CLOUD COMPUTING - CLOUD 2019, 2019, 11513 : 222 - 239
  • [26] Feature Transformation and Reduction for Text Classification
    Ferreira, Artur J.
    Figueiredo, Mario A. T.
    [J]. PATTERN RECOGNITION IN INFORMATION SYSTEMS, 2010, : 72 - 81
  • [27] Abstracting for Dimensionality Reduction in Text Classification
    McAllister, Richard A.
    Angryk, Rafal A.
    [J]. INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2013, 28 (02) : 115 - 138
  • [28] Feature reduction methods for text classification
    Wu, Di
    Zhang, Yaping
    Wang, Xin
    [J]. Journal of Computational Information Systems, 2008, 4 (02): : 495 - 502
  • [29] Enhancing text analysis via dimensionality reduction
    Underhill, David G.
    McDowell, Luke K.
    Marchette, David J.
    Solka, Jeffrey L.
    [J]. IRI 2007: PROCEEDINGS OF THE 2007 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION, 2007, : 348 - +
  • [30] A Novel Algebra to Articulate Feature in Text Dimension Reduction
    Guo, Xin
    Xiang, Yang
    Chen, Qian
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING (GRC), 2013, : 132 - 136