Enhancing Effectiveness of Dimension Reduction in Text Classification

被引:4
|
作者
Seyyedi, Seyyed Hossein [1 ]
Minaei-Bidgoli, Behrouz [2 ]
机构
[1] Islamic Azad Univ, Qazvin Branch, Fac Comp & Informat Technol Engn, Qazvin 3419915195, Iran
[2] Iran Univ Sci & Technol, Sch Comp Engn, Tehran 1684613114, Iran
关键词
Classification; spam detection; high-dimensionality; feature selection; feature extraction; FEATURE-SELECTION METHOD; LATENT SEMANTIC ANALYSIS; GENETIC ALGORITHM; NEURAL-NETWORK; INFORMATION GAIN; SPAM DETECTION; CATEGORIZATION; FEATURES; SYSTEM; IDENTIFICATION;
D O I
10.1142/S0218213017500087
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Nowadays, text is one prevalent forms of data and text classification is a widely used data mining task, which has various application fields. One mass-produced instance of text is email. As a communication medium, despite having a lot of advantages, email suffers from a serious problem. The number of spam emails has steadily increased in the recent years, leading to considerable irritation. Therefore, spam detection has emerged as a separate field of text classification. A primary challenge of text classification, which is more severe in spam detection and impedes the process, is high-dimensionality of feature space. Various dimension reduction methods have been proposed that produce a lower dimensional space compared to the original. These methods are divided mainly into two groups: feature selection and feature extraction. This research deals with dimension reduction in the text classification task and especially performs experiments in the spam detection field. We employ Information Gain (IG) and Chi-square Statistic (CHI) as well-known feature selection methods. Also, we propose a new feature extraction method called Sprinkled Semantic Feature Space (SSFS). Furthermore, this paper presents a new hybrid method called IG SSFS. In IG SSFS, we combine the selection and extraction processes to reap the benefits from both. To evaluate the mentioned methods in the spam detection field, experiments are conducted on some well-known email datasets. According to the results, SSFS demonstrated superior effectiveness over the basic selection methods in terms of improving classifiers' performance, and IG SSFS further enhanced the performance despite consuming less processing time.
引用
收藏
页数:21
相关论文
共 50 条
  • [1] A survey on dimension reduction techniques in text classification
    Wang, Zhi Juan
    Zhou, Ruo Song
    [J]. COMPUTING, CONTROL, INFORMATION AND EDUCATION ENGINEERING, 2015, : 633 - 635
  • [3] Dimension reduction in text classification with support vector machines
    Kim, H
    Howland, P
    Park, H
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2005, 6 : 37 - 53
  • [4] Cluster-preserving dimension reduction methods for efficient classification of text data
    Howland, P
    Park, H
    [J]. SURVEY OF TEXT MINING: CLUSTERING, CLASSIFICATION, AND RETRIEVAL, 2004, : 3 - 23
  • [5] An Effective Class-centroid-based Dimension Reduction Method for Text Classification
    Pang, Guansong
    Jin, Huidong
    Jiang, Shengyi
    [J]. PROCEEDINGS OF THE 22ND INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'13 COMPANION), 2013, : 223 - 224
  • [6] Enhancing Text Classification with the Universum
    Liu, Chien-Liang
    Lee, Ching-Hsien
    [J]. 2016 12TH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (ICNC-FSKD), 2016, : 1147 - 1153
  • [7] Enhancing NILM classification via robust principal component analysis dimension reduction
    Yaniv, Arbel
    Beck, Yuval
    [J]. HELIYON, 2024, 10 (09)
  • [8] On the (In)Effectiveness of Images for Text Classification
    Ma, Chunpeng
    Shen, Aili
    Yoshikawa, Hiyori
    Iwakura, Tomoya
    Beck, Daniel
    Baldwin, Timothy
    [J]. 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 42 - 48
  • [9] PLS and dimension reduction for classification
    Yushu Liu
    William Rayens
    [J]. Computational Statistics, 2007, 22 : 189 - 208
  • [10] Dimension Reduction for Tensor Classification
    Zeng, Peng
    Zhong, Wenxuan
    [J]. TOPICS IN APPLIED STATISTICS, 2013, 55 : 213 - 227