Two-step based hybrid feature selection method for spam filtering

被引:9
|
作者
Wang, Youwei [1 ]
Liu, Yuanning [1 ]
Zhu, Xiaodong [1 ]
机构
[1] Jilin Univ, Coll Comp Sci & Technol, Changchun 130023, Peoples R China
基金
中国国家自然科学基金;
关键词
Feature selection; spam filtering; particle swarm optimization; convergence rate; Support Vector Machine; Naive Bayesian; CLASSIFICATION; ALGORITHM;
D O I
10.3233/IFS-141240
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Feature selection, which can reduce the dimensionality of vector space without sacrificing the performance of the classifier, is commonly used in spam filtering. As many classifiers cannot deal with the features with large dimensions, the noisy, irrelevant and redundant data should be removed from the feature spaces. In this paper, a two-step based hybrid feature selection method, called TFSM, is proposed. Firstly, we select the most discriminative features by an existing document frequency based feature selection method (called ODFFS). Secondly, we select the remaining features by combining the ODFFS and a newly proposed term frequency based feature selection method (called NTFFS). Moreover, we propose a new optimizing meta-heuristic method, called GOPSO, to improve the convergence rate of standard particle swarm optimization. In the experiments, Support Vector Machine (SVM) and Naive Bayesian (NB) classifiers are used on four corpuses: PU2, PU3, Enron-spam and Trec2007. The experimental results show that, TFSM is significantly superior to information gain, comprehensively measure feature selection, t-test based feature selection, term frequency based information gain and improved term frequency inverse document frequency method on four corpuses when SVM and NB are applied respectively.
引用
收藏
页码:2785 / 2796
页数:12
相关论文
共 50 条
  • [31] Prediction of Cyclin Protein Using Two-Step Feature Selection Technique
    Sun, Jia-Nan
    Yang, Hua-Yi
    Yao, Jing
    Ding, Hui
    Han, Shu-Guang
    Wu, Cheng-Yan
    Tang, Hua
    IEEE ACCESS, 2020, 8 : 109535 - 109542
  • [32] A Novel Two-step Feature Selection based Cost Sensitive Myocardial Infarction Prediction Model
    Hodjat Hamidi
    Atefeh Daraei
    International Journal of Computational Intelligence Systems, 2018, 11 : 861 - 872
  • [33] A Novel Two-step Feature Selection based Cost Sensitive Myocardial Infarction Prediction Model
    Hamidi, Hodjat
    Daraei, Atefeh
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2018, 11 (01) : 861 - 872
  • [34] Attentive Hybrid Feature with Two-Step Fusion for Facial Expression Recognition
    Weng, Jun
    Yang, Yang
    Tan, Zichang
    Lei, Zhen
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 6410 - 6416
  • [35] A Heuristic-Based Feature Selection Method for Clustering Spam Emails
    Song, Jungsuk
    Eto, Masashi
    Kim, Hyung Chan
    Inoue, Daisuke
    Nakao, Koji
    NEURAL INFORMATION PROCESSING: THEORY AND ALGORITHMS, PT I, 2010, 6443 : 290 - 297
  • [36] A Two-Step Feature Selection Method to Predict Cancerlectins by Multiview Features and Synthetic Minority Oversampling Technique
    Yang, Runtao
    Zhang, Chengjin
    Zhang, Lina
    Gao, Rui
    BIOMED RESEARCH INTERNATIONAL, 2018, 2018
  • [37] An Enhancement in Cancer Classification Accuracy Using a Two-Step Feature Selection Method Based on Artificial Neural Networks with 15 Neurons
    Rahman, Md Akizur
    Muniyandi, Ravie Chandren
    SYMMETRY-BASEL, 2020, 12 (02):
  • [38] Intelligent Feature Selection Using Hybrid Based Feature Selection Method
    Nisar, Shibli
    Tariq, Muhammad
    2016 SIXTH INTERNATIONAL CONFERENCE ON INNOVATIVE COMPUTING TECHNOLOGY (INTECH), 2016, : 168 - 172
  • [39] HSNF: Hybrid sampling with two-step noise filtering for imbalanced data classification
    Duan, Lilong
    Xue, Wei
    Gu, Xiaolei
    Luo, Xiao
    He, Yongsheng
    INTELLIGENT DATA ANALYSIS, 2023, 27 (06) : 1573 - 1593
  • [40] Two-step hybrid collaborative filtering using deep variational Bayesian autoencoders
    Nahta, Ravi
    Meena, Yogesh Kumar
    Gopalani, Dinesh
    Chauhan, Ganpat Singh
    INFORMATION SCIENCES, 2021, 562 : 136 - 154