A study of spam filtering using support vector machines

被引:69
|
作者
Amayri, Ola [1 ]
Bouguila, Nizar [1 ]
机构
[1] Concordia Univ, Concordia Inst Informat Syst Engn, Montreal, PQ, Canada
关键词
Spam filtering; Support vector machines; String kernels; Feature mapping; Online active; CLASSIFICATION;
D O I
10.1007/s10462-010-9166-x
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Electronic mail is a major revolution taking place over traditional communication systems due to its convenient, economical, fast, and easy to use nature. A major bottleneck in electronic communications is the enormous dissemination of unwanted, harmful emails known as spam emails. A major concern is the developing of suitable filters that can adequately capture those emails and achieve high performance rate. Machine learning (ML) researchers have developed many approaches in order to tackle this problem. Within the context of machine learning, support vector machines (SVM) have made a large contribution to the development of spam email filtering. Based on SVM, different schemes have been proposed through text classification approaches (TC). A crucial problem when using SVM is the choice of kernels as they directly affect the separation of emails in the feature space. This paper presents thorough investigation of several distance-based kernels and specify spam filtering behaviors using SVM. The majority of used kernels in recent studies concern continuous data and neglect the structure of the text. In contrast to classical kernels, we propose the use of various string kernels for spam filtering. We show how effectively string kernels suit spam filtering problem. On the other hand, data preprocessing is a vital part of text classification where the objective is to generate feature vectors usable by SVM kernels. We detail a feature mapping variants in TC that yield improved performance for the standard SVM in filtering task. Furthermore, to cope for realtime scenarios we propose an online active framework for spam filtering. We present empirical results from an extensive study of online, transductive, and online active methods for classifying spam emails in real time. We show that active online method using string kernels achieves higher precision and recall rates.
引用
收藏
页码:73 / 108
页数:36
相关论文
共 50 条
  • [1] A study of spam filtering using support vector machines
    Ola Amayri
    Nizar Bouguila
    [J]. Artificial Intelligence Review, 2010, 34 : 73 - 108
  • [2] Online Spam Filtering Using Support Vector Machines
    Amayri, Ola
    Bouguila, Nizar
    [J]. ISCC: 2009 IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS, VOLS 1 AND 2, 2009, : 337 - 340
  • [3] Evolutionary support vector machines for spam filtering
    Stoean, Ruxandra
    Stoean, Catalin
    Preuss, Mike
    Dumitrescu, D.
    [J]. 5TH ROEDUNET IEEE INTERNATIONAL CONFERENCE, PROCEEDINGS, 2006, : 261 - 265
  • [4] Improved Online Support Vector Machines Spam Filtering Using String Kernels
    Amayri, Ola
    Bouguila, Nizar
    [J]. PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, PROCEEDINGS, 2009, 5856 : 621 - 628
  • [5] A Method of Spam Filtering Based on Weighted Support Vector Machines
    Chen Xiao-li
    Liu Pei-yu
    Zhu Zhen-fang
    Qiu Ye
    [J]. 2009 IEEE INTERNATIONAL SYMPOSIUM ON IT IN MEDICINE & EDUCATION, VOLS 1 AND 2, PROCEEDINGS, 2009, : 947 - 950
  • [6] Using of support vector machines for link spam detection
    Sharapov, Ruslan V.
    Sharapova, Ekaterina V.
    [J]. INTERNATIONAL CONFERENCE ON GRAPHIC AND IMAGE PROCESSING (ICGIP 2011), 2011, 8285
  • [7] Support vector machines for spam categorization
    Drucker, H
    Wu, DH
    Vapnik, VN
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS, 1999, 10 (05): : 1048 - 1054
  • [8] An SMS Spam Filtering System Using Support Vector Machine
    Joe, Inwhee
    Shim, Hyetaek
    [J]. FUTURE GENERATION INFORMATION TECHNOLOGY, 2010, 6485 : 577 - 584
  • [9] Research on spam filtering technology using Support Vector Machine
    Mei, Zheng
    Ji, Geng
    Xiao, Li
    Qiao, Liu
    [J]. 2007 INTERNATIONAL CONFERENCE ON COMMUNICATIONS, CIRCUITS AND SYSTEMS PROCEEDINGS, VOLS 1 AND 2: VOL 1: COMMUNICATION THEORY AND SYSTEMS; VOL 2: SIGNAL PROCESSING, COMPUTATIONAL INTELLIGENCE, CIRCUITS AND SYSTEMS, 2007, : 492 - +
  • [10] Personalized Spam filtering using Incremental Training of Support Vector Machine
    Sanghani, Gopi
    Kotecha, Ketan
    [J]. 2016 INTERNATIONAL CONFERENCE ON COMPUTING, ANALYTICS AND SECURITY TRENDS (CAST), 2016, : 323 - 328