Words versus character N-grams for anti-spam filtering

被引:53
|
作者
Kanaris, Ioannis [1 ]
Kanaris, Konstantinos [1 ]
Houvardas, Ioannis [1 ]
Stamatatos, Efstathios [1 ]
机构
[1] Univ Aegean, Dept Informat & Commun Syst Engn, Karlovassi 83200, Samos, Greece
关键词
anti-sparn filtering; machine learning; n-grams;
D O I
10.1142/S0218213007003692
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The increasing number of unsolicited e-mail messages (spam) reveals the need for the development of reliable anti-spam filters. The vast majority of content-based techniques rely on word-based representation of messages. Such approaches require reliable tokenizers for detecting the token boundaries. As a consequence, a common practice of spammers is to attempt to confuse tokenizers using unexpected punctuation marks or special characters within the message. In this paper we explore an alternative low-level representation based on character n-grams which avoids the use of tokcnizers and other language-dependent tools. Based on experiments on two well-known benchmark corpora and a variety of evaluation measures, we show that character n-grams are more reliable features than word-tokens despite the fact that they increase the dimensionality of the problem. Moreover, we propose a method for extracting variable-length n-grams which produces optimal classifiers among the examined models under cost-sensitive evaluation.
引用
收藏
页码:1047 / 1067
页数:21
相关论文
共 50 条
  • [41] Feature selection on Chinese text classification using character n-grams
    Wei, Zhihua
    Miao, Duoqian
    Chauchat, Jean-Hugues
    Zhong, Caiming
    ROUGH SETS AND KNOWLEDGE TECHNOLOGY, 2008, 5009 : 500 - +
  • [42] A comparative performance study of feature selection methods for the anti-spam filtering domain
    Mendez, J. R.
    Fdez-Riverola, F.
    Diaz, F.
    Iglesias, E. L.
    Corchado, J. M.
    ADVANCES IN DATA MINING: APPLICATIONS IN MEDICINE, WEB MINING, MARKETING, IMAGE AND SIGNAL MINING, 2006, 4065 : 106 - 120
  • [43] Factorial design analysis applied to the performance of SMS anti-spam filtering systems
    Aragao, Marcelo V. C.
    Frigieri, Edielson Prevato
    Ynoguti, Carlos A.
    Paiva, Anderson P.
    EXPERT SYSTEMS WITH APPLICATIONS, 2016, 64 : 589 - 604
  • [44] Evolutionary Multi-objective Scheduling for Anti-Spam Filtering Throughput Optimization
    Ruano-Ordas, David
    Basto-Fernandes, Vitor
    Yevseyeva, Iryna
    Ramon Mendez, Jose
    HYBRID ARTIFICIAL INTELLIGENT SYSTEMS, HAIS 2017, 2017, 10334 : 137 - 148
  • [45] Effect of Cost Parameters Adjustment on the Accuracy of Bayesian Anti-Spam Filtering System
    Cui C.
    Lü D.
    Jiang S.-F.
    Beijing Ligong Daxue Xuebao/Transaction of Beijing Institute of Technology, 2019, 39 (02): : 142 - 146
  • [46] Research on advanced filtering algorithm for anti-spam based on a Bayesian classification model
    Zhen, L
    Liang, T
    Kun, S
    Zhou, MT
    Wavelet Analysis and Active Media Technology Vols 1-3, 2005, : 81 - 86
  • [47] <bold>Anti-Spam Filtering Using Neural Networks and Baysian Classifiers</bold>
    Yang, Yue
    Elfayoumy, Sherif
    2007 INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE IN ROBOTICS AND AUTOMATION, 2007, : 545 - +
  • [48] Relative N-Gram Signatures: Document Visualization at the Level of Character N-Grams
    Jankowska, Magdalena
    Keselj, Vlado
    Milios, Evangelos
    2012 IEEE CONFERENCE ON VISUAL ANALYTICS SCIENCE AND TECHNOLOGY (VAST), 2012, : 103 - 112
  • [49] Measuring similarity between Karel programs using character and word n-grams
    G. Sidorov
    M. Ibarra Romero
    I. Markov
    R. Guzman-Cabrera
    L. Chanona-Hernández
    F. Velásquez
    Programming and Computer Software, 2017, 43 : 47 - 50
  • [50] Integrating visual words as bunch of n-grams for effective biomedical image classification
    Pedrosa, Glauco V.
    Rahman, Md Mahmudur
    Antani, Sameer K.
    Demner-Fushman, Dina
    Long, L. Rodney
    Traina, Agma J. M.
    2014 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2014, : 431 - 436