Words versus character N-grams for anti-spam filtering

被引:53
|
作者
Kanaris, Ioannis [1 ]
Kanaris, Konstantinos [1 ]
Houvardas, Ioannis [1 ]
Stamatatos, Efstathios [1 ]
机构
[1] Univ Aegean, Dept Informat & Commun Syst Engn, Karlovassi 83200, Samos, Greece
关键词
anti-sparn filtering; machine learning; n-grams;
D O I
10.1142/S0218213007003692
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The increasing number of unsolicited e-mail messages (spam) reveals the need for the development of reliable anti-spam filters. The vast majority of content-based techniques rely on word-based representation of messages. Such approaches require reliable tokenizers for detecting the token boundaries. As a consequence, a common practice of spammers is to attempt to confuse tokenizers using unexpected punctuation marks or special characters within the message. In this paper we explore an alternative low-level representation based on character n-grams which avoids the use of tokcnizers and other language-dependent tools. Based on experiments on two well-known benchmark corpora and a variety of evaluation measures, we show that character n-grams are more reliable features than word-tokens despite the fact that they increase the dimensionality of the problem. Moreover, we propose a method for extracting variable-length n-grams which produces optimal classifiers among the examined models under cost-sensitive evaluation.
引用
收藏
页码:1047 / 1067
页数:21
相关论文
共 50 条
  • [1] Detection of Opinion Spam with Character n-grams
    Hernandez Fusilier, Donato
    Montes-y-Gomez, Manuel
    Rosso, Paolo
    Guzman Cabrera, Rafael
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING (CICLING 2015), PT II, 2015, 9042 : 285 - 294
  • [2] Spam detection using character N-grams
    Kanaris, Ioannis
    Kanaris, Konstantinos
    Stamatatos, Efstathios
    ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 3955 : 95 - 104
  • [3] Overview of textual anti-spam filtering techniques
    Subramaniam, Thamarai
    Jalab, Hamid A.
    Taqa, Alaa Y.
    INTERNATIONAL JOURNAL OF THE PHYSICAL SCIENCES, 2010, 5 (12): : 1869 - 1882
  • [4] Using visual features for anti-SPAM filtering
    Wu, CT
    Cheng, KT
    Zhu, Q
    Wu, KL
    2005 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), VOLS 1-5, 2005, : 2925 - 2928
  • [5] Anti-spam filtering using neural networks
    Elfayoumy, S
    Yang, Y
    Ahuja, S
    IC-AI '04 & MLMTA'04 , VOL 1 AND 2, PROCEEDINGS, 2004, : 984 - 989
  • [6] A suffix tree approach to anti-spam email filtering
    Rajesh Pampapathi
    Boris Mirkin
    Mark Levene
    Machine Learning, 2006, 65 : 309 - 338
  • [7] Research in Anti-Spam Method Based on Bayesian Filtering
    Wu, Jiansheng
    Deng, Tao
    PACIIA: 2008 PACIFIC-ASIA WORKSHOP ON COMPUTATIONAL INTELLIGENCE AND INDUSTRIAL APPLICATION, VOLS 1-3, PROCEEDINGS, 2008, : 1838 - 1842
  • [8] Which Granularity to Bootstrap a Multilingual Method of Document Alignment: Character N-grams or Word N-grams?
    Lecluze, Charlotte
    Rigouste, Lois
    Giguet, Emmanuel
    Lucas, Nadine
    CORPUS RESOURCES FOR DESCRIPTIVE AND APPLIED STUDIES. CURRENT CHALLENGES AND FUTURE DIRECTIONS: SELECTED PAPERS FROM THE 5TH INTERNATIONAL CONFERENCE ON CORPUS LINGUISTICS (CILC2013), 2013, 95 : 473 - 481
  • [9] An evaluation of naive Bayesian anti-spam filtering techniques
    Deshpande, Vikas P.
    Erbacher, Robert F.
    Harris, Chris
    2007 IEEE INFORMATION ASSURANCE WORKSHOP, 2007, : 333 - +
  • [10] Combining SVM classifiers for email anti-spam filtering
    Blanco, Angela
    Maria Ricket, Alba
    Martin-Merino, Manuel
    COMPUTATIONAL AND AMBIENT INTELLIGENCE, 2007, 4507 : 903 - +