Words versus character N-grams for anti-spam filtering

被引：53

作者：

Kanaris, Ioannis ^{[1
]}

Kanaris, Konstantinos ^{[1
]}

Houvardas, Ioannis ^{[1
]}

Stamatatos, Efstathios ^{[1
]}

机构：

[1] Univ Aegean, Dept Informat & Commun Syst Engn, Karlovassi 83200, Samos, Greece

来源：

INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS | 2007年 / 16卷 / 06期

关键词：

anti-sparn filtering; machine learning; n-grams;

D O I：

10.1142/S0218213007003692

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The increasing number of unsolicited e-mail messages (spam) reveals the need for the development of reliable anti-spam filters. The vast majority of content-based techniques rely on word-based representation of messages. Such approaches require reliable tokenizers for detecting the token boundaries. As a consequence, a common practice of spammers is to attempt to confuse tokenizers using unexpected punctuation marks or special characters within the message. In this paper we explore an alternative low-level representation based on character n-grams which avoids the use of tokcnizers and other language-dependent tools. Based on experiments on two well-known benchmark corpora and a variety of evaluation measures, we show that character n-grams are more reliable features than word-tokens despite the fact that they increase the dimensionality of the problem. Moreover, we propose a method for extracting variable-length n-grams which produces optimal classifiers among the examined models under cost-sensitive evaluation.

引用

页码：1047 / 1067

页数：21

共 50 条

[1] Detection of Opinion Spam with Character n-grams
Hernandez Fusilier, Donato
Montes-y-Gomez, Manuel
Rosso, Paolo
Guzman Cabrera, Rafael
COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING (CICLING 2015), PT II, 2015, 9042 : 285 - 294
[2] Spam detection using character N-grams
Kanaris, Ioannis
Kanaris, Konstantinos
Stamatatos, Efstathios
ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 3955 : 95 - 104
[3] Overview of textual anti-spam filtering techniques
Subramaniam, Thamarai
Jalab, Hamid A.
Taqa, Alaa Y.
INTERNATIONAL JOURNAL OF THE PHYSICAL SCIENCES, 2010, 5 (12): : 1869 - 1882
[4] Using visual features for anti-SPAM filtering
Wu, CT
Cheng, KT
Zhu, Q
Wu, KL
2005 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), VOLS 1-5, 2005, : 2925 - 2928
[5] Anti-spam filtering using neural networks
Elfayoumy, S
Yang, Y
Ahuja, S
IC-AI '04 & MLMTA'04 , VOL 1 AND 2, PROCEEDINGS, 2004, : 984 - 989
[6] A suffix tree approach to anti-spam email filtering
Rajesh Pampapathi
Boris Mirkin
Mark Levene
Machine Learning, 2006, 65 : 309 - 338
[7] Research in Anti-Spam Method Based on Bayesian Filtering
Wu, Jiansheng
Deng, Tao
PACIIA: 2008 PACIFIC-ASIA WORKSHOP ON COMPUTATIONAL INTELLIGENCE AND INDUSTRIAL APPLICATION, VOLS 1-3, PROCEEDINGS, 2008, : 1838 - 1842
[8] Which Granularity to Bootstrap a Multilingual Method of Document Alignment: Character N-grams or Word N-grams?
Lecluze, Charlotte
Rigouste, Lois
Giguet, Emmanuel
Lucas, Nadine
CORPUS RESOURCES FOR DESCRIPTIVE AND APPLIED STUDIES. CURRENT CHALLENGES AND FUTURE DIRECTIONS: SELECTED PAPERS FROM THE 5TH INTERNATIONAL CONFERENCE ON CORPUS LINGUISTICS (CILC2013), 2013, 95 : 473 - 481
[9] An evaluation of naive Bayesian anti-spam filtering techniques
Deshpande, Vikas P.
Erbacher, Robert F.
Harris, Chris
2007 IEEE INFORMATION ASSURANCE WORKSHOP, 2007, : 333 - +
[10] Combining SVM classifiers for email anti-spam filtering
Blanco, Angela
Maria Ricket, Alba
Martin-Merino, Manuel
COMPUTATIONAL AND AMBIENT INTELLIGENCE, 2007, 4507 : 903 - +

← 1 2 3 4 5 →