Words versus character N-grams for anti-spam filtering

被引：53

作者：

Kanaris, Ioannis ^{[1
]}

Kanaris, Konstantinos ^{[1
]}

Houvardas, Ioannis ^{[1
]}

Stamatatos, Efstathios ^{[1
]}

机构：

[1] Univ Aegean, Dept Informat & Commun Syst Engn, Karlovassi 83200, Samos, Greece

来源：

INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS | 2007年 / 16卷 / 06期

关键词：

anti-sparn filtering; machine learning; n-grams;

D O I：

10.1142/S0218213007003692

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The increasing number of unsolicited e-mail messages (spam) reveals the need for the development of reliable anti-spam filters. The vast majority of content-based techniques rely on word-based representation of messages. Such approaches require reliable tokenizers for detecting the token boundaries. As a consequence, a common practice of spammers is to attempt to confuse tokenizers using unexpected punctuation marks or special characters within the message. In this paper we explore an alternative low-level representation based on character n-grams which avoids the use of tokcnizers and other language-dependent tools. Based on experiments on two well-known benchmark corpora and a variety of evaluation measures, we show that character n-grams are more reliable features than word-tokens despite the fact that they increase the dimensionality of the problem. Moreover, we propose a method for extracting variable-length n-grams which produces optimal classifiers among the examined models under cost-sensitive evaluation.

引用

页码：1047 / 1067

页数：21

共 50 条

[41] Feature selection on Chinese text classification using character n-grams
Wei, Zhihua
Miao, Duoqian
Chauchat, Jean-Hugues
Zhong, Caiming
ROUGH SETS AND KNOWLEDGE TECHNOLOGY, 2008, 5009 : 500 - +
[42] A comparative performance study of feature selection methods for the anti-spam filtering domain
Mendez, J. R.
Fdez-Riverola, F.
Diaz, F.
Iglesias, E. L.
Corchado, J. M.
ADVANCES IN DATA MINING: APPLICATIONS IN MEDICINE, WEB MINING, MARKETING, IMAGE AND SIGNAL MINING, 2006, 4065 : 106 - 120
[43] Factorial design analysis applied to the performance of SMS anti-spam filtering systems
Aragao, Marcelo V. C.
Frigieri, Edielson Prevato
Ynoguti, Carlos A.
Paiva, Anderson P.
EXPERT SYSTEMS WITH APPLICATIONS, 2016, 64 : 589 - 604
[44] Evolutionary Multi-objective Scheduling for Anti-Spam Filtering Throughput Optimization
Ruano-Ordas, David
Basto-Fernandes, Vitor
Yevseyeva, Iryna
Ramon Mendez, Jose
HYBRID ARTIFICIAL INTELLIGENT SYSTEMS, HAIS 2017, 2017, 10334 : 137 - 148
[45] Effect of Cost Parameters Adjustment on the Accuracy of Bayesian Anti-Spam Filtering System
Cui C.
Lü D.
Jiang S.-F.
Beijing Ligong Daxue Xuebao/Transaction of Beijing Institute of Technology, 2019, 39 (02): : 142 - 146
[46] Research on advanced filtering algorithm for anti-spam based on a Bayesian classification model
Zhen, L
Liang, T
Kun, S
Zhou, MT
Wavelet Analysis and Active Media Technology Vols 1-3, 2005, : 81 - 86
[47] <bold>Anti-Spam Filtering Using Neural Networks and Baysian Classifiers</bold>
Yang, Yue
Elfayoumy, Sherif
2007 INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE IN ROBOTICS AND AUTOMATION, 2007, : 545 - +
[48] Relative N-Gram Signatures: Document Visualization at the Level of Character N-Grams
Jankowska, Magdalena
Keselj, Vlado
Milios, Evangelos
2012 IEEE CONFERENCE ON VISUAL ANALYTICS SCIENCE AND TECHNOLOGY (VAST), 2012, : 103 - 112
[49] Measuring similarity between Karel programs using character and word n-grams
G. Sidorov
M. Ibarra Romero
I. Markov
R. Guzman-Cabrera
L. Chanona-Hernández
F. Velásquez
Programming and Computer Software, 2017, 43 : 47 - 50
[50] Integrating visual words as bunch of n-grams for effective biomedical image classification
Pedrosa, Glauco V.
Rahman, Md Mahmudur
Antani, Sameer K.
Demner-Fushman, Dina
Long, L. Rodney
Traina, Agma J. M.
2014 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2014, : 431 - 436

← 1 2 3 4 5 →