Spam filtering using Kolmogorov complexity analysis

被引:2
|
作者
Richard, G. [2 ]
Doncescu, A. [1 ]
机构
[1] Univ Toulouse, CNRS, LAAS, Toulouse, France
[2] Univ Toulouse, IRIT, Toulouse, France
关键词
spam; Kolmogorov complexity; compression; clustering; k-nearest neighbours;
D O I
10.1504/IJWGS.2008.018500
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
One of the most irrelevant side effects of e-commerce technology is the development of spamming as an e-marketing technique. Spam e-mails (or unsolicited commercial e-mails) induce a burden for everybody having an electronic mailbox: detecting and filtering spam is then a challenging task and a lot of approaches have been developed to identify spam before it is posted in the end user's mailbox. In this paper, we focus on a relatively new approach whose foundations rely on the works of A. Kolmogorov. The main idea is to give a formal meaning to the notion of 'information content' and to provide a measure of this content. Using such a quantitative approach, it becomes possible to define a distance, which is a major tool for classification purposes. To validate our approach, we proceed in two steps: first, we use the classical compression distance over a mix of spam and legitimate e-mails to check out if they can be properly clustered without any Supervision. It has been the case to highlight a kind of underlying structure for spam e-mails. In the second step, we have implemented a k-nearest neighbours algorithm providing 85% as accuracy rate. Coupled with other anti-spam techniques, compression-based methods could bring a great help in the spam filtering challenge.
引用
收藏
页码:136 / 148
页数:13
相关论文
共 50 条
  • [1] Spam filtering using spam mail communities
    Deepak, P
    Parameswaran, S
    2005 SYMPOSIUM ON APPLICATIONS AND THE INTERNET, PROCEEDINGS, 2005, : 377 - 383
  • [2] Short Messages Spam Filtering Using Sentiment Analysis
    Ezpeleta, Enaitz
    Zurutuza, Urko
    Gomez Hidalgo, Jose Maria
    TEXT, SPEECH, AND DIALOGUE, 2016, 9924 : 142 - 153
  • [3] Kolmogorov complexity estimation and analysis
    Evans, SC
    Hershey, JE
    Saulnier, G
    6TH WORLD MULTICONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL XI, PROCEEDINGS: COMPUTER SCIENCE II, 2002, : 280 - 285
  • [4] Spam filtering without text analysis
    Belabbes, Sihem
    Richard, Gilles
    GLOBAL E-SECURITY, PROCEEDINGS, 2008, 12 : 144 - +
  • [5] Average-case analysis of algorithms using Kolmogorov complexity
    Tao Jiang
    Ming Li
    Paul M. B. Vitányi
    Journal of Computer Science and Technology, 2000, 15 : 402 - 408
  • [6] Average-case analysis of algorithms using Kolmogorov complexity
    Jiang, T
    Li, M
    Vitányi, PMB
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2000, 15 (05) : 402 - 408
  • [7] Filtering spam
    Editor & Publisher, 1999, (Suppl):
  • [8] Filtering spam
    Baker, B
    INTERNET WORLD, 1998, 9 (01): : 14 - 14
  • [9] Lower bounds using Kolmogorov complexity
    Laplante, Sophie
    LOGICAL APPROACHES TO COMPUTATIONAL BARRIERS, PROCEEDINGS, 2006, 3988 : 297 - 306
  • [10] A Survey on Using Kolmogorov Complexity in Cybersecurity
    Resende, Joao S.
    Martins, Rolando
    Antunes, Luis
    ENTROPY, 2019, 21 (12)