Spam detection using character N-grams

被引:0
|
作者
Kanaris, Ioannis [1 ]
Kanaris, Konstantinos
Stamatatos, Efstathios
机构
[1] Univ Aegean, Dept Informat & Commun Syst Engn, GR-83200 Karlovassi, Greece
[2] Univ Aegean, Dept Math, GR-83200 Karlovassi, Greece
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a content-based approach to spam detection based on low-level information. Instead of the traditional 'bag of words' representation, we use a 'bag of character n-grams' representation which avoids the sparse data problem that arises in n-grams on the word-level. Moreover, it is language-independent and does not require any lemmatizer or 'deep' text preprocessing. Based on experiments on Ling-Spam corpus we evaluate the proposed representation in combination with support vector machines. Both binary and term-frequency representations achieve high precision rates while maintaining recall on equally high level, which is a crucial factor for anti-spam filters, a cost sensitive application.
引用
收藏
页码:95 / 104
页数:10
相关论文
共 50 条
  • [1] Detection of Opinion Spam with Character n-grams
    Hernandez Fusilier, Donato
    Montes-y-Gomez, Manuel
    Rosso, Paolo
    Guzman Cabrera, Rafael
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING (CICLING 2015), PT II, 2015, 9042 : 285 - 294
  • [2] Words versus character N-grams for anti-spam filtering
    Kanaris, Ioannis
    Kanaris, Konstantinos
    Houvardas, Ioannis
    Stamatatos, Efstathios
    [J]. INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2007, 16 (06) : 1047 - 1067
  • [3] Authorship Attribution in Portuguese Using Character N-grams
    Markov, Ilia
    Baptista, Jorge
    Pichardo-Lagunas, Obdulia
    [J]. ACTA POLYTECHNICA HUNGARICA, 2017, 14 (03) : 59 - 78
  • [4] Plagiarism Detection Using Stopword n-grams
    Stamatatos, Efstathios
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2011, 62 (12): : 2512 - 2527
  • [5] A first approach to CLIR using character n-grams alignment
    Vilares, Jesus
    Oakes, Michael P.
    Tait, John I.
    [J]. EVALUATION OF MULTILINGUAL AND MULTI-MODAL INFORMATION RETRIEVAL, 2007, 4730 : 111 - +
  • [6] Which Granularity to Bootstrap a Multilingual Method of Document Alignment: Character N-grams or Word N-grams?
    Lecluze, Charlotte
    Rigouste, Lois
    Giguet, Emmanuel
    Lucas, Nadine
    [J]. CORPUS RESOURCES FOR DESCRIPTIVE AND APPLIED STUDIES. CURRENT CHALLENGES AND FUTURE DIRECTIONS: SELECTED PAPERS FROM THE 5TH INTERNATIONAL CONFERENCE ON CORPUS LINGUISTICS (CILC2013), 2013, 95 : 473 - 481
  • [7] Clone Detection for Ecore Metamodels using N-grams
    Babur, Onder
    [J]. PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON MODEL-DRIVEN ENGINEERING AND SOFTWARE DEVELOPMENT, 2018, : 411 - 419
  • [8] Predicting Political Donations Using Twitter Hashtags and Character N-Grams
    Conrad, Colin
    Keselj, Vlado
    [J]. 2016 IEEE 18TH CONFERENCE ON BUSINESS INFORMATICS (CBI), VOL. 2, 2016, : 1 - 7
  • [9] Embedded malware detection using Markov n-grams
    Shafiq, M. Zubair
    Khayam, Syed Ali
    Farooq, Muddassar
    [J]. DETECTION OF INTRUSIONS AND MALWARE, AND VULNERABILITY ASSESSMENT, 2008, 5137 : 88 - +
  • [10] Author Assertion of Furtive Write Print Using Character N-Grams
    Hassan, Feryal H.
    Chaurasia, Mousmi A.
    [J]. FUTURE INFORMATION TECHNOLOGY, 2011, 13 : 274 - 278