Spam E-Mail Classification by Utilizing N-Gram Features of Hyperlink Texts

被引：0

作者：

Bozkir, A. Selman ^{[1
]}

Sahin, Esra ^{[1
]}

Aydos, Murat ^{[1
]}

Sezer, Ebru Akcapinar ^{[1
]}

Orhan, Fatih ^{[2
]}

机构：

[1] Hacettepe Univ, Dept Comp Engn, Ankara, Turkey

[2] COMODO Grp, Clifton, NJ USA

来源：

2017 11TH IEEE INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT 2017) | 2017年

关键词：

Spam Email; Machine Learning; Active Learning; N-Grams; Bag of Words;

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

With the advent of the Internet and reduction of the costs in digital communication, spam has become a key problem in several types of media (i.e. email, social media and micro blog). Further, in recent years, email spamming in particular has been subjected to an exponentially growing threat which affects both individuals and business world. Hence, a large number of studies have been proposed in order to combat with spam emails. In this study, instead of subject or body components of emails, pure use of hyperlink texts along with word level n-gram indexing schema is proposed for the first time in order to generate features to be employed in a spam/ham email classifier. Since the length of link texts in e-mails does not exceed sentence level, we have limited the n-gram indexing up to trigram schema. Throughout the study, provided by COMODO Inc, a novel large scale dataset covering 50.000 link texts belonging to spam and ham emails has been used for feature extraction and performance evaluation. In order to generate the required vocabularies; unigrams, bigrams and trigrams models have been generated. Next, including one active learner, three different machine learning methods (Support Vector Machines, SVM-Pegasos and Naive Bayes) have been employed to classify each link. According to the results of the experiments, classification using trigram based bag-of-words representation reaches up to 98,75% accuracy which outperforms unigram and bigram schemas. Apart from having high accuracy, the proposed approach also preserves privacy of the customers since it does not require any kind of analysis on body contents of e-mails.

引用

页码：308 / 312

页数：5

共 50 条

[1] An approach for spam E-mail detection with support vector machine and n-gram indexing
Moon, J
Shon, T
Seo, J
Kim, J
Seo, J
[J]. COMPUTER AND INFORMATION SCIENCES - ISCIS 2004, PROCEEDINGS, 2004, 3280 : 351 - 362
[2] Time-efficient spam e-mail filtering using n-gram models
Ciltik, Ali
Gungor, Tunga
[J]. PATTERN RECOGNITION LETTERS, 2008, 29 (01) : 19 - 33
[3] Spam Classification Based on E-Mail Path Analysis
Palla, Srikanth
Dantu, Ram
Cangussu, Joao W.
[J]. INTERNATIONAL JOURNAL OF INFORMATION SECURITY AND PRIVACY, 2008, 2 (02) : 46 - 69
[4] Spam E-Mail Classification Based on the IFWB Algorithm
Jou, Chichang
[J]. INTELLIGENT INFORMATION AND DATABASE SYSTEMS (ACIIDS 2013), PT I,, 2013, 7802 : 314 - 324
[5] E-mail, hold the spam
Hoyle, J
[J]. JOURNAL OF THE AMERICAN DENTAL ASSOCIATION, 2000, 131 (10): : 1426 - 1426
[6] Development of Proposed Ensemble Model for Spam e-mail Classification
Shrivas, Akhilesh Kumar
Dewangan, Amit Kumar
Ghosh, S. M.
Singh, Devendra
[J]. INFORMATION TECHNOLOGY AND CONTROL, 2021, 50 (03): : 411 - 423
[7] Voting-based Classification for E-mail Spam Detection
Al-Shboul, Bashar
Hakh, Heba
Faris, Hossam
Aljarah, Ibrahim
Alsawalqah, Hamad
[J]. JOURNAL OF ICT RESEARCH AND APPLICATIONS, 2016, 10 (01) : 29 - 42
[8] Detecting Spam Tweets using Character N-gram Features
Ashour, Mokhtar
Salama, Cherif
El-Kharashi, M. Watheq
[J]. PROCEEDINGS OF 2018 13TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND SYSTEMS (ICCES), 2018, : 190 - 195
[9] CLASSIFICATION OF E-MAIL SPAM WITH SUPERVISED MACHINE LEARNING - NAIVE BAYESIAN CLASSIFICATION
Prasad, J. Phani
Venkatesham, T.
[J]. ADVANCES AND APPLICATIONS IN MATHEMATICAL SCIENCES, 2021, 20 (12): : 3087 - 3092
[10] Using E-mail Authentication and Disposable E-mail Addressing for Filtering Spam
Luo, Jia-Ning
Yang, Ming Hour
[J]. 2009 10TH INTERNATIONAL SYMPOSIUM ON PERVASIVE SYSTEMS, ALGORITHMS, AND NETWORKS (ISPAN 2009), 2009, : 356 - +

← 1 2 3 4 5 →