Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach

被引:7
|
作者
Janez-Martino, Francisco [1 ]
Alaiz-Rodriguez, Rocio
Gonzalez-Castro, Victor
Fidalgo, Eduardo
Alegre, Enrique
机构
[1] Univ Leon, Dept Elect Syst & Automat, Leon, Spain
关键词
Spam detection; Multi-classification; Image-based spam; Hidden text; Text classification; Word embedding; SELECTION; FEATURES; DOMAINS; MODEL;
D O I
10.1016/j.asoc.2023.110226
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Spam emails are unsolicited, annoying and sometimes harmful messages which may contain malware, phishing or hoaxes. Unlike most studies that address the design of efficient anti-spam filters, we approach the spam email problem from a different and novel perspective. Focusing on the needs of cybersecurity units, we follow a topic-based approach for addressing the classification of spam email into multiple categories. We propose SPEMC-15K-E and SPEMC-15K-S, two novel datasets with approximately 15K emails each in English and Spanish, respectively, and we label them using agglomerative hierarchical clustering into 11 classes. We evaluate 16 pipelines, combining four text representation techniques-Term Frequency-Inverse Document Frequency (TF-IDF), Bag of Words, Word2Vec and BERT-and four classifiers: Support Vector Machine, Naive Bayes, Random Forest and Logistic Regression. Experimental results show that the highest performance is achieved with TF-IDF and LR for the English dataset, with a F1 score of 0.953 and an accuracy of 94.6%, and while for the Spanish dataset, TF-IDF with NB yields a F1 score of 0.945 and 98.5% accuracy. Regarding the processing time, TF-IDF with LR leads to the fastest classification, processing an English and Spanish spam email in 2 ms and 2.2 ms on average, respectively.& COPY; 2023 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
引用
收藏
页数:16
相关论文
共 50 条
  • [31] Development of an efficient hierarchical clustering analysis using an agglomerative clustering algorithm
    Naeem, Arshia
    Rehman, Mariam
    Anjum, Maria
    Asif, Muhammad
    CURRENT SCIENCE, 2019, 117 (06): : 1045 - 1053
  • [32] K-Linkage: A New Agglomerative Approach for Hierarchical Clustering
    Yildirim, Pelin
    Birant, Derya
    ADVANCES IN ELECTRICAL AND COMPUTER ENGINEERING, 2017, 17 (04) : 77 - 88
  • [33] Market-Basket Analysis using Agglomerative Hierarchical approach for clustering a retail items
    Saraf, Rujata
    Patil, Sonal
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2016, 16 (03): : 47 - 56
  • [34] Hierarchical Topic-Based Communities Construction for Authors in a Literature Database
    Wu, Chien-Liang
    Koh, Jia-Ling
    TRENDS IN APPLIED INTELLIGENT SYSTEMS, PT II, PROCEEDINGS, 2010, 6097 : 514 - 524
  • [35] A Discriminative Approach to Topic-Based Citation Recommendation
    Tang, Jie
    Zhang, Jing
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2009, 5476 : 572 - 579
  • [36] Trend analysis using agglomerative hierarchical clustering approach for time series big data
    Pasupathi, Subbulakshmi
    Shanmuganathan, Vimal
    Madasamy, Kaliappan
    Yesudhas, Harold Robinson
    Kim, Mucheol
    JOURNAL OF SUPERCOMPUTING, 2021, 77 (07): : 6505 - 6524
  • [37] Trend Analysis Using Agglomerative Hierarchical Clustering Approach for Time Series Big Data
    Subbulakshmi, P.
    Vimal, S.
    Kaliappan, M.
    Robinson, Y. Harold
    Kim, Mucheol
    ADVANCES IN ARTIFICIAL INTELLIGENCE AND APPLIED COGNITIVE COMPUTING, 2021, : 869 - 876
  • [38] Trend analysis using agglomerative hierarchical clustering approach for time series big data
    Subbulakshmi Pasupathi
    Vimal Shanmuganathan
    Kaliappan Madasamy
    Harold Robinson Yesudhas
    Mucheol Kim
    The Journal of Supercomputing, 2021, 77 : 6505 - 6524
  • [39] Control configuration synthesis using agglomerative hierarchical clustering: A graph-theoretic approach
    Kang, Lixia
    Tang, Wentao
    Liu, Yongzhong
    Daoutidis, Prodromos
    JOURNAL OF PROCESS CONTROL, 2016, 46 : 43 - 54
  • [40] Analytics and visualization of citation network applying topic-based clustering
    Rina Nakazawa
    Takayuki Itoh
    Takafumi Saito
    Journal of Visualization, 2018, 21 : 681 - 693