Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach

被引:7
|
作者
Janez-Martino, Francisco [1 ]
Alaiz-Rodriguez, Rocio
Gonzalez-Castro, Victor
Fidalgo, Eduardo
Alegre, Enrique
机构
[1] Univ Leon, Dept Elect Syst & Automat, Leon, Spain
关键词
Spam detection; Multi-classification; Image-based spam; Hidden text; Text classification; Word embedding; SELECTION; FEATURES; DOMAINS; MODEL;
D O I
10.1016/j.asoc.2023.110226
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Spam emails are unsolicited, annoying and sometimes harmful messages which may contain malware, phishing or hoaxes. Unlike most studies that address the design of efficient anti-spam filters, we approach the spam email problem from a different and novel perspective. Focusing on the needs of cybersecurity units, we follow a topic-based approach for addressing the classification of spam email into multiple categories. We propose SPEMC-15K-E and SPEMC-15K-S, two novel datasets with approximately 15K emails each in English and Spanish, respectively, and we label them using agglomerative hierarchical clustering into 11 classes. We evaluate 16 pipelines, combining four text representation techniques-Term Frequency-Inverse Document Frequency (TF-IDF), Bag of Words, Word2Vec and BERT-and four classifiers: Support Vector Machine, Naive Bayes, Random Forest and Logistic Regression. Experimental results show that the highest performance is achieved with TF-IDF and LR for the English dataset, with a F1 score of 0.953 and an accuracy of 94.6%, and while for the Spanish dataset, TF-IDF with NB yields a F1 score of 0.945 and 98.5% accuracy. Regarding the processing time, TF-IDF with LR leads to the fastest classification, processing an English and Spanish spam email in 2 ms and 2.2 ms on average, respectively.& COPY; 2023 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Classifying Spam Emails using Text and Readability Features
    Shams, Rushdi
    Mercer, Robert E.
    2013 IEEE 13TH INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2013, : 657 - 666
  • [2] Classifying Spam Emails Using Artificial Intelligent Techniques
    Roy, Sanjiban Sekhar
    Viswanatham, V. Madhu
    INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH IN AFRICA, 2016, 22 : 152 - 161
  • [3] Topic-Based Hierarchical Segmentation
    Chien, Jen-Tzung
    Chueh, Chuang-Hua
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01): : 55 - 66
  • [4] A Novel Approach for Climate Classification Using Agglomerative Hierarchical Clustering
    Uppalapati, Sanketh
    Garg, Vishal
    Pudi, Vikram
    Mathur, Jyotirmay
    Gupta, Raj
    Bhatia, Aviruch
    ENERGY INFORMATICS, EI.A 2023, PT I, 2024, 14467 : 152 - 167
  • [5] An incremental document clustering algorithm based on a hierarchical agglomerative approach
    Joo, KH
    Lee, SJ
    DISTRIBUTED COMPUTING AND INTERNET TECHNOLOGY, PROCEEDINGS, 2005, 3816 : 321 - 332
  • [6] Topic-Based Hard Clustering of Documents Using Generative Models
    Ponti, Giovanni
    Tagarelli, Andrea
    FOUNDATIONS OF INTELLIGENT SYSTEMS, PROCEEDINGS, 2009, 5722 : 231 - 240
  • [7] An Approach for Fast Hierarchical Agglomerative Clustering Using Graphics Processors with CUDA
    Shalom, S. A. Arul
    Dash, Manoranjan
    Tue, Minh
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT II, PROCEEDINGS, 2010, 6119 : 35 - +
  • [8] Fast and Effective Clustering of Spam Emails Based on Structural Similarity
    Sheikhalishahi, Mina
    Saracino, Andrea
    Mejri, Mohamed
    Tawbi, Nadia
    Martinelli, Fabio
    FOUNDATIONS AND PRACTICE OF SECURITY (FPS 2015), 2016, 9482 : 195 - 211
  • [9] CIBS: A biomedical text summarizer using topic-based sentence clustering
    Moradi, Milad
    JOURNAL OF BIOMEDICAL INFORMATICS, 2018, 88 : 53 - 61
  • [10] Topic-Based Clustering of Japanese Sentences Using Sentence-BERT
    Tsumuraya, Kenshin
    Amano, Miki
    Uehara, Minoru
    Adachi, Yoshihiro
    2022 TENTH INTERNATIONAL SYMPOSIUM ON COMPUTING AND NETWORKING WORKSHOPS, CANDARW, 2022, : 255 - 260