Information Extraction from Spam Emails using Stylistic and Semantic Features to Identify Spammers

被引:0
|
作者
Halder, Soma [1 ]
Tiwari, Richa [1 ]
Sprague, Alan [1 ]
机构
[1] Univ Alabama Birmingham, Birmingham, AL 35229 USA
关键词
Spam; semantics; stylistics; natural language processing; IP address;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Traditional anti spamming methods filter spam emails and prevent them from entering the inbox but take no measure to trace spammers and penalize them. We use natural language processing techniques to cluster spam emails from the same spammer based on the content and the style of the email. Spam emails from different sources are studied with features like stylistic, semantic and combination of both. Three sets of clustering are performed: clustering based on stylistic feature, clustering based on semantic feature and clustering based on combined feature. These clusters are then compared and evaluated. We notice that spam emails from the same sources have similarities and cluster together. These emails have URLs of the WebPages that the spammer is trying to promote. Clusters are mapped to the internet protocol (IP) of these URLs and the whois information of the IP addresses' help to get information about the source of spam.
引用
收藏
页码:104 / 107
页数:4
相关论文
共 50 条
  • [1] Classifying Spam Emails using Text and Readability Features
    Shams, Rushdi
    Mercer, Robert E.
    2013 IEEE 13TH INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2013, : 657 - 666
  • [2] Spam filtering based on supervised latent semantic features extraction
    Zeng, Qingpeng
    Wu, Shuixiu
    Wang, Mingwen
    Journal of Computational Information Systems, 2008, 4 (03): : 1299 - 1306
  • [3] A Novel Data Mining Approach for Detecting Spam Emails using Robust Chi-Square Features
    Sharma, Mugdha
    Kaur, Jasmeen
    PROCEEDING OF THE THIRD INTERNATIONAL SYMPOSIUM ON WOMEN IN COMPUTING AND INFORMATICS (WCI-2015), 2015, : 49 - 53
  • [4] Large-Scale Information Extraction from Emails with Data Constraints
    Gupta, Rajeev
    Kondapally, Ranganath
    Guha, Siddharth
    BIG DATA ANALYTICS (BDA 2019), 2019, 11932 : 124 - 139
  • [5] Using Structural and Semantic Information to Identify Software Components
    Sas, Cezar
    Capiluppi, Andrea
    2021 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING (SANER 2021), 2021, : 546 - 550
  • [6] Semantic information generation from classification and information extraction
    Silva, TDS
    de Freitas, FLG
    Teske, RC
    Bittencourt, G
    WEB ENGINEERING, PROCEEDINGS, 2004, 3140 : 573 - 574
  • [7] Semantic Structuring of and Information Extraction from Medical Documents Using the UMLS
    Denecke, K.
    METHODS OF INFORMATION IN MEDICINE, 2008, 47 (05) : 425 - 434
  • [8] Annotating Needles in the Haystack without Looking: Product Information Extraction from Emails
    Zhang, Weinan
    Ahmed, Amr
    Yang, Jie
    Josifovski, Vanja
    Smola, Alex J.
    KDD'15: PROCEEDINGS OF THE 21ST ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2015, : 2257 - 2266
  • [9] Extraction of Semantic Features from Transaction Dialogues
    Mustapha, Aida
    INFORMATION RETRIEVAL TECHNOLOGY, AIRS 2014, 2014, 8870 : 348 - 359
  • [10] Semantic information extraction from Tamil documents
    Pandian, S. Lakshmana
    Devakumar, J.
    Geetha, T.V.
    International Journal of Metadata, Semantics and Ontologies, 2008, 3 (03) : 226 - 232