A Local-Concentration-Based Feature Extraction Approach for Spam Filtering

被引:35
|
作者
Zhu, Yuanchun [1 ,2 ]
Tan, Ying [1 ,2 ]
机构
[1] Peking Univ, Sch Elect Engn & Comp Sci, Key Lab Machine Percept, Minist Educ, Beijing 100871, Peoples R China
[2] Peking Univ, Sch Elect Engn & Comp Sci, Dept Machine Intelligence, Beijing 100871, Peoples R China
基金
国家高技术研究发展计划(863计划); 中国国家自然科学基金;
关键词
Artificial immune system (AIS); bag-of-words (BoW); feature extraction; global concentration (GC); local concentration (LC); spam filtering;
D O I
10.1109/TIFS.2010.2103060
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Inspired from the biological immune system, we propose a local concentration (LC)-based feature extraction approach for anti-spam. The LC approach is considered to be able to effectively extract position-correlated information from messages by transforming each area of a message to a corresponding LC feature. Two implementation strategies of the LC approach are designed using a fixed-length sliding window and a variable-length sliding window. To incorporate the LC approach into the whole process of spam filtering, a generic LC model is designed. In the LC model, two types of detector sets are at first generated by using term selection methods and a well-defined tendency threshold. Then a sliding window is adopted to divide the message into individual areas. After segmentation of the message, the concentration of detectors is calculated and taken as the feature for each local area. Finally, all the features of local areas are combined as a feature vector of the message. To evaluate the proposed LC model, several experiments are conducted on five benchmark corpora using the cross-validation method. It is shown that the LC approach cooperates well with three term selection methods, which endows it with flexible applicability in the real world. Compared to the global-concentration-based approach and the prevalent bag-of-words approach, the LC approach has better performance in terms of both accuracy and measure. It is also demonstrated that the LC approach is robust against messages with variable message length.
引用
收藏
页码:486 / 497
页数:12
相关论文
共 50 条
  • [31] An Adequate Approach to Image Retrieval Based on Local Level Feature Extraction
    Khan, Sumaira Muhammad Hayat
    Hussain, Ayyaz
    Al Shaikhli, Imad Fakhri Taha
    MEHRAN UNIVERSITY RESEARCH JOURNAL OF ENGINEERING AND TECHNOLOGY, 2015, 34 (04) : 337 - 348
  • [32] Variable Length Concentration based Feature Construction Method for Spam Detection
    Gao, Yang
    Mi, Guyue
    Tan, Ying
    2015 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2015,
  • [33] Layout Based Spam Filtering
    Musat, Claudiu N.
    PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY, VOL 12, 2006, 12 : 161 - 164
  • [34] Efficient feature selection methods in chinese spam filtering
    Xu, Yan
    Information Technology Journal, 2013, 12 (20) : 5492 - 5496
  • [35] Adaptive spam filtering using dynamic feature spaces
    Zhou, Yan
    Mulekar, Madhuri S.
    Nerellapalli, Praveen
    INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2007, 16 (04) : 627 - 646
  • [36] Adaptive spam filtering using dynamic feature space
    Zhou, Y
    Mulekar, MS
    Nerellapalli, P
    ICTAI 2005: 17TH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2005, : 302 - 309
  • [37] Textual case-based reasoning for spam filtering: a comparison of feature-based and feature-free approaches
    Sarah Jane Delany
    Derek Bridge
    Artificial Intelligence Review, 2006, 26 : 75 - 87
  • [38] Textual case-based reasoning for spam filtering: a comparison of feature-based and feature-free approaches
    Delany, Sarah Jane
    Bridge, Derek
    ARTIFICIAL INTELLIGENCE REVIEW, 2006, 26 (1-2) : 75 - 87
  • [39] Spam Image Discrimination using Support Vector Machine based on Higher-Order Local Autocorrelation Feature Extraction
    Cheng, Hongrong
    Qin, Zhiguang
    Liu, Qiao
    Wan, Mingcheng
    2008 IEEE CONFERENCE ON CYBERNETICS AND INTELLIGENT SYSTEMS, VOLS 1 AND 2, 2008, : 457 - 461
  • [40] High Efficiency Spam Filtering: A Manifold Learning-Based Approach
    Wang, Chao
    Li, Qun
    Ren, Tian-yu
    Wang, Xiao-hu
    Guo, Guang-xin
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2021, 2021