Textual case-based reasoning for spam filtering: a comparison of feature-based and feature-free approaches

被引:6
|
作者
Delany, Sarah Jane [2 ]
Bridge, Derek [1 ]
机构
[1] Univ Coll Cork, Cork, Ireland
[2] Dublin Inst Technol, Dublin, Ireland
关键词
spam filtering; case-based reasoning; case-base editing; case-based maintenance; feature selection; distance measures; text compression;
D O I
10.1007/s10462-007-9041-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Spam filtering is a text classification task to which Case-Based Reasoning (CBR) has been successfully applied. We describe the ECUE system, which classifies emails using a feature-based form of textual CBR. Then, we describe an alternative way to compute the distances between cases in a feature-free fashion, using a distance measure based on text compression. This distance measure has the advantages of having no set-up costs and being resilient to concept drift. We report an empirical comparison, which shows the feature-free approach to be more accurate than the feature-based system. These results are fairly robust over different compression algorithms in that we find that the accuracy when using a Lempel-Ziv compressor (GZip) is approximately the same as when using a statistical compressor (PPM). We note, however, that the feature-free systems take much longer to classify emails than the feature-based system. Improvements in the classification time of both kinds of systems can be obtained by applying case base editing algorithms, which aim to remove noisy and redundant cases from a case base while maintaining, or even improving, generalisation accuracy. We report empirical results using the Competence-Based Editing (CBE) technique. We show that CBE removes more cases when we use the distance measure based on text compression (without significant changes in generalisation accuracy) than it does when we use the feature-based approach.
引用
收藏
页码:75 / 87
页数:13
相关论文
共 50 条
  • [21] Case-based reasoning approaches
    Bergmann, R
    Breen, S
    Göker, M
    Manago, M
    Wess, S
    DEVELOPING INDUSTRIAL CASE-BASED REASONING APPLICATIONS, 1999, 1612 : 21 - 34
  • [22] Feature-based geometric reasoning for process planning
    Narayan, G.Aditya
    Nalluri, S.R.P.Rao
    Gurumoorthy, B.
    Sadhana - Academy Proceedings in Engineering Sciences, 1997, 22 (pt 2): : 217 - 240
  • [23] Rough set feature selection algorithms for textual case-based classification
    Gupta, Kalyan Moy
    Aha, David W.
    Moore, Philip
    ADVANCES IN CASE-BASED REASONING, PROCEEDINGS, 2006, 4106 : 166 - 181
  • [24] Feature reduction method based on threshold optimization for case-based reasoning classifier
    Zhao, Hui
    Yan, Ai-Jun
    Wang, Pu
    Kongzhi Lilun Yu Yingyong/Control Theory and Applications, 2015, 32 (04): : 533 - 539
  • [25] SVDFeature: A Toolkit for Feature-based Collaborative Filtering
    Chen, Tianqi
    Zhang, Weinan
    Lu, Qiuxia
    Chen, Kailong
    Zheng, Zhao
    Yu, Yong
    JOURNAL OF MACHINE LEARNING RESEARCH, 2012, 13 : 3619 - 3622
  • [26] Feature-Based Nonlocal Polarimetric SAR Filtering
    Xing, Xiaoli
    Chen, Qihao
    Yang, Shuai
    Liu, Xiuguo
    REMOTE SENSING, 2017, 9 (10)
  • [27] Investigating graphs in textual case-based reasoning
    Cunningham, C
    Weber, R
    Proctor, JM
    Fowler, C
    Murphy, M
    ADVANCES IN CASE-BASED REASONING, PROCEEDINGS, 2004, 3155 : 573 - 586
  • [28] An integrated feature selection and cluster analysis techniques for case-based reasoning
    Zhu, Guo-Niu
    Hu, Jie
    Qi, Jin
    Ma, Jin
    Peng, Ying-Hong
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2015, 39 : 14 - 22
  • [29] Feature selection for neonatal resuscitation management using case-based reasoning
    Datta, S.K.
    Ghosh, I.
    Samant, R.K.
    Modelling, Measurement and Control C, 2007, 68 (3-4): : 67 - 85
  • [30] Bayesian Feature Construction for Case-Based Reasoning: Generating Good Checklists
    Flogard, Eirik Lund
    Mengshoel, Ole Jakob
    Bach, Kerstin
    CASE-BASED REASONING RESEARCH AND DEVELOPMENT, ICCBR 2021, 2021, 12877 : 94 - 109