Augmenting naive Bayes classifiers with statistical language models

被引:128
|
作者
Peng, FC
Schuurmans, D
Wang, SJ
机构
[1] Univ Massachusetts, Dept Comp Sci, Ctr Intelligent Informat Retrieval, Amherst, MA 01003 USA
[2] Univ Alberta, Dept Comp Sci, Edmonton, AB T6G 2E8, Canada
[3] Univ Waterloo, Sch Comp Sci, Waterloo, ON N2L 3G1, Canada
来源
INFORMATION RETRIEVAL | 2004年 / 7卷 / 3-4期
关键词
naive Bayes; text classification; n-gram language models; smoothing;
D O I
10.1023/B:INRT.0000011209.19643.e2
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We augment naive Bayes models with statistical n-gram language models to address short-comings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier which allows for a local Markov dependence among observations; a model we refer to as the Chain Augmented Naive Bayes (CAN) Bayes classifier. CAN models have two advantages over standard naive Bayes classifiers. First, they relax some of the independence assumptions of naive Bayes - allowing a local Markov chain dependence in the observed variables - while still permitting efficient inference and learning. Second, they permit straightforward application of sophisticated smoothing techniques from statistical language modeling, which allows one to obtain better parameter estimates than the standard Laplace smoothing used in naive Bayes classification. In this paper, we introduce CAN models and apply them to various text classification problems. To demonstrate the language independent and task independent nature of these classifiers, we present experimental results on several text classification problems - authorship attribution, text genre classification, and topic detection - in several languages - Greek, English, Japanese and Chinese. We then systematically study the key factors in the CAN model that can influence the classification performance, and analyze the strengths and weaknesses of the model.
引用
收藏
页码:317 / 345
页数:29
相关论文
共 50 条
  • [11] Bayesian Naive Bayes classifiers to text classification
    Xu, Shuo
    JOURNAL OF INFORMATION SCIENCE, 2018, 44 (01) : 48 - 59
  • [12] Comparative analysis of the impact of discretization on the classification with Naive Bayes and semi-Naive Bayes classifiers
    Mizianty, Marcin
    Kurgan, Lukasz
    Ogiela, Marek
    SEVENTH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, PROCEEDINGS, 2008, : 823 - +
  • [13] An Ensemble of Naive Bayes Classifiers for Uncertain Categorical Data
    de Holanda Maia, Marcelo Rodrigues
    Plastino, Alexandre
    Freitas, Alex A.
    2021 21ST IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2021), 2021, : 1222 - 1227
  • [14] SSV criterion based discretization for naive Bayes classifiers
    Grabczewski, K
    ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING - ICAISC 2004, 2004, 3070 : 574 - 579
  • [15] Naive Bayes classifiers that perform well with continuous variables
    Bouckaert, RR
    AI 2004: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2004, 3339 : 1089 - 1094
  • [16] A Novel Naive Bayes Voting Strategy for Combining Classifiers
    De Stefano, C.
    Fontanella, F.
    di Freca, A. Scotto
    13TH INTERNATIONAL CONFERENCE ON FRONTIERS IN HANDWRITING RECOGNITION (ICFHR 2012), 2012, : 467 - 472
  • [17] On why discretization works for naive-Bayes classifiers
    Yang, Y
    Webb, GI
    AI 2003: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2003, 2903 : 440 - 452
  • [18] Tool wear monitoring using naive Bayes classifiers
    Karandikar, Jaydeep
    McLeay, Tom
    Turner, Sam
    Schmitz, Tony
    INTERNATIONAL JOURNAL OF ADVANCED MANUFACTURING TECHNOLOGY, 2015, 77 (9-12): : 1613 - 1626
  • [19] Naive Bayes classifiers learned from incomplete data
    Chen, Jingnian
    Huang, Houkuan
    Tian, Fengzhan
    Qiao, Zhufeng
    Jisuanji Gongcheng/Computer Engineering, 2006, 32 (17): : 86 - 88
  • [20] Heterodimeric protein complex identification by naive Bayes classifiers
    Maruyama, Osamu
    BMC BIOINFORMATICS, 2013, 14