Weighted average pointwise mutual information for feature selection in text categorization

被引:0
|
作者
Schneider, KM [1 ]
机构
[1] Univ Passau, Dept Gen Linguist, D-94030 Passau, Germany
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Mutual information is a common feature score in feature selection for text categorization. Mutual information suffers from two theoretical problems: It assumes independent word variables, and longer documents are given higher weights in the estimation of the feature scores, which is in contrast to common evaluation measures that do not distinguish between long and short documents. We propose a variant of mutual information, called Weighted Average Pointwise Mutual Information (WAPMI) that avoids both problems. We provide theoretical as well as extensive empirical evidence in favor of WAPMI. Furthermore, we show that WAPMI has a nice property that other feature metrics lack, namely it allows to select the best feature set size automatically by maximizing an objective function, which can be done using a simple heuristic without resorting to costly methods like EM and model selection.
引用
收藏
页码:252 / 263
页数:12
相关论文
共 50 条
  • [41] Applying cascaded feature selection to SVM text categorization
    Masuyama, T
    Nakagawa, H
    [J]. 13TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2002, : 241 - 245
  • [42] Feature selection for support vector machines in text categorization
    Liu, Y
    Lu, HM
    Lu, ZX
    Wang, P
    [J]. MLMTA'03: INTERNATIONAL CONFERENCE ON MACHINE LEARNING; MODELS, TECHNOLOGIES AND APPLICATIONS, 2003, : 129 - 134
  • [43] Feature Selection with Structural Sparse Mode for Text Categorization
    Zheng, Wenbin
    Tang, Dan
    Zhang, Haiqing
    Tang, Hong
    [J]. 2017 NINTH INTERNATIONAL CONFERENCE ON INTELLIGENT HUMAN-MACHINE SYSTEMS AND CYBERNETICS (IHMSC 2017), VOL 1, 2017, : 359 - 362
  • [44] A discriminative and semantic feature selection method for text categorization
    Zong, Wei
    Wu, Feng
    Chu, Lap-Keung
    Sculli, Domenic
    [J]. INTERNATIONAL JOURNAL OF PRODUCTION ECONOMICS, 2015, 165 : 215 - 222
  • [45] Comparison and Improvement of feature selection method for text categorization
    Shan, Li-Li
    Liu, Bing-Quan
    Sun, Cheng-Jie
    [J]. Harbin Gongye Daxue Xuebao/Journal of Harbin Institute of Technology, 2011, 43 (SUPPL. 1): : 319 - 324
  • [46] Using typical testors for feature selection in text categorization
    Pons-Porratal, Aurora
    Gil-Garcia, Reynaldo
    Berlanga-Liavori, Rafael
    [J]. PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS AND APPLICATIONS, PROCEEDINGS, 2007, 4756 : 643 - +
  • [47] Feature subset selection in SOM based text categorization
    Bassiouny, S
    Nagi, M
    Hussein, MF
    [J]. IC-AI '04 & MLMTA'04 , VOL 1 AND 2, PROCEEDINGS, 2004, : 860 - 866
  • [48] PKIP: Feature selection in text categorization for item banks
    Nuntiyagul, A
    Naruedomkul, K
    Cercone, N
    Wongsawang, D
    [J]. ICTAI 2005: 17TH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2005, : 212 - 216
  • [49] Five new feature selection metrics in text categorization
    Song, Fengxi
    Zhang, David
    Xu, Yong
    Wang, Jizhong
    [J]. INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2007, 21 (06) : 1085 - 1101
  • [50] An Improved Strategy of the Feature Selection Algorithm for the Text Categorization
    Yang, Jieming
    Lu, Yixin
    Liu, Zhiying
    [J]. 2019 20TH IEEE/ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD), 2019, : 3 - 7