A probabilistic model derived term weighting scheme for text classification

被引:17
|
作者
Feng, Guozhong [1 ,2 ,3 ]
Li, Shaoting [4 ]
Sun, Tieli [1 ]
Zhang, Bangzuo [1 ]
机构
[1] Northeast Normal Univ, Sch Comp Sci & Informat Technol, Key Lab Intelligent Informat Proc Jilin Univ, Changchun 130117, Jilin, Peoples R China
[2] Northeast Normal Univ, Sch Math & Stat, Key Lab Appl Stat MOE, Changchun 130024, Jilin, Peoples R China
[3] Northeast Normal Univ, Inst Computat Biol, Changchun 130117, Jilin, Peoples R China
[4] Dongbei Univ Finance & Econ, Sch Stat, Dalian 116025, Peoples R China
基金
中国国家自然科学基金;
关键词
Latent feature selection indicator; Matching score function; Naive Bayes; Supervised term weighting; Text classification; CATEGORIZATION; BAYES;
D O I
10.1016/j.patrec.2018.03.003
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Term weighting is known as a text presentation strategy to assign appropriate value to each term to improve the performance of text classification in the task of transforming the content of textual document into a vector in the term space. Supervised weighting methods using the information on the membership of training documents in predefined classes are naturally expected to provide better results than the unsupervised ones. In this paper, a new weighting scheme is proposed via a matching score function based on a probabilistic model. We introduce a latent variable to indicate whether a term contains text classification information or not, specify conjugate priors and exploit the conjugacy by integrating out the latent indicator and the parameters. Then the non-discriminating terms can be assigned weights close to 0. Experimental results using kNN and SVM classifiers illustrate the effectiveness of the proposed approach on both small and large text data sets. (C) 2018 Published by Elsevier B.V.
引用
收藏
页码:23 / 29
页数:7
相关论文
共 50 条
  • [21] Customized term weighting scheme for document classification
    Benjamin, C. M. X.
    Woon, W. L.
    Wong, K. S. D.
    [J]. 2008 INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION ENGINEERING, VOLS 1-3, 2008, : 294 - 299
  • [22] A NOVEL TERM WEIGHTING SCHEME MIDF FOR TEXT CATEGORIZATION
    Deisy, C.
    Gowri, M.
    Baskar, S.
    Kalaiarasi, S. M. A.
    Ramraj, N.
    [J]. JOURNAL OF ENGINEERING SCIENCE AND TECHNOLOGY, 2010, 5 (01) : 94 - 107
  • [23] An Improved Term Weighting Scheme for Sentiment Classification
    Zhang, Pu
    Wang, Yinghao
    Wang, Junxia
    Zeng, Xianhua
    Wang, Yong
    [J]. 2017 IEEE 2ND ADVANCED INFORMATION TECHNOLOGY, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (IAEAC), 2017, : 462 - 466
  • [24] A New Improved Term Weighting Scheme for Text Categorization
    Nguyen Pham Xuan
    Hieu Le Quang
    [J]. KNOWLEDGE AND SYSTEMS ENGINEERING (KSE 2013), VOL 1, 2014, 244 : 261 - 270
  • [25] A novel term weighting scheme for automated text categorization
    Xu, Hongzhi
    Li, Chunping
    [J]. PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS, 2007, : 759 - 764
  • [26] An Effective Term Weighting Method Using Random Walk Model for Text Classification
    Islam, Md. Rafiqul
    Islam, Md. Rakibul
    [J]. 2008 11TH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY: ICCIT 2008, VOLS 1 AND 2, 2008, : 433 - 436
  • [27] Improving Term Weighting Schemes for Short Text Classification in Vector Space Model
    Samant, Surender Singh
    Murthy, N. L. Bhanu
    Malapati, Aruna
    [J]. IEEE ACCESS, 2019, 7 : 166578 - 166592
  • [28] A Comparative Study on Term Weighting Schemes for Text Classification
    Mazyad, Ahmad
    Teytaud, Fabien
    Fonlupt, Cyril
    [J]. MACHINE LEARNING, OPTIMIZATION, AND BIG DATA, MOD 2017, 2018, 10710 : 100 - 108
  • [29] RANDOM WALK TERM WEIGHTING FOR IMPROVED TEXT CLASSIFICATION
    Hassan, Samer
    Mihalcea, Rada
    Banea, Carmen
    [J]. INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2007, 1 (04) : 421 - 439
  • [30] A Study on Text Classification: Term Weighting Algorithm Analysis
    Tseng, Kuan-Hua
    Lin, Chun-Hung Richard
    Liu, Jain-Shing
    Huang, Chih-Ming Andrew
    Wang, Yue-Han
    [J]. JOURNAL OF INTERNET TECHNOLOGY, 2021, 22 (02): : 311 - 325