SLDA-TC: A Novel Text Categorization Approach Based on Supervised Topic Model

被引:0
|
作者
Tang H.-L. [1 ,2 ,3 ]
Dou Q.-S. [1 ,2 ,3 ]
Yu L.-P. [1 ,2 ,3 ]
Song Y.-J. [1 ,2 ,3 ]
Lu M.-Y. [4 ]
机构
[1] School of Computer Science and Technology, Shandong Technology and Business University, Yantai, 264005, Shandong
[2] Co-innovation Center of Shandong Colleges and Universities: Future Intelligent Computing, Yantai, 264005, Shandong
[3] Key Laboratory of Intelligent Information Processing in Universities of Shandong(Shandong Technology and Business University), Yantai, 264005, Shandong
[4] Information Science and Technology College, Dalian Maritime University, Dalian, 116026, Liaoning
来源
关键词
Gibbs sampling; Latent Dirichlet allocation; Text categorization; Topic model;
D O I
10.3969/j.issn.0372-2112.2019.06.017
中图分类号
学科分类号
摘要
In this paper, SLDA-TC, a novel text categorization model based on supervised topic model is proposed. The new parameter represents the probability distribution of topic-category is introduced. The SLDA-TC-Gibbs sampling algorithm is presented. At each iteration, a word's latent topic sampling only utilizes the other training documents having the same category with the document the word occurred, meanwhile, the theoretical proof is given. In the SLDA-TC model, the number of topics is only slightly larger than the number of categories. The experimental results demonstrate that the SLDA-TC model promotes the accuracy and speed for text classification compared with the LDA-TC and SVM algorithms. © 2019, Chinese Institute of Electronics. All right reserved.
引用
收藏
页码:1300 / 1308
页数:8
相关论文
共 24 条
  • [1] Salton G., Wong A.K., Yang C.S., Et al., A vector space model for automatic indexing, Communications Ofthe ACM, 18, 11, pp. 613-620, (1975)
  • [2] Yu C.L., Ming-Yu L., Fan L., Analysis and construction of word weighting function in VSM, Journal of Computer Research & Development, 39, 10, pp. 1205-1210, (2002)
  • [3] Tang H.L., Lin Z.K., Lu M.Y., An improved co-training text categorization algorithm based on diversity measures, Acta Electronica Sinica, 36, b12, pp. 138-143, (2008)
  • [4] Zhai Y.-D., Wang K.-P., Zhang D.-N., Et al., An algorithm for semantic similarity of short text based on wordnet, Acta Electronica Sinica, 40, 3, pp. 617-620, (2012)
  • [5] He Y.F., Jiang M.H., Information bottleneck based feature selection in web text categorization, Journal of Tsinghua University (Sci& Tech), 50, 1, (2010)
  • [6] Guo M.S., Zhang Y., Liu T., Research advances and prospect of recognizing textual entailment and knowledge acquisition, Chinese Journal of Computers, 40, 4, pp. 889-910, (2017)
  • [7] Turney P.D., Pantel P., From frequency to meaning: vector space models of semantics, Journal of Artificial Intelligence Research Archive, AI Access Foundation, 37, 1, pp. 141-188, (2010)
  • [8] Mikolov T., Chen K., Corrado G., Et al., Efficientestimation of word representations in vector space, Computer Science, (2013)
  • [9] Deerwester S., Dumais S.T., Furnas G.W., Et al., Indexing bylatent semantic analysis, Journal of the American Society for Information Science, 41, 6, pp. 391-407, (1990)
  • [10] Hofmann T., Probabilistic latent semantic indexing, Proceedings of the 22nd ACM-SIGIR International Conference on Research and Development in Information Retrieval, pp. 50-57, (1999)