A Weighted Topical Document Embedding based Clustering Method for News Text

被引:0
|
作者
Zhu Dechao [1 ]
Song Hui [1 ]
机构
[1] Donghua Univ, Sch Comp Sci, Shanghai, Peoples R China
来源
2016 IEEE INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC) | 2016年
关键词
Text Clustering; Skip-Gram; LDA; TF-IDF;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
As an unsupervised machine learning method, clustering can preliminarily group text without artificial labeling, which effectively accelerates the organization, abstraction and navigation on large news set. The length of news is long, and the text contains many homonymy and polysemy, that is one of the reason that traditional text clustering methods perform weaker on grouping news text. This paper presents a novel text representation method based on topical document embedding (TDE) to capture the semantic features of different topics. In TDE representation, document embedding of news texts is obtained by adding up word vector from Skip-Gram model weighted by TF-IDF score of all the key words in the text. While the topical document embedding is learned by joining the topic vectors obtained from LDA model and the document vectors in document embedding. By using topical document embedding to perform clustering, we implement a novel text clustering method (TDE-TC). The experimental results show that the effect of news clustering based on TDE representation is better than that of bag of words model and LDA model.
引用
收藏
页码:1060 / 1065
页数:6
相关论文
共 50 条
  • [31] A Rule-Based Approach to Embedding Techniques for Text Document Classification
    Aubaid, Asmaa M.
    Mishra, Alok
    APPLIED SCIENCES-BASEL, 2020, 10 (11):
  • [32] Frequent Term Based Text Document Clustering: A New Approach
    Kumar, Manoj
    Yadav, D. K.
    Gupta, Vijay Kumar
    2015 INTERNATIONAL CONFERENCE ON SOFT COMPUTING TECHNIQUES AND IMPLEMENTATIONS (ICSCTI), 2015,
  • [33] Analysis of Similarity Measures with WordNet Based Text Document Clustering
    Sandhya, Nadella
    Govardhan, A.
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION SYSTEMS DESIGN AND INTELLIGENT APPLICATIONS 2012 (INDIA 2012), 2012, 132 : 703 - +
  • [34] News Text Topic Clustering Optimized Method Based on TF-IDF Algorithm on Spark
    Zhou, Zhuo
    Qin, Jiaohua
    Xiang, Xuyu
    Tan, Yun
    Liu, Qiang
    Xiong, Neal N.
    CMC-COMPUTERS MATERIALS & CONTINUA, 2020, 62 (01): : 217 - 231
  • [35] Text document clustering based on frequent word meaning sequences
    Li, Yanjun
    Chung, Soon M.
    Holt, John D.
    DATA & KNOWLEDGE ENGINEERING, 2008, 64 (01) : 381 - 404
  • [36] A Clustering Algorithm Based on Document Embedding to Identify Clinical Note Templates
    Tang C.
    Plasek J.M.
    Xiong Y.
    Zhang Z.
    Bates D.W.
    Zhou L.
    Xiong, Yun (yunx@fudan.edu.cn), 1600, Springer Science and Business Media Deutschland GmbH (08): : 497 - 515
  • [37] Text document summarization using word embedding
    Mohd, Mudasir
    Jan, Rafiya
    Shah, Muzaffar
    EXPERT SYSTEMS WITH APPLICATIONS, 2020, 143 (143)
  • [38] Weighted k-Means Algorithm Based Text Clustering
    Chen, Xiuguo
    Yin, Wensheng
    Tu, Pinghui
    Zhang, Hengxi
    IEEC 2009: FIRST INTERNATIONAL SYMPOSIUM ON INFORMATION ENGINEERING AND ELECTRONIC COMMERCE, PROCEEDINGS, 2009, : 51 - +
  • [39] Text Document Clustering with Metric Learning
    Wang, Jinlong
    Wu, Shunyao
    Huy Quan Vu
    Li, Gang
    SIGIR 2010: PROCEEDINGS OF THE 33RD ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH DEVELOPMENT IN INFORMATION RETRIEVAL, 2010, : 783 - 784
  • [40] Ontologies improve text document clustering
    Hotho, A
    Staab, S
    Stumme, G
    THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2003, : 541 - 544