A Weighted Topical Document Embedding based Clustering Method for News Text

被引:0
|
作者
Zhu Dechao [1 ]
Song Hui [1 ]
机构
[1] Donghua Univ, Sch Comp Sci, Shanghai, Peoples R China
关键词
Text Clustering; Skip-Gram; LDA; TF-IDF;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
As an unsupervised machine learning method, clustering can preliminarily group text without artificial labeling, which effectively accelerates the organization, abstraction and navigation on large news set. The length of news is long, and the text contains many homonymy and polysemy, that is one of the reason that traditional text clustering methods perform weaker on grouping news text. This paper presents a novel text representation method based on topical document embedding (TDE) to capture the semantic features of different topics. In TDE representation, document embedding of news texts is obtained by adding up word vector from Skip-Gram model weighted by TF-IDF score of all the key words in the text. While the topical document embedding is learned by joining the topic vectors obtained from LDA model and the document vectors in document embedding. By using topical document embedding to perform clustering, we implement a novel text clustering method (TDE-TC). The experimental results show that the effect of news clustering based on TDE representation is better than that of bag of words model and LDA model.
引用
收藏
页码:1060 / 1065
页数:6
相关论文
共 50 条
  • [21] NEWS STORY CLUSTERING WITH FISHER EMBEDDING
    Chu, Wei-Ta
    Hsu, Han-Nung
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 1175 - 1178
  • [22] Text Document Preprocessing and Dimension Reduction Techniques for Text Document Clustering
    Kadhim, Ammar Ismael
    Cheah, Yu-N
    Ahamed, Nurul Hashimah
    [J]. PROCEEDINGS 2014 4TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE WITH APPLICATIONS IN ENGINEERING AND TECHNOLOGY ICAIET 2014, 2014, : 69 - 73
  • [23] Text document clustering and the space of concept on text document automatically generated
    Fu, WP
    Wu, B
    He, Q
    Shi, ZZ
    [J]. 2001 INTERNATIONAL CONFERENCES ON INFO-TECH AND INFO-NET PROCEEDINGS, CONFERENCE A-G: INFO-TECH & INFO-NET: A KEY TO BETTER LIFE, 2001, : C107 - C112
  • [24] Improved Meta-Heuristic Model for Text Document Clustering by Adaptive Weighted Similarity
    Venkanna, Gugulothu
    Bharati, K. F.
    [J]. INTERNATIONAL JOURNAL OF UNCERTAINTY FUZZINESS AND KNOWLEDGE-BASED SYSTEMS, 2023, 31 (05) : 749 - 771
  • [25] Text Document Clustering Based on Neural K-Mean Clustering Technique
    Kaur, Daljeet
    Bajwa, Jagpuneet Kaur
    [J]. ADVANCES IN COMPUTING AND DATA SCIENCES, ICACDS 2016, 2017, 721 : 336 - 344
  • [26] Improved graph node embedding and clustering method for fault short text
    Qiu, Jingxiong
    Sun, Linfu
    Han, Min
    [J]. Jisuanji Jicheng Zhizao Xitong/Computer Integrated Manufacturing Systems, CIMS, 2023, 29 (12): : 4257 - 4266
  • [27] An Automatic Text Classification Method Based on Hierarchical Taxonomies, Neural Networks and Document Embedding: The NETHIC Tool
    Lomasto, Luigi
    Di Florio, Rosario
    Ciapetti, Andrea
    Miscione, Giuseppe
    Ruggiero, Giulia
    Toti, Daniele
    [J]. ENTERPRISE INFORMATION SYSTEMS (ICEIS 2019), 2020, 378 : 57 - 77
  • [28] Tens-embedding: A Tensor-based document embedding method
    Rahimi, Zahra
    Homayounpour, Mohammad Mehdi
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2020, 162
  • [29] News Text Topic Clustering Optimized Method Based on TF-IDF Algorithm on Spark
    Zhou, Zhuo
    Qin, Jiaohua
    Xiang, Xuyu
    Tan, Yun
    Liu, Qiang
    Xiong, Neal N.
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2020, 62 (01): : 217 - 231
  • [30] Analysis of similarity measures with WordNet based text document clustering
    Sandhya, Nadella
    Govardhan, A.
    [J]. Advances in Intelligent and Soft Computing, 2012, 132 AISC : 703 - 714