A Weighted Topical Document Embedding based Clustering Method for News Text

被引:0
|
作者
Zhu Dechao [1 ]
Song Hui [1 ]
机构
[1] Donghua Univ, Sch Comp Sci, Shanghai, Peoples R China
关键词
Text Clustering; Skip-Gram; LDA; TF-IDF;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
As an unsupervised machine learning method, clustering can preliminarily group text without artificial labeling, which effectively accelerates the organization, abstraction and navigation on large news set. The length of news is long, and the text contains many homonymy and polysemy, that is one of the reason that traditional text clustering methods perform weaker on grouping news text. This paper presents a novel text representation method based on topical document embedding (TDE) to capture the semantic features of different topics. In TDE representation, document embedding of news texts is obtained by adding up word vector from Skip-Gram model weighted by TF-IDF score of all the key words in the text. While the topical document embedding is learned by joining the topic vectors obtained from LDA model and the document vectors in document embedding. By using topical document embedding to perform clustering, we implement a novel text clustering method (TDE-TC). The experimental results show that the effect of news clustering based on TDE representation is better than that of bag of words model and LDA model.
引用
收藏
页码:1060 / 1065
页数:6
相关论文
共 50 条
  • [1] A Text Document Clustering Method Based on Topical Concept
    Ding, Yi
    Fu, Xian
    [J]. ADVANCES IN ELECTRONIC COMMERCE, WEB APPLICATION AND COMMUNICATION, VOL 1, 2012, 148 : 547 - 552
  • [2] A Text Document Clustering Method Based on Weighted BERT Model
    Li, Yutong
    Cai, Juanjuan
    Wang, Jingling
    [J]. PROCEEDINGS OF 2020 IEEE 4TH INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC 2020), 2020, : 1426 - 1430
  • [3] A Text Document Clustering Method Based on Ontology
    Ding, Yi
    Fu, Xian
    [J]. ADVANCES IN NEURAL NETWORKS - ISNN 2011, PT II, 2011, 6676 : 199 - 206
  • [4] GAE-Based Document Embedding Method for Clustering
    Jung, Sungwon
    Ka, Sangmin
    [J]. IEEE ACCESS, 2022, 10 : 130089 - 130096
  • [5] WTL-CNN: a news text classification method of convolutional neural network based on weighted word embedding
    Zhao, Weidong
    Zhu, Lin
    Wang, Ming
    Zhang, Xiliang
    Zhang, Jinming
    [J]. CONNECTION SCIENCE, 2022, 34 (01) : 2291 - 2312
  • [6] A Topic Recognition Method of News Text Based on Word Embedding Enhancement
    Du, Qiming
    Li, Nan
    Liu, Wenfu
    Sun, Daozhu
    Yang, Shudan
    Yue, Feng
    [J]. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2022, 2022
  • [7] Text document clustering based on neighbors
    Luo, Congnan
    Li, Yanjun
    Chung, Soon M.
    [J]. DATA & KNOWLEDGE ENGINEERING, 2009, 68 (11) : 1271 - 1288
  • [8] Ontology-based text document clustering
    Staab, S
    Hotho, A
    [J]. INTELLIGENT INFORMATION PROCESSING AND WEB MINING, 2003, : 451 - 452
  • [9] Validation of text clustering based on document contents
    Toivonen, J
    Visa, A
    Vesanen, T
    Back, B
    Vanharanta, H
    [J]. MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION, 2001, 2123 : 184 - 195
  • [10] Multi-Document News Summarization via Paragraph Embedding and Density Peak Clustering
    Wang, Baoyan
    Zhang, Jian
    Ding, Fanggui
    Zou, Yuexian
    [J]. 2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 260 - 263