Clustering Documents using the Document to Vector Model for Dimensionality Reduction

被引:18
|
作者
Radu, Robert-George [1 ]
Radulescu, Iulia-Maria [1 ]
Truica, Ciprian-Octavian [1 ]
Apostol, Elena-Simona [1 ]
Mocanu, Mariana [1 ]
机构
[1] Univ Politehn Bucuresti, Fac Automat Control & Comp, Comp Sci & Engn Dept, Bucharest, Romania
关键词
text clustering; document embeddings; text preprocessing; clustering evaluation;
D O I
10.1109/aqtr49680.2020.9129967
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The TF-IDF model is the most common way of representing documents in the vector space. However, its results are highly dimensional, posing problems to the classic clustering algorithms due to the curse of dimensionality. Recent word embeddings based techniques can reduce the documents representations dimensionality while also preserving the semantic relationships between words. In this paper, we analyze the accuracy of four different classical clustering algorithms (K-Means, Spherical K-Means, LDA, and DBSCAN) in combination with the Document to Vector model.
引用
收藏
页码:57 / 62
页数:6
相关论文
共 50 条
  • [1] Patent Document Clustering Using Dimensionality Reduction
    Girthana, K.
    Swamynathan, S.
    [J]. PROGRESS IN ADVANCED COMPUTING AND INTELLIGENT ENGINEERING, VOL 2, 2018, 564 : 167 - 176
  • [2] Word Embedding of Dimensionality Reduction for Document Clustering
    Zhu, Pengyu
    Lang, Qi
    Liu, Xiaodong
    [J]. 2023 35TH CHINESE CONTROL AND DECISION CONFERENCE, CCDC, 2023, : 4371 - 4376
  • [3] Document clustering method using dimension reduction and support vector clustering to overcome sparseness
    Jun, Sunghae
    Park, Sang-Sung
    Jang, Dong-Sik
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2014, 41 (07) : 3204 - 3212
  • [4] Comparing LDA with pLSI as a dimensionality reduction method in document clustering
    Masada, Tomonari
    Kiyasu, Senya
    Miyahara, Sueharu
    [J]. LARGE-SCALE KNOWLEDGE RESOURCES: CONSTRUCTION AND APPLICATION, 2008, 4938 : 13 - 26
  • [5] Effect of Dimensionality Reduction on Different Distance Measures in Document Clustering
    Paukkeri, Mari-Sanna
    Kivimaki, Ilkka
    Tirunagari, Santosh
    Oja, Erkki
    Honkela, Timo
    [J]. NEURAL INFORMATION PROCESSING, PT III, 2011, 7064 : 167 - +
  • [6] Simultaneous Clustering and Dimensionality Reduction Using Variational Bayesian Mixture Model
    Watanabe, Kazuho
    Akaho, Shotaro
    Omachi, Shinichiro
    Okada, Masato
    [J]. CLASSIFICATION AS A TOOL FOR RESEARCH, 2010, : 81 - 89
  • [7] Document Clustering Using an Ontology-Based Vector Space Model
    Costa, Ruben
    Lima, Celson
    [J]. INTERNATIONAL JOURNAL OF INFORMATION RETRIEVAL RESEARCH, 2015, 5 (03) : 39 - 60
  • [8] Scalable Supervised Dimensionality Reduction Using Clustering
    Raeder, Troy
    Perlich, Claudia
    Dalessandro, Brian
    Stitelman, Ori
    Provost, Foster
    [J]. 19TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'13), 2013, : 1213 - 1221
  • [9] An efficient two-level SOMART document clustering through dimensionality reduction
    Hussin, MF
    Kamel, MS
    Nagi, MH
    [J]. NEURAL INFORMATION PROCESSING, 2004, 3316 : 158 - 165
  • [10] Document Clustering Using Semantic Relationship Between Target Documents and Related Documents
    Sasaki, Minoru
    Shinnou, Hiroyuki
    [J]. SEMAPRO 2010: THE FOURTH INTERNATIONAL CONFERENCE ON ADVANCES IN SEMANTIC PROCESSING, 2010, : 91 - 95