Confronting Sparseness and High Dimensionality in Short Text Clustering via Feature Vector Projections

被引:6
|
作者
Akritidis, Leonidas [1 ]
Alamaniotis, Miltiadis [2 ]
Fevgas, Athanasios [2 ]
Bozanis, Panayiotis [1 ]
机构
[1] Intl Hellen Univ, Sch Sci & Technol, Thessaloniki, Greece
[2] Univ Texas San Antonio, Dept Elect & Comp Engn, San Antonio, TX USA
来源
2020 IEEE 32ND INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI) | 2020年
关键词
short text clustering; text mining; machine learning; unsupervised learning; clustering; data mining; CONCEPT DECOMPOSITIONS;
D O I
10.1109/ICTAI50040.2020.00129
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Short text clustering is a popular problem that focuses on the unsupervised grouping of similar short text documents, or entitled entities. Since the short texts are currently being utilized in a vast number of applications, the problem in question has been rendered increasingly significant in the past few years. The high cluster homogeneity and completeness are two among the most important goals of all data clustering algorithms. However, in the context of short texts, their fulfilment is particularly difficult, because this type of data is typically represented by sparse vectors that collectively comprise a very high dimensional space. In this article we introduce VEPHC, a two-stage clustering algorithm designed to confront the sparseness and high dimensionality traits of short texts. During the first stage (or else, the VEP part), the initial feature vectors are projected onto a lower dimensional space by constructing and scoring variable-sized combinations of features (that is, terms). In the second stage (or else, the HC part), VEPHC improves the homogeneity and completeness of the generated clusters through split and merge operations that are based on the similarities of all inter-cluster elements. The experimental evaluation of VEPHC on two real-world datasets demonstrates its superior performance over numerous state-of-the-art clustering algorithms in terms of F1 scores and Normalized Mutual Information.
引用
收藏
页码:813 / 820
页数:8
相关论文
共 40 条
  • [1] Feature Word Vector Based on Short Text Clustering
    Liu, Xin
    Wang, Bo
    Xi, Yao-yi
    Mao, Er-song
    Ke, Sheng-cai
    Tang, Yong-wang
    COMPUTER SCIENCE AND TECHNOLOGY (CST2016), 2017, : 533 - 545
  • [2] A practical algorithm for solving the sparseness problem of short text clustering
    Qiang, Jipeng
    Li, Yun
    Yuan, Yunhao
    Liu, Wei
    Wu, Xindong
    INTELLIGENT DATA ANALYSIS, 2019, 23 (03) : 701 - 716
  • [3] Method of Feature Reduction in Short Text Classification Based on Feature Clustering
    Li, Fangfang
    Yin, Yao
    Shi, Jinjing
    Mao, Xingliang
    Shi, Ronghua
    APPLIED SCIENCES-BASEL, 2019, 9 (08):
  • [4] Asymmetric Short-Text Clustering via Prompt
    Wang, Zhi
    Zhu, Yi
    Li, Yun
    Qiang, Jipeng
    Yuan, Yunhao
    Zhang, Chaowei
    NEW GENERATION COMPUTING, 2024, 42 (04) : 599 - 615
  • [5] Structural Feature-based Event Clustering for Short Text Streams
    Sun, Zhengya
    Han, Jiuqi
    Hao, Hong-Wei
    2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2016, : 3252 - 3257
  • [6] Improving Hierarchical Short Text Clustering through Dominant Feature Learning
    Akritidis, Leonidas
    Alamaniotis, Miltiadis
    Fevgas, Athanasios
    Tsompanopoulou, Panagiota
    Bozanis, Panayiotis
    INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2022, 31 (05)
  • [7] High-Dimensional Clustering via Random Projections
    Laura Anderlucci
    Francesca Fortunato
    Angela Montanari
    Journal of Classification, 2022, 39 : 191 - 216
  • [8] High-Dimensional Clustering via Random Projections
    Anderlucci, Laura
    Fortunato, Francesca
    Montanari, Angela
    JOURNAL OF CLASSIFICATION, 2022, 39 (01) : 191 - 216
  • [9] A novel text clustering algorithm via integrating feature set construction and text partition
    Chen, L. (chenlei3656@163.com), 1600, Binary Information Press, P.O. Box 162, Bethel, CT 06801-0162, United States (09):
  • [10] A K-means Text Clustering Algorithm Based on Subject Feature Vector
    Duo, Ji
    Zhang, Peng
    Hao, Liu
    JOURNAL OF WEB ENGINEERING, 2021, 20 (06): : 1935 - 1946