Confronting Sparseness and High Dimensionality in Short Text Clustering via Feature Vector Projections

被引：6

作者：

Akritidis, Leonidas ^{[1
]}

Alamaniotis, Miltiadis ^{[2
]}

Fevgas, Athanasios ^{[2
]}

Bozanis, Panayiotis ^{[1
]}

机构：

[1] Intl Hellen Univ, Sch Sci & Technol, Thessaloniki, Greece

[2] Univ Texas San Antonio, Dept Elect & Comp Engn, San Antonio, TX USA

来源：

2020 IEEE 32ND INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI) | 2020年

关键词：

short text clustering; text mining; machine learning; unsupervised learning; clustering; data mining; CONCEPT DECOMPOSITIONS;

D O I：

10.1109/ICTAI50040.2020.00129

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Short text clustering is a popular problem that focuses on the unsupervised grouping of similar short text documents, or entitled entities. Since the short texts are currently being utilized in a vast number of applications, the problem in question has been rendered increasingly significant in the past few years. The high cluster homogeneity and completeness are two among the most important goals of all data clustering algorithms. However, in the context of short texts, their fulfilment is particularly difficult, because this type of data is typically represented by sparse vectors that collectively comprise a very high dimensional space. In this article we introduce VEPHC, a two-stage clustering algorithm designed to confront the sparseness and high dimensionality traits of short texts. During the first stage (or else, the VEP part), the initial feature vectors are projected onto a lower dimensional space by constructing and scoring variable-sized combinations of features (that is, terms). In the second stage (or else, the HC part), VEPHC improves the homogeneity and completeness of the generated clusters through split and merge operations that are based on the similarities of all inter-cluster elements. The experimental evaluation of VEPHC on two real-world datasets demonstrates its superior performance over numerous state-of-the-art clustering algorithms in terms of F1 scores and Normalized Mutual Information.

引用

页码：813 / 820

页数：8

共 40 条

[1] Feature Word Vector Based on Short Text Clustering
Liu, Xin
Wang, Bo
Xi, Yao-yi
Mao, Er-song
Ke, Sheng-cai
Tang, Yong-wang
COMPUTER SCIENCE AND TECHNOLOGY (CST2016), 2017, : 533 - 545
[2] A practical algorithm for solving the sparseness problem of short text clustering
Qiang, Jipeng
Li, Yun
Yuan, Yunhao
Liu, Wei
Wu, Xindong
INTELLIGENT DATA ANALYSIS, 2019, 23 (03) : 701 - 716
[3] Method of Feature Reduction in Short Text Classification Based on Feature Clustering
Li, Fangfang
Yin, Yao
Shi, Jinjing
Mao, Xingliang
Shi, Ronghua
APPLIED SCIENCES-BASEL, 2019, 9 (08):
[4] Asymmetric Short-Text Clustering via Prompt
Wang, Zhi
Zhu, Yi
Li, Yun
Qiang, Jipeng
Yuan, Yunhao
Zhang, Chaowei
NEW GENERATION COMPUTING, 2024, 42 (04) : 599 - 615
[5] Structural Feature-based Event Clustering for Short Text Streams
Sun, Zhengya
Han, Jiuqi
Hao, Hong-Wei
2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2016, : 3252 - 3257
[6] Improving Hierarchical Short Text Clustering through Dominant Feature Learning
Akritidis, Leonidas
Alamaniotis, Miltiadis
Fevgas, Athanasios
Tsompanopoulou, Panagiota
Bozanis, Panayiotis
INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2022, 31 (05)
[7] High-Dimensional Clustering via Random Projections
Laura Anderlucci
Francesca Fortunato
Angela Montanari
Journal of Classification, 2022, 39 : 191 - 216
[8] High-Dimensional Clustering via Random Projections
Anderlucci, Laura
Fortunato, Francesca
Montanari, Angela
JOURNAL OF CLASSIFICATION, 2022, 39 (01) : 191 - 216
[9] A novel text clustering algorithm via integrating feature set construction and text partition
Chen, L. (chenlei3656@163.com), 1600, Binary Information Press, P.O. Box 162, Bethel, CT 06801-0162, United States (09):
[10] A K-means Text Clustering Algorithm Based on Subject Feature Vector
Duo, Ji
Zhang, Peng
Hao, Liu
JOURNAL OF WEB ENGINEERING, 2021, 20 (06): : 1935 - 1946

← 1 2 3 4 →