Text clustering based on pre-trained models and autoencoders

被引：3

作者：

Xu, Qiang ^{[1
]}

Gu, Hao ^{[1
]}

Ji, ShengWei ^{[1
]}

机构：

[1] Hefei Univ, Sch Artificial Intelligence & Big Data, Hefei, Anhui, Peoples R China

来源：

FRONTIERS IN COMPUTATIONAL NEUROSCIENCE | 2024年 / 17卷

关键词：

text clustering; medical; deep learning; pre-trained models; autoencoder; deep embedded clustering model; NEURAL-NETWORKS;

D O I：

10.3389/fncom.2023.1334436

中图分类号：

Q [生物科学];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Text clustering is the task of grouping text data based on similarity, and it holds particular importance in the medical field. sIn healthcare, medical data clustering is a highly active and effective research area. It not only provides strong support for making correct medical decisions from medical datasets but also aids in patient record management and medical information retrieval. With the development of the healthcare industry, a large amount of medical data is being generated, and traditional medical data clustering faces significant challenges. Many existing text clustering algorithms are primarily based on the bag-of-words model, which has issues such as high dimensionality, sparsity, and the neglect of word positions and context. Pre-trained models are a deep learning-based approach that treats text as a sequence to accurately capture word positions and context information. Moreover, compared to traditional K-means and fuzzy C-means clustering models, deep learning-based clustering algorithms are better at handling high-dimensional, complex, and nonlinear data. In particular, clustering algorithms based on autoencoders can learn data representations and clustering information, effectively reducing noise interference and errors during the clustering process. This paper combines pre-trained language models with deep embedding clustering models. Experimental results demonstrate that our model performs exceptionally well on four public datasets, outperforming most existing text clustering algorithms, and can be applied to medical data clustering.

引用

页数：13

共 50 条

[21] Pre-trained transformer-based language models for Sundanese
Wilson Wongso
Henry Lucky
Derwin Suhartono
[J]. Journal of Big Data, 9
[22] Pre-trained CNNs Models for Content based Image Retrieval
Ahmed, Ali
[J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2021, 12 (07) : 200 - 206
[23] Pre-trained transformer-based language models for Sundanese
Wongso, Wilson
Lucky, Henry
Suhartono, Derwin
[J]. JOURNAL OF BIG DATA, 2022, 9 (01)
[24] Difference between Multi-modal vs. Text Pre-trained Models in Embedding Text
Sun, Yuchong
Cheng, Xiwei
Song, Ruihua
Che, Wanxiang
Lu, Zhiwu
Wen, Jirong
[J]. Beijing Daxue Xuebao (Ziran Kexue Ban)/Acta Scientiarum Naturalium Universitatis Pekinensis, 2023, 59 (01): : 48 - 56
[25] How Different are Pre-trained Transformers for Text Ranking?
Rau, David
Kamps, Jaap
[J]. ADVANCES IN INFORMATION RETRIEVAL, PT II, 2022, 13186 : 207 - 214
[26] CUE: An Uncertainty Interpretation Framework for Text Classifiers Built on Pre-Trained Language Models
Li, Jiazheng
Sun, Zhaoyue
Liang, Bin
Gui, Lin
He, Yulan
[J]. UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, 2023, 216 : 1253 - 1262
[27] Pre-trained Text Embeddings for Enhanced Text-to-Speech Synthesis
Hayashi, Tomoki
Watanabe, Shinji
Toda, Tomoki
Takeda, Kazuya
Toshniwal, Shubham
Livescu, Karen
[J]. INTERSPEECH 2019, 2019, : 4430 - 4434
[28] A Comparison of SVM Against Pre-trained Language Models (PLMs) for Text Classification Tasks
Wahba, Yasmen
Madhavji, Nazim
Steinbacher, John
[J]. MACHINE LEARNING, OPTIMIZATION, AND DATA SCIENCE, LOD 2022, PT II, 2023, 13811 : 304 - 313
[29] General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference
Du, Jingfei
Ott, Myle
Li, Haoran
Zhou, Xing
Stoyanov, Veselin
[J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020,
[30] BERTifying Sinhala - A Comprehensive Analysis of Pre-trained Language Models for Sinhala Text Classification
Dhananjaya, Vinura
Demotte, Piyumal
Ranathunga, Surangika
Jayasena, Sanath
[J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 7377 - 7385

← 1 2 3 4 5 →