Text clustering based on pre-trained models and autoencoders

被引：3

作者：

Xu, Qiang ^{[1
]}

Gu, Hao ^{[1
]}

Ji, ShengWei ^{[1
]}

机构：

[1] Hefei Univ, Sch Artificial Intelligence & Big Data, Hefei, Anhui, Peoples R China

来源：

FRONTIERS IN COMPUTATIONAL NEUROSCIENCE | 2024年 / 17卷

关键词：

text clustering; medical; deep learning; pre-trained models; autoencoder; deep embedded clustering model; NEURAL-NETWORKS;

D O I：

10.3389/fncom.2023.1334436

中图分类号：

Q [生物科学];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Text clustering is the task of grouping text data based on similarity, and it holds particular importance in the medical field. sIn healthcare, medical data clustering is a highly active and effective research area. It not only provides strong support for making correct medical decisions from medical datasets but also aids in patient record management and medical information retrieval. With the development of the healthcare industry, a large amount of medical data is being generated, and traditional medical data clustering faces significant challenges. Many existing text clustering algorithms are primarily based on the bag-of-words model, which has issues such as high dimensionality, sparsity, and the neglect of word positions and context. Pre-trained models are a deep learning-based approach that treats text as a sequence to accurately capture word positions and context information. Moreover, compared to traditional K-means and fuzzy C-means clustering models, deep learning-based clustering algorithms are better at handling high-dimensional, complex, and nonlinear data. In particular, clustering algorithms based on autoencoders can learn data representations and clustering information, effectively reducing noise interference and errors during the clustering process. This paper combines pre-trained language models with deep embedding clustering models. Experimental results demonstrate that our model performs exceptionally well on four public datasets, outperforming most existing text clustering algorithms, and can be applied to medical data clustering.

引用

页数：13

共 50 条

[1] Pre-Trained Language Models for Text Generation: A Survey
Li, Junyi
Tang, Tianyi
Zhao, Wayne Xin
Nie, Jian-Yun
Wen, Ji-Rong
[J]. ACM COMPUTING SURVEYS, 2024, 56 (09)
[2] On the Power of Pre-Trained Text Representations: Models and Applications in Text Mining
Meng, Yu
Huang, Jiaxin
Zhang, Yu
Han, Jiawei
[J]. KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 4052 - 4053
[3] Text Detoxification using Large Pre-trained Neural Models
Dale, David
Voronov, Anton
Dementieva, Daryna
Logacheva, Varvara
Kozlova, Olga
Semenov, Nikita
Panchenko, Alexander
[J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 7979 - 7996
[4] EFFICIENT TEXT ANALYSIS WITH PRE-TRAINED NEURAL NETWORK MODELS
Cui, Jia
Lu, Heng
Wang, Wenjie
Kang, Shiyin
He, Liqiang
Li, Guangzhi
Yu, Dong
[J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 671 - 676
[5] Uncertainty Estimation and Reduction of Pre-trained Models for Text Regression
Wang, Yuxia
Beck, Daniel
Baldwin, Timothy
Verspoor, Karin
[J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2022, 10 : 680 - 696
[6] Non-Autoregressive Text Generation with Pre-trained Language Models
Su, Yixuan
Cai, Deng
Wang, Yan
Vandyke, David
Baker, Simon
Li, Piji
Collier, Nigel
[J]. 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 234 - 243
[7] ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining
Minh Phuc Nguyen
Vu Hoang Tran
Vu Hoang
Ta Duc Huy
Bui, Trung H.
Truong, Steven Q. H.
[J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 328 - 337
[8] Radical-vectors with pre-trained models for Chinese Text Classification
Yin, Guoqing
Wu, Junmin
Zhao, Guochao
[J]. 2022 EURO-ASIA CONFERENCE ON FRONTIERS OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY, FCSIT, 2022, : 12 - 15
[9] Towards unifying pre-trained language models for semantic text exchange
Miao, Jingyuan
Zhang, Yuqi
Jiang, Nan
Wen, Jie
Pei, Kanglu
Wan, Yue
Wan, Tao
Chen, Honglong
[J]. WIRELESS NETWORKS, 2023,
[10] Short-Text Classification Method with Text Features from Pre-trained Models
Chen, Jie
Ma, Jing
Li, Xiaofeng
[J]. Data Analysis and Knowledge Discovery, 2021, 5 (09): : 21 - 30

← 1 2 3 4 5 →