Text clustering based on pre-trained models and autoencoders

被引：3

作者：

Xu, Qiang ^{[1
]}

Gu, Hao ^{[1
]}

Ji, ShengWei ^{[1
]}

机构：

[1] Hefei Univ, Sch Artificial Intelligence & Big Data, Hefei, Anhui, Peoples R China

来源：

FRONTIERS IN COMPUTATIONAL NEUROSCIENCE | 2024年 / 17卷

关键词：

text clustering; medical; deep learning; pre-trained models; autoencoder; deep embedded clustering model; NEURAL-NETWORKS;

D O I：

10.3389/fncom.2023.1334436

中图分类号：

Q [生物科学];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Text clustering is the task of grouping text data based on similarity, and it holds particular importance in the medical field. sIn healthcare, medical data clustering is a highly active and effective research area. It not only provides strong support for making correct medical decisions from medical datasets but also aids in patient record management and medical information retrieval. With the development of the healthcare industry, a large amount of medical data is being generated, and traditional medical data clustering faces significant challenges. Many existing text clustering algorithms are primarily based on the bag-of-words model, which has issues such as high dimensionality, sparsity, and the neglect of word positions and context. Pre-trained models are a deep learning-based approach that treats text as a sequence to accurately capture word positions and context information. Moreover, compared to traditional K-means and fuzzy C-means clustering models, deep learning-based clustering algorithms are better at handling high-dimensional, complex, and nonlinear data. In particular, clustering algorithms based on autoencoders can learn data representations and clustering information, effectively reducing noise interference and errors during the clustering process. This paper combines pre-trained language models with deep embedding clustering models. Experimental results demonstrate that our model performs exceptionally well on four public datasets, outperforming most existing text clustering algorithms, and can be applied to medical data clustering.

引用

页数：13

共 50 条

[31] From Cloze to Comprehension: Retrofitting Pre-trained Masked Language Models to Pre-trained Machine Reader
Xu, Weiwen
Li, Xin
Zhang, Wenxuan
Zhou, Meng
Lam, Wai
Si, Luo
Bing, Lidong
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[32] Annotating Columns with Pre-trained Language Models
Suhara, Yoshihiko
Li, Jinfeng
Li, Yuliang
Zhang, Dan
Demiralp, Cagatay
Chen, Chen
Tan, Wang-Chiew
[J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22), 2022, : 1493 - 1503
[33] Effects of Pre-trained Word Embeddings on Text-based Deception Detection
Nam, David
Yasmin, Jerin
Zulkernine, Farhana
[J]. 2020 IEEE INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, INTL CONF ON CLOUD AND BIG DATA COMPUTING, INTL CONF ON CYBER SCIENCE AND TECHNOLOGY CONGRESS (DASC/PICOM/CBDCOM/CYBERSCITECH), 2020, : 437 - 443
[34] Lottery Jackpots Exist in Pre-Trained Models
Zhang, Yuxin
Lin, Mingbao
Zhong, Yunshan
Chao, Fei
Ji, Rongrong
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (12) : 14990 - 15004
[35] LaoPLM: Pre-trained Language Models for Lao
Lin, Nankai
Fu, Yingwen
Yang, Ziyu
Chen, Chuwei
Jiang, Shengyi
[J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6506 - 6512
[36] Interpreting Art by Leveraging Pre-Trained Models
Penzel, Niklas
Denzler, Joachim
[J]. 2023 18TH INTERNATIONAL CONFERENCE ON MACHINE VISION AND APPLICATIONS, MVA, 2023,
[37] Natural Attack for Pre-trained Models of Code
Yang, Zhou
Shi, Jieke
He, Junda
Lo, David
[J]. 2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), 2022, : 1482 - 1493
[38] Generalization of vision pre-trained models for histopathology
Milad Sikaroudi
Maryam Hosseini
Ricardo Gonzalez
Shahryar Rahnamayan
H. R. Tizhoosh
[J]. Scientific Reports, 13
[39] Pre-trained models: Past, present and future
Han, Xu
Zhang, Zhengyan
Ding, Ning
Gu, Yuxian
Liu, Xiao
Huo, Yuqi
Qiu, Jiezhong
Yao, Yuan
Zhang, Ao
Zhang, Liang
Han, Wentao
Huang, Minlie
Jin, Qin
Lan, Yanyan
Liu, Yang
Liu, Zhiyuan
Lu, Zhiwu
Qiu, Xipeng
Song, Ruihua
Tang, Jie
Wen, Ji-Rong
Yuan, Jinhui
Zhao, Wayne Xin
Zhu, Jun
[J]. AI OPEN, 2021, 2 : 225 - 250
[40] Learning to Modulate pre-trained Models in RL
Schmied, Thomas
Hofmarcher, Markus
Paischer, Fabian
Pascanu, Razvan
Hochreiter, Sepp
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,

← 1 2 3 4 5 →