Text clustering based on pre-trained models and autoencoders

被引:3
|
作者
Xu, Qiang [1 ]
Gu, Hao [1 ]
Ji, ShengWei [1 ]
机构
[1] Hefei Univ, Sch Artificial Intelligence & Big Data, Hefei, Anhui, Peoples R China
关键词
text clustering; medical; deep learning; pre-trained models; autoencoder; deep embedded clustering model; NEURAL-NETWORKS;
D O I
10.3389/fncom.2023.1334436
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Text clustering is the task of grouping text data based on similarity, and it holds particular importance in the medical field. sIn healthcare, medical data clustering is a highly active and effective research area. It not only provides strong support for making correct medical decisions from medical datasets but also aids in patient record management and medical information retrieval. With the development of the healthcare industry, a large amount of medical data is being generated, and traditional medical data clustering faces significant challenges. Many existing text clustering algorithms are primarily based on the bag-of-words model, which has issues such as high dimensionality, sparsity, and the neglect of word positions and context. Pre-trained models are a deep learning-based approach that treats text as a sequence to accurately capture word positions and context information. Moreover, compared to traditional K-means and fuzzy C-means clustering models, deep learning-based clustering algorithms are better at handling high-dimensional, complex, and nonlinear data. In particular, clustering algorithms based on autoencoders can learn data representations and clustering information, effectively reducing noise interference and errors during the clustering process. This paper combines pre-trained language models with deep embedding clustering models. Experimental results demonstrate that our model performs exceptionally well on four public datasets, outperforming most existing text clustering algorithms, and can be applied to medical data clustering.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] Pre-trained transformer-based language models for Sundanese
    Wilson Wongso
    Henry Lucky
    Derwin Suhartono
    [J]. Journal of Big Data, 9
  • [22] Pre-trained CNNs Models for Content based Image Retrieval
    Ahmed, Ali
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2021, 12 (07) : 200 - 206
  • [23] Pre-trained transformer-based language models for Sundanese
    Wongso, Wilson
    Lucky, Henry
    Suhartono, Derwin
    [J]. JOURNAL OF BIG DATA, 2022, 9 (01)
  • [24] Difference between Multi-modal vs. Text Pre-trained Models in Embedding Text
    Sun, Yuchong
    Cheng, Xiwei
    Song, Ruihua
    Che, Wanxiang
    Lu, Zhiwu
    Wen, Jirong
    [J]. Beijing Daxue Xuebao (Ziran Kexue Ban)/Acta Scientiarum Naturalium Universitatis Pekinensis, 2023, 59 (01): : 48 - 56
  • [25] How Different are Pre-trained Transformers for Text Ranking?
    Rau, David
    Kamps, Jaap
    [J]. ADVANCES IN INFORMATION RETRIEVAL, PT II, 2022, 13186 : 207 - 214
  • [26] CUE: An Uncertainty Interpretation Framework for Text Classifiers Built on Pre-Trained Language Models
    Li, Jiazheng
    Sun, Zhaoyue
    Liang, Bin
    Gui, Lin
    He, Yulan
    [J]. UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, 2023, 216 : 1253 - 1262
  • [27] Pre-trained Text Embeddings for Enhanced Text-to-Speech Synthesis
    Hayashi, Tomoki
    Watanabe, Shinji
    Toda, Tomoki
    Takeda, Kazuya
    Toshniwal, Shubham
    Livescu, Karen
    [J]. INTERSPEECH 2019, 2019, : 4430 - 4434
  • [28] A Comparison of SVM Against Pre-trained Language Models (PLMs) for Text Classification Tasks
    Wahba, Yasmen
    Madhavji, Nazim
    Steinbacher, John
    [J]. MACHINE LEARNING, OPTIMIZATION, AND DATA SCIENCE, LOD 2022, PT II, 2023, 13811 : 304 - 313
  • [29] General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference
    Du, Jingfei
    Ott, Myle
    Li, Haoran
    Zhou, Xing
    Stoyanov, Veselin
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020,
  • [30] BERTifying Sinhala - A Comprehensive Analysis of Pre-trained Language Models for Sinhala Text Classification
    Dhananjaya, Vinura
    Demotte, Piyumal
    Ranathunga, Surangika
    Jayasena, Sanath
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 7377 - 7385