Text clustering based on pre-trained models and autoencoders

被引:3
|
作者
Xu, Qiang [1 ]
Gu, Hao [1 ]
Ji, ShengWei [1 ]
机构
[1] Hefei Univ, Sch Artificial Intelligence & Big Data, Hefei, Anhui, Peoples R China
关键词
text clustering; medical; deep learning; pre-trained models; autoencoder; deep embedded clustering model; NEURAL-NETWORKS;
D O I
10.3389/fncom.2023.1334436
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Text clustering is the task of grouping text data based on similarity, and it holds particular importance in the medical field. sIn healthcare, medical data clustering is a highly active and effective research area. It not only provides strong support for making correct medical decisions from medical datasets but also aids in patient record management and medical information retrieval. With the development of the healthcare industry, a large amount of medical data is being generated, and traditional medical data clustering faces significant challenges. Many existing text clustering algorithms are primarily based on the bag-of-words model, which has issues such as high dimensionality, sparsity, and the neglect of word positions and context. Pre-trained models are a deep learning-based approach that treats text as a sequence to accurately capture word positions and context information. Moreover, compared to traditional K-means and fuzzy C-means clustering models, deep learning-based clustering algorithms are better at handling high-dimensional, complex, and nonlinear data. In particular, clustering algorithms based on autoencoders can learn data representations and clustering information, effectively reducing noise interference and errors during the clustering process. This paper combines pre-trained language models with deep embedding clustering models. Experimental results demonstrate that our model performs exceptionally well on four public datasets, outperforming most existing text clustering algorithms, and can be applied to medical data clustering.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Pre-Trained Language Models for Text Generation: A Survey
    Li, Junyi
    Tang, Tianyi
    Zhao, Wayne Xin
    Nie, Jian-Yun
    Wen, Ji-Rong
    [J]. ACM COMPUTING SURVEYS, 2024, 56 (09)
  • [2] On the Power of Pre-Trained Text Representations: Models and Applications in Text Mining
    Meng, Yu
    Huang, Jiaxin
    Zhang, Yu
    Han, Jiawei
    [J]. KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 4052 - 4053
  • [3] Text Detoxification using Large Pre-trained Neural Models
    Dale, David
    Voronov, Anton
    Dementieva, Daryna
    Logacheva, Varvara
    Kozlova, Olga
    Semenov, Nikita
    Panchenko, Alexander
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 7979 - 7996
  • [4] EFFICIENT TEXT ANALYSIS WITH PRE-TRAINED NEURAL NETWORK MODELS
    Cui, Jia
    Lu, Heng
    Wang, Wenjie
    Kang, Shiyin
    He, Liqiang
    Li, Guangzhi
    Yu, Dong
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 671 - 676
  • [5] Uncertainty Estimation and Reduction of Pre-trained Models for Text Regression
    Wang, Yuxia
    Beck, Daniel
    Baldwin, Timothy
    Verspoor, Karin
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2022, 10 : 680 - 696
  • [6] Non-Autoregressive Text Generation with Pre-trained Language Models
    Su, Yixuan
    Cai, Deng
    Wang, Yan
    Vandyke, David
    Baker, Simon
    Li, Piji
    Collier, Nigel
    [J]. 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 234 - 243
  • [7] ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining
    Minh Phuc Nguyen
    Vu Hoang Tran
    Vu Hoang
    Ta Duc Huy
    Bui, Trung H.
    Truong, Steven Q. H.
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 328 - 337
  • [8] Radical-vectors with pre-trained models for Chinese Text Classification
    Yin, Guoqing
    Wu, Junmin
    Zhao, Guochao
    [J]. 2022 EURO-ASIA CONFERENCE ON FRONTIERS OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY, FCSIT, 2022, : 12 - 15
  • [9] Towards unifying pre-trained language models for semantic text exchange
    Miao, Jingyuan
    Zhang, Yuqi
    Jiang, Nan
    Wen, Jie
    Pei, Kanglu
    Wan, Yue
    Wan, Tao
    Chen, Honglong
    [J]. WIRELESS NETWORKS, 2023,
  • [10] Short-Text Classification Method with Text Features from Pre-trained Models
    Chen, Jie
    Ma, Jing
    Li, Xiaofeng
    [J]. Data Analysis and Knowledge Discovery, 2021, 5 (09): : 21 - 30