A Study of Text Vectorization Method Combining Topic Model and Transfer Learning

被引:13
|
作者
Yang, Xi [1 ,2 ]
Yang, Kaiwen [1 ]
Cui, Tianxu [1 ]
Chen, Min [1 ]
He, Liyan [1 ]
机构
[1] Beijing Wuzi Univ, Sch Informat, Beijing 101149, Peoples R China
[2] Univ Sci & Technol Beijing, Sch Comp & Commun Engn, Beijing 100083, Peoples R China
基金
中国国家自然科学基金;
关键词
text vectorization; topic model; pretrained model; transfer learning; SELF-ATTENTION; LATENT; CLASSIFICATION; NEWS;
D O I
10.3390/pr10020350
中图分类号
TQ [化学工业];
学科分类号
0817 ;
摘要
With the development of Internet cloud technology, the scale of data is expanding. Traditional processing methods find it difficult to deal with the problem of information extraction of big data. Therefore, it is necessary to use machine-learning-assisted intelligent processing to extract information from data in order to solve the optimization problem in complex systems. There are many forms of data storage. Among them, text data is an important data type that directly reflects semantic information. Text vectorization is an important concept in natural language processing tasks. Because text data can not be directly used for model parameter training, it is necessary to vectorize the original text data and make it numerical, and then the feature extraction operation can be carried out. The traditional text digitization method is often realized by constructing a bag of words, but the vector generated by this method can not reflect the semantic relationship between words, and it also easily causes the problems of data sparsity and dimension explosion. Therefore, this paper proposes a text vectorization method combining a topic model and transfer learning. Firstly, the topic model is selected to model the text data and extract its keywords, to grasp the main information of the text data. Then, with the help of the bidirectional encoder representations from transformers (BERT) model, which belongs to the pretrained model, model transfer learning is carried out to generate vectors, which are applied to the calculation of similarity between texts. By setting up a comparative experiment, this method is compared with the traditional vectorization method. The experimental results show that the vector generated by the topic-modeling- and transfer-learning-based text vectorization (TTTV) proposed in this paper can obtain better results when calculating the similarity between texts with the same topic, which means that it can more accurately judge whether the contents of the given two texts belong to the same topic.
引用
收藏
页数:16
相关论文
共 50 条
  • [31] Combining Model-Agnostic Meta-Learning and Transfer Learning for Regression
    Satrya, Wahyu Fadli
    Yun, Ji-Hoon
    SENSORS, 2023, 23 (02)
  • [32] TOPIC MODEL AND SIMILARITY CALCULATION OF TEXT ON SPARK
    Dai, Changsong
    Wang, Yongbin
    Wang, Qi
    2017 14TH INTERNATIONAL COMPUTER CONFERENCE ON WAVELET ACTIVE MEDIA TECHNOLOGY AND INFORMATION PROCESSING (ICCWAMTIP), 2017, : 15 - 19
  • [33] A New Image Recognition and Classification Method Combining Transfer Learning Algorithm and MobileNet Model for Welding Defects
    Pan, Haihong
    Pang, Zaijun
    Wang, Yaowei
    Wang, Yijue
    Chen, Lin
    IEEE ACCESS, 2020, 8 : 119951 - 119960
  • [34] A CWTM Model of Topic Extraction for Short Text
    Diao, Yunlan
    Du, Yajun
    Xiao, Pan
    Liu, Jia
    KNOWLEDGE GRAPH AND SEMANTIC COMPUTING: LANGUAGE, KNOWLEDGE, AND INTELLIGENCE, CCKS 2017, 2017, 784 : 80 - 91
  • [35] MII: A Novel Text Classification Model Combining Deep Active Learning with BERT
    Zhang, Anman
    Li, Bohan
    Wang, Wenhuan
    Wan, Shuo
    Chen, Weitong
    CMC-COMPUTERS MATERIALS & CONTINUA, 2020, 63 (03): : 1499 - 1514
  • [36] MII: A novel text classification model combining deep active learning with BERT
    Zhang A.
    Li B.
    Wang W.
    Wan S.
    Chen W.
    Computers, Materials and Continua, 2020, 63 (03): : 1499 - 1514
  • [37] Combining topic-based model and text categorisation approach for utterance understanding in human-machine dialogue
    Lichouri, Mohamed
    Djeradi, Rachida
    Djeradi, Amar
    INTERNATIONAL JOURNAL OF COMPUTATIONAL SCIENCE AND ENGINEERING, 2018, 17 (01) : 109 - 117
  • [38] Latent Topic Text Representation Learning on Statistical Manifolds
    Jiang, Bingbing
    Li, Zhengyu
    Chen, Huanhuan
    Cohn, Anthony G.
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2018, 29 (11) : 5643 - 5654
  • [39] Learning From Short Text Streams With Topic Drifts
    Li, Peipei
    He, Lu
    Wang, Haiyan
    Hu, Xuegang
    Zhang, Yuhong
    Li, Lei
    Wu, Xindong
    IEEE TRANSACTIONS ON CYBERNETICS, 2018, 48 (09) : 2697 - 2711
  • [40] A Deep Learning Based Fault Diagnosis Method Combining Domain Knowledge and Transfer Learning
    Choudhury, Madhurjya Dev
    Kleijn, W. Bastiaan
    Blincoe, Kelly
    Dhupia, Jaspreet Singh
    2023 29TH INTERNATIONAL CONFERENCE ON MECHATRONICS AND MACHINE VISION IN PRACTICE, M2VIP 2023, 2023,