A Study of Text Vectorization Method Combining Topic Model and Transfer Learning

被引:13
|
作者
Yang, Xi [1 ,2 ]
Yang, Kaiwen [1 ]
Cui, Tianxu [1 ]
Chen, Min [1 ]
He, Liyan [1 ]
机构
[1] Beijing Wuzi Univ, Sch Informat, Beijing 101149, Peoples R China
[2] Univ Sci & Technol Beijing, Sch Comp & Commun Engn, Beijing 100083, Peoples R China
基金
中国国家自然科学基金;
关键词
text vectorization; topic model; pretrained model; transfer learning; SELF-ATTENTION; LATENT; CLASSIFICATION; NEWS;
D O I
10.3390/pr10020350
中图分类号
TQ [化学工业];
学科分类号
0817 ;
摘要
With the development of Internet cloud technology, the scale of data is expanding. Traditional processing methods find it difficult to deal with the problem of information extraction of big data. Therefore, it is necessary to use machine-learning-assisted intelligent processing to extract information from data in order to solve the optimization problem in complex systems. There are many forms of data storage. Among them, text data is an important data type that directly reflects semantic information. Text vectorization is an important concept in natural language processing tasks. Because text data can not be directly used for model parameter training, it is necessary to vectorize the original text data and make it numerical, and then the feature extraction operation can be carried out. The traditional text digitization method is often realized by constructing a bag of words, but the vector generated by this method can not reflect the semantic relationship between words, and it also easily causes the problems of data sparsity and dimension explosion. Therefore, this paper proposes a text vectorization method combining a topic model and transfer learning. Firstly, the topic model is selected to model the text data and extract its keywords, to grasp the main information of the text data. Then, with the help of the bidirectional encoder representations from transformers (BERT) model, which belongs to the pretrained model, model transfer learning is carried out to generate vectors, which are applied to the calculation of similarity between texts. By setting up a comparative experiment, this method is compared with the traditional vectorization method. The experimental results show that the vector generated by the topic-modeling- and transfer-learning-based text vectorization (TTTV) proposed in this paper can obtain better results when calculating the similarity between texts with the same topic, which means that it can more accurately judge whether the contents of the given two texts belong to the same topic.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Study on text representation method based on deep learning and topic information
    Jiang, Zilong
    Gao, Shu
    Chen, Liangchen
    COMPUTING, 2020, 102 (03) : 623 - 642
  • [2] Study on text representation method based on deep learning and topic information
    Zilong Jiang
    Shu Gao
    Liangchen Chen
    Computing, 2020, 102 : 623 - 642
  • [3] Diversified recommendation method combining topic model and random walk
    Fang, Chen
    Zhang, Hengwei
    Wang, Jindong
    Wang, Na
    MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (04) : 4355 - 4378
  • [4] Diversified recommendation method combining topic model and random walk
    Chen Fang
    Hengwei Zhang
    Jindong Wang
    Na Wang
    Multimedia Tools and Applications, 2018, 77 : 4355 - 4378
  • [5] A NEW TOPIC-BRIDGED MODEL FOR TRANSFER LEARNING
    Wu, Meng-Sung
    Chien, Jen-Tzung
    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 5346 - 5349
  • [6] A Guided Derivative Topic Dissemination Model Based on Topic Identity and Transfer Learning
    Wang, Rong
    Wang, Menghuan
    Zhang, Gongguo
    Li, Tun
    Li, Qian
    Xiao, Yunpeng
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2025,
  • [7] A text classification network model combining machine learning and deep learning
    Chen, Hao
    Zhang, Haifei
    Yang, Yuwei
    He, Long
    INTERNATIONAL JOURNAL OF SENSOR NETWORKS, 2024, 44 (03) : 182 - 192
  • [8] Transfer learning using a nonparametric sparse topic model
    Faisal, Ali
    Gillberg, Jussi
    Leen, Gayle
    Peltonen, Jaakko
    NEUROCOMPUTING, 2013, 112 : 124 - 137
  • [9] Evaluation of Transfer Learning for Polish with a Text-to-Text Model
    Chrabrowa, Aleksandra
    Dragan, Lukasz
    Grzegorczyk, Karol
    Kajtoch, Dariusz
    Koszowski, Mikolaj
    Mroczkowski, Robert
    Rybak, Piotr
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 4374 - 4394
  • [10] Combining feature norms and text data with topic models
    Steyvers, Mark
    ACTA PSYCHOLOGICA, 2010, 133 (03) : 234 - 243