A Study of Text Vectorization Method Combining Topic Model and Transfer Learning

被引:13
|
作者
Yang, Xi [1 ,2 ]
Yang, Kaiwen [1 ]
Cui, Tianxu [1 ]
Chen, Min [1 ]
He, Liyan [1 ]
机构
[1] Beijing Wuzi Univ, Sch Informat, Beijing 101149, Peoples R China
[2] Univ Sci & Technol Beijing, Sch Comp & Commun Engn, Beijing 100083, Peoples R China
基金
中国国家自然科学基金;
关键词
text vectorization; topic model; pretrained model; transfer learning; SELF-ATTENTION; LATENT; CLASSIFICATION; NEWS;
D O I
10.3390/pr10020350
中图分类号
TQ [化学工业];
学科分类号
0817 ;
摘要
With the development of Internet cloud technology, the scale of data is expanding. Traditional processing methods find it difficult to deal with the problem of information extraction of big data. Therefore, it is necessary to use machine-learning-assisted intelligent processing to extract information from data in order to solve the optimization problem in complex systems. There are many forms of data storage. Among them, text data is an important data type that directly reflects semantic information. Text vectorization is an important concept in natural language processing tasks. Because text data can not be directly used for model parameter training, it is necessary to vectorize the original text data and make it numerical, and then the feature extraction operation can be carried out. The traditional text digitization method is often realized by constructing a bag of words, but the vector generated by this method can not reflect the semantic relationship between words, and it also easily causes the problems of data sparsity and dimension explosion. Therefore, this paper proposes a text vectorization method combining a topic model and transfer learning. Firstly, the topic model is selected to model the text data and extract its keywords, to grasp the main information of the text data. Then, with the help of the bidirectional encoder representations from transformers (BERT) model, which belongs to the pretrained model, model transfer learning is carried out to generate vectors, which are applied to the calculation of similarity between texts. By setting up a comparative experiment, this method is compared with the traditional vectorization method. The experimental results show that the vector generated by the topic-modeling- and transfer-learning-based text vectorization (TTTV) proposed in this paper can obtain better results when calculating the similarity between texts with the same topic, which means that it can more accurately judge whether the contents of the given two texts belong to the same topic.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] Development of Experiential Learning with Finite Element Method in Heat Transfer Topic
    Purnama, Afik Syaifrudin Cahya
    Widiastuti, Indah
    Pambudi, Nugroho Agung
    INTERNATIONAL CONFERENCE ON SCIENCE AND APPLIED SCIENCE (ICSAS) 2019, 2019, 2202
  • [22] Topic Models over Text Streams: A Study of Batch and Online Unsupervised Learning
    Banerjee, Arindam
    Basu, Sugato
    PROCEEDINGS OF THE SEVENTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2007, : 431 - +
  • [23] A Common Topic Transfer Learning Model for Crossing City POI Recommendations
    Li, Dichao
    Gong, Zhiguo
    Zhang, Defu
    IEEE TRANSACTIONS ON CYBERNETICS, 2019, 49 (12) : 4282 - 4295
  • [24] A Method for Retrieving Maize Fractional Vegetation Cover by Combining 3-D Radiative Transfer Model and Transfer Learning
    Wu, Zhuo
    Zheng, Xingming
    Ding, Yanling
    Tao, Zui
    Sun, Yuan
    Li, Bingze
    Chen, Xinmeng
    Zhao, Jianing
    Liu, Yirui
    Chen, Xinyu
    Li, Xinbiao
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 15671 - 15684
  • [25] News Text Classification Model Based on Topic Model
    Li, Zhenzhong
    Shang, Wenqian
    Yan, Menghan
    2016 IEEE/ACIS 15TH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE (ICIS), 2016, : 1197 - 1201
  • [26] The Cluster-Abstraction Model: Unsupervised learning of topic hierarchies from text data
    Hofmann, T
    IJCAI-99: PROCEEDINGS OF THE SIXTEENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 & 2, 1999, : 682 - 687
  • [27] Neuroimaging-ITM: A Text Mining Pipeline Combining Deep Adversarial Learning with Interaction Based Topic Modeling for Enabling the FAIR Neuroimaging Study
    Jianzhuo Yan
    Lihong Chen
    Yongchuan Yu
    Hongxia Xu
    Zhe Xu
    Ying Sheng
    Jianhui Chen
    Neuroinformatics, 2022, 20 : 701 - 726
  • [28] A topic detection method for network long text
    Zheng H.-Y.
    Liao C.-L.
    Li T.-Z.
    Gongcheng Kexue Xuebao/Chinese Journal of Engineering, 2019, 41 (09): : 1208 - 1214
  • [29] Neuroimaging-ITM: A Text Mining Pipeline Combining Deep Adversarial Learning with Interaction Based Topic Modeling for Enabling the FAIR Neuroimaging Study
    Yan, Jianzhuo
    Chen, Lihong
    Yu, Yongchuan
    Xu, Hongxia
    Xu, Zhe
    Sheng, Ying
    Chen, Jianhui
    NEUROINFORMATICS, 2022, 20 (03) : 701 - 726
  • [30] Topic Modeling as a Method of Educational Text Structuring
    Sakhovskiy, Andrey
    Tutubalina, Elena
    Solovyev, Valery
    Solnyshkina, Marina
    2020 13TH INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ESYSTEMS ENGINEERING (DESE 2020), 2020, : 399 - 405