Microblog Short Text Semantic Modeling Method for Search

被引:0
|
作者
Kou F.-F. [1 ]
Du J.-P. [1 ]
Shi Y.-S. [1 ]
Yang C.-X. [1 ]
Cui W.-Q. [1 ]
Liang M.-Y. [1 ]
Shi L. [1 ]
机构
[1] Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing
来源
Jisuanji Xuebao/Chinese Journal of Computers | 2020年 / 43卷 / 05期
基金
中国国家自然科学基金; 国家重点研发计划; 中国博士后科学基金;
关键词
Microblogs; Search; Semantic modeling; Short text; Social network;
D O I
10.11897/SP.J.1016.2020.00781
中图分类号
学科分类号
摘要
Microblogs contain lots of short text data with time and user information. It has received widespread attention to achieve accurate search by mining the semantics of Microblogs. When applying the traditional topic models to Microblog short text semantic modeling task, they usually will face the following issues. First, traditional topic modeling methods cannot deal with the problem of semantic sparsity that caused by the shortness of Microblogs. Second, since topic models only acquire semantic at document-level, they cannot mine the local semantic existing in contexts. Therefore, rough semantic representation will result in inaccurate search results. In order to obtain high-quality semantic representation and realize precise search, we propose a Microblog short text semantic modeling method for search (MSSMS), which contains three components: a short text expansion algorithm based on embedding vector, a microblog topic model based on expansion and Microblog search. The short text expansion algorithm aims to expand short text into long text. To realize this purpose, it utilizes the embedding vectors to construct similar-word sets for each word in the short text. As the embedding vectors contain local semantics, by using the expanded long text as the input of the topic model, the local semantics contained in embedding vectors and the global semantics acquired through the topic model can be combined. Besides, as short texts have turned into long texts, the semantic sparsity of short text can be weakened. In the proposed Microblog topic model, to further alleviate the semantic sparsity of short text, we introduce the bi-term pattern, which assigns word-word pairs to share the same topic. In addition, the proposed Microblog topic model also models multiple characteristics (text, time, user information) simultaneously. This operation could further improve the quality of the generated semantic representation, for the reason that multiple characteristics can constrain the generation of topics. After that, we can acquire the document-topic distributions, topic-word distributions, topic-time Beta distributions, and topic-user distributions, as these multiple characteristics (text, time, user information) are all mapped into the topic semantic space, they can be seemed as the unified semantic representation. Based on the generated unified semantic representation, we can calculate the similarities between short texts. Through sorting these similarities, we can realize the precise Microblog search. Finally, to verify the effectiveness of the proposed MSSMS, we conduct extensive experiments on real-world datasets of Sina Weibo, and these experiments are divided into two categories. One is to evaluate the semantic modeling ability of the MSSMS, and the other is to apply the MSSMS into Microblog search. In order to comprehensively evaluate the semantic modeling ability of the MSSMS, we not only use objective evaluation metric to measure the topic coherence but also use subjective evaluation methods to access the quality of the generated semantic representation. The experimental results show that compared with the comparison algorithm, the semantic representation generated by the proposed MSSMS method has the highest quality, and the MSSMS method has the best semantic modeling ability. In addition, the microblog search experiment results also verify that the proposed MSSMS method can achieve accurate microblog search. © 2020, Science Press. All right reserved.
引用
收藏
页码:781 / 795
页数:14
相关论文
共 27 条
  • [1] Zhu C, Du J., Background feature clustering and its application to social text, Information Processing Letters, 136, pp. 44-48, (2018)
  • [2] Kou F, Du J, He Y, Et al., Social network search based on semantic analysis and learning, CAAI Transactions on Intelligence Technology, 1, 4, pp. 293-302, (2016)
  • [3] Wang Xiao-Yang, Zheng Xiao-Qing, Xiao Yang-Hua, Entity-relation modeling and discovery for smart search, Journal on Communications, 36, 12, pp. 17-27, (2015)
  • [4] Zhou D, Wu X, Zhao W, Et al., Query expansion with enriched user profiles for personalized search utilizing folksonomy data, IEEE Transactions on Knowledge & Data Engineering, 29, 7, pp. 1536-1548, (2017)
  • [5] Chen T, Salaheldeen H M, He X, Et al., VELDA: Relating an image tweet's text and images, Proceedings of the 29th AAAI Conference on Artificial Intelligence, pp. 30-36, (2015)
  • [6] Zha H, Zha H, Zha H, Et al., Cost-effective online trending topic detection and popularity prediction in microblogging, ACM Transactions on Information Systems, 35, 3, (2016)
  • [7] Liu Y, Liu Z, Chua T S, Et al., Topical word embeddings, Proceedings of the 29th AAAI Conference on Artificial Intelligence, pp. 2418-2424, (2015)
  • [8] Mikolov T, Sutskever I, Chen K, Et al., Distributed representations of words and phrases and their compositionality, Proceedings of the 27th Annual Conference on Neural Information Processing Systems, pp. 3111-3119, (2013)
  • [9] Li C, Wang H, Zhang Z, Et al., Topic modeling for short texts with auxiliary word embeddings, Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 165-174, (2016)
  • [10] Xun G, Li Y, Zhao W X, Et al., A correlated topic model using word embeddings, Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 4207-4213, (2017)