A Hybrid Model for Documents Representation

被引:1
|
作者
Mohamed, Dina [1 ]
El-Kilany, Ayman [1 ]
Mokhtar, Hoda M. O. [1 ]
机构
[1] Cairo Univ, Fac Comp & Artificial Intelligence, Giza, Egypt
关键词
Document representation; latent dirichlet allocation; hierarchical latent dirichlet allocation; Word2vec; Isolation Forest;
D O I
10.14569/IJACSA.2021.0120339
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Text representation is a critical issue for exploring the insights behind the text. Many models have been developed to represent the text in defined forms such as numeric vectors where it would be easy to calculate the similarity between the documents using the well-known distance measures. In this paper, we aim to build a model to represent text semantically either in one document or multiple documents using a combination of hierarchical Latent Dirichlet Allocation (hLDA), Word2vec, and Isolation Forest models. The proposed model aims to learn a vector for each document using the relationship between its words' vectors and the hierarchy of topics generated using the hierarchical Latent Dirichlet Allocation model. Then, the isolation forest model is used to represent multiple documents in one representation as one profile to facilitate finding similar documents to the profile. The proposed text representation model outperforms the traditional text representation models when applied to represent scientific papers before performing content-based scientific papers recommendation for researchers.
引用
收藏
页码:317 / 324
页数:8
相关论文
共 50 条
  • [21] FORMAL REPRESENTATION OF CONTENTS OF DOCUMENTS
    DIMOV, SN
    NAUCHNO-TEKHNICHESKAYA INFORMATSIYA SERIYA 2-INFORMATSIONNYE PROTSESSY I SISTEMY, 1974, (09): : 9 - 11
  • [22] Inferential representation of science documents
    Park, Hongseok
    Information Processing and Management, 1996, 32 (04): : 419 - 429
  • [23] Hybrid Grammar Language Model for Handwritten Historical Documents Recognition
    Cirera, Nuria
    Fornes, Alicia
    Frinken, Volkmar
    Llados, Josep
    PATTERN RECOGNITION AND IMAGE ANALYSIS, IBPRIA 2013, 2013, 7887 : 117 - 124
  • [24] Hybrid High Dimensional Model Representation for reliability analysis
    Chowdhury, Rajib
    Rao, B. N.
    COMPUTER METHODS IN APPLIED MECHANICS AND ENGINEERING, 2009, 198 (5-8) : 753 - 765
  • [25] Compiling for a hybrid programming model using the LMAD representation
    Zhu, JJ
    Hoeflinger, J
    Padua, D
    LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, 2003, 2624 : 321 - 335
  • [26] Documents ranking based on a hybrid language model for Chinese information retrieval
    Zheng, Dequan
    Yu, Feng
    Zhao, Tiejun
    Li, Sheng
    2006 IEEE INTERNATIONAL CONFERENCE ON INFORMATION ACQUISITION, VOLS 1 AND 2, CONFERENCE PROCEEDINGS, 2006, : 279 - 283
  • [27] Efficient memory representation of XML documents
    Busatto, G
    Lohrey, M
    Maneth, S
    DATABASE PROGRAMMING LANGUAGES, 2005, 3774 : 199 - 216
  • [28] Representation and manipulation of music documents in ScFX
    Filgueiras, M.
    Leal, J.P.
    Electronic Publishing - Origination Dissemination and Design, 1993, 6 (04):
  • [29] Computational representation of semantics in historical documents
    Mirzaee, V
    Iverson, L
    Hamidzadeh, B
    HUMANITIES, COMPUTERS AND CULTURAL HERITAGE, 2005, : 199 - 206
  • [30] REPRESENTATION AND THE UTILITY OF MOVING IMAGE DOCUMENTS
    OCONNOR, BC
    PROCEEDINGS OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1986, 23 : 237 - 243