TEII: Topic enhanced inverted index for top-k document retrieval

被引:10
|
作者
Jiang, Di [1 ]
Leung, Kenneth Wai-Ting [2 ]
Yang, Lingxiao [3 ]
Ng, Wilfred [2 ]
机构
[1] Baidu Inc, Beijing, Peoples R China
[2] Hong Kong Univ Sci & Technol, Hong Kong, Hong Kong, Peoples R China
[3] Univ London London Sch Econ & Polit Sci, London WC2A 2AE, England
关键词
Topic model; Search engine; Information retrieval;
D O I
10.1016/j.knosys.2015.07.014
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, topic modeling is gaining significant momentum in information retrieval (IR). Researchers have found that utilizing the topic information generated through topic modeling together with traditional TF-IDF information generates superior results in document retrieval. However, in order to apply this idea to real-life IR systems, some critical problems need to be solved: how to store the topic information and how to utilize it with the TF-IDF information for efficient document retrieval. In this paper, we propose the Topic Enhanced Inverted Index (TEII) to incorporate the topic information into the inverted index for efficient top-k document retrieval. Specifically, we explore two different types of TEIIs. We first propose the incremental TEII, which includes the topic information into the traditional inverted index by adding topic-based inverted lists. The incremental TEII is beneficial for legacy IR systems, since it does not change the existing TF-IDF-based inverted lists. As a more flexible alternative, we propose the hybrid TEII to incorporate the topic information into each posting of the inverted index. In the hybrid TEII, two relaxation methods are proposed to support dynamic estimation of the upper bound impact of each posting. The hybrid TEII is highly extensible for incorporating different ranking factors and we show an extension of the hybrid TEII by considering the static quality of the documents in the corpus. Based on the incremental and hybrid TEIIs, we develop several query processing algorithms to support efficient top-k document retrieval on TEIIs. Empirical evaluation on the TREC dataset verifies the effectiveness and efficiency of the proposed index structures and query processing algorithms. (C) 2015 Elsevier B.V. All rights reserved.
引用
收藏
页码:346 / 358
页数:13
相关论文
共 50 条
  • [1] Approximating Document Frequency for Self-Index based Top-k Document Retrieval
    Suzuki, Tokinori
    Fujii, Atsushi
    [J]. 2015 IEEE 29TH INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS WORKSHOPS WAINA 2015, 2015, : 541 - 546
  • [2] Top-k document retrieval in optimal space
    Tsur, Dekel
    [J]. INFORMATION PROCESSING LETTERS, 2013, 113 (12) : 440 - 443
  • [3] Faster Compact Top-k Document Retrieval
    Konow, Roberto
    Navarro, Gonzalo
    [J]. 2013 DATA COMPRESSION CONFERENCE (DCC), 2013, : 351 - 360
  • [4] Top-K Color Queries for Document Retrieval
    Karpinski, Marek
    Nekrich, Yakov
    [J]. PROCEEDINGS OF THE TWENTY-SECOND ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2011, : 401 - 411
  • [5] Faster Compressed Top-k Document Retrieval
    Hon, Wing-Kai
    Shah, Rahul
    Thankachan, Sharma V.
    Vitter, Jeffrey Scott
    [J]. 2013 DATA COMPRESSION CONFERENCE (DCC), 2013, : 341 - 350
  • [6] Top-k Document Retrieval in External Memory
    Shah, Rahul
    Sheng, Cheng
    Thankachan, Sharma V.
    Vitter, Jeffrey Scott
    [J]. ALGORITHMS - ESA 2013, 2013, 8125 : 803 - 814
  • [7] Efficient In-Memory Top-k Document Retrieval
    Culpepper, J. Shane
    Petri, Matthias
    Scholer, Falk
    [J]. SIGIR 2012: PROCEEDINGS OF THE 35TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2012, : 225 - 234
  • [8] TIME-OPTIMAL TOP-k DOCUMENT RETRIEVAL
    Navarro, Gonzalo
    Nekrich, Yakov
    [J]. SIAM JOURNAL ON COMPUTING, 2017, 46 (01) : 80 - 113
  • [9] Faster Top-k Document Retrieval in Optimal Space
    Navarro, Gonzalo
    Thankachan, Sharma V.
    [J]. STRING PROCESSING AND INFORMATION RETRIEVAL (SPIRE 2013), 2013, 8214 : 255 - 262
  • [10] New space/time tradeoffs for top-k document retrieval on sequences
    Navarro, Gonzalo
    Thankachan, Sharma V.
    [J]. THEORETICAL COMPUTER SCIENCE, 2014, 542 : 83 - 97