Space-Efficient Framework for Top-k String Retrieval Problems

被引:57
|
作者
Hon, Wing-Kai [1 ]
Shah, Rahul [2 ]
Vitter, Jeffrey Scott [3 ]
机构
[1] Natl Tsing Hua Univ, Dept Comp Sci, Hsinchu, Taiwan
[2] Louisiana State Univ, Dept Comp Sci, Baton Rouge, LA 70803 USA
[3] Texas A&M Univ, Dept Comp Sci, College Stn, TX 77843 USA
关键词
document retrieval; text indexing; succinct data structures; top-k queries; SUFFIX ARRAYS;
D O I
10.1109/FOCS.2009.19
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Given a set D = {d(1), d(2), ... , d(D)} of D strings of total length n, our task is to report the "most relevant" strings for a given query pattern P. This involves somewhat more advanced query functionality than the usual pattern matching, as some notion of "most relevant" is involved. In information retrieval literature, this task is best achieved by using inverted indexes. However, inverted indexes work only for some predefined set of patterns. In the pattern matching community, the most popular pattern-matching data structures are suffix trees and suffix arrays. However, a typical suffix tree search involves going through all the occurrences of the pattern over the entire string collection, which might be a lot more than the required relevant documents. The first formal framework to study such kind of retrieval problems was given by Muthukrishnan [25]. He considered two metrics for relevance: frequency and proximity. He took a threshold-based approach on these metrics and gave data structures taking O(n log n) words of space. We study this problem in a slightly different framework of reporting the top k most relevant documents (in sorted order) under similar and more general relevance metrics. Our framework gives linear space data structure with optimal query times for arbitrary score functions. As a corollary, it improves the space utilization for the problems in [25] while maintaining optimal query performance. We also develop compressed variants of these data structures for several specific relevance metrics.
引用
收藏
页码:713 / 722
页数:10
相关论文
共 50 条
  • [1] Efficient Compressed Indexing for Approximate Top-k String Retrieval
    Ferrada, Hector
    Navarro, Gonzalo
    [J]. STRING PROCESSING AND INFORMATION RETRIEVAL, SPIRE 2014, 2014, 8799 : 18 - 30
  • [2] A Framework for Space-Efficient String Kernels
    Belazzougui, Djamal
    Cunial, Fabio
    [J]. ALGORITHMICA, 2017, 79 (03) : 857 - 883
  • [3] A Framework for Space-Efficient String Kernels
    Djamal Belazzougui
    Fabio Cunial
    [J]. Algorithmica, 2017, 79 : 857 - 883
  • [4] Time- and Space-Efficient Sliding Window Top-k Query Processing
    Pripuzic, Kresimir
    Zarko, Ivana Podnar
    Aberer, Karl
    [J]. ACM TRANSACTIONS ON DATABASE SYSTEMS, 2015, 40 (01):
  • [5] Efficient Top-K Retrieval with Signatures
    Chappell, Timothy
    Geva, Shlomo
    Anthony Nguyen
    Zuccon, Guido
    [J]. PROCEEDINGS OF THE 18TH AUSTRALASIAN DOCUMENT COMPUTING SYMPOSIUM (ADCS 2013), 2013, : 10 - 17
  • [6] Top-k document retrieval in optimal space
    Tsur, Dekel
    [J]. INFORMATION PROCESSING LETTERS, 2013, 113 (12) : 440 - 443
  • [7] Space-Efficient Frameworks for Top-kString Retrieval
    Hon, Wing-Kai
    Shah, Rahul
    Thankachan, Sharma V.
    Vitter, Jeffrey Scott
    [J]. JOURNAL OF THE ACM, 2014, 61 (02)
  • [8] Efficient Top-k Retrieval on Massive Data
    Han, Xixian
    Li, Jianzhong
    Gao, Hong
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015, 27 (10) : 2687 - 2699
  • [9] Efficient skyline and top-k retrieval in subspaces
    Tao, Yufei
    Xiao, Xiaokui
    Pei, Jian
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2007, 19 (08) : 1072 - 1088
  • [10] Efficient Top-k Retrieval on Massive Data
    Han, Xixian
    Li, Jianzhong
    Gao, Hong
    [J]. 2016 32ND IEEE INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2016, : 1496 - 1497