Space-Efficient Framework for Top-k String Retrieval Problems

被引:57
|
作者
Hon, Wing-Kai [1 ]
Shah, Rahul [2 ]
Vitter, Jeffrey Scott [3 ]
机构
[1] Natl Tsing Hua Univ, Dept Comp Sci, Hsinchu, Taiwan
[2] Louisiana State Univ, Dept Comp Sci, Baton Rouge, LA 70803 USA
[3] Texas A&M Univ, Dept Comp Sci, College Stn, TX 77843 USA
关键词
document retrieval; text indexing; succinct data structures; top-k queries; SUFFIX ARRAYS;
D O I
10.1109/FOCS.2009.19
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Given a set D = {d(1), d(2), ... , d(D)} of D strings of total length n, our task is to report the "most relevant" strings for a given query pattern P. This involves somewhat more advanced query functionality than the usual pattern matching, as some notion of "most relevant" is involved. In information retrieval literature, this task is best achieved by using inverted indexes. However, inverted indexes work only for some predefined set of patterns. In the pattern matching community, the most popular pattern-matching data structures are suffix trees and suffix arrays. However, a typical suffix tree search involves going through all the occurrences of the pattern over the entire string collection, which might be a lot more than the required relevant documents. The first formal framework to study such kind of retrieval problems was given by Muthukrishnan [25]. He considered two metrics for relevance: frequency and proximity. He took a threshold-based approach on these metrics and gave data structures taking O(n log n) words of space. We study this problem in a slightly different framework of reporting the top k most relevant documents (in sorted order) under similar and more general relevance metrics. Our framework gives linear space data structure with optimal query times for arbitrary score functions. As a corollary, it improves the space utilization for the problems in [25] while maintaining optimal query performance. We also develop compressed variants of these data structures for several specific relevance metrics.
引用
收藏
页码:713 / 722
页数:10
相关论文
共 50 条
  • [41] Space-Efficient String Mining under Frequency Constraints
    Fischer, Johannes
    Makinen, Veli
    Valimaki, Niko
    [J]. ICDM 2008: EIGHTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2008, : 193 - +
  • [42] Fast String Matching with Space-efficient Word Graphs
    Yata, Susumu
    Morita, Kazuhiro
    Fuketa, Masao
    Aoe, Jun-ichi
    [J]. IIT: 2008 INTERNATIONAL CONFERENCE ON INNOVATIONS IN INFORMATION TECHNOLOGY, 2008, : 484 - 488
  • [43] HashTrie: A space-efficient multiple string matching algorithm
    [J]. 2015, Editorial Board of Journal on Communications (36):
  • [44] Space-efficient computation of parallel approximate string matching
    Muhammad Umair Sadiq
    Muhammad Murtaza Yousaf
    [J]. The Journal of Supercomputing, 2023, 79 : 9093 - 9126
  • [45] Space-Efficient Feature Maps for String Alignment Kernels
    Tabei, Yasuo
    Yamanishi, Yoshihiro
    Pagh, Rasmus
    [J]. DATA SCIENCE AND ENGINEERING, 2020, 5 (02) : 168 - 179
  • [46] Space-efficient Feature Maps for String Alignment Kernels
    Tabei, Yasuo
    Yamanishi, Yoshihiro
    Pagh, Rasmus
    [J]. 2019 19TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2019), 2019, : 1312 - 1317
  • [47] Fast Algorithms for Top-k Approximate String Matching
    Yang, Zhenglu
    Yu, Jianjun
    Kitsuregawa, Masaru
    [J]. PROCEEDINGS OF THE TWENTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-10), 2010, : 1467 - 1473
  • [48] String indexing for top-k close consecutive occurrences
    Bille, Philip
    Gortz, Inge Li
    Pedersen, Max Rishoj
    Rotenberg, Eva
    Steiner, Teresa Anna
    [J]. THEORETICAL COMPUTER SCIENCE, 2022, 927 : 133 - 147
  • [49] Space-Efficient String Indexing for Wildcard Pattern Matching
    Lewenstein, Moshe
    Nekrich, Yakov
    Vitter, Jeffrey Scott
    [J]. 31ST INTERNATIONAL SYMPOSIUM ON THEORETICAL ASPECTS OF COMPUTER SCIENCE (STACS 2014), 2014, 25 : 506 - 517
  • [50] Space-Efficient Feature Maps for String Alignment Kernels
    Yasuo Tabei
    Yoshihiro Yamanishi
    Rasmus Pagh
    [J]. Data Science and Engineering, 2020, 5 : 168 - 179