Document Retrieval on Repetitive Collections

被引:0
|
作者
Navarro, Gonzalo [1 ]
Puglisi, Simon J. [2 ]
Siren, Jouni [1 ]
机构
[1] Univ Chile, Dept Comp Sci, Ctr Biotechnol & Bioengn, Santiago, Chile
[2] Univ Helsinki, Helsinki, Finland
来源
ALGORITHMS - ESA 2014 | 2014年 / 8737卷
基金
芬兰科学院;
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Document retrieval aims at finding the most important documents where a pattern appears in a collection of strings. Traditional pattern-matching techniques yield brute-force document retrieval solutions, which has motivated the research on tailored indexes that offer near-optimal performance. However, an experimental study establishing which alternatives are actually better than brute force, and which perform best depending on the collection characteristics, has not been carried out. In this paper we address this shortcoming by exploring the relationship between the nature of the underlying collection and the performance of current methods. Via extensive experiments we show that established solutions are often beaten in practice by brute-force alternatives. We also design new methods that offer superior time/space tradeoffs, particularly on repetitive collections.
引用
收藏
页码:725 / 736
页数:12
相关论文
共 50 条
  • [1] Document retrieval on repetitive string collections
    Gagie, Travis
    Hartikainen, Aleksi
    Karhu, Kalle
    Karkkainen, Juha
    Navarro, Gonzalo
    Puglisi, Simon J.
    Siren, Jouni
    INFORMATION RETRIEVAL JOURNAL, 2017, 20 (03): : 253 - 291
  • [2] Document retrieval on repetitive string collections
    Travis Gagie
    Aleksi Hartikainen
    Kalle Karhu
    Juha Kärkkäinen
    Gonzalo Navarro
    Simon J. Puglisi
    Jouni Sirén
    Information Retrieval Journal, 2017, 20 : 253 - 291
  • [3] Document Listing on Repetitive Collections
    Gagie, Travis
    Karhu, Kalle
    Navarro, Gonzalo
    Puglisi, Simon J.
    Siren, Jouni
    COMBINATORIAL PATTERN MATCHING, 2013, 7922 : 107 - 119
  • [4] Retrieval from document image collections
    Balasubramanian, A
    Meshesha, M
    Jawahar, C
    DOCUMENT ANALYSIS SYSTEMS VII, PROCEEDINGS, 2006, 3872 : 1 - 12
  • [5] Universal indexes for highly repetitive document collections
    Claude, Francisco
    Farina, Antonio
    Martinez-Prieto, Miguel A.
    Navarro, Gonzalo
    INFORMATION SYSTEMS, 2016, 61 : 1 - 23
  • [6] Document listing on repetitive collections with guaranteed performance
    Navarro, Gonzalo
    THEORETICAL COMPUTER SCIENCE, 2019, 772 : 58 - 72
  • [7] On the reproducibility of experiments of indexing repetitive document collections
    Farina, Antonio
    Martinez-Prieto, Miguel A.
    Claude, Francisco
    Navarro, Gonzalo
    Lastra-Diaz, Juan J.
    Prezza, Nicola
    Seco, Diego
    INFORMATION SYSTEMS, 2019, 83 : 181 - 194
  • [8] Storage and Retrieval of Highly Repetitive Sequence Collections
    Makinen, Veli
    Navarro, Gonzalo
    Siren, Jouni
    Valimaki, Niko
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2010, 17 (03) : 281 - 308
  • [9] Semantic Retrieval and Navigation in Clinical Document Collections
    Kreuzthaler, Markus
    Daumke, Philipp
    Schulz, Stefan
    EHEALTH2015 - HEALTH INFORMATICS MEETS EHEALTH: INNOVATIVE HEALTH PERSPECTIVES: PERSONALIZED HEALTH, 2015, 212 : 9 - 14
  • [10] Content-based document image retrieval in complex document collections
    Agam, G.
    Argamon, S.
    Friedera, O.
    Grossman, D.
    Lewis, D.
    DOCUMENT RECOGNITION AND RETRIEVAL XIV, 2007, 6500