The suffix-signature method for searching for phrases in text

被引:2
|
作者
Zhou, M
Tompa, FW
机构
[1] Open Text Corp, Waterloo, ON N2L 5Z5, Canada
[2] Univ Waterloo, Dept Comp Sci, Waterloo, ON N2L 3G1, Canada
关键词
text indexing; phrase search; suffix arrays; PAT arrays; signature arrays;
D O I
10.1016/S0306-4379(98)00029-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present a new algorithm to find all occurrences of a given phrase based on the data structure known as a suffix array and using a corresponding array of signatures. With this algorithm, matches to phrases of moderate length can be found with expected search time of one disk access to the text and one disk access to its index. To achieve this performance for phrases of up to five words in length requires an index having total size of approximately 120% of the size of the text. The algorithm guarantees a worst case search performance of two disk accesses to the text per phrase search. Experiments with actual data ranging in size from 6Mb to 550Mb and with actual query patterns derived from logs of searches on the World Wide Web show that the approach is applicable in practice to a variety of texts and realistic phrase searches. (C)1998 Elsevier Science Ltd. All rights reserved.
引用
收藏
页码:567 / 588
页数:22
相关论文
共 50 条
  • [21] Compact suffix automata representations for searching long patterns
    Faro, Simone
    Scafiti, Stefano
    THEORETICAL COMPUTER SCIENCE, 2023, 940 : 254 - 268
  • [22] A Text Similarity Measure Based on Suffix Tree
    Huang, Chenghui
    Liu, Yan
    Xia, Shengzhong
    Yin, Jian
    INFORMATION-AN INTERNATIONAL INTERDISCIPLINARY JOURNAL, 2011, 14 (02): : 583 - 592
  • [23] Distributed text search using suffix arrays
    Arroyuelo, Diego
    Bonacic, Carolina
    Gil-Costa, Veronica
    Marin, Mauricio
    Navarro, Gonzalo
    PARALLEL COMPUTING, 2014, 40 (09) : 471 - 495
  • [24] FREE TEXT SEARCHING
    SHARP, JR
    JOURNAL OF DOCUMENTATION, 1991, 47 (02) : 195 - 196
  • [25] ROLE OF WORDS AND PHRASES IN AUTOMATIC TEXT ANALYSIS
    SALTON, G
    WONG, A
    COMPUTERS AND THE HUMANITIES, 1976, 10 (02): : 69 - 87
  • [26] Statistical identification of key phrases for text classification
    Coenen, Frans
    Leng, Paul
    Sanderson, Robert
    Wang, Yanbo J.
    MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION, PROCEEDINGS, 2007, 4571 : 838 - +
  • [27] On the face of it: How recurrent phrases organize text
    Levin, Magnus
    Lindquist, Hans
    CORPORA: PRAGMATICS AND DISCOURSE, 2009, (68): : 169 - 188
  • [28] Clarifying Implicit and Underspecified Phrases in Instructional Text
    Anthonio, Talita
    Sauer, Anna
    Roth, Michael
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 3319 - 3330
  • [29] Searching beyond text: Issues with multimedia searching
    Notess, GR
    ONLINE, 2000, 24 (05): : 61 - 63
  • [30] A method for improving full text search using signature files
    Yamakawa, Y
    Fuketa, M
    Morita, K
    Aoe, J
    INTERNATIONAL JOURNAL OF COMPUTER MATHEMATICS, 2001, 77 (01) : 73 - 88