The suffix-signature method for searching for phrases in text

被引:2
|
作者
Zhou, M
Tompa, FW
机构
[1] Open Text Corp, Waterloo, ON N2L 5Z5, Canada
[2] Univ Waterloo, Dept Comp Sci, Waterloo, ON N2L 3G1, Canada
关键词
text indexing; phrase search; suffix arrays; PAT arrays; signature arrays;
D O I
10.1016/S0306-4379(98)00029-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present a new algorithm to find all occurrences of a given phrase based on the data structure known as a suffix array and using a corresponding array of signatures. With this algorithm, matches to phrases of moderate length can be found with expected search time of one disk access to the text and one disk access to its index. To achieve this performance for phrases of up to five words in length requires an index having total size of approximately 120% of the size of the text. The algorithm guarantees a worst case search performance of two disk accesses to the text per phrase search. Experiments with actual data ranging in size from 6Mb to 550Mb and with actual query patterns derived from logs of searches on the World Wide Web show that the approach is applicable in practice to a variety of texts and realistic phrase searches. (C)1998 Elsevier Science Ltd. All rights reserved.
引用
收藏
页码:567 / 588
页数:22
相关论文
共 50 条
  • [41] Dotted suffix trees - A structure for approximate text indexing
    Coelho, Luis Pedro
    Oliveira, Arlindo L.
    STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2006, 4209 : 329 - 336
  • [42] Text generation by probabilistic suffix tree language model
    Marukatat, Sanparith
    16TH INTERNATIONAL JOINT SYMPOSIUM ON ARTIFICIAL INTELLIGENCE AND NATURAL LANGUAGE PROCESSING (ISAI-NLP 2021), 2021,
  • [43] New text indexing functionalities of the compressed suffix arrays
    Sadakane, K
    JOURNAL OF ALGORITHMS-COGNITION INFORMATICS AND LOGIC, 2003, 48 (02): : 294 - 313
  • [44] Hierarchical clustering of text corpora using suffix trees
    Maslowska, I
    Slowinski, R
    INTELLIGENT INFORMATION PROCESSING AND WEB MINING, 2003, : 179 - 188
  • [45] Text clustering using a suffix tree similarity measure
    Huang C.
    Yin J.
    Hou F.
    Journal of Computers, 2011, 6 (10) : 2180 - 2186
  • [46] THE EFFECTIVENESS OF A SEARCHING THESAURUS IN FREE-TEXT SEARCHING IN A FULL-TEXT DATABASE
    KRISTENSEN, J
    JARVELIN, K
    INTERNATIONAL CLASSIFICATION, 1990, 17 (02): : 77 - 84
  • [47] COMMON PHRASES AND MINIMUM-SPACE TEXT STORAGE
    WAGNER, RA
    COMMUNICATIONS OF THE ACM, 1973, 16 (03) : 148 - 152
  • [48] Stylistic Effect of indefinite Noun phrases in a literary Text or Implicature by Avoiding definite Noun phrases
    Sumidai, Yasunori
    SPRACHWISSENSCHAFT, 2012, 37 (02): : 213 - 241
  • [49] Using Negation and Phrases in Inducing Rules for Text Classification
    Chua, Stephanie
    Coenen, Frans
    Malcolm, Grant
    Garcia-Constantino, Matias Fernando
    RESEARCH AND DEVELOPMENT IN INTELLIGENT SYSTEMS XXVIII: INCORPORATING APPLICATIONS AND INNOVATIONS IN INTELLIGENT SYSTEMS XIX, 2011, : 153 - 166
  • [50] Mining Quality Phrases from Massive Text Corpora
    Liu, Jialu
    Shang, Jingbo
    Wang, Chi
    Ren, Xiang
    Han, Jiawei
    SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2015, : 1729 - 1744