The suffix-signature method for searching for phrases in text

被引:2
|
作者
Zhou, M
Tompa, FW
机构
[1] Open Text Corp, Waterloo, ON N2L 5Z5, Canada
[2] Univ Waterloo, Dept Comp Sci, Waterloo, ON N2L 3G1, Canada
关键词
text indexing; phrase search; suffix arrays; PAT arrays; signature arrays;
D O I
10.1016/S0306-4379(98)00029-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present a new algorithm to find all occurrences of a given phrase based on the data structure known as a suffix array and using a corresponding array of signatures. With this algorithm, matches to phrases of moderate length can be found with expected search time of one disk access to the text and one disk access to its index. To achieve this performance for phrases of up to five words in length requires an index having total size of approximately 120% of the size of the text. The algorithm guarantees a worst case search performance of two disk accesses to the text per phrase search. Experiments with actual data ranging in size from 6Mb to 550Mb and with actual query patterns derived from logs of searches on the World Wide Web show that the approach is applicable in practice to a variety of texts and realistic phrase searches. (C)1998 Elsevier Science Ltd. All rights reserved.
引用
收藏
页码:567 / 588
页数:22
相关论文
共 50 条
  • [31] Statistical recognition of noun phrases in unrestricted text
    Serrano, JI
    Araujo, L
    ADVANCES IN INTELLIGENT DATA ANALYSIS VI, PROCEEDINGS, 2005, 3646 : 397 - 408
  • [32] An Approach for Text Mining Based on Noun Phrases
    Pinheiro, Marcello Sandi
    do Prado, Hercules Antonio
    Ferneda, Edilson
    Ladeira, Marcelo
    INTELLIGENT DECISION TECHNOLOGIES, 2015, 39 : 525 - 535
  • [33] Detection of Trends of Technical Phrases in Text Mining
    Abe, Hidenao
    Tsumoto, Shusaku
    2009 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING ( GRC 2009), 2009, : 7 - 12
  • [34] Searching Maximal Degenerate Motifs Guided by a Compact Suffix Tree
    Jiang, Hongshan
    Zhao, Ying
    Chen, Wenguang
    Zheng, Weimin
    ADVANCES IN COMPUTATIONAL BIOLOGY, 2010, 680 : 19 - 26
  • [35] ANNOTATED SUFFIX TREE AS A WAY OF TEXT REPRESENTATION FOR INFORMATION RETRIEVAL IN TEXT COLLECTIONS
    Frolov, Dmitry S.
    BIZNES INFORMATIKA-BUSINESS INFORMATICS, 2015, 34 (04): : 63 - 70
  • [36] Using Suffix Tray and Longest Previous Factor for Pattern Searching
    Kongsen, Jongsuk
    Chairungsee, Supaporn
    PROCEEDINGS OF THE 2017 INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY (ICIT 2017), 2017, : 7 - 11
  • [37] On the cost of searching signature trees
    Chen, YJ
    INFORMATION PROCESSING LETTERS, 2006, 99 (01) : 19 - 26
  • [38] Faster Compressed Suffix Trees for Repetitive Text Collections
    Navarro, Gonzalo
    Ordonez, Alberto
    EXPERIMENTAL ALGORITHMS, SEA 2014, 2014, 8504 : 424 - 435
  • [39] A Fast Searching for Similar Text using Genomic Read Mapping Method
    Ock, Chang Seok
    Kim, Sung-Hwan
    Tak, Haesung
    Cho, Hwan Gue
    2013 IEEE 16TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND ENGINEERING (CSE 2013), 2013, : 219 - 226
  • [40] Assessment of Graduate Students' Resumes Using Short Text Searching Method
    Nasr, Sara
    German, Oleg Vitoldovicz
    2019 IEEE SECOND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND KNOWLEDGE ENGINEERING (AIKE), 2019, : 306 - 308