An efficient algorithm for finding similar short substrings from large scale string data

被引:0
|
作者
Uno, Takeaki [1 ]
机构
[1] Natl Inst Informat, Chiyoda Ku, Tokyo 1018430, Japan
来源
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS | 2008年 / 5012卷
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Finding similar substrings/substructures is a central task in analyzing huge amounts of string data such as genome sequences, web documents, log data, etc. In the sense of complexity theory, the existence of polynomial time algorithms for such problems is usually trivial since the number of substrings is bounded by the square of their lengths. However, straight forward algorithms do not work for practical huge databases because of their computation time of high degree order. This paper addresses the problems of finding pairs of strings with small Hamming distances from huge databases composed of short strings. By solving the problem for all the substrings of fixed length, we can efficiently find candidates of similar non-short substrings. We focus on the practical efficiency of algorithms, and propose an algorithm running in almost linear time of the database size. We prove that the computation time of its variant is bounded by linear of the database size when the length of short strings to be found is constant. Slight modifications of the algorithm adapt to the edit distance and mismatch tolerance computation. Computational experiments for genome sequences show the efficiency of the algorithm. An implementation is available at the author's homepage(1).
引用
收藏
页码:345 / 356
页数:12
相关论文
共 50 条
  • [31] An efficient data gathering algorithm for large-scale wireless sensor networks with mobile sinks
    Zhao, Jumin
    Tang, Qingming
    Li, Deng-ao
    Zhu, Biaokai
    Li, Yikun
    INTERNATIONAL JOURNAL OF AD HOC AND UBIQUITOUS COMPUTING, 2018, 28 (01) : 35 - 44
  • [32] An Efficient Route Planning Algorithm for Special Vehicles with Large-Scale Road Network Data
    Tian, Ting
    Wu, Huijing
    Wei, Haitao
    Wu, Fang
    Xu, Mingliang
    ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION, 2025, 14 (02)
  • [33] A time and space efficient data structure for string searching on large texts
    Dipto. di Matemat. Pura ed Applicata, Università di Padova, Via Belzoni 7, I-35131 Padova, Italy
    Inf. Process. Lett., 5 (217-222):
  • [34] A time and space efficient data structure for string searching on large texts
    Colussi, L
    DeCol, A
    INFORMATION PROCESSING LETTERS, 1996, 58 (05) : 217 - 222
  • [35] AN ALGORITHM OF LARGE-SCALE APPROXIMATE MULTIPLE STRING MATCHING FOR NETWORK SECURITY
    Song, Tian
    Xue, Yibo
    Wang, Dongsheng
    2006 FIRST INTERNATIONAL CONFERENCE ON COMMUNICATIONS AND NETWORKING IN CHINA, 2006,
  • [36] Time and Space Efficient Large Scale Link Discovery using String Similarities
    Karampelas, Andreas
    Vouros, George A.
    FUNDAMENTA INFORMATICAE, 2020, 172 (03) : 299 - 325
  • [37] Memory-Efficient Pipelined Architecture for Large-Scale String Matching
    Yang, Yi-Hua E.
    Prasanna, Viktor K.
    PROCEEDINGS OF THE 2009 17TH IEEE SYMPOSIUM ON FIELD PROGRAMMABLE CUSTOM COMPUTING MACHINES, 2009, : 104 - 111
  • [38] Efficient updates in dynamic XML data: from binary string to quaternary string
    Changqing Li
    Tok Wang Ling
    Min Hu
    The VLDB Journal, 2008, 17 : 573 - 601
  • [39] Efficient updates in dynamic XML data: from binary string to quaternary string
    Li, Changqing
    Ling, Tok Wang
    Hu, Min
    VLDB JOURNAL, 2008, 17 (03): : 573 - 601
  • [40] Efficient Methods for Sampling Responses from Large-Scale Qualitative Data
    Singh, Surendra N.
    Hillmer, Steve
    Wang, Ze
    MARKETING SCIENCE, 2011, 30 (03) : 532 - 549