An efficient algorithm for finding similar short substrings from large scale string data

被引:0
|
作者
Uno, Takeaki [1 ]
机构
[1] Natl Inst Informat, Chiyoda Ku, Tokyo 1018430, Japan
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Finding similar substrings/substructures is a central task in analyzing huge amounts of string data such as genome sequences, web documents, log data, etc. In the sense of complexity theory, the existence of polynomial time algorithms for such problems is usually trivial since the number of substrings is bounded by the square of their lengths. However, straight forward algorithms do not work for practical huge databases because of their computation time of high degree order. This paper addresses the problems of finding pairs of strings with small Hamming distances from huge databases composed of short strings. By solving the problem for all the substrings of fixed length, we can efficiently find candidates of similar non-short substrings. We focus on the practical efficiency of algorithms, and propose an algorithm running in almost linear time of the database size. We prove that the computation time of its variant is bounded by linear of the database size when the length of short strings to be found is constant. Slight modifications of the algorithm adapt to the edit distance and mismatch tolerance computation. Computational experiments for genome sequences show the efficiency of the algorithm. An implementation is available at the author's homepage(1).
引用
收藏
页码:345 / 356
页数:12
相关论文
共 50 条
  • [1] Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data
    Takeaki Uno
    Knowledge and Information Systems, 2010, 25 : 229 - 251
  • [2] Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data
    Uno, Takeaki
    KNOWLEDGE AND INFORMATION SYSTEMS, 2010, 25 (02) : 229 - 251
  • [3] An Efficient Motif Finding Algorithm for Large DNA Data Sets
    Yu, Qiang
    Huo, Hongwei
    Chen, Xiaoyang
    Guo, Haitao
    Vitter, Jeffrey Scott
    Huan, Jun
    2014 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2014,
  • [4] An Efficient Algorithm for Finding All Pairs k-Mismatch Maximal Common Substrings
    Thankachan, Sharma V.
    Chockalingam, Sriram P.
    Aluru, Srinivas
    BIOINFORMATICS RESEARCH AND APPLICATIONS, ISBRA 2016, 2016, 9683 : 3 - 14
  • [5] An Efficient Large-Scale Volume Data Compression Algorithm
    Xiao, Degui
    Zhao, Liping
    Yang, Lei
    Li, Zhiyong
    Li, Kenli
    ADVANCES IN NEURAL NETWORKS - ISNN 2009, PT 3, PROCEEDINGS, 2009, 5553 : 567 - 575
  • [6] An efficient algorithm for dense regions discovery from large-scale data streams
    Yip, AM
    Wu, EH
    Ng, MK
    Chan, TF
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2004, 3056 : 116 - 120
  • [7] An efficient biclustering algorithm for finding genes with similar patterns in time-series expression data
    Madeira, Sara C.
    Oliveira, Arlindo L.
    PROCEEDINGS OF THE 5TH ASIA- PACIFIC BIOINFOMATICS CONFERENCE 2007, 2007, 5 : 67 - +
  • [8] A Short Text Similarity Algorithm for Finding Similar Police 110 Incidents
    Duan, Lei
    Xu, Tongge
    2016 7TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA (CCBD), 2016, : 260 - 264
  • [9] A SPACE-EFFICIENT SHORT-FINDING ALGORITHM
    SU, SL
    BARRY, CH
    LO, CY
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 1994, 13 (08) : 1065 - 1068
  • [10] SCAN++: Efficient Algorithm for Finding Clusters, Hubs and Outliers on Large-scale Graphs
    Shiokawa, Hiroaki
    Fujiwara, Yasuhiro
    Onizuka, Makoto
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 8 (11): : 1178 - 1189