Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data

被引:0
|
作者
Takeaki Uno
机构
[1] National Institute of Informatics,
来源
关键词
Neighbor search; Neighbor graph construction; Similarity analysis; Data analysis; Large scale data; Homology search;
D O I
暂无
中图分类号
学科分类号
摘要
Finding similar substrings/substructures is a central task in analyzing huge string data such as genome sequences, Web documents, log data, feature vectors of pictures, photos, videos, etc. Although the existence of polynomial time algorithms for such problems is trivial since the number of substrings is bounded by the square of their lengths, straightforward algorithms do not work for huge databases because of their high degree order of the computation time. This paper addresses the problem of finding pairs of strings with small Hamming distances from huge databases composed of short strings of a fixed length. Comparison of long strings can be solved by inputting all their substrings of fixed length so that we can find candidates of similar non-short substrings. We focus on the practical efficiency of algorithms, and propose an algorithm that runs in time almost linear in the input/output size. We prove that the computation time of its variant is linear in the database size when the length of the short strings is constant, and computational experiments for genome sequences and Web texts show its practical efficiency. Slight modifications adapt to the edit distance and mismatch tolerance computation. An implementation is available at the author’s homepage.
引用
收藏
页码:229 / 251
页数:22
相关论文
共 45 条
  • [1] Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data
    Uno, Takeaki
    KNOWLEDGE AND INFORMATION SYSTEMS, 2010, 25 (02) : 229 - 251
  • [2] An efficient algorithm for finding similar short substrings from large scale string data
    Uno, Takeaki
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2008, 5012 : 345 - 356
  • [3] Internal Sorting Algorithm for Large-scale Data Based on GPU-assisted
    Liu Shenghui
    Ma Junfeng
    Che Nan
    PROCEEDINGS OF 2013 2ND INTERNATIONAL CONFERENCE ON MEASUREMENT, INFORMATION AND CONTROL (ICMIC 2013), VOLS 1 & 2, 2013, : 634 - 638
  • [4] Data association algorithm for large-scale multi-object tracking with complex interactions
    Vo, Garret
    Zakharov, Dmitri
    Park, Chiwoo
    JOURNAL OF ELECTRONIC IMAGING, 2021, 30 (06)
  • [5] An efficient algorithm for dense regions discovery from large-scale data streams
    Yip, AM
    Wu, EH
    Ng, MK
    Chan, TF
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2004, 3056 : 116 - 120
  • [6] A fast algorithm for learning a ranking function from large-scale data sets
    Raykar, Vikas C.
    Duraiswami, Ramani
    Krishnapuram, Balaji
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2008, 30 (07) : 1158 - 1170
  • [7] MDH: A high speed Multi-phase Dynamic Hash string matching algorithm for large-scale pattern set
    Zhou, Zongwei
    Xue, Yibo
    Liu, Junda
    Zhang, Wei
    Li, Jun
    INFORMATION AND COMMUNICATIONS SECURITY, PROCEEDINGS, 2007, 4681 : 201 - +
  • [8] Large-Scale Storage/Retrieval Requests Sorting Algorithm for Multi-I/O Depots Automated Storage/Retrieval Systems
    Song, Yu Bo
    Mu, Hai Bo
    DISCRETE DYNAMICS IN NATURE AND SOCIETY, 2021, 2021
  • [9] A Distributed Graph Algorithm for Discovering Unique Behavioral Groups from Large-Scale Telco Data
    Ho, Qirong
    Lin, Wenqing
    Shaham, Eran
    Krishnaswamy, Shonali
    The Anh Dang
    Wang, Jingxuan
    Zhongyan, Isabel Choo
    Shi-Nash, Amy
    CIKM'16: PROCEEDINGS OF THE 2016 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2016, : 1353 - 1362
  • [10] A hybrid intelligent optimization algorithm to select discriminative genes from large-scale medical data
    Wang, Tao
    Jia, LiYun
    Xu, JiaLing
    Gad, Ahmed G.
    Ren, Hai
    Salem, Ahmed
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024, 15 (12) : 5921 - 5948