Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data

被引：0

作者：

Takeaki Uno

机构：

[1] National Institute of Informatics,

来源：

Knowledge and Information Systems | 2010年 / 25卷

关键词：

Neighbor search; Neighbor graph construction; Similarity analysis; Data analysis; Large scale data; Homology search;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Finding similar substrings/substructures is a central task in analyzing huge string data such as genome sequences, Web documents, log data, feature vectors of pictures, photos, videos, etc. Although the existence of polynomial time algorithms for such problems is trivial since the number of substrings is bounded by the square of their lengths, straightforward algorithms do not work for huge databases because of their high degree order of the computation time. This paper addresses the problem of finding pairs of strings with small Hamming distances from huge databases composed of short strings of a fixed length. Comparison of long strings can be solved by inputting all their substrings of fixed length so that we can find candidates of similar non-short substrings. We focus on the practical efficiency of algorithms, and propose an algorithm that runs in time almost linear in the input/output size. We prove that the computation time of its variant is linear in the database size when the length of the short strings is constant, and computational experiments for genome sequences and Web texts show its practical efficiency. Slight modifications adapt to the edit distance and mismatch tolerance computation. An implementation is available at the author’s homepage.

引用

页码：229 / 251

页数：22

共 45 条

[1] Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data
Uno, Takeaki
KNOWLEDGE AND INFORMATION SYSTEMS, 2010, 25 (02) : 229 - 251
[2] An efficient algorithm for finding similar short substrings from large scale string data
Uno, Takeaki
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2008, 5012 : 345 - 356
[3] Internal Sorting Algorithm for Large-scale Data Based on GPU-assisted
Liu Shenghui
Ma Junfeng
Che Nan
PROCEEDINGS OF 2013 2ND INTERNATIONAL CONFERENCE ON MEASUREMENT, INFORMATION AND CONTROL (ICMIC 2013), VOLS 1 & 2, 2013, : 634 - 638
[4] Data association algorithm for large-scale multi-object tracking with complex interactions
Vo, Garret
Zakharov, Dmitri
Park, Chiwoo
JOURNAL OF ELECTRONIC IMAGING, 2021, 30 (06)
[5] An efficient algorithm for dense regions discovery from large-scale data streams
Yip, AM
Wu, EH
Ng, MK
Chan, TF
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2004, 3056 : 116 - 120
[6] A fast algorithm for learning a ranking function from large-scale data sets
Raykar, Vikas C.
Duraiswami, Ramani
Krishnapuram, Balaji
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2008, 30 (07) : 1158 - 1170
[7] MDH: A high speed Multi-phase Dynamic Hash string matching algorithm for large-scale pattern set
Zhou, Zongwei
Xue, Yibo
Liu, Junda
Zhang, Wei
Li, Jun
INFORMATION AND COMMUNICATIONS SECURITY, PROCEEDINGS, 2007, 4681 : 201 - +
[8] Large-Scale Storage/Retrieval Requests Sorting Algorithm for Multi-I/O Depots Automated Storage/Retrieval Systems
Song, Yu Bo
Mu, Hai Bo
DISCRETE DYNAMICS IN NATURE AND SOCIETY, 2021, 2021
[9] A Distributed Graph Algorithm for Discovering Unique Behavioral Groups from Large-Scale Telco Data
Ho, Qirong
Lin, Wenqing
Shaham, Eran
Krishnaswamy, Shonali
The Anh Dang
Wang, Jingxuan
Zhongyan, Isabel Choo
Shi-Nash, Amy
CIKM'16: PROCEEDINGS OF THE 2016 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2016, : 1353 - 1362
[10] A hybrid intelligent optimization algorithm to select discriminative genes from large-scale medical data
Wang, Tao
Jia, LiYun
Xu, JiaLing
Gad, Ahmed G.
Ren, Hai
Salem, Ahmed
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024, 15 (12) : 5921 - 5948

← 1 2 3 4 5 →