An efficient algorithm for finding similar short substrings from large scale string data

被引：0

作者：

Uno, Takeaki ^{[1
]}

机构：

[1] Natl Inst Informat, Chiyoda Ku, Tokyo 1018430, Japan

来源：

ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS | 2008年 / 5012卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Finding similar substrings/substructures is a central task in analyzing huge amounts of string data such as genome sequences, web documents, log data, etc. In the sense of complexity theory, the existence of polynomial time algorithms for such problems is usually trivial since the number of substrings is bounded by the square of their lengths. However, straight forward algorithms do not work for practical huge databases because of their computation time of high degree order. This paper addresses the problems of finding pairs of strings with small Hamming distances from huge databases composed of short strings. By solving the problem for all the substrings of fixed length, we can efficiently find candidates of similar non-short substrings. We focus on the practical efficiency of algorithms, and propose an algorithm running in almost linear time of the database size. We prove that the computation time of its variant is bounded by linear of the database size when the length of short strings to be found is constant. Slight modifications of the algorithm adapt to the edit distance and mismatch tolerance computation. Computational experiments for genome sequences show the efficiency of the algorithm. An implementation is available at the author's homepage(1).

引用

页码：345 / 356

页数：12

共 50 条

[1] Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data
Takeaki Uno
Knowledge and Information Systems, 2010, 25 : 229 - 251
[2] Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data
Uno, Takeaki
KNOWLEDGE AND INFORMATION SYSTEMS, 2010, 25 (02) : 229 - 251
[3] An Efficient Motif Finding Algorithm for Large DNA Data Sets
Yu, Qiang
Huo, Hongwei
Chen, Xiaoyang
Guo, Haitao
Vitter, Jeffrey Scott
Huan, Jun
2014 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2014,
[4] An Efficient Algorithm for Finding All Pairs k-Mismatch Maximal Common Substrings
Thankachan, Sharma V.
Chockalingam, Sriram P.
Aluru, Srinivas
BIOINFORMATICS RESEARCH AND APPLICATIONS, ISBRA 2016, 2016, 9683 : 3 - 14
[5] An Efficient Large-Scale Volume Data Compression Algorithm
Xiao, Degui
Zhao, Liping
Yang, Lei
Li, Zhiyong
Li, Kenli
ADVANCES IN NEURAL NETWORKS - ISNN 2009, PT 3, PROCEEDINGS, 2009, 5553 : 567 - 575
[6] An efficient algorithm for dense regions discovery from large-scale data streams
Yip, AM
Wu, EH
Ng, MK
Chan, TF
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2004, 3056 : 116 - 120
[7] An efficient biclustering algorithm for finding genes with similar patterns in time-series expression data
Madeira, Sara C.
Oliveira, Arlindo L.
PROCEEDINGS OF THE 5TH ASIA- PACIFIC BIOINFOMATICS CONFERENCE 2007, 2007, 5 : 67 - +
[8] A Short Text Similarity Algorithm for Finding Similar Police 110 Incidents
Duan, Lei
Xu, Tongge
2016 7TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA (CCBD), 2016, : 260 - 264
[9] A SPACE-EFFICIENT SHORT-FINDING ALGORITHM
SU, SL
BARRY, CH
LO, CY
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 1994, 13 (08) : 1065 - 1068
[10] SCAN++: Efficient Algorithm for Finding Clusters, Hubs and Outliers on Large-scale Graphs
Shiokawa, Hiroaki
Fujiwara, Yasuhiro
Onizuka, Makoto
PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 8 (11): : 1178 - 1189

← 1 2 3 4 5 →