Efficient tools for comparative substring analysis

被引：16

作者：

Apostolico, Alberto ^{[2
,3
]}

Denas, Olgert ^{[1
]}

Dress, Andreas ^{[4
,5
]}

机构：

[1] Emory Univ, Dept Math & Comp Sci, Atlanta, GA 30322 USA

[2] Georgia Inst Technol, Coll Comp, Atlanta, GA 30332 USA

[3] Univ Padua, Accademia Nazl Lincei & DEI, I-35100 Padua, Italy

[4] Chinese Acad Sci, CAS MPG Partner Inst Computat Biol, Shanghai, Peoples R China

[5] Max Planck Inst Math Sci, D-04103 Leipzig, Germany

来源：

JOURNAL OF BIOTECHNOLOGY | 2010年 / 149卷 / 03期

关键词：

Suffix tree; Phylogeny; Maximal words; SEQUENCE ALIGNMENT; EVOLUTIONARY;

D O I：

10.1016/j.jbiotec.2010.05.006

中图分类号：

Q81 [生物工程学（生物技术）]; Q93 [微生物学];

学科分类号：

071005 ; 0836 ; 090102 ; 100705 ;

摘要：

This paper introduces an efficient implementation of approaches to alignment-free comparative genome analysis and genome-based phylogeny relying on substring composition. Distances derived from substring statistics have been proposed recently as a meaningful alternative to distances derived from sequence alignment. In particular, procaryote phylogenies based on comparative 5- and 6-mer analysis of whole proteomes have successfully been worked out. The present implementation extends the computation of composition-based distances so as to involve allk-mers for anyk up to any preset maximum length K (including K = infinity). Remarkably, although there may be Theta(L-2) distinct strings that occur in a given sequence of length L (and Theta(KL) of length k <= K), it is shown that composition-based distances as well as many other details of interest in comparative genome analysis can be computed in 0(L) time and space (with a constant that is independent of the size of K, that is, the same constant works for all K). A typical run with 2 sequences of altogether 1.5 million characters computes their composition-based distance in about 2 s, a performance to be contrasted with the several hours needed, even when restricting attention to substrings of length at most 6, by the direct method in use. This paper describes the details of this implementation an implementation that allows the user to compute composition-based distances for a wide range of instances on data sets of unprecedented size which may be useful in assessing the validity of the approach and to fine-tune the identification of those values of k (or K) yielding the best separators and descriptors in correspondence with different inputs, indicates how the proposed algorithm can also be used for other tasks related to the identification and comparative analysis of highly over-or under-represented (sub)strings in given genomes, meta-genomes, or any other sequence families of interest (e.g., all proteins encoded by a given genome, all strings of non-coding or regulatory RNA, all introns, etc.), and thus conforms with the increasing need for radically new, fast, and massive techniques for comparative genome analysis. (C) 2010 Elsevier B.V. All rights reserved.

引用

页码：120 / 126

页数：7

共 50 条

[1] Memory Efficient Minimum Substring Partitioning
Li, Yang
Kamousi, Pegah
Han, Fangqiu
Yang, Shengqi
Yan, Xifeng
Suri, Subhash
[J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2013, 6 (03): : 169 - 180
[2] Space-Efficient Substring Occurrence Estimation
Alessio Orlandi
Rossano Venturini
[J]. Algorithmica, 2016, 74 : 65 - 90
[3] Generalization of Efficient Implementation of Compression by Substring Enumeration
Sakuma, Shumpei
Narisawa, Kazuyuki
Shinohara, Ayumi
[J]. 2016 DATA COMPRESSION CONFERENCE (DCC), 2016, : 630 - 630
[4] Efficient Approximate Substring Matching in Compressed String
Han, Yutong
Wang, Bin
Yang, Xiaochun
[J]. Web-Age Information Management, Pt II, 2016, 9659 : 184 - 197
[5] An efficient algorithm for identifying the most contributory substring
Stephenson, Ben
[J]. DATA WAREHOUSING AND KNOWLEDGE DISCOVERY, PROCEEDINGS, 2007, 4654 : 272 - +
[6] Space-Efficient Substring Occurrence Estimation
Orlandi, Alessio
Venturini, Rossano
[J]. ALGORITHMICA, 2016, 74 (01) : 65 - 90
[7] Automated Substring Hole Analysis
Adler, Yoram
Farchi, Eitan
Klausner, Moshe
Pelleg, Dan
Raz, Orna
Shochat, Moran
Ur, Shmuel
Zlotnick, Aviad
[J]. 2009 31ST INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, COMPANION VOLUME, 2009, : 203 - 206
[8] Storage Efficient Substring Searchable Symmetric Encryption
Leontiadis, Iraklis
Li, Ming
[J]. SCC'18: PROCEEDINGS OF THE 6TH INTERNATIONAL WORKSHOP ON SECURITY IN CLOUD COMPUTING, 2018, : 3 - 13
[9] Efficient Algorithms for Substring Near Neighbor Problem
Andoni, Alexandr
Indyk, Piotr
[J]. PROCEEDINGS OF THE SEVENTHEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2006, : 1203 - 1212
[10] Efficient Computation of Substring Equivalence Classes with Suffix Arrays
Narisawa, Kazuyuki
Hiratsuka, Hideharu
Inenaga, Shunsuke
Bannai, Hideo
Takeda, Masayuki
[J]. ALGORITHMICA, 2017, 79 (02) : 291 - 318

← 1 2 3 4 5 →