Efficient tools for comparative substring analysis

被引:16
|
作者
Apostolico, Alberto [2 ,3 ]
Denas, Olgert [1 ]
Dress, Andreas [4 ,5 ]
机构
[1] Emory Univ, Dept Math & Comp Sci, Atlanta, GA 30322 USA
[2] Georgia Inst Technol, Coll Comp, Atlanta, GA 30332 USA
[3] Univ Padua, Accademia Nazl Lincei & DEI, I-35100 Padua, Italy
[4] Chinese Acad Sci, CAS MPG Partner Inst Computat Biol, Shanghai, Peoples R China
[5] Max Planck Inst Math Sci, D-04103 Leipzig, Germany
关键词
Suffix tree; Phylogeny; Maximal words; SEQUENCE ALIGNMENT; EVOLUTIONARY;
D O I
10.1016/j.jbiotec.2010.05.006
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
This paper introduces an efficient implementation of approaches to alignment-free comparative genome analysis and genome-based phylogeny relying on substring composition. Distances derived from substring statistics have been proposed recently as a meaningful alternative to distances derived from sequence alignment. In particular, procaryote phylogenies based on comparative 5- and 6-mer analysis of whole proteomes have successfully been worked out. The present implementation extends the computation of composition-based distances so as to involve allk-mers for anyk up to any preset maximum length K (including K = infinity). Remarkably, although there may be Theta(L-2) distinct strings that occur in a given sequence of length L (and Theta(KL) of length k <= K), it is shown that composition-based distances as well as many other details of interest in comparative genome analysis can be computed in 0(L) time and space (with a constant that is independent of the size of K, that is, the same constant works for all K). A typical run with 2 sequences of altogether 1.5 million characters computes their composition-based distance in about 2 s, a performance to be contrasted with the several hours needed, even when restricting attention to substrings of length at most 6, by the direct method in use. This paper describes the details of this implementation an implementation that allows the user to compute composition-based distances for a wide range of instances on data sets of unprecedented size which may be useful in assessing the validity of the approach and to fine-tune the identification of those values of k (or K) yielding the best separators and descriptors in correspondence with different inputs, indicates how the proposed algorithm can also be used for other tasks related to the identification and comparative analysis of highly over-or under-represented (sub)strings in given genomes, meta-genomes, or any other sequence families of interest (e.g., all proteins encoded by a given genome, all strings of non-coding or regulatory RNA, all introns, etc.), and thus conforms with the increasing need for radically new, fast, and massive techniques for comparative genome analysis. (C) 2010 Elsevier B.V. All rights reserved.
引用
收藏
页码:120 / 126
页数:7
相关论文
共 50 条
  • [1] Memory Efficient Minimum Substring Partitioning
    Li, Yang
    Kamousi, Pegah
    Han, Fangqiu
    Yang, Shengqi
    Yan, Xifeng
    Suri, Subhash
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2013, 6 (03): : 169 - 180
  • [2] Space-Efficient Substring Occurrence Estimation
    Alessio Orlandi
    Rossano Venturini
    [J]. Algorithmica, 2016, 74 : 65 - 90
  • [3] Generalization of Efficient Implementation of Compression by Substring Enumeration
    Sakuma, Shumpei
    Narisawa, Kazuyuki
    Shinohara, Ayumi
    [J]. 2016 DATA COMPRESSION CONFERENCE (DCC), 2016, : 630 - 630
  • [4] Efficient Approximate Substring Matching in Compressed String
    Han, Yutong
    Wang, Bin
    Yang, Xiaochun
    [J]. Web-Age Information Management, Pt II, 2016, 9659 : 184 - 197
  • [5] An efficient algorithm for identifying the most contributory substring
    Stephenson, Ben
    [J]. DATA WAREHOUSING AND KNOWLEDGE DISCOVERY, PROCEEDINGS, 2007, 4654 : 272 - +
  • [6] Space-Efficient Substring Occurrence Estimation
    Orlandi, Alessio
    Venturini, Rossano
    [J]. ALGORITHMICA, 2016, 74 (01) : 65 - 90
  • [7] Automated Substring Hole Analysis
    Adler, Yoram
    Farchi, Eitan
    Klausner, Moshe
    Pelleg, Dan
    Raz, Orna
    Shochat, Moran
    Ur, Shmuel
    Zlotnick, Aviad
    [J]. 2009 31ST INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, COMPANION VOLUME, 2009, : 203 - 206
  • [8] Storage Efficient Substring Searchable Symmetric Encryption
    Leontiadis, Iraklis
    Li, Ming
    [J]. SCC'18: PROCEEDINGS OF THE 6TH INTERNATIONAL WORKSHOP ON SECURITY IN CLOUD COMPUTING, 2018, : 3 - 13
  • [9] Efficient Algorithms for Substring Near Neighbor Problem
    Andoni, Alexandr
    Indyk, Piotr
    [J]. PROCEEDINGS OF THE SEVENTHEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2006, : 1203 - 1212
  • [10] Efficient Computation of Substring Equivalence Classes with Suffix Arrays
    Narisawa, Kazuyuki
    Hiratsuka, Hideharu
    Inenaga, Shunsuke
    Bannai, Hideo
    Takeda, Masayuki
    [J]. ALGORITHMICA, 2017, 79 (02) : 291 - 318