Simultaneous identification of long similar substrings in large sets of sequences

被引:3
|
作者
Kleffe, Juergen [1 ]
Moeller, Friedrich [1 ]
Wittig, Burghardt [1 ]
机构
[1] Univ Med Berlin, Charite, Inst Mol & Bioinformat, D-14195 Berlin, Germany
关键词
D O I
10.1186/1471-2105-8-S5-S7
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Sequence comparison faces new challenges today, with many complete genomes and large libraries of transcripts known. Gene annotation pipelines match these sequences in order to identify genes and their alternative splice forms. However, the software currently available cannot simultaneously compare sets of sequences as large as necessary especially if errors must be considered. Results: We therefore present a new algorithm for the identification of almost perfectly matching substrings in very large sets of sequences. Its implementation, called ClustDB, is considerably faster and can handle 16 times more data than VMATCH, the most memory efficient exact program known today. ClustDB simultaneously generates large sets of exactly matching substrings of a given minimum length as seeds for a novel method of match extension with errors. It generates alignments of maximum length with a considered maximum number of errors within each overlapping window of a given size. Such alignments are not optimal in the usual sense but faster to calculate and often more appropriate than traditional alignments for genomic sequence comparisons, EST and full-length cDNA matching, and genomic sequence assembly. The method is used to check the overlaps and to reveal possible assembly errors for 1377 Medicago truncatula BAC-size sequences published at http://www.medicago.org/genome/assembly_table.php?chr=1. Conclusion: The program ClustDB proves that window alignment is an efficient way to find long sequence sections of homogenous alignment quality, as expected in case of random errors, and to detect systematic errors resulting from sequence contaminations. Such inserts are systematically overlooked in long alignments controlled by only tuning penalties for mismatches and gaps.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Simultaneous identification of long similar substrings in large sets of sequences
    Jürgen Kleffe
    Friedrich Möller
    Burghardt Wittig
    [J]. BMC Bioinformatics, 8
  • [2] Suffix tree searcher: Exploration of common substrings in large DNA sequence sets
    Minkley D.
    Whitney M.J.
    Lin S.-H.
    Barsky M.G.
    Kelly C.
    Upton C.
    [J]. BMC Research Notes, 7 (1)
  • [3] An efficient algorithm for finding similar short substrings from large scale string data
    Uno, Takeaki
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2008, 5012 : 345 - 356
  • [4] The density of sets containing large similar copies of finite sets
    Kenneth Falconer
    Vjekoslav Kovač
    Alexia Yavicoli
    [J]. Journal d'Analyse Mathématique, 2022, 148 : 339 - 359
  • [5] The density of sets containing large similar copies of finite sets
    Falconer, Kenneth
    Kovac, Vjekoslav
    Yavicoli, Alexia
    [J]. JOURNAL D ANALYSE MATHEMATIQUE, 2022, 148 (01): : 339 - 359
  • [6] Long DNA sequences and large data sets: investigating the Quaternary via ancient DNA
    Hofreiter, Michael
    [J]. QUATERNARY SCIENCE REVIEWS, 2008, 27 (27-28) : 2586 - 2592
  • [7] Representation and Identification of Approximately Similar Event Sequences
    Martin, T. P.
    Azvine, B.
    [J]. FLEXIBLE QUERY ANSWERING SYSTEMS 2015, 2016, 400 : 87 - 99
  • [8] Indexing of sequences of sets for efficient exact and similar subsequence matching
    Andrzejewski, W
    Morzy, T
    Morzy, M
    [J]. COMPUTER AND INFORMATION SCIENCES - ISCIS 2005, PROCEEDINGS, 2005, 3733 : 864 - 873
  • [9] Generation of Checking Sequences Using Identification Sets
    Porto, Faimison Rodrigues
    Endo, Andre Takeshi
    Simao, Adenilso
    [J]. FORMAL METHODS AND SOFTWARE ENGINEERING, 2013, 8144 : 115 - 130
  • [10] Equivalence of gap sequences and Hausdorff dimensions of self-similar sets
    Yang, Yamin
    [J]. PROCEEDING OF THE SEVENTH INTERNATIONAL CONFERENCE ON INFORMATION AND MANAGEMENT SCIENCES, 2008, 7 : 442 - 443