Searching DNA databases for similarities to DNA sequences: when is a match significant?

被引:73
|
作者
Anderson, I [1 ]
Brass, A [1 ]
机构
[1] Univ Manchester, Sch Biol Sci, Manchester M13 9PT, Lancs, England
关键词
D O I
10.1093/bioinformatics/14.4.349
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Searching DNA sequences against a DNA database is an essential element of sequence analysis. However, few systematic studies have been carried out to deter-mine when a match between two DNA sequences has biological significance and this is limiting the use that can be made of DNA searching algorithms. Results: A rest set of DNA sequences has been constructed consisting of artificially evolved and real sequences. This set has been used to test various database searching algorithms (BLAST, BLAST2, FASTA and Smith-Waterman) on a subset of the EMBL database. The results of this analysis have been used to determine the sensitivity and coverage of all of the algorithms. Guidelines have been produced which can be used to assess the significance of DNA database search results. The Smith-Water-man algorithm was shown to have the best coverage, but the wet-st sensitivity, whereas the default BLASTN algorithm (word length set to 11) was shown to have good sensitivity, but poor coverage. A sensible compromise between speed, sensitivity and coverage can be obtained using either the FASTA or BLAST (word length set to 6) algorithms. However; analysis of the results also showed that no algorithm works well when the length of the probe sequence is <200 bases. In general, matches can accurately be identified between coding regions of DNA sequences when there is >35% sequence identity between the corresponding proteins. Searching a DNA sequence against a DNA sequence database can, therefore, be a useful tool in sequence analysis.
引用
收藏
页码:349 / 356
页数:8
相关论文
共 50 条
  • [31] DNA databases
    Lederman, L
    BIOTECHNIQUES, 2005, 39 (01) : 23 - 23
  • [32] Analysis of Similarities/Dissimilarities of DNA Sequences Based on a Novel Graphical Representation
    Yu, Jia-Feng
    Wang, Ji-Hua
    Sun, Xiao
    MATCH-COMMUNICATIONS IN MATHEMATICAL AND IN COMPUTER CHEMISTRY, 2010, 63 (02) : 493 - 512
  • [33] Separating Significant Matches from Spurious Matches in DNA Sequences
    Devillers, Hugo
    Schbath, Sophie
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2012, 19 (01) : 1 - 12
  • [34] Prior art considerations when patenting DNA sequences
    Novo Nordisk of North America, 405 Lexington Ave., New York, NY 10017, United States
    NAT. BIOTECHNOL., 10 (1309-1310):
  • [36] Prior art considerations when patenting DNA sequences
    Agris, CH
    NATURE BIOTECHNOLOGY, 1996, 14 (10) : 1309 - 1310
  • [37] Iterated similarity sequences and factorial level similarities in databases
    Mezey, Paul G.
    JOURNAL OF COMPUTATIONAL METHODS IN SCIENCES AND ENGINEERING, 2016, 16 (04) : 719 - 727
  • [38] Tools and databases for solving problems in detection and identification of repetitive DNA sequences
    Satovic, Eva
    Cvitanic, Monika Tunjic
    Plohl, Miroslav
    PERIODICUM BIOLOGORUM, 2020, 121 (1-2) : 7 - 14
  • [39] An assessment of the taxonomic reliability of DNA barcode sequences in publicly available databases
    Jin, Soyeong
    Kim, Kwang Young
    Kim, Min-Seok
    Park, Chungoo
    ALGAE, 2020, 35 (03) : 293 - 301
  • [40] Taxonomic Reliability of DNA Sequences in Public Sequence Databases: A Fungal Perspective
    Nilsson, R. Henrik
    Ryberg, Martin
    Kristiansson, Erik
    Abarenkov, Kessy
    Larsson, Karl-Henrik
    Koljalg, Urmas
    PLOS ONE, 2006, 1 (01):