Searching DNA databases for similarities to DNA sequences: when is a match significant?

被引:73
|
作者
Anderson, I [1 ]
Brass, A [1 ]
机构
[1] Univ Manchester, Sch Biol Sci, Manchester M13 9PT, Lancs, England
关键词
D O I
10.1093/bioinformatics/14.4.349
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Searching DNA sequences against a DNA database is an essential element of sequence analysis. However, few systematic studies have been carried out to deter-mine when a match between two DNA sequences has biological significance and this is limiting the use that can be made of DNA searching algorithms. Results: A rest set of DNA sequences has been constructed consisting of artificially evolved and real sequences. This set has been used to test various database searching algorithms (BLAST, BLAST2, FASTA and Smith-Waterman) on a subset of the EMBL database. The results of this analysis have been used to determine the sensitivity and coverage of all of the algorithms. Guidelines have been produced which can be used to assess the significance of DNA database search results. The Smith-Water-man algorithm was shown to have the best coverage, but the wet-st sensitivity, whereas the default BLASTN algorithm (word length set to 11) was shown to have good sensitivity, but poor coverage. A sensible compromise between speed, sensitivity and coverage can be obtained using either the FASTA or BLAST (word length set to 6) algorithms. However; analysis of the results also showed that no algorithm works well when the length of the probe sequence is <200 bases. In general, matches can accurately be identified between coding regions of DNA sequences when there is >35% sequence identity between the corresponding proteins. Searching a DNA sequence against a DNA sequence database can, therefore, be a useful tool in sequence analysis.
引用
收藏
页码:349 / 356
页数:8
相关论文
共 50 条
  • [41] FASTPAT - A FAST AND EFFICIENT ALGORITHM FOR STRING SEARCHING IN DNA-SEQUENCES
    PRUNELLA, N
    LIUNI, S
    ATTIMONELLI, M
    PESOLE, G
    COMPUTER APPLICATIONS IN THE BIOSCIENCES, 1993, 9 (05): : 541 - 545
  • [42] Efficient Searching for Motifs in DNA Sequences Using Position Weight Matrices
    Stojanovic, Nikola
    BIOMEDICAL ENGINEERING SYSTEMS AND TECHNOLOGIES, 2011, 127 : 394 - 405
  • [43] LEPSCAN-a web server for searching latent periodicity in DNA sequences
    Shelenkov, Andrew
    Korotkov, Eugene
    BRIEFINGS IN BIOINFORMATICS, 2012, 13 (02) : 143 - 149
  • [44] A Guided Dynamic Programming Approach for Searching a Set of Similar DNA Sequences
    Nordin, A. R. M.
    Osman, M. T. A.
    Yazid, M. S. M.
    Aziz, A.
    2009 SECOND INTERNATIONAL CONFERENCE ON THE APPLICATIONS OF DIGITAL INFORMATION AND WEB TECHNOLOGIES (ICADIWT 2009), 2009, : 512 - +
  • [45] SEARCHING FOR PROTEINS AND SEQUENCES OF DNA-REPLICATION IN MAMMALIAN-CELLS
    FALASCHI, A
    RIVA, S
    DELLAVALLE, G
    COBIANCHI, F
    BIAMONTI, G
    VALENTINI, O
    ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY, 1984, 179 : 497 - 506
  • [46] Searching sequences in protein databases generated by overlapping translation
    Benyó, B
    Biro, J
    Fördös, G
    Benyó, Z
    FEBS JOURNAL, 2005, 272 : 106 - 106
  • [47] Sample collection strategies when building mitochondrial DNA forensic databases
    Simao, Filipa
    Castillo, Adriana
    Burgos, German
    Gusma, Leonor
    FORENSIC SCIENCE INTERNATIONAL GENETICS SUPPLEMENT SERIES, 2022, 8 : 91 - 96
  • [48] Identifying DNA and protein patterns with statistically significant alignments of multiple sequences
    Hertz, GZ
    Stormo, GD
    BIOINFORMATICS, 1999, 15 (7-8) : 563 - 577
  • [49] An efficient method for significant motifs discovery from multiple DNA sequences
    Al-Ssulami, Abdulrakeeb M.
    Azmi, Aqil M.
    Mathkour, Hassan
    JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, 2017, 15 (04)
  • [50] RScan: fast searching structural similarities for structured RNAs in large databases
    Xue, Chenghai
    Liu, Guo-Ping
    BMC GENOMICS, 2007, 8 (1)