Statistics of large-scale sequence searching

被引:14
|
作者
Spang, R [1 ]
Vingron, M [1 ]
机构
[1] Deutsch Krebsforschungszentrum, Immunochem Abt, D-69120 Heidelberg, Germany
关键词
D O I
10.1093/bioinformatics/14.3.279
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Database seal-ch programs such as FASTA, BLAST or a rigorous Smith-Waterman algorithm produce lists of database entries, which are assumed to be related to the query. The computation of statistical significance of similarity scores is well established for single pairs of sequences and using purely random models. However; the multi-trial context of a database search poses new, problems. The credibility of a cer-tain score obtained in a database sear-ch deer-eases with the amount of data that is compared. To improve p-value computation for database sear-ch experiments, statistical properties of the databases, such as the distribution of sequence length and effects induced by frequently repeated sequence patterns, need to be taken into account. Results: We investigated the SWISS-PROT protein database Release 31.0 running extensive simulations of database searches. A discrepancy is observed between the theoretical predictions and the empirical distribution. To correct for this, we evaluate the statistical significance of scores in the conte,ut of a database sear ch by a contrasting semi-random model. This model enhances purely random models by one additional parameter reflecting individual statistical proper-ties of real databases. We call this parameter the effective size of the database.
引用
收藏
页码:279 / 284
页数:6
相关论文
共 50 条
  • [1] Thoroughly searching sequence space: Large-scale protein design of structural ensembles
    Larson, SM
    England, JL
    Desjarlais, JR
    Pande, V
    [J]. BIOPHYSICAL JOURNAL, 2002, 82 (01) : 460A - 460A
  • [2] Perspectives sequence data base searching in the era of large-scale genomic sequencing
    Smith, RF
    [J]. GENOME RESEARCH, 1996, 6 (08) : 653 - 660
  • [3] THE LARGE-SCALE SEARCHING OF HANDWRITING SAMPLES
    BAXENDALE, D
    RENSHAW, ID
    [J]. JOURNAL OF THE FORENSIC SCIENCE SOCIETY, 1979, 19 (04): : 245 - 251
  • [4] The statistics of the large-scale velocity field
    Bernardeau, F
    [J]. MAPPING, MEASURING, AND MODELLING THE UNIVERSE, 1996, 94 : 253 - 258
  • [5] LARGE-SCALE INHOMOGENEITIES AND GALAXY STATISTICS
    SCHAEFFER, R
    SILK, J
    [J]. ASTRONOMY & ASTROPHYSICS, 1984, 130 (01) : 131 - 142
  • [6] Large-Scale Statistics for Cu Electromigration
    Hauschildt, M.
    Gall, M.
    Hernandez, R.
    [J]. STRESS-INDUCED PHENOMENA IN METALLIZATION, 2009, 1143 : 31 - +
  • [7] Large-Scale Pairwise Sequence Alignments on a Large-Scale GPU Cluster
    Savran, Ibrahim
    Gao, Yang
    Bakos, Jason D.
    [J]. IEEE DESIGN & TEST, 2014, 31 (01) : 51 - 61
  • [8] Searching for light relics with large-scale structure
    Baumann, Daniel
    Green, Daniel
    Wallisch, Benjamin
    [J]. JOURNAL OF COSMOLOGY AND ASTROPARTICLE PHYSICS, 2018, (08):
  • [9] Fast Large-Scale Multimedia Indexing and Searching
    Mohamed, Hisham
    Osipyan, Hasmik
    Marchand-Maillet, Stephane
    [J]. 2014 12TH INTERNATIONAL WORKSHOP ON CONTENT-BASED MULTIMEDIA INDEXING (CBMI), 2014,
  • [10] Recent advances in large-scale structure statistics
    Martinez, VJ
    Stein, ML
    [J]. STATISTICAL CHALLENGES IN MODERN ASTRONOMY II, 1997, : 153 - 171