Statistics of large-scale sequence searching

被引:14
|
作者
Spang, R [1 ]
Vingron, M [1 ]
机构
[1] Deutsch Krebsforschungszentrum, Immunochem Abt, D-69120 Heidelberg, Germany
关键词
D O I
10.1093/bioinformatics/14.3.279
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Database seal-ch programs such as FASTA, BLAST or a rigorous Smith-Waterman algorithm produce lists of database entries, which are assumed to be related to the query. The computation of statistical significance of similarity scores is well established for single pairs of sequences and using purely random models. However; the multi-trial context of a database search poses new, problems. The credibility of a cer-tain score obtained in a database sear-ch deer-eases with the amount of data that is compared. To improve p-value computation for database sear-ch experiments, statistical properties of the databases, such as the distribution of sequence length and effects induced by frequently repeated sequence patterns, need to be taken into account. Results: We investigated the SWISS-PROT protein database Release 31.0 running extensive simulations of database searches. A discrepancy is observed between the theoretical predictions and the empirical distribution. To correct for this, we evaluate the statistical significance of scores in the conte,ut of a database sear ch by a contrasting semi-random model. This model enhances purely random models by one additional parameter reflecting individual statistical proper-ties of real databases. We call this parameter the effective size of the database.
引用
收藏
页码:279 / 284
页数:6
相关论文
共 50 条
  • [31] ESTIMATION OF THE LAGRANGIAN STATISTICS OF LARGE-SCALE SURFACE CURRENTS
    STRACHUK, NK
    [J]. SOVIET JOURNAL OF REMOTE SENSING, 1989, 5 (06): : 1006 - 1017
  • [32] LASH: Large-Scale Sequence Mining with Hierarchies
    Beedkar, Kaustubh
    Gemulla, Rainer
    [J]. SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2015, : 491 - 503
  • [33] STATISTICS OF GALAXY ORIENTATIONS - MORPHOLOGY AND LARGE-SCALE STRUCTURE
    LAMBAS, DG
    GROTH, EJ
    PEEBLES, PJE
    [J]. ASTRONOMICAL JOURNAL, 1988, 95 (04): : 975 - 984
  • [34] Statistics of inhomogeneous turbulence in large-scale quasigeostrophic dynamics
    Svirsky, Anton
    Herbert, Corentin
    Frishman, Anna
    [J]. PHYSICAL REVIEW E, 2023, 108 (06)
  • [35] INFLATIONARY STOCHASTIC DYNAMICS AND THE STATISTICS OF LARGE-SCALE STRUCTURE
    YI, I
    VISHNIAC, ET
    [J]. ASTROPHYSICAL JOURNAL SUPPLEMENT SERIES, 1993, 86 (02): : 333 - 364
  • [36] VOID STATISTICS, SCALING, AND THE ORIGINS OF LARGE-SCALE STRUCTURE
    FRY, JN
    GIOVANELLI, R
    HAYNES, MP
    MELOTT, AL
    SCHERRER, RJ
    [J]. ASTROPHYSICAL JOURNAL, 1989, 340 (01): : 11 - 22
  • [37] Fast Tag Searching Protocol for Large-Scale RFID Systems
    Zheng, Yuanqing
    Li, Mo
    [J]. 2011 19TH IEEE INTERNATIONAL CONFERENCE ON NETWORK PROTOCOLS (ICNP), 2011,
  • [38] Efficient Keyword Searching in Large-Scale Social Network Service
    Chen, Hanhua
    Jin, Hai
    [J]. IEEE TRANSACTIONS ON SERVICES COMPUTING, 2018, 11 (05) : 810 - 820
  • [39] A study on efficient searching for image information in large-scale network
    Sugawara, S
    Yamaoka, K
    Sakai, Y
    [J]. ELECTRONICS AND COMMUNICATIONS IN JAPAN PART I-COMMUNICATIONS, 2000, 83 (07): : 15 - 25
  • [40] Fast Tag Searching Protocol for Large-Scale RFID Systems
    Zheng, Yuanqing
    Li, Mo
    [J]. IEEE-ACM TRANSACTIONS ON NETWORKING, 2013, 21 (03) : 924 - 934