The Distribution of Word Matches Between Markovian Sequences with Periodic Boundary Conditions

被引:4
|
作者
Burden, Conrad J. [1 ]
Leopardi, Paul [1 ]
Foret, Sylvain [2 ]
机构
[1] Australian Natl Univ, Inst Math Sci, GPO Box 4, Canberra, ACT 0200, Australia
[2] Australian Natl Univ, Res Sch Biol, Canberra, ACT 0200, Australia
关键词
Markov chains; sequence analysis; statistical models; ASYMPTOTIC-BEHAVIOR; SIMILARITY;
D O I
10.1089/cmb.2012.0277
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Word match counts have traditionally been proposed as an alignment-free measure of similarity for biological sequences. The D-2 statistic, which simply counts the number of exact word matches between two sequences, is a useful test bed for developing rigorous mathematical results, which can then be extended to more biologically useful measures. The distributional properties of the D-2 statistic under the null hypothesis of identically and independently distributed letters have been studied extensively, but no comprehensive study of the D-2 distribution for biologically more realistic higher-order Markovian sequences exists. Here we derive exact formulas for the mean and variance of the D-2 statistic for Markovian sequences of any order, and demonstrate through Monte Carlo simulations that the entire distribution is accurately characterized by a Polya-Aeppli distribution for sequence lengths of biological interest. The approach is novel in that Markovian dependency is defined for sequences with periodic boundary conditions, and this enables exact analytic formulas for the mean and variance to be derived. We also carry out a preliminary comparison between the approximate D-2 distribution computed with the theoretical mean and variance under a Markovian hypothesis and an empirical D-2 distribution from the human genome.
引用
收藏
页码:41 / 63
页数:23
相关论文
共 50 条
  • [1] The Distribution of Short Word Match Counts between Markovian Sequences
    Burden, Conrad J.
    Leopardi, Paul
    Foret, Sylvain
    [J]. BIOINFORMATICS 2013: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON BIOINFORMATICS MODELS, METHODS AND ALGORITHMS, 2013, : 25 - 33
  • [2] Approximate word matches between two random sequences
    Burden, Conrad J.
    Kantorovitz, Miriam R.
    Wilson, Susan R.
    [J]. ANNALS OF APPLIED PROBABILITY, 2008, 18 (01): : 1 - 21
  • [3] Empirical distribution of k-word matches in biological sequences
    Foret, Sylvain
    Wilson, Susan R.
    Burden, Conrad J.
    [J]. PATTERN RECOGNITION, 2009, 42 (04) : 539 - 548
  • [4] Word Match Counts Between Markovian Biological Sequences
    Burden, Conrad
    Leopardi, Paul
    Foret, Sylvain
    [J]. BIOMEDICAL ENGINEERING SYSTEMS AND TECHNOLOGIES (BIOSTEC 2013), 2014, 452 : 147 - 161
  • [5] DISTRIBUTION OF THE NUMBER OF MATCHES BETWEEN NUCLEOTIDE-SEQUENCES
    RINSMA, I
    HENDY, M
    PENNY, D
    [J]. BULLETIN OF MATHEMATICAL BIOLOGY, 1990, 52 (03) : 349 - 358
  • [6] Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences
    Sylvain Forêt
    Miriam R Kantorovitz
    Conrad J Burden
    [J]. BMC Bioinformatics, 7
  • [7] Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences
    Foret, Sylvain
    Kantorovitz, Miriam R.
    Burden, Conrad J.
    [J]. BMC BIOINFORMATICS, 2006, 7 (Suppl 5) : S21
  • [8] Estimating evolutionary distances between genomic sequences from spaced-word matches
    Morgenstern, Burkhard
    Zhu, Bingyao
    Horwege, Sebastian
    Leimeister, Chris Andre
    [J]. ALGORITHMS FOR MOLECULAR BIOLOGY, 2015, 10
  • [9] Distributional regimes for the number of k-word matches between two random sequences
    Lippert, RA
    Huang, HY
    Waterman, MS
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (22) : 13980 - 13989
  • [10] Asymptotic behavior of k-word matches between two uniformly distributed sequences
    Kantorovitz, M. R.
    Booth, H. S.
    Burden, C. J.
    Wilson, S. R.
    [J]. JOURNAL OF APPLIED PROBABILITY, 2007, 44 (03) : 788 - 805