Exact distribution of word counts in shuffled sequences

被引:2
|
作者
Rodland, EA [1 ]
机构
[1] Univ Oslo, Rikshosp, Radiumhosp HF, Ctr Mol Biol & Neurosci,Inst Med Microbiol, N-0027 Oslo, Norway
关键词
sequence shuffling; Markov chain; word count; exact distribution; hypergeometric distribution; generalised hypergeometric series; moment generating function; genome sequence analysis; directed graph; Euler path;
D O I
10.1239/aap/1143936143
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
In DNA sequences, specific words may take on biological functions as marker or signalling sequences. These may often be identified by frequent-word analyses as being particularly abundant. Accurate statistics is needed to assess the statistical significance of these word frequencies. The set of shuffled sequences - letter sequences having the same k-word composition, for some choice of k, as the sequence being analysed - is considered the most appropriate sample space for analysing word counts. However, little is known about these word counts. Here we present exact formulae for word counts in shuffled sequences.
引用
收藏
页码:116 / 133
页数:18
相关论文
共 50 条
  • [2] The Distribution of Short Word Match Counts between Markovian Sequences
    Burden, Conrad J.
    Leopardi, Paul
    Foret, Sylvain
    [J]. BIOINFORMATICS 2013: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON BIOINFORMATICS MODELS, METHODS AND ALGORITHMS, 2013, : 25 - 33
  • [3] MODERATE DEVIATIONS FOR WORD COUNTS IN BIOLOGICAL SEQUENCES
    Behrens, Sarah
    Loewe, Matthias
    [J]. JOURNAL OF APPLIED PROBABILITY, 2009, 46 (04) : 1020 - 1037
  • [4] Omnibus Sequences, Coupon Collection, and Missing Word Counts
    Abraham, Sunil
    Brockman, Greg
    Sapp, Stephanie
    Godbole, Anant P.
    [J]. METHODOLOGY AND COMPUTING IN APPLIED PROBABILITY, 2013, 15 (02) : 363 - 378
  • [5] Word Match Counts Between Markovian Biological Sequences
    Burden, Conrad
    Leopardi, Paul
    Foret, Sylvain
    [J]. BIOMEDICAL ENGINEERING SYSTEMS AND TECHNOLOGIES (BIOSTEC 2013), 2014, 452 : 147 - 161
  • [6] Omnibus Sequences, Coupon Collection, and Missing Word Counts
    Sunil Abraham
    Greg Brockman
    Stephanie Sapp
    Anant P. Godbole
    [J]. Methodology and Computing in Applied Probability, 2013, 15 : 363 - 378
  • [7] An overview on the distribution of word counts in Markov chains
    Schbath, S
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 2000, 7 (1-2) : 193 - 201
  • [8] EXACT FORMULA FOR DISTRIBUTION OF SEQUENCES {ωn}
    Shahverdian, Ashot
    Kilicman, Adem
    [J]. INTERNATIONAL JOURNAL OF NUMBER THEORY, 2013, 9 (01) : 179 - 187
  • [9] SORTING SHUFFLED MONOTONE SEQUENCES
    LEVCOPOULOS, C
    PETERSSON, O
    [J]. LECTURE NOTES IN COMPUTER SCIENCE, 1990, 447 : 181 - 191
  • [10] COUNTS OF LONG ALIGNED WORD MATCHES AMONG RANDOM LETTER SEQUENCES
    KARLIN, S
    OST, F
    [J]. ADVANCES IN APPLIED PROBABILITY, 1987, 19 (02) : 293 - 351