Discovery of conserved sequence patterns using a stochastic dictionary model

被引:40
|
作者
Gupta, M [1 ]
Liu, JS [1 ]
机构
[1] Harvard Univ, Dept Stat, Cambridge, MA 02138 USA
关键词
data augmentation; gene regulation; missing data; transcription factor binding site;
D O I
10.1198/016214503388619094
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Detection of unknown patterns from a randomly generated sequence of observations is a problem arising in fields ranging from signal processing to computational biology. Here we focus on the discovery of short recurring patterns (called motifs) in DNA sequences that represent binding sites for certain proteins in the process of gene regulation. What makes this a difficult problem is that these patterns can vary stochastically. We describe a novel data augmentation strategy for detecting such patterns in biological sequences based on an extension of a "dictionary" model. In this approach, we treat conserved patterns and individual nucleotides as stochastic words generated according to probability weight matrices and the observed sequences generated by concatenations of these words. By using a missing-data approach to find these patterns, we also address other related problems, including determining widths of patterns, finding multiple motifs, handling low-complexity regions, and finding patterns with insertions and deletions. The issue of selecting appropriate models is also discussed. However, the flexibility of this model is also accompanied by a high degree of computational complexity. We demonstrate how dynamic programming-like recursions can be used to improve computational efficiency.
引用
收藏
页码:55 / 66
页数:12
相关论文
共 50 条
  • [1] Efficient discovery of conserved patterns using a pattern graph
    Jonassen, I
    COMPUTER APPLICATIONS IN THE BIOSCIENCES, 1997, 13 (05): : 509 - 522
  • [2] Discovery of defintion patterns by compressing dictionary sentences
    Tsuchiya, Masatoshi
    Kurohashi, Sadao
    Sato, Satoshi
    Transactions of the Japanese Society for Artificial Intelligence, 2002, 17 (04) : 420 - 430
  • [3] Discovery of definition patterns by compressing dictionary sentences
    Tsuchiya, Masatoshi
    Kurohashi, Sadao
    Sato, Satoshi
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2002, 2281 : 284 - 295
  • [4] Knowledge discovery on RFM model using Bernoulli sequence
    Yeh, I-Cheng
    Yang, King-Jang
    Ting, Tao-Ming
    EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (03) : 5866 - 5871
  • [5] Efficient Discovery of Sequence Outlier Patterns
    Cao, Lei
    Yan, Yizhou
    Madden, Samuel
    Rundensteiner, Elke A.
    Gopalsamy, Mathan
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2019, 12 (08): : 920 - 932
  • [6] Comparative genomic workflow Discovery of Conserved Noncoding DNA Patterns
    Rajapakse, Jagath C.
    Pooja
    Chen, Chunxi
    Ho, Sy-Loi
    IEEE ENGINEERING IN MEDICINE AND BIOLOGY MAGAZINE, 2009, 28 (04): : 19 - 24
  • [7] Sequence-structure patterns: Discovery and applications
    Milledge, T
    Khuri, S
    Wei, X
    Yang, C
    Zheng, G
    Narasimhan, G
    Proceedings of the 8th Joint Conference on Information Sciences, Vols 1-3, 2005, : 1282 - 1285
  • [8] Efficient and Accurate Discovery of Patterns in Sequence Datasets
    Floratou, Avrilia
    Tata, Sandeep
    Patel, Jignesh M.
    26TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING ICDE 2010, 2010, : 461 - 472
  • [9] Vulnerability Discovery Model for a Software System Using Stochastic Differential Equation
    Shrivastava, A. K.
    Sharma, Ruchi
    Kapur, P. K.
    2015 1ST INTERNATIONAL CONFERENCE ON FUTURISTIC TRENDS ON COMPUTATIONAL ANALYSIS AND KNOWLEDGE MANAGEMENT (ABLAZE), 2015, : 199 - 205
  • [10] Conserved structural features and sequence patterns in the GroES fold family
    Taneja, B
    Mande, SC
    PROTEIN ENGINEERING, 1999, 12 (10): : 815 - 818