An Efficient Motif Finding Algorithm for Large DNA Data Sets

被引:0
|
作者
Yu, Qiang [1 ]
Huo, Hongwei [1 ]
Chen, Xiaoyang [1 ]
Guo, Haitao [1 ]
Vitter, Jeffrey Scott [2 ]
Huan, Jun [2 ]
机构
[1] Xidian Univ, Sch Comp Sci & Technol, Xian 710071, Peoples R China
[2] Univ Kansas, Informat & Telecommun Technol Ctr, Lawrence, KS 66047 USA
关键词
Motif discovery; ChIP-seq; emerging substrings; MapReduce; DISCOVERY; SEARCH;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The planted (l, d) motif discovery has been successfully used to locate transcription factor binding sites in dozens of promoter sequences over the past decade. However, there has not been enough work done in identifying (l, d) motifs in the next-generation sequencing (ChIP-seq) data sets, which contain thousands of input sequences and thereby bring new challenge to make a good identification in reasonable time. To cater this need, we propose a new planted (l, d) motif discovery algorithm named MCES, which identifies motifs by mining and combining emerging substrings. Specially, to handle larger data sets, we design a MapReduce-based strategy to mine emerging substrings distributedly. Experimental results on the simulated data show that i) MCES is able to identify (l, d) motifs efficiently and effectively in thousands to millions of input sequences, and runs faster than the state-of-the-art (l, d) motif discovery algorithms, such as F-motif and TraverStringsR; ii) MCES is able to identify motifs without known lengths, and has a better identification accuracy than the competing algorithm CisFinder. Also, the validity of MCES is tested on real data sets.
引用
收藏
页数:6
相关论文
共 50 条
  • [41] StemFinder: An Efficient Algorithm for Searching Motif Stems over Large Alphabets
    Yu, Qiang
    Huo, Hongwei
    Vitter, Jeffrey Scott
    Huan, Jun
    Nekrich, Yakov
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2013,
  • [42] An efficient algorithm for identifying (ℓ, d) motif from huge DNA datasets
    M. Mohamed Divan Masood
    A. R. Arunarani
    D. Manjula
    Vijayan Sugumaran
    [J]. Journal of Ambient Intelligence and Humanized Computing, 2021, 12 : 485 - 495
  • [43] Efficient co-triangulation of large data sets
    Weimer, H
    Warren, J
    Troutner, J
    Wiggins, W
    Shrout, J
    [J]. VISUALIZATION '98, PROCEEDINGS, 1998, : 119 - +
  • [44] Efficient nonparametric population modeling for large data sets
    De Nicolao, Giuseppe
    Pillonetto, Gianluigi
    Chierici, Marco
    Cobelli, Claudio
    [J]. 2007 AMERICAN CONTROL CONFERENCE, VOLS 1-13, 2007, : 1648 - +
  • [45] A genetic algorithm for clustering on very large data sets
    Gasvoda, J
    Ding, Q
    [J]. COMPUTER APPLICATIONS IN INDUSTRY AND ENGINEERING, 2003, : 163 - 167
  • [46] AN ALGORITHM FOR THE PRINCIPAL COMPONENT ANALYSIS OF LARGE DATA SETS
    Halko, Nathan
    Martinsson, Per-Gunnar
    Shkolnisky, Yoel
    Tygert, Mark
    [J]. SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2011, 33 (05): : 2580 - 2594
  • [47] A Genetic Algorithm Approach for Clustering Large Data Sets
    Luchi, Diego
    Rodrigues, Alexandre
    Varejao, Flavio Miguel
    Santos, Willian
    [J]. 2016 IEEE 28TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2016), 2016, : 570 - 576
  • [48] From searching to finding: New developments for managing large data sets
    Swienty-Busch, Juergen
    Evans, David
    [J]. ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2014, 247
  • [49] A Genetic Algorithm Based Modification on the LTS Algorithm for Large Data Sets
    Satman, M. Hakan
    [J]. COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2012, 41 (05) : 644 - 652
  • [50] Discovery of Regular Domains in Large DNA Data Sets
    Bertacchini, Francesca
    Bilotta, Eleonora
    Pantano, Pietro
    [J]. ACM-BCB' 2017: PROCEEDINGS OF THE 8TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY,AND HEALTH INFORMATICS, 2017, : 744 - 749