An Efficient Motif Finding Algorithm for Large DNA Data Sets

被引:0
|
作者
Yu, Qiang [1 ]
Huo, Hongwei [1 ]
Chen, Xiaoyang [1 ]
Guo, Haitao [1 ]
Vitter, Jeffrey Scott [2 ]
Huan, Jun [2 ]
机构
[1] Xidian Univ, Sch Comp Sci & Technol, Xian 710071, Peoples R China
[2] Univ Kansas, Informat & Telecommun Technol Ctr, Lawrence, KS 66047 USA
来源
2014 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM) | 2014年
关键词
Motif discovery; ChIP-seq; emerging substrings; MapReduce; DISCOVERY; SEARCH;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The planted (l, d) motif discovery has been successfully used to locate transcription factor binding sites in dozens of promoter sequences over the past decade. However, there has not been enough work done in identifying (l, d) motifs in the next-generation sequencing (ChIP-seq) data sets, which contain thousands of input sequences and thereby bring new challenge to make a good identification in reasonable time. To cater this need, we propose a new planted (l, d) motif discovery algorithm named MCES, which identifies motifs by mining and combining emerging substrings. Specially, to handle larger data sets, we design a MapReduce-based strategy to mine emerging substrings distributedly. Experimental results on the simulated data show that i) MCES is able to identify (l, d) motifs efficiently and effectively in thousands to millions of input sequences, and runs faster than the state-of-the-art (l, d) motif discovery algorithms, such as F-motif and TraverStringsR; ii) MCES is able to identify motifs without known lengths, and has a better identification accuracy than the competing algorithm CisFinder. Also, the validity of MCES is tested on real data sets.
引用
收藏
页数:6
相关论文
共 50 条
  • [21] Efficient Algorithm for Finding Dominant Trapping Sets of LDPC Codes
    Karimi, Mehdi
    Banihashemi, Amir H.
    IEEE TRANSACTIONS ON INFORMATION THEORY, 2012, 58 (11) : 6942 - 6958
  • [22] An efficient algorithm for finding similar short substrings from large scale string data
    Uno, Takeaki
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2008, 5012 : 345 - 356
  • [23] Computationally Efficient Algorithm to Identify Matched Molecular Pairs (MMPs) in Large Data Sets
    Hussain, Jameed
    Rea, Ceara
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2010, 50 (03) : 339 - 348
  • [24] An Efficient PSO-Based Algorithm for Finding Maximal Exact Match in Large DNA Sequences
    Daas, Mohamed Skander
    Kenidra, Billel
    Bouanaka, Hamza
    Chikhi, Salim
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS, 2023, 22 (04)
  • [25] RPPMD (Randomly Projected Possible Motif Discovery): An Efficient Bucketing Method for Finding DNA Planted Motif
    Bin Ashraf, Faisal
    Abir, Ali Imam
    Salekin, Md Sirajus
    Mottalib, M. A.
    2017 INTERNATIONAL CONFERENCE ON ELECTRICAL, COMPUTER AND COMMUNICATION ENGINEERING (ECCE), 2017, : 509 - 513
  • [26] Efficient record linkage in large data sets
    Jin, L
    Li, C
    Mehrotra, S
    EIGHTH INTERNATIONAL CONFERENCE ON DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS, 2003, : 137 - 146
  • [27] Efficient Discovery of Confounders in Large Data Sets
    Zhou, Wenjun
    Xiong, Hui
    2009 9TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, 2009, : 647 - 656
  • [28] EFFICIENT DNA MOTIF DISCOVERY USING MODIFIED GENETIC ALGORITHM
    Al Daoud, Essam
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS, 2013, 12 (03)
  • [29] PMS5: an efficient exact algorithm for the (ℓ, d)-motif finding problem
    Hieu Dinh
    Sanguthevar Rajasekaran
    Vamsi K Kundeti
    BMC Bioinformatics, 12
  • [30] An Efficient Algorithm for Finding Dominant Trapping Sets of Irregular LDPC Codes
    Karimi, Mehdi
    Banihashemi, Amir H.
    2011 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY PROCEEDINGS (ISIT), 2011, : 1091 - 1095