An Efficient Motif Finding Algorithm for Large DNA Data Sets

被引：0

作者：

Yu, Qiang ^{[1
]}

Huo, Hongwei ^{[1
]}

Chen, Xiaoyang ^{[1
]}

Guo, Haitao ^{[1
]}

Vitter, Jeffrey Scott ^{[2
]}

Huan, Jun ^{[2
]}

机构：

[1] Xidian Univ, Sch Comp Sci & Technol, Xian 710071, Peoples R China

[2] Univ Kansas, Informat & Telecommun Technol Ctr, Lawrence, KS 66047 USA

来源：

2014 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM) | 2014年

关键词：

Motif discovery; ChIP-seq; emerging substrings; MapReduce; DISCOVERY; SEARCH;

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

The planted (l, d) motif discovery has been successfully used to locate transcription factor binding sites in dozens of promoter sequences over the past decade. However, there has not been enough work done in identifying (l, d) motifs in the next-generation sequencing (ChIP-seq) data sets, which contain thousands of input sequences and thereby bring new challenge to make a good identification in reasonable time. To cater this need, we propose a new planted (l, d) motif discovery algorithm named MCES, which identifies motifs by mining and combining emerging substrings. Specially, to handle larger data sets, we design a MapReduce-based strategy to mine emerging substrings distributedly. Experimental results on the simulated data show that i) MCES is able to identify (l, d) motifs efficiently and effectively in thousands to millions of input sequences, and runs faster than the state-of-the-art (l, d) motif discovery algorithms, such as F-motif and TraverStringsR; ii) MCES is able to identify motifs without known lengths, and has a better identification accuracy than the competing algorithm CisFinder. Also, the validity of MCES is tested on real data sets.

引用

页数：6

共 50 条

[21] Efficient Algorithm for Finding Dominant Trapping Sets of LDPC Codes
Karimi, Mehdi
Banihashemi, Amir H.
IEEE TRANSACTIONS ON INFORMATION THEORY, 2012, 58 (11) : 6942 - 6958
[22] An efficient algorithm for finding similar short substrings from large scale string data
Uno, Takeaki
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2008, 5012 : 345 - 356
[23] Computationally Efficient Algorithm to Identify Matched Molecular Pairs (MMPs) in Large Data Sets
Hussain, Jameed
Rea, Ceara
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2010, 50 (03) : 339 - 348
[24] An Efficient PSO-Based Algorithm for Finding Maximal Exact Match in Large DNA Sequences
Daas, Mohamed Skander
Kenidra, Billel
Bouanaka, Hamza
Chikhi, Salim
INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS, 2023, 22 (04)
[25] RPPMD (Randomly Projected Possible Motif Discovery): An Efficient Bucketing Method for Finding DNA Planted Motif
Bin Ashraf, Faisal
Abir, Ali Imam
Salekin, Md Sirajus
Mottalib, M. A.
2017 INTERNATIONAL CONFERENCE ON ELECTRICAL, COMPUTER AND COMMUNICATION ENGINEERING (ECCE), 2017, : 509 - 513
[26] Efficient record linkage in large data sets
Jin, L
Li, C
Mehrotra, S
EIGHTH INTERNATIONAL CONFERENCE ON DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS, 2003, : 137 - 146
[27] Efficient Discovery of Confounders in Large Data Sets
Zhou, Wenjun
Xiong, Hui
2009 9TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, 2009, : 647 - 656
[28] EFFICIENT DNA MOTIF DISCOVERY USING MODIFIED GENETIC ALGORITHM
Al Daoud, Essam
INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS, 2013, 12 (03)
[29] PMS5: an efficient exact algorithm for the (ℓ, d)-motif finding problem
Hieu Dinh
Sanguthevar Rajasekaran
Vamsi K Kundeti
BMC Bioinformatics, 12
[30] An Efficient Algorithm for Finding Dominant Trapping Sets of Irregular LDPC Codes
Karimi, Mehdi
Banihashemi, Amir H.
2011 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY PROCEEDINGS (ISIT), 2011, : 1091 - 1095

← 1 2 3 4 5 →