An Efficient Motif Finding Algorithm for Large DNA Data Sets

被引：0

作者：

Yu, Qiang ^{[1
]}

Huo, Hongwei ^{[1
]}

Chen, Xiaoyang ^{[1
]}

Guo, Haitao ^{[1
]}

Vitter, Jeffrey Scott ^{[2
]}

Huan, Jun ^{[2
]}

机构：

[1] Xidian Univ, Sch Comp Sci & Technol, Xian 710071, Peoples R China

[2] Univ Kansas, Informat & Telecommun Technol Ctr, Lawrence, KS 66047 USA

来源：

2014 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM) | 2014年

关键词：

Motif discovery; ChIP-seq; emerging substrings; MapReduce; DISCOVERY; SEARCH;

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

The planted (l, d) motif discovery has been successfully used to locate transcription factor binding sites in dozens of promoter sequences over the past decade. However, there has not been enough work done in identifying (l, d) motifs in the next-generation sequencing (ChIP-seq) data sets, which contain thousands of input sequences and thereby bring new challenge to make a good identification in reasonable time. To cater this need, we propose a new planted (l, d) motif discovery algorithm named MCES, which identifies motifs by mining and combining emerging substrings. Specially, to handle larger data sets, we design a MapReduce-based strategy to mine emerging substrings distributedly. Experimental results on the simulated data show that i) MCES is able to identify (l, d) motifs efficiently and effectively in thousands to millions of input sequences, and runs faster than the state-of-the-art (l, d) motif discovery algorithms, such as F-motif and TraverStringsR; ii) MCES is able to identify motifs without known lengths, and has a better identification accuracy than the competing algorithm CisFinder. Also, the validity of MCES is tested on real data sets.

引用

页数：6

共 50 条

[41] StemFinder: An Efficient Algorithm for Searching Motif Stems over Large Alphabets
Yu, Qiang
Huo, Hongwei
Vitter, Jeffrey Scott
Huan, Jun
Nekrich, Yakov
2013 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2013,
[42] An efficient algorithm for identifying (ℓ, d) motif from huge DNA datasets
M. Mohamed Divan Masood
A. R. Arunarani
D. Manjula
Vijayan Sugumaran
Journal of Ambient Intelligence and Humanized Computing, 2021, 12 : 485 - 495
[43] Efficient co-triangulation of large data sets
Weimer, H
Warren, J
Troutner, J
Wiggins, W
Shrout, J
VISUALIZATION '98, PROCEEDINGS, 1998, : 119 - +
[44] Efficient nonparametric population modeling for large data sets
De Nicolao, Giuseppe
Pillonetto, Gianluigi
Chierici, Marco
Cobelli, Claudio
2007 AMERICAN CONTROL CONFERENCE, VOLS 1-13, 2007, : 1648 - +
[45] AN ALGORITHM FOR THE PRINCIPAL COMPONENT ANALYSIS OF LARGE DATA SETS
Halko, Nathan
Martinsson, Per-Gunnar
Shkolnisky, Yoel
Tygert, Mark
SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2011, 33 (05): : 2580 - 2594
[46] A genetic algorithm for clustering on very large data sets
Gasvoda, J
Ding, Q
COMPUTER APPLICATIONS IN INDUSTRY AND ENGINEERING, 2003, : 163 - 167
[47] A Genetic Algorithm Approach for Clustering Large Data Sets
Luchi, Diego
Rodrigues, Alexandre
Varejao, Flavio Miguel
Santos, Willian
2016 IEEE 28TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2016), 2016, : 570 - 576
[48] From searching to finding: New developments for managing large data sets
Swienty-Busch, Juergen
Evans, David
ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2014, 247
[49] A Genetic Algorithm Based Modification on the LTS Algorithm for Large Data Sets
Satman, M. Hakan
COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2012, 41 (05) : 644 - 652
[50] Discovery of Regular Domains in Large DNA Data Sets
Bertacchini, Francesca
Bilotta, Eleonora
Pantano, Pietro
ACM-BCB' 2017: PROCEEDINGS OF THE 8TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY,AND HEALTH INFORMATICS, 2017, : 744 - 749

← 1 2 3 4 5 →