A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework

被引:17
|
作者
Chang, Yu-Jung [1 ]
Chen, Chien-Chih [1 ,2 ]
Chen, Chuen-Liang [2 ]
Ho, Jan-Ming [1 ]
机构
[1] Acad Sinica, Inst Informat Sci, Taipei, Taiwan
[2] Natl Taiwan Univ, Dept Comp Sci & Informat Engn, Taipei 10764, Taiwan
来源
BMC GENOMICS | 2012年 / 13卷
关键词
Sequencing Error; Coverage Depth; Graph Construction; Position Weight Matrix; MapReduce Framework;
D O I
10.1186/1471-2164-13-S7-S28
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: State-of-the-art high-throughput sequencers, e.g., the Illumina HiSeq series, generate sequencing reads that are longer than 150 bp up to a total of 600 Gbp of data per run. The high-throughput sequencers generate lengthier reads with greater sequencing depth than those generated by previous technologies. Two major challenges exist in using the high-throughput technology for de novo assembly of genomes. First, the amount of physical memory may be insufficient to store the data structure of the assembly algorithm, even for high-end multicore processors. Moreover, the graph-theoretical model used to capture intersection relationships of the reads may contain structural defects that are not well managed by existing assembly algorithms. Results: We developed a distributed genome assembler based on string graphs and MapReduce framework, known as the CloudBrush. The assembler includes a novel edge-adjustment algorithm to detect structural defects by examining the neighboring reads of a specific read for sequencing errors and adjusting the edges of the string graph, if necessary. CloudBrush is evaluated against GAGE benchmarks to compare its assembly quality with the other assemblers. The results show that our assemblies have a moderate N50, a low misassembly rate of misjoins, and indels of > 5 bp. In addition, we have introduced two measures, known as precision and recall, to address the issues of faithfully aligned contigs to target genomes. Compared with the assembly tools used in the GAGE benchmarks, CloudBrush is shown to produce contigs with high precision and recall. We also verified the effectiveness of the edge-adjustment algorithm using simulated datasets and ran CloudBrush on a nematode dataset using a commercial cloud. CloudBrush assembler is available at https://github.com/ice91/CloudBrush.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework
    Yu-Jung Chang
    Chien-Chih Chen
    Chuen-Liang Chen
    Jan-Ming Ho
    BMC Genomics, 13
  • [2] A de novo Genome Assembler based on MapReduce and Bi-directed de Bruijn Graph
    Zhang, Yuehua
    Xuan, Pengfei
    Wang, Yunsheng
    Srimani, Pradip K.
    Luo, Feng
    2016 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2016, : 65 - 71
  • [3] A repetitive sequence assembler based on next-generation sequencing
    Lian, S.
    Tu, Y.
    Wang, Y.
    Chen, X.
    Wang, L.
    GENETICS AND MOLECULAR RESEARCH, 2016, 15 (03)
  • [4] Readjoiner: a fast and memory efficient string graph-based sequence assembler
    Giorgio Gonnella
    Stefan Kurtz
    BMC Bioinformatics, 13
  • [5] Readjoiner: a fast and memory efficient string graph-based sequence assembler
    Gonnella, Giorgio
    Kurtz, Stefan
    BMC BIOINFORMATICS, 2012, 13
  • [6] ArrOW: Experiencing a Parallel Cloud-based De Novo Assembler Workflow
    Ocana, Kary
    Guedes, Thaylon
    de Oliveira, Daniel
    2019 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2019, : 185 - 190
  • [7] A SERVICE INTEGRITY ASSURANCE FRAMEWORK FOR CLOUD COMPUTING BASED ON MAPREDUCE
    Ren, Yulong
    Tang, Wen
    2012 IEEE 2ND INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENT SYSTEMS (CCIS) VOLS 1-3, 2012, : 240 - 244
  • [8] Subset selection of high-depth next generation sequencing reads for de novo genome assembly using MapReduce framework
    Chih-Hao Fang
    Yu-Jung Chang
    Wei-Chun Chung
    Ping-Heng Hsieh
    Chung-Yen Lin
    Jan-Ming Ho
    BMC Genomics, 16
  • [9] Subset selection of high-depth next generation sequencing reads for de novo genome assembly using MapReduce framework
    Fang, Chih-Hao
    Chang, Yu-Jung
    Chung, Wei-Chun
    Hsieh, Ping-Heng
    Lin, Chung-Yen
    Ho, Jan-Ming
    BMC GENOMICS, 2015, 16
  • [10] De novo DNA assembler for third generation sequencers' reads based on BLASR algorithm
    Winiarski, Michal
    Kusmirek, Wiktor
    Nowak, Robert M.
    PHOTONICS APPLICATIONS IN ASTRONOMY, COMMUNICATIONS, INDUSTRY, AND HIGH-ENERGY PHYSICS EXPERIMENTS 2018, 2018, 10808