Memory Efficient Minimum Substring Partitioning

被引:32
|
作者
Li, Yang [1 ]
Kamousi, Pegah [1 ]
Han, Fangqiu [1 ]
Yang, Shengqi [1 ]
Yan, Xifeng [1 ]
Suri, Subhash [1 ]
机构
[1] Univ Calif Santa Barbara, Santa Barbara, CA 93106 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2013年 / 6卷 / 03期
关键词
D O I
10.14778/2535569.2448951
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Massively parallel DNA sequencing technologies are revolutionizing genomics research. Billions of short reads generated at low costs can be assembled for reconstructing the whole genomes. Unfortunately, the large memory footprint of the existing de novo assembly algorithms makes it challenging to get the assembly done for higher eukaryotes like mammals. In this work, we investigate the memory issue of constructing de Bruijn graph, a core task in leading assembly algorithms, which often consumes several hundreds of gigabytes memory for large genomes. We propose a disk-based partition method, called Minimum Substring Partitioning (MSP), to complete the task using less than 10 gigabytes memory, without runtime slowdown. MSP breaks the short reads into multiple small disjoint partitions so that each partition can be loaded into memory, processed individually and later merged with others to form a de Bruijn graph. By leveraging the overlaps among the k-mers (substring of length k), MSP achieves astonishing compression ratio: The total size of partitions is reduced from Theta(kn) to Theta(n), where n is the size of the short read database, and k is the length of a k-mer. Experimental results show that our method can build de Bruijn graphs using a commodity computer for any large-volume sequence dataset.
引用
收藏
页码:169 / 180
页数:12
相关论文
共 50 条
  • [1] The minimum substring cover problem
    Hermelin, Danny
    Rawitz, Dror
    Rizzi, Romeo
    Vialette, Stephane
    [J]. APPROXIMATION AND ONLINE ALGORITHMS, 2008, 4927 : 170 - +
  • [2] The Minimum Substring Cover problem
    Hermelin, Danny
    Rawitz, Dror
    Rizzi, Romeo
    Vialette, Stephane
    [J]. INFORMATION AND COMPUTATION, 2008, 206 (11) : 1303 - 1312
  • [3] Efficient Partitioning of Memory Systems and Its Importance for Memory Consolidation
    Roxin, Alex
    Fusi, Stefano
    [J]. PLOS COMPUTATIONAL BIOLOGY, 2013, 9 (07)
  • [4] Efficient tools for comparative substring analysis
    Apostolico, Alberto
    Denas, Olgert
    Dress, Andreas
    [J]. JOURNAL OF BIOTECHNOLOGY, 2010, 149 (03) : 120 - 126
  • [5] Memory-Efficient Adjoints via Graph Partitioning
    Charoenwanit, Ekkapot
    [J]. 2022 19TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER SCIENCE AND SOFTWARE ENGINEERING (JCSSE 2022), 2022,
  • [6] An Adaptive and Memory Efficient Sampling Mechanism for Partitioning in MapReduce
    Kenn Slagter
    Ching-Hsien Hsu
    Yeh-Ching Chung
    [J]. International Journal of Parallel Programming, 2015, 43 : 489 - 507
  • [7] An Adaptive and Memory Efficient Sampling Mechanism for Partitioning in MapReduce
    Slagter, Kenn
    Hsu, Ching-Hsien
    Chung, Yeh-Ching
    [J]. INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2015, 43 (03) : 489 - 507
  • [8] An efficient node partitioning algorithm for the capacitated minimum spanning tree problem
    Han, Jun
    Sun, Zhaohao
    Huai, Jinpeng
    Li, Xian
    [J]. 6TH IEEE/ACIS INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE, PROCEEDINGS, 2007, : 575 - +
  • [9] Space-Efficient Substring Occurrence Estimation
    Alessio Orlandi
    Rossano Venturini
    [J]. Algorithmica, 2016, 74 : 65 - 90
  • [10] Generalization of Efficient Implementation of Compression by Substring Enumeration
    Sakuma, Shumpei
    Narisawa, Kazuyuki
    Shinohara, Ayumi
    [J]. 2016 DATA COMPRESSION CONFERENCE (DCC), 2016, : 630 - 630