Memory Efficient Minimum Substring Partitioning

被引:32
|
作者
Li, Yang [1 ]
Kamousi, Pegah [1 ]
Han, Fangqiu [1 ]
Yang, Shengqi [1 ]
Yan, Xifeng [1 ]
Suri, Subhash [1 ]
机构
[1] Univ Calif Santa Barbara, Santa Barbara, CA 93106 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2013年 / 6卷 / 03期
关键词
D O I
10.14778/2535569.2448951
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Massively parallel DNA sequencing technologies are revolutionizing genomics research. Billions of short reads generated at low costs can be assembled for reconstructing the whole genomes. Unfortunately, the large memory footprint of the existing de novo assembly algorithms makes it challenging to get the assembly done for higher eukaryotes like mammals. In this work, we investigate the memory issue of constructing de Bruijn graph, a core task in leading assembly algorithms, which often consumes several hundreds of gigabytes memory for large genomes. We propose a disk-based partition method, called Minimum Substring Partitioning (MSP), to complete the task using less than 10 gigabytes memory, without runtime slowdown. MSP breaks the short reads into multiple small disjoint partitions so that each partition can be loaded into memory, processed individually and later merged with others to form a de Bruijn graph. By leveraging the overlaps among the k-mers (substring of length k), MSP achieves astonishing compression ratio: The total size of partitions is reduced from Theta(kn) to Theta(n), where n is the size of the short read database, and k is the length of a k-mer. Experimental results show that our method can build de Bruijn graphs using a commodity computer for any large-volume sequence dataset.
引用
收藏
页码:169 / 180
页数:12
相关论文
共 50 条
  • [31] Efficient variable partitioning and scheduling for DSP processors with multiple memory modules
    Zhuge, QF
    Sha, EHMS
    Xiao, B
    Chantrapornchai, C
    [J]. IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2004, 52 (04) : 1090 - 1099
  • [32] Minimum entropy data partitioning
    Roberts, SJ
    Everson, R
    Rezek, I
    [J]. NINTH INTERNATIONAL CONFERENCE ON ARTIFICIAL NEURAL NETWORKS (ICANN99), VOLS 1 AND 2, 1999, (470): : 844 - 849
  • [33] On partitioning minimum spanning trees
    Guttmann-Beck, Nili
    Hassin, Refael
    Stern, Michal
    [J]. DISCRETE APPLIED MATHEMATICS, 2024, 359 : 45 - 54
  • [34] The efficient computation of complete and concise substring scales with suffix trees
    Ferre, Sebastien
    [J]. Formal Concept Analysis, Proceedings, 2007, 4390 : 98 - 113
  • [35] Efficient Data Structures for Range Shortest Unique Substring Queries
    Abedin, Paniz
    Ganguly, Arnab
    Pissis, Solon P.
    Thankachan, Sharma V.
    [J]. ALGORITHMS, 2020, 13 (11) : 1 - 9
  • [36] Efficient watermark detection by using the longest common substring technique
    Mohamed, Taha M.
    Elmahdy, Hesham N.
    Onsi, Hoda M.
    [J]. EGYPTIAN INFORMATICS JOURNAL, 2011, 12 (02) : 115 - 123
  • [37] An Efficient Rank Based Approach for Closest String and Closest Substring
    Dinu, Liviu P.
    Ionescu, Radu
    [J]. PLOS ONE, 2012, 7 (06):
  • [38] Efficient Memory Partitioning for Parallel Data Access in FPGA via Data Reuse
    Su, Jincheng
    Yang, Fan
    Zeng, Xuan
    Zhou, Dian
    Chen, Jie
    [J]. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2017, 36 (10) : 1674 - 1687
  • [39] Memory Efficient Set Partitioning in Hierarchical Tree (MESH) for Wavelet Image Compression
    Ghani, Farid
    Kader, Abdul
    Khan, Ekram
    Ahmad, Badlishah
    [J]. IEICE TRANSACTIONS ON COMMUNICATIONS, 2012, E95B (09) : 2906 - 2913
  • [40] An efficient substring search method by using delayed keyword extraction
    Okada, M
    Ando, K
    Lee, SS
    Hayashi, Y
    Aoe, J
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2001, 37 (05) : 741 - 761