Scalable Incremental Checkpointing using GPU-Accelerated De-Duplication

被引:1
|
作者
Tan, Nigel [1 ]
Luettgau, Jakob [1 ]
Marquez, Jack [1 ]
Terianishi, Keita [2 ]
Morales, Nicolas [3 ]
Bhowmick, Sanjukta [4 ]
Cappello, Franck [5 ]
Taufer, Michela [1 ]
Nicolae, Bogdan [5 ]
机构
[1] Univ Tennessee Knoxville, Knoxville, TN 37996 USA
[2] Oak Ridge Natl Lab, Oak Ridge, TN USA
[3] Sandia Natl Labs, POB 5800, Albuquerque, NM 87185 USA
[4] Univ North Texas, Denton, TX USA
[5] Argonne Natl Lab, Lemont, IL USA
基金
美国国家科学基金会;
关键词
Checkpointing; data versioning; incremental storage; deduplication; GPU parallelization;
D O I
10.1145/3605573.3605639
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Writing large amounts of data concurrently to stable storage is a typical I/O pattern of many HPCworkflows. This pattern introduces high I/O overheads and results in increased storage space utilization especially for workflows that need to capture the evolution of data structures with high frequency as checkpoints. In this context, many applications, such as graph pattern matching, perform sparse updates to large data structures between checkpoints. For these applications, incremental checkpointing techniques that save only the differences from one checkpoint to another can dramatically reduce the checkpoint sizes, I/O bottlenecks, and storage space utilization. However, such techniques are not without challenges: it is non-trivial to transparently determine what data has changed since a previous checkpoint and assemble the differences in a compact fashion that does not result in excessive metadata. State-of-art data reduction techniques (e.g., compression and de-duplication) have significant limitations when applied to modern HPC applications that leverage GPUs: slow at detecting the differences, generate a large amount of metadata to keep track of the differences, and ignore crucial spatiotemporal checkpoint data redundancy. This paper addresses these challenges by proposing a Merkle tree-based incremental checkpointing method to exploit GPUs' high memory bandwidth and massive parallelism. Experimental results at scale show a significant reduction of the I/O overhead and space utilization of checkpointing compared with state-of-the-art incremental checkpointing and compression techniques.
引用
收藏
页码:665 / 674
页数:10
相关论文
共 50 条
  • [1] An Transfer Latency Optimized Solution in GPU-Accelerated De-duplication
    Zhu, Rui
    Chen, Chang-nian
    Qin, Lei-hua
    [J]. INDUSTRIAL INSTRUMENTATION AND CONTROL SYSTEMS II, PTS 1-3, 2013, 336-338 : 2059 - 2062
  • [2] An incremental clustering scheme for data de-duplication
    Gianni Costa
    Giuseppe Manco
    Riccardo Ortale
    [J]. Data Mining and Knowledge Discovery, 2010, 20 : 152 - 187
  • [3] An incremental clustering scheme for data de-duplication
    Costa, Gianni
    Manco, Giuseppe
    Ortale, Riccardo
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2010, 20 (01) : 152 - 187
  • [4] GPU-Accelerated Scalable Solver for Banded Linear Systems
    Liu, Hang
    Seo, Jung-Hee
    Mital, Rajat
    Huang, H. Howie
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2013,
  • [5] De-Duplication Of Passports Using Aadhaar
    Prathilothamai, M.
    Nair, Priyanka Sunil
    [J]. 2017 INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATION AND INFORMATICS (ICCCI), 2017,
  • [6] DATA DE-DUPLICATION WITH ADAPTIVE CHUNKING AND ACCELERATED MODIFICATION IDENTIFYING
    Zhang, Xingjun
    Zhu, Guofeng
    Wang, Endong
    Fowler, Scott
    Dong, Xiaoshe
    [J]. COMPUTING AND INFORMATICS, 2016, 35 (03) : 586 - 614
  • [7] Scalable GPU-accelerated IPv6 Lookup using Hierarchical Perfect Hashing
    Zhou, Shijie
    Prasanna, Viktor K.
    [J]. 2015 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2015,
  • [8] A GPU-accelerated Approximate Algorithm for Incremental Learning of Gaussian Mixture Model
    Chen, Chunlei
    Mu, Dejun
    Zhang, Huixiang
    Hong, Bo
    [J]. 2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS & PHD FORUM (IPDPSW), 2012, : 1937 - 1943
  • [9] Article De-duplication Using Distributed Representations
    Okura, Shumpei
    Tagami, Yukihiro
    Tajima, Akira
    [J]. PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'16 COMPANION), 2016, : 87 - 88
  • [10] GPU-Accelerated Incremental Correlation Clustering of Large Data with Visual Feedback
    Papenhausen, Eric
    Wang, Bing
    Ha, Sungsoo
    Zelenyuk, Alla
    Imre, Dan
    Mueller, Klaus
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2013,