Scalable I/O aggregation for asynchronous multi-level checkpointing

被引:0
|
作者
Gossman M.J. [1 ]
Nicolae B. [2 ]
Calhoun J.C. [1 ]
机构
[1] Holcombe Department of Electrical and Computer Engineering, Clemson University, Clemson, 29631, SC
[2] Mathematical and Computer Science Division, Argonne National Laboratory, Lemont, 22222, IL
基金
美国国家科学基金会;
关键词
Asynchronous I/O; Checkpoint-restart; Distributed I/O aggregation;
D O I
10.1016/j.future.2024.06.003
中图分类号
学科分类号
摘要
Checkpointing distributed HPC applications is a common I/O pattern with many use cases: resilience, job management, reproducibility, revisiting previous intermediate results, etc. This is a difficult pattern for a large number of processes that need to capture massive data sizes and write them persistently to shared storage (e.g., parallel file system), which is subject to I/O bottlenecks due to limited I/O bandwidth under concurrency. In addition to I/O performance and scalability considerations, there are often limits that users impose on the number of files or objects that can be used to capture the checkpoints. For example, users need to move checkpoints between HPC systems or parallel file systems, which is inefficient for a large number of files, or need to use the checkpoints in workflows that expect related objects to be grouped together. As a consequence, I/O aggregation is often used to reduce the number of files and objects persistent to shared storage such that it is much lower than the number of processes. However, I/O aggregation is challenging for two reasons: (1) if more than one process is writing checkpointing data to the same file, this causes additional I/O contention that amplifies the I/O bottlenecks; (2) scalable state-of-art checkpointing techniques are asynchronous and rely on multi-level techniques to capture the data structures to local storage or memory, then flush it from there to shared storage in the background, which competes for resources (I/O, memory, network bandwidth) with the application that is running in the foreground. State of art approaches have addressed the problem of I/O aggregation for synchronous checkpointing but are insufficient for asynchronous checkpointing. To fill this gap, we contribute with a novel I/O aggregation strategy that operates efficiently in the background to complement asynchronous C/R. Specifically, we explore how to (1) develop a network of efficient, thread-safe I/O proxies that persist data via limited-sized write buffers, (2) prioritize remote (from non-proxy processes) and local data on I/O proxies to minimize write overhead, and (3) load-balance flushing on I/O proxies. We analyze trade-offs of developing such strategies and discuss the performance impact on large-scale micro-benchmarks, as well as a real HPC application (HACC). © 2024 Elsevier B.V.
引用
收藏
页码:420 / 432
页数:12
相关论文
共 50 条
  • [1] Towards Efficient I/O Scheduling for Collaborative Multi-Level Checkpointing
    Maurya, Avinash
    Nicolae, Bogdan
    Rafique, M. Mustafa
    Tonellot, Thierry
    Cappello, Franck
    29TH INTERNATIONAL SYMPOSIUM ON THE MODELING, ANALYSIS, AND SIMULATION OF COMPUTER AND TELECOMMUNICATION SYSTEMS (MASCOTS 2021), 2021, : 150 - 157
  • [2] Towards Optimal Multi-Level Checkpointing
    Benoit, Anne
    Cavelan, Aurelien
    Le Fevre, Valentin
    Robert, Yves
    Sun, Hongyang
    IEEE TRANSACTIONS ON COMPUTERS, 2017, 66 (07) : 1212 - 1226
  • [3] MULTI-LEVEL RISK AGGREGATION
    Filipovic, Damir
    ASTIN BULLETIN, 2009, 39 (02): : 565 - 575
  • [4] O(depth)-Competitive Algorithm for Online Multi-level Aggregation
    Buchbinder, Niv
    Feldman, Moran
    Naor, Joseph
    Talmon, Ohad
    PROCEEDINGS OF THE TWENTY-EIGHTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2017, : 1235 - 1244
  • [5] Multi-level checkpointing and silent error detection for linear workflows
    Benoit, Anne
    Cavelan, Aurelien
    Robert, Yves
    Sun, Hongyang
    JOURNAL OF COMPUTATIONAL SCIENCE, 2018, 28 : 398 - 415
  • [6] Multi-level Aggregation in Face Recognition
    Kiersztyn, Adam
    Karczmarek, Pawel
    Pedrycz, Witold
    ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING, ICAISC 2018, PT I, 2018, 10841 : 645 - 656
  • [7] New results on multi-level aggregation
    Bienkowski, Marcin
    Boehm, Martin
    Byrka, Jaroslaw
    Chrobak, Marek
    Duerr, Christoph
    Folwarczny, Lukas
    Jez, Lukasz
    Sgall, Jiri
    Nguyen Kim Thang
    Vesely, Pavel
    THEORETICAL COMPUTER SCIENCE, 2021, 861 : 133 - 143
  • [8] A Pipelined Multi-level Checkpoint Storage System for Virtual Cluster Checkpointing
    Yaothanee, Jumpol
    Chanchio, Kasidit
    2023 8TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA ANALYTICS, ICCCBDA, 2023, : 239 - 246
  • [9] Evaluating Multi-Level Checkpointing for Distributed Deep Neural Network Training
    Anthony, Quentin
    Dai, Donglai
    SCWS 2021: 2021 SC WORKSHOPS SUPPLEMENTARY PROCEEDINGS, 2021, : 60 - 67
  • [10] Optimizing Energy Consumption on HPC Systems with a Multi-level Checkpointing Mechanism
    Amrizal, Muhammad Alfian
    Takizawa, Hiroyuki
    2017 INTERNATIONAL CONFERENCE ON NETWORKING, ARCHITECTURE, AND STORAGE (NAS), 2017, : 140 - 148