Characterizing Machine Learning I/O Workloads on Leadership Scale HPC Systems

被引:7
|
作者
Paul, Arnab K. [1 ]
Karimi, Ahmad Maroof [1 ]
Wang, Feiyi [1 ]
机构
[1] Oak Ridge Natl Lab, POB 2009, Oak Ridge, TN 37831 USA
关键词
Burst Buffer; Darshan; High Performance Computing; HPC Storage; IBM Spectrum Scale; I/O Characterization; Machine Learning; Parallel File System;
D O I
10.1109/MASCOTS53633.2021.9614303
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
High performance computing (HPC) is no longer solely limited to traditional workloads such as simulation and modeling. With the increase in the popularity of machine learning (ML) and deep learning (DL) technologies, we are observing that an increasing number of HPC users are incorporating ML methods into their workflow and scientific discovery processes, across a wide spectrum of science domains such as biology, earth science, and physics. This gives rise to a diverse set of I/O patterns than the traditional checkpoint/restart-based HPC I/O behavior. The details of the I/O characteristics of such ML I/O workloads have not been studied extensively for large-scale leadership HPC systems. This paper aims to fill that gap by providing an in-depth analysis to gain an understanding of the I/O behavior of ML I/O workloads using darshan - an I/O characterization tool designed for lightweight tracing and profiling. We study the darshan logs of more than 23,000 HPC ML I/O jobs over a time period of one year running on Summit - the second-fastest supercomputer in the world. This paper provides a systematic I/O characterization of ML I/O jobs running on a leadership scale supercomputer to understand how the I/O behavior differs across science domains and the scale of workloads, and analyze the usage of parallel file system and burst buffer by ML I/O workloads.
引用
收藏
页码:198 / 205
页数:8
相关论文
共 50 条
  • [1] I/O performance analysis of machine learning workloads on leadership scale supercomputer
    Karimi, Ahmad Maroof
    Paul, Arnab K.
    Wang, Feiyi
    [J]. PERFORMANCE EVALUATION, 2022, 157
  • [2] Extracting and characterizing I/O behavior of HPC workloads
    Devarajan, Hariharan
    Mohror, Kathryn
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2022), 2022, : 243 - 255
  • [3] Characterizing I/O Workloads of HPC Applications Through Online Analysis
    Dong, Wenrui
    Liu, Guangming
    Yu, Jie
    Zuo, You
    [J]. 2015 IEEE 34TH INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC), 2015,
  • [4] Machine Learning Assisted HPC Workload Trace Generation for Leadership Scale Storage Systems
    Paul, Arnab K.
    Choi, Jong Youl
    Karimi, Ahmad Maroof
    Wang, Feiyi
    [J]. PROCEEDINGS OF THE 31ST INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, HPDC 2022, 2022, : 199 - 212
  • [5] Characterizing Deep-Learning I/O Workloads in TensorFlow
    Chien, Steven W. D.
    Markidis, Stefano
    Sishtla, Chaitanya Prasad
    Santos, Luis
    Herman, Pawel
    Narasimhamurthy, Sai
    Laure, Erwin
    [J]. PROCEEDINGS OF 2018 IEEE/ACM 3RD JOINT INTERNATIONAL WORKSHOP ON PARALLEL DATA STORAGE & DATA INTENSIVE SCALABLE COMPUTING SYSTEMS (PDSW-DISCS), 2018, : 54 - 63
  • [6] An I/O Analysis of HPC Workloads on CephFS and Lustre
    Chiusole, Alberto
    Cozzini, Stefano
    van der Ster, Daniel
    Lamanna, Massimo
    Giuliani, Graziano
    [J]. HIGH PERFORMANCE COMPUTING: ISC HIGH PERFORMANCE 2019 INTERNATIONAL WORKSHOPS, 2020, 11887 : 300 - 316
  • [7] Replicating HPC I/O Workloads With Proxy Applications
    Dickson, James
    Wright, Steven
    Maheswaran, Satheesh
    Herdman, Andy
    Miller, Mark C.
    Jarvis, Stephen
    [J]. PROCEEDINGS OF PDSW-DISCS 2016 - 1ST JOINT INTERNATIONAL WORKSHOP ON PARALLEL DATA STORAGE AND DATA INTENSIVE SCALABLE COMPUTING SYSTEMS, 2016, : 13 - 18
  • [8] Detecting I/O Access Patterns of HPC Workloads at Runtime
    Bez, Jean Luca
    Boito, Francieli Zanon
    Nou, Ramon
    Miranda, Alberto
    Cortes, Toni
    Navaux, Philippe O. A.
    [J]. 2019 31ST INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD 2019), 2019, : 80 - 87
  • [9] Characterizing I/O in Machine Learning with MLPerf Storage
    Balmau, Oana
    [J]. SIGMOD RECORD, 2022, 51 (03) : 47 - 48
  • [10] I/O Behavior Characterizing and Predicting of Virtualization Workloads
    Hu, Yanyan
    Long, Xiang
    Zhang, Jiong
    [J]. JOURNAL OF COMPUTERS, 2012, 7 (07) : 1712 - 1725