Job Characteristics on Large-Scale Systems: Long-Term Analysis, Quantification, and Implications

被引:26
|
作者
Patel, Tirthak [1 ]
Liu, Zhengchun [2 ]
Kettimuthu, Raj [2 ]
Rich, Paul [2 ]
Allcock, William [2 ]
Tiwari, Devesh [1 ]
机构
[1] Northeastern Univ, Boston, MA 02115 USA
[2] Argonne Natl Lab, Argonne, IL 60439 USA
关键词
High Performance Computing; Large-Scale Systems; Monitoring; Queueing Analysis; Statistical Analysis; HPC; BEHAVIOR; CLOUD;
D O I
10.1109/SC41405.2020.00088
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
HPC workload analysis and resource consumption characteristics are the key to driving better operation practices, system procurement decisions, and designing effective resource management techniques. Unfortunately, the HPC community does not have easy accessibility to long-term introspective workload analysis and characterization for production-scale HPC systems. This study bridges this gap by providing detailed long-term quantification, characterization, and analysis of job characteristics on two supercomputers: Intrepid and Mira. This study is one of the largest of its kind - covering trends and characteristics for over three billion compute hours, 750 thousand jobs, and spanning a decade. We confirm several long-held conventional wisdom, and identify many previously undiscovered trends and its implications. We also introduce a learning based technique to predict the resource requirement of future jobs with high accuracy, using features available prior to the job submission and without requiring any application-specific tracing or application-intrusive instrumentation.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications
    Gupta, Saurabh
    Patel, Tirthak
    Engelmann, Christian
    Tiwari, Devesh
    [J]. SC'17: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2017,
  • [2] Scalable Long-Term Safety Certificate for Large-Scale Systems
    Hoshino, Kenta
    Wang, Zhuoyuan
    Nakahira, Yorie
    [J]. IEEE CONTROL SYSTEMS LETTERS, 2023, 7 : 1285 - 1290
  • [3] Data analysis toolkit for long-term, large-scale experiments
    Bennett, D. P.
    Cuss, R. J.
    Vardon, P. J.
    Harrington, J. F.
    Philp, R. N.
    Thomas, H. R.
    [J]. MINERALOGICAL MAGAZINE, 2012, 76 (08) : 3355 - 3364
  • [4] Long-Term Scheduling of Large-Scale Hydropower Systems for Energy Maximization
    Zhang Ming
    Fan Ziwu
    Guo Yongbin
    [J]. ADVANCES IN HYDRAULIC PHYSICAL MODELING AND FIELD INVESTMENT AND INVESTIGATION, 2010, : 133 - 138
  • [5] Long-term operation experiences with large-scale solar systems in Slovenia
    Arkar, C
    Medved, S
    Novak, P
    [J]. RENEWABLE ENERGY, 1999, 16 (1-4) : 669 - 672
  • [6] LIABILITY AND LARGE-SCALE, LONG-TERM HAZARDS
    RINGLEB, AH
    WIGGINS, SN
    [J]. JOURNAL OF POLITICAL ECONOMY, 1990, 98 (03) : 574 - 595
  • [7] Large-scale, long-term stable femtosecond timing distribution and synchronization systems
    Kim, Jungwon
    Ludwig, Frank
    Chen, Jeff
    Loehl, Florian
    Wong, Franco
    Schlarb, Holger
    Kaertner, Franz
    [J]. 2007 DIGEST OF THE LEOS SUMMER TOPICAL MEETINGS, 2007, : 182 - +
  • [8] Long-term dynamics of the large-scale magnetic structures
    Ambroz, P
    [J]. SOLAR PHYSICS, 2004, 224 (01) : 61 - 68
  • [9] LONG-TERM FORECASTING AND PROBLEM OF LARGE-SCALE WARS
    STEFFLRE, V
    [J]. FUTURES, 1974, 6 (04) : 302 - 308
  • [10] LONG-TERM LARGE-SCALE CLINICAL EVALUATION OF INDOMETHACIN
    ENGLUND, DW
    [J]. ARTHRITIS AND RHEUMATISM, 1966, 9 (03): : 502 - &