Analysis and prediction of performance variability in large-scale computing systems

被引:0
|
作者
Beni, Majid Salimi [1 ]
Hunold, Sascha [2 ]
Cosenza, Biagio [1 ]
机构
[1] Univ Salerno, Dept Comp Sci, Salerno, Italy
[2] TU Wien, Fac Informat, Vienna, Austria
来源
JOURNAL OF SUPERCOMPUTING | 2024年 / 80卷 / 10期
关键词
High performance interconnects; Performance variability; MPI; Dragonfly plus topology; Performance predictability;
D O I
10.1007/s11227-024-06040-w
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The development of new exascale supercomputers has dramatically increased the need for fast, high-performance networking technology. Efficient network topologies, such as Dragonfly+, have been introduced to meet the demands of data-intensive applications and to match the massive computing power of GPUs and accelerators. However, these supercomputers still face performance variability mainly caused by the network that affects system and application performance. This study comprehensively analyzes performance variability on a large-scale HPC system with Dragonfly+ network topology, focusing on factors such as communication patterns, message size, job placement locality, MPI collective algorithms, and overall system workload. The study also proposes an easy-to-measure metric for estimating network background traffic generated by other users, which can be used to estimate the performance of our job accurately. The insights gained from this study contribute to improving performance predictability, enhancing job placement policies and MPI algorithm selection, and optimizing resource management strategies in supercomputers.
引用
收藏
页码:14978 / 15005
页数:28
相关论文
共 50 条
  • [41] A Cloud Computing Framework for Cascading Failure Simulation and Analysis of Large-Scale Transmission Systems
    Liu, Youbo
    Liu, Yang
    Liu, Junyong
    Saunders, Christopher S.
    Taylor, Gareth
    Masoud, Bazargan
    Liang, Wuxing
    2014 INTERNATIONAL CONFERENCE ON POWER SYSTEM TECHNOLOGY (POWERCON), 2014,
  • [42] Considering Time in Designing Large-Scale Systems for Scientific Computing
    Chen, Nan-Chen
    Poon, Sarah S.
    Ramakrishnan, Lavanya
    Aragon, Cecilia R.
    ACM CONFERENCE ON COMPUTER-SUPPORTED COOPERATIVE WORK AND SOCIAL COMPUTING (CSCW 2016), 2016, : 1535 - 1547
  • [43] Large-Scale Distributed Graph Computing Systems: An Experimental Evaluation
    Lu, Yi
    Cheng, James
    Yan, Da
    Wu, Huanhuan
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 8 (03): : 281 - 292
  • [44] Improving Failure Tolerance in Large-Scale Cloud Computing Systems
    Luo, Liang
    Meng, Sa
    Qiu, Xiwei
    Dai, Yuanshun
    IEEE TRANSACTIONS ON RELIABILITY, 2019, 68 (02) : 620 - 632
  • [45] Muclouds: Parallel Simulator for Large-scale Cloud Computing Systems
    Liu, Jinzhao
    Zhou, Yuezhi
    Zhang, Di
    Fang, Yujian
    Han, Wei
    Zhang, Yaoxue
    2014 IEEE 11TH INTL CONF ON UBIQUITOUS INTELLIGENCE AND COMPUTING AND 2014 IEEE 11TH INTL CONF ON AUTONOMIC AND TRUSTED COMPUTING AND 2014 IEEE 14TH INTL CONF ON SCALABLE COMPUTING AND COMMUNICATIONS AND ITS ASSOCIATED WORKSHOPS, 2014, : 80 - 87
  • [46] Advanced computing in intelligent large-scale distributed systems - Preface
    Koodziej, Joanna
    Nishino, Hiroaki
    COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 2012, 27 (01): : 5 - 6
  • [47] Computing optimal Hankel norm approximations of large-scale systems
    Benner, P
    Quintana-Ortí, ES
    Quintana-Ortí, G
    2004 43RD IEEE CONFERENCE ON DECISION AND CONTROL (CDC), VOLS 1-5, 2004, : 3078 - 3083
  • [48] Cloud Computing Applications for Large-Scale Satellite Ground Systems
    Anthony, Richard
    Fritz, John
    Barnhart, Doug
    2011 - MILCOM 2011 MILITARY COMMUNICATIONS CONFERENCE, 2011, : 1894 - 1898
  • [49] Experience Transfer for the Configuration Tuning in Large-Scale Computing Systems
    Chen, Haifeng
    Zhang, Wenxuan
    Jiang, Guofei
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2011, 23 (03) : 388 - 401
  • [50] Generalizing the Utility of GPUs in Large-Scale Heterogeneous Computing Systems
    Xiao, Shucai
    Feng, Wu-chun
    2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS & PHD FORUM (IPDPSW), 2012, : 2554 - 2557