Analysis and prediction of performance variability in large-scale computing systems

被引：0

作者：

Beni, Majid Salimi ^{[1
]}

Hunold, Sascha ^{[2
]}

Cosenza, Biagio ^{[1
]}

机构：

[1] Univ Salerno, Dept Comp Sci, Salerno, Italy

[2] TU Wien, Fac Informat, Vienna, Austria

来源：

JOURNAL OF SUPERCOMPUTING | 2024年 / 80卷 / 10期

关键词：

High performance interconnects; Performance variability; MPI; Dragonfly plus topology; Performance predictability;

D O I：

10.1007/s11227-024-06040-w

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The development of new exascale supercomputers has dramatically increased the need for fast, high-performance networking technology. Efficient network topologies, such as Dragonfly+, have been introduced to meet the demands of data-intensive applications and to match the massive computing power of GPUs and accelerators. However, these supercomputers still face performance variability mainly caused by the network that affects system and application performance. This study comprehensively analyzes performance variability on a large-scale HPC system with Dragonfly+ network topology, focusing on factors such as communication patterns, message size, job placement locality, MPI collective algorithms, and overall system workload. The study also proposes an easy-to-measure metric for estimating network background traffic generated by other users, which can be used to estimate the performance of our job accurately. The insights gained from this study contribute to improving performance predictability, enhancing job placement policies and MPI algorithm selection, and optimizing resource management strategies in supercomputers.

引用

页码：14978 / 15005

页数：28

共 50 条

[41] A Cloud Computing Framework for Cascading Failure Simulation and Analysis of Large-Scale Transmission Systems
Liu, Youbo
Liu, Yang
Liu, Junyong
Saunders, Christopher S.
Taylor, Gareth
Masoud, Bazargan
Liang, Wuxing
2014 INTERNATIONAL CONFERENCE ON POWER SYSTEM TECHNOLOGY (POWERCON), 2014,
[42] Considering Time in Designing Large-Scale Systems for Scientific Computing
Chen, Nan-Chen
Poon, Sarah S.
Ramakrishnan, Lavanya
Aragon, Cecilia R.
ACM CONFERENCE ON COMPUTER-SUPPORTED COOPERATIVE WORK AND SOCIAL COMPUTING (CSCW 2016), 2016, : 1535 - 1547
[43] Large-Scale Distributed Graph Computing Systems: An Experimental Evaluation
Lu, Yi
Cheng, James
Yan, Da
Wu, Huanhuan
PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 8 (03): : 281 - 292
[44] Improving Failure Tolerance in Large-Scale Cloud Computing Systems
Luo, Liang
Meng, Sa
Qiu, Xiwei
Dai, Yuanshun
IEEE TRANSACTIONS ON RELIABILITY, 2019, 68 (02) : 620 - 632
[45] Muclouds: Parallel Simulator for Large-scale Cloud Computing Systems
Liu, Jinzhao
Zhou, Yuezhi
Zhang, Di
Fang, Yujian
Han, Wei
Zhang, Yaoxue
2014 IEEE 11TH INTL CONF ON UBIQUITOUS INTELLIGENCE AND COMPUTING AND 2014 IEEE 11TH INTL CONF ON AUTONOMIC AND TRUSTED COMPUTING AND 2014 IEEE 14TH INTL CONF ON SCALABLE COMPUTING AND COMMUNICATIONS AND ITS ASSOCIATED WORKSHOPS, 2014, : 80 - 87
[46] Advanced computing in intelligent large-scale distributed systems - Preface
Koodziej, Joanna
Nishino, Hiroaki
COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 2012, 27 (01): : 5 - 6
[47] Computing optimal Hankel norm approximations of large-scale systems
Benner, P
Quintana-Ortí, ES
Quintana-Ortí, G
2004 43RD IEEE CONFERENCE ON DECISION AND CONTROL (CDC), VOLS 1-5, 2004, : 3078 - 3083
[48] Cloud Computing Applications for Large-Scale Satellite Ground Systems
Anthony, Richard
Fritz, John
Barnhart, Doug
2011 - MILCOM 2011 MILITARY COMMUNICATIONS CONFERENCE, 2011, : 1894 - 1898
[49] Experience Transfer for the Configuration Tuning in Large-Scale Computing Systems
Chen, Haifeng
Zhang, Wenxuan
Jiang, Guofei
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2011, 23 (03) : 388 - 401
[50] Generalizing the Utility of GPUs in Large-Scale Heterogeneous Computing Systems
Xiao, Shucai
Feng, Wu-chun
2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS & PHD FORUM (IPDPSW), 2012, : 2554 - 2557

← 1 2 3 4 5 →