Analysis and prediction of performance variability in large-scale computing systems

被引:0
|
作者
Beni, Majid Salimi [1 ]
Hunold, Sascha [2 ]
Cosenza, Biagio [1 ]
机构
[1] Univ Salerno, Dept Comp Sci, Salerno, Italy
[2] TU Wien, Fac Informat, Vienna, Austria
来源
JOURNAL OF SUPERCOMPUTING | 2024年 / 80卷 / 10期
关键词
High performance interconnects; Performance variability; MPI; Dragonfly plus topology; Performance predictability;
D O I
10.1007/s11227-024-06040-w
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The development of new exascale supercomputers has dramatically increased the need for fast, high-performance networking technology. Efficient network topologies, such as Dragonfly+, have been introduced to meet the demands of data-intensive applications and to match the massive computing power of GPUs and accelerators. However, these supercomputers still face performance variability mainly caused by the network that affects system and application performance. This study comprehensively analyzes performance variability on a large-scale HPC system with Dragonfly+ network topology, focusing on factors such as communication patterns, message size, job placement locality, MPI collective algorithms, and overall system workload. The study also proposes an easy-to-measure metric for estimating network background traffic generated by other users, which can be used to estimate the performance of our job accurately. The insights gained from this study contribute to improving performance predictability, enhancing job placement policies and MPI algorithm selection, and optimizing resource management strategies in supercomputers.
引用
收藏
页码:14978 / 15005
页数:28
相关论文
共 50 条
  • [1] Hybrid performance modeling and prediction of large-scale computing systems
    Pllana, Sabri
    Benkner, Siegfried
    Xhafa, Fatos
    Barolli, Leonard
    CISIS 2008: THE SECOND INTERNATIONAL CONFERENCE ON COMPLEX, INTELLIGENT AND SOFTWARE INTENSIVE SYSTEMS, PROCEEDINGS, 2008, : 132 - +
  • [2] A novel approach for hybrid performance modelling and prediction of large-scale computing systems
    Pllana, Sabri
    Benkner, Siegfried
    Xhafa, Fatos
    Barolli, Leonard
    INTERNATIONAL JOURNAL OF GRID AND UTILITY COMPUTING, 2009, 1 (04) : 316 - 327
  • [3] Performance measurement and analysis of large-scale parallel applications on leadership computing systems
    Wylie, Brian J. N.
    Geimer, Markus
    Wolf, Felix
    SCIENTIFIC PROGRAMMING, 2008, 16 (2-3) : 167 - 181
  • [4] Performance Visualization for Large-Scale Computing Systems: A Literature Review
    Gao, Qin
    Zhang, Xuhui
    Rau, Pei-Luen Patrick
    Maciejewski, Anthony A.
    Siegel, Howard Jay
    HUMAN-COMPUTER INTERACTION: DESIGN AND DEVELOPMENT APPROACHES, PT I, 2011, 6761 : 450 - 460
  • [5] Large-Scale Optical Reservoir Computing for Spatiotemporal Chaotic Systems Prediction
    Rafayelyan, Mushegh
    Dong, Jonathan
    Tan, Yongqi
    Krzakala, Florent
    Gigane, Sylvain
    PHYSICAL REVIEW X, 2020, 10 (04):
  • [6] SYSTEMS FOR VERY LARGE-SCALE COMPUTING
    Jerger, Natalie Enright
    Lipasti, Mikko
    IEEE MICRO, 2011, 31 (03) : 4 - 6
  • [7] Large-scale neuromorphic computing systems
    Furber, Steve
    JOURNAL OF NEURAL ENGINEERING, 2016, 13 (05)
  • [8] Intelligent computing in large-scale systems
    Kolodziej, Joanna
    Gonzalez-Velez, Horacio
    Xhafa, Fatos
    Barolli, Leonard
    KNOWLEDGE ENGINEERING REVIEW, 2015, 30 (02): : 137 - 139
  • [9] Performance Analysis of Work Stealing in Large-scale Multithreaded Computing
    Sonenberg, Nikki
    Kielanski, Grzegorz
    Van Houdt, Benny
    ACM TRANSACTIONS ON MODELING AND PERFORMANCE EVALUATION OF COMPUTING SYSTEMS, 2021, 6 (02)
  • [10] Integrated Analysis of Performance and Resources in Large-Scale Quantum Computing
    Hwang, Yongsoo
    Kim, Taewan
    Baek, Chungheon
    Choi, Byung-Soo
    PHYSICAL REVIEW APPLIED, 2020, 13 (05)