Automating Workload Analysis of Large-Scale Supercomputer Systems

被引：1

作者：

Shvets, P. A. ^{[1
,2
]}

Voevodin, V. V. ^{[1
,2
]}

Zhumatiy, S. A. ^{[1
]}

机构：

[1] Lomonosov Moscow State Univ, Moscow 119991, Russia

[2] Moscow Ctr Fundamental & Appl Math, Moscow 119991, Russia

来源：

LOBACHEVSKII JOURNAL OF MATHEMATICS | 2021年 / 42卷 / 07期

基金：

俄罗斯基础研究基金会;

关键词：

supercomputing; high-performance computing; workload analysis; efficiency; data analysis; monitoring data; system software;

D O I：

10.1134/S1995080221070210

中图分类号：

O1 [数学];

学科分类号：

0701 ; 070101 ;

摘要：

The architecture of modern supercomputers is extremely complex, so it is exceedingly difficult to monitor and maintain the efficiency of their functioning. And even if it is possible to collect the necessary data on the operation of all important supercomputer components, how not to drown in this ''sea of information'' and not miss the onset of a critical situation? This requires the automation of the workload analysis process. One of the possible solutions is to create a set of rules that automatically detect and notify supercomputer administrators about the occurrence of certain critical situations or cases of a significant decrease in the efficiency of supercomputer functioning. Such approach allows quickly identifying the most interesting and important situations for the administrator, as well as correctly prioritizing the workload analysis process in whole. This article describes the process of developing a set of 19 rules, each of which determines a way to detect the onset of a certain critical situation, provides a description of the possible causes of its occurrence, and also specifies the criticality of the situation that has arisen. These rules allow monitoring different aspects of supercomputer behavior: the efficiency of using application packages, the operation of the queue system, the load and availability of service servers, the presence of global performance issues in user applications, and the peculiarities of using separate partitions of the supercomputer. The developed rules formed the basis of the software solution that was implemented and evaluated on the Petaflop-level Lomonosov-2 supercomputer.

引用

页码：1547 / 1559

页数：13

共 50 条

[21] Ensemble Learning for Large-Scale Workload Prediction
Singh, Nidhi
Rao, Shrisha
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2014, 2 (02) : 149 - 165
[22] The impact of workload variability on the energy efficiency of large-scale heterogeneous distributed systems
Stavrinides, Georgios L.
Karatza, Helen D.
SIMULATION MODELLING PRACTICE AND THEORY, 2018, 89 : 135 - 143
[23] Adaptive workload-dependent scheduling for large-scale content delivery systems
Almeroth, KC
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2001, 11 (03) : 426 - 439
[24] Automating Software Analysis at Large Scale
Kroening, Daniel
Tautschnig, Michael
MATHEMATICAL AND ENGINEERING METHODS IN COMPUTER SCIENCE, MEMICS 2014, 2014, 8934 : 30 - 39
[25] LARGE SCALE STRUCTURAL ANALYSIS BY A SUPERCOMPUTER.
Miyoshi, Toshiro
Yoshida, Yuichiro
Takano, Naoki
Journal of the Faculty of Engineering, the University of Tokyo, Series A, 1987, (25): : 22 - 23
[26] Enabling Large-Scale Simulation of CAM on the Sunway TaihuLight Supercomputer
Li, Yuxuan
Duan, Xiaohui
Gan, Lin
Wan, Wubing
Chen, Yuhu
Xu, Kai
Yang, Jinzhe
Liu, Weiguo
Xue, Wei
Fu, Haohuan
Yang, Guangwen
IEEE TRANSACTIONS ON COMPUTERS, 2022, 71 (04) : 824 - 837
[27] Processing of Large-Scale Nano-ink Data by Supercomputer
Kim, Sungsuk
Gil, Joon-Min
SECURITY-ENRICHED URBAN COMPUTING AND SMART GRID, 2010, 78 : 376 - +
[28] Large-scale and cooperative graybox parallel optimization on the supercomputer Fugaku
Canonne, Lorenzo
Derbel, Bilel
Tsuji, Miwako
Sato, Mitsuhisa
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2024, 191
[29] LARGE-SCALE MULTILAYER GLASS-CERAMIC SUBSTRATE FOR SUPERCOMPUTER
SHIMADA, Y
KOBAYASHI, Y
KATA, K
KURANO, M
TAKAMIZAWA, H
IEEE TRANSACTIONS ON COMPONENTS HYBRIDS AND MANUFACTURING TECHNOLOGY, 1990, 13 (04): : 751 - 758
[30] Resolving Frontier Problems of Mastering Large-Scale Supercomputer Complexes
Nikitenko, Dmitry
Voevodin, Vladimir
Zhumatiy, Sergey
PROCEEDINGS OF THE ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS (CF'16), 2016, : 349 - 352

← 1 2 3 4 5 →