Automating Workload Analysis of Large-Scale Supercomputer Systems

被引:1
|
作者
Shvets, P. A. [1 ,2 ]
Voevodin, V. V. [1 ,2 ]
Zhumatiy, S. A. [1 ]
机构
[1] Lomonosov Moscow State Univ, Moscow 119991, Russia
[2] Moscow Ctr Fundamental & Appl Math, Moscow 119991, Russia
基金
俄罗斯基础研究基金会;
关键词
supercomputing; high-performance computing; workload analysis; efficiency; data analysis; monitoring data; system software;
D O I
10.1134/S1995080221070210
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
The architecture of modern supercomputers is extremely complex, so it is exceedingly difficult to monitor and maintain the efficiency of their functioning. And even if it is possible to collect the necessary data on the operation of all important supercomputer components, how not to drown in this ''sea of information'' and not miss the onset of a critical situation? This requires the automation of the workload analysis process. One of the possible solutions is to create a set of rules that automatically detect and notify supercomputer administrators about the occurrence of certain critical situations or cases of a significant decrease in the efficiency of supercomputer functioning. Such approach allows quickly identifying the most interesting and important situations for the administrator, as well as correctly prioritizing the workload analysis process in whole. This article describes the process of developing a set of 19 rules, each of which determines a way to detect the onset of a certain critical situation, provides a description of the possible causes of its occurrence, and also specifies the criticality of the situation that has arisen. These rules allow monitoring different aspects of supercomputer behavior: the efficiency of using application packages, the operation of the queue system, the load and availability of service servers, the presence of global performance issues in user applications, and the peculiarities of using separate partitions of the supercomputer. The developed rules formed the basis of the software solution that was implemented and evaluated on the Petaflop-level Lomonosov-2 supercomputer.
引用
收藏
页码:1547 / 1559
页数:13
相关论文
共 50 条
  • [21] Ensemble Learning for Large-Scale Workload Prediction
    Singh, Nidhi
    Rao, Shrisha
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2014, 2 (02) : 149 - 165
  • [22] The impact of workload variability on the energy efficiency of large-scale heterogeneous distributed systems
    Stavrinides, Georgios L.
    Karatza, Helen D.
    SIMULATION MODELLING PRACTICE AND THEORY, 2018, 89 : 135 - 143
  • [23] Adaptive workload-dependent scheduling for large-scale content delivery systems
    Almeroth, KC
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2001, 11 (03) : 426 - 439
  • [24] Automating Software Analysis at Large Scale
    Kroening, Daniel
    Tautschnig, Michael
    MATHEMATICAL AND ENGINEERING METHODS IN COMPUTER SCIENCE, MEMICS 2014, 2014, 8934 : 30 - 39
  • [25] LARGE SCALE STRUCTURAL ANALYSIS BY A SUPERCOMPUTER.
    Miyoshi, Toshiro
    Yoshida, Yuichiro
    Takano, Naoki
    Journal of the Faculty of Engineering, the University of Tokyo, Series A, 1987, (25): : 22 - 23
  • [26] Enabling Large-Scale Simulation of CAM on the Sunway TaihuLight Supercomputer
    Li, Yuxuan
    Duan, Xiaohui
    Gan, Lin
    Wan, Wubing
    Chen, Yuhu
    Xu, Kai
    Yang, Jinzhe
    Liu, Weiguo
    Xue, Wei
    Fu, Haohuan
    Yang, Guangwen
    IEEE TRANSACTIONS ON COMPUTERS, 2022, 71 (04) : 824 - 837
  • [27] Processing of Large-Scale Nano-ink Data by Supercomputer
    Kim, Sungsuk
    Gil, Joon-Min
    SECURITY-ENRICHED URBAN COMPUTING AND SMART GRID, 2010, 78 : 376 - +
  • [28] Large-scale and cooperative graybox parallel optimization on the supercomputer Fugaku
    Canonne, Lorenzo
    Derbel, Bilel
    Tsuji, Miwako
    Sato, Mitsuhisa
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2024, 191
  • [29] LARGE-SCALE MULTILAYER GLASS-CERAMIC SUBSTRATE FOR SUPERCOMPUTER
    SHIMADA, Y
    KOBAYASHI, Y
    KATA, K
    KURANO, M
    TAKAMIZAWA, H
    IEEE TRANSACTIONS ON COMPONENTS HYBRIDS AND MANUFACTURING TECHNOLOGY, 1990, 13 (04): : 751 - 758
  • [30] Resolving Frontier Problems of Mastering Large-Scale Supercomputer Complexes
    Nikitenko, Dmitry
    Voevodin, Vladimir
    Zhumatiy, Sergey
    PROCEEDINGS OF THE ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS (CF'16), 2016, : 349 - 352