Automatic Monitoring of Large-Scale Computing Infrastructure

被引:0
|
作者
Kim, Bockjoo [1 ]
Bourilkov, Dimitri [1 ]
机构
[1] Univ Florida, Dept Phys, Gainesville, FL 32611 USA
关键词
D O I
10.1051/epjconf/202429507007
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Modern distributed computing systems produce large amounts of monitoring data. For these systems to operate smoothly, under-performing or failing components must be identified quickly, and preferably automatically, enabling the system managers to react accordingly. In this contribution, we analyze jobs and transfer data collected in the running of the LHC computing infrastructure. The monitoring data is harvested from the Elasticsearch database and converted to formats suitable for further processing. Based on various machine and deep learning techniques, we develop automatic tools for continuous monitoring of the health of the underlying systems. Our initial implementation is based on publicly available deep learning tools, PyTorch or TensorFlow packages, running on state-of-the-art GPU systems.
引用
收藏
页数:7
相关论文
共 50 条
  • [1] A reconfigurable monitoring system for large-scale network computing
    Subramanyan, R
    Miguel-Alonso, J
    Fortes, JAB
    [J]. EURO-PAR 2003 PARALLEL PROCESSING, PROCEEDINGS, 2003, 2790 : 98 - 108
  • [2] Configuration monitoring tool for large-scale distributed computing
    Wu, Y
    Graham, G
    Lu, X
    Afaq, A
    Kim, BJ
    Fisk, I
    [J]. NUCLEAR INSTRUMENTS & METHODS IN PHYSICS RESEARCH SECTION A-ACCELERATORS SPECTROMETERS DETECTORS AND ASSOCIATED EQUIPMENT, 2004, 534 (1-2): : 66 - 69
  • [3] Automatic differentiation of C++ codes for large-scale scientific computing
    Bartlett, Roscoe A.
    Gay, David M.
    Phipps, Eric T.
    [J]. COMPUTATIONAL SCIENCE - ICCS 2006, PT 4, PROCEEDINGS, 2006, 3994 : 525 - 532
  • [4] Large-Scale WSN Installation for Pervasive Monitoring of Civil Infrastructure in London
    Hoult, N. A.
    Fidler, P. R. A.
    Bennett, P. J.
    Middleton, C. R.
    Pottle, S.
    Duguid, K.
    Bessant, G.
    McKoy, R.
    Soga, K.
    [J]. STRUCTURAL HEALTH MONITORING 2010, 2010, : 214 - 219
  • [5] LARGE-SCALE INFRASTRUCTURE PROJECTS IN EUROPE
    EKENGER, P
    [J]. TECHNOLOGY IN SOCIETY, 1987, 9 (01) : 87 - 95
  • [6] An Efficient Management and Automatic Failover on a Large-Scale Cluster Monitoring System
    Park, Choon Seo
    Sok, Song-Woo
    Jeong, Jin-Hwan
    Lee, Yong-Ju
    Kim, Chang Soo
    Min, Ok-Gee
    Kim, Hag-Young
    Yoo, Jae Soo
    [J]. PROCEEDINGS OF THE 8TH WSEAS INTERNATIONAL CONFERENCE ON SYSTEM SCIENCE AND SIMULATION IN ENGINEERING (ICOSSSE '09), 2009, : 278 - 281
  • [7] Computing Effective Mixed Strategies for Protecting Targets in Large-Scale Critical Infrastructure Networks
    Wang, Zhen
    Jiang, Mengting
    Yang, Yu
    Chen, Lili
    Ding, Hong
    [J]. FRONTIERS IN PHYSICS, 2021, 9
  • [8] Advanced Computing and Optimization Infrastructure for Extremely Large-Scale Graphs on Post Peta-Scale Supercomputers
    Fujisawa, Katsuki
    Suzumura, Toyotaro
    Sato, Hitoshi
    Ueno, Koji
    Yasui, Yuichiro
    Iwabuchi, Keita
    Endo, Toshio
    [J]. OPTIMIZATION IN THE REAL WORLD: TOWARD SOLVING REAL-WORLD OPTIMIZATION PROBLEMS, 2016, 13 : 1 - 13
  • [9] Advanced Computing and Optimization Infrastructure for Extremely Large-Scale Graphs on Post Peta-Scale Supercomputers
    Fujisawa, Katsuki
    Endo, Toshio
    Yasui, Yuichiro
    [J]. MATHEMATICAL SOFTWARE, ICMS 2016, 2016, 9725 : 265 - 274
  • [10] Elastic Infrastructure to Support Computing Clouds for Large-scale Cyber-Physical Systems
    Schmidt, Douglas C.
    White, Jules
    Gill, Christopher D.
    [J]. 2014 IEEE 17TH INTERNATIONAL SYMPOSIUM ON OBJECT/COMPONENT/SERVICE-ORIENTED REAL-TIME DISTRIBUTED COMPUTING (ISORC), 2014, : 56 - 63