Automatic Monitoring of Large-Scale Computing Infrastructure

被引:0
|
作者
Kim, Bockjoo [1 ]
Bourilkov, Dimitri [1 ]
机构
[1] Univ Florida, Dept Phys, Gainesville, FL 32611 USA
关键词
D O I
10.1051/epjconf/202429507007
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Modern distributed computing systems produce large amounts of monitoring data. For these systems to operate smoothly, under-performing or failing components must be identified quickly, and preferably automatically, enabling the system managers to react accordingly. In this contribution, we analyze jobs and transfer data collected in the running of the LHC computing infrastructure. The monitoring data is harvested from the Elasticsearch database and converted to formats suitable for further processing. Based on various machine and deep learning techniques, we develop automatic tools for continuous monitoring of the health of the underlying systems. Our initial implementation is based on publicly available deep learning tools, PyTorch or TensorFlow packages, running on state-of-the-art GPU systems.
引用
收藏
页数:7
相关论文
共 50 条
  • [21] A Large-Scale Infrastructure for Serious Games Services
    Amini, Pedram
    Motlagh, Seyed Abbas Zahiri
    Nezhadpour, Mohammad
    2018 2ND NATIONAL AND 1ST INTERNATIONAL DIGITAL GAMES RESEARCH CONFERENCE: TRENDS, TECHNOLOGIES, AND APPLICATIONS (DGRC), 2018, : 27 - 33
  • [22] Advancing Computing Infrastructure for Very Large-Scale Deep Learning at C3SR
    Hwu, Wen-mei
    2020 IEEE 34TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2020), 2020, : 989 - 989
  • [23] Infrastructure and interfaces for large-scale numerical software
    Freitag, LA
    Gropp, WD
    Hovland, PD
    McInnes, LC
    Smith, BF
    INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS I-V, PROCEEDINGS, 1999, : 2657 - 2663
  • [24] Data consistency in a large-scale runtime infrastructure
    Liu, BQ
    Wang, HM
    Yao, YP
    Proceedings of the 2005 Winter Simulation Conference, Vols 1-4, 2005, : 1787 - 1794
  • [25] Management of informational infrastructure of large-scale enterprises
    Moiseenko, AG
    NEFTYANOE KHOZYAISTVO, 1998, (9-10): : 58 - 63
  • [26] Priorities for governing large-scale infrastructure in the tropics
    Bebbington, Anthony
    Chicchon, Avecita
    Cuba, Nicholas
    Greenspan, Emily
    Hecht, Susanna
    Bebbington, Denise Humphreys
    Kandel, Susan
    Osborne, Tracey
    Ray, Rebecca
    Rogan, John
    Sauls, Laura
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2020, 117 (36) : 21829 - 21833
  • [27] CYC - A LARGE-SCALE INVESTMENT IN KNOWLEDGE INFRASTRUCTURE
    LENAT, DB
    COMMUNICATIONS OF THE ACM, 1995, 38 (11) : 33 - 38
  • [28] Large-scale automatic depression screening using meta-data from wifi infrastructure
    Ware, Shweta
    Yue, Chaoqun
    Morillo, Reynaldo
    Lu, Jin
    Shang, Chao
    Kamath, Jayesh
    Bamis, Athanasios
    Bi, Jinbo
    Russell, Alexander
    Wang, Bing
    Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2018, 2 (04)
  • [29] Applying Cluster Computing to Enable a Large-scale Smart Grid Stability Monitoring Application
    Interrante, John
    Aggour, Kareem S.
    2012 IEEE 14TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS & 2012 IEEE 9TH INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS (HPCC-ICESS), 2012, : 328 - 335
  • [30] Parameter Communication Consistency Model for Large-Scale Security Monitoring Based on Mobile Computing
    Yang, Rui
    Zhang, Jilin
    Wan, Jian
    Zhou, Li
    Shen, Jing
    Zhang, Yunchen
    Wei, Zhenguo
    Zhang, Juncong
    Wang, Jue
    IEEE ACCESS, 2019, 7 : 171884 - 171897