Design and Implementation of a Scalable HPC Monitoring System

被引:7
|
作者
Sanchez, S. [1 ]
Bonnie, A. [1 ]
Van Heule, G. [1 ]
Robinson, C. [1 ]
DeConinck, A. [1 ]
Kelly, K. [1 ]
Snead, Q. [1 ]
Brandt, J. [2 ]
机构
[1] Los Alamos Natl Lab, Los Alamos, NM 87544 USA
[2] Sandia Natl Labs, Albuquerque, NM 87185 USA
关键词
D O I
10.1109/IPDPSW.2016.167
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
]Over the past decade, platforms at Los Alamos National Laboratory (LANL) have experienced large increases in complexity and scale to reach computational targets. The changes to the compute platforms have presented new challenges to the production monitoring systems in which they must not only cope with larger volumes of monitoring data, but also must provide new capabilities for the management, distribution, and analysis of this data. This schema must support both real-time analysis for alerting on urgent issues, as well as analysis of historical data for understanding performance issues and trends in system behavior. This paper presents the design of our proposed next-generation monitoring system, as well as implementation details for an initial deployment. This design takes the form of a multi-stage data processing pipeline, including a scalable cluster for data aggregation and early analysis; a message broker for distribution of this data to varied consumers; and an initial selection of consumer services for alerting and analysis. We will also present estimates of the capabilities and scale required to monitor two upcoming compute platforms at LANL.
引用
收藏
页码:1721 / 1725
页数:5
相关论文
共 50 条
  • [1] Design and implementation of the automatic waste water quality monitoring system of HPC factory
    Trinh Luong Mien
    Nguyen Van Tiem
    [J]. 2019 12TH INTERNATIONAL CONFERENCE ON ADVANCED TECHNOLOGIES FOR COMMUNICATIONS (ATC 2019), 2019, : 164 - 168
  • [2] HPC Cluster Monitoring System Architecture Design and Implement
    Li, Min
    Zhang, Yisheng
    [J]. ICICTA: 2009 SECOND INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTATION TECHNOLOGY AND AUTOMATION, VOL II, PROCEEDINGS, 2009, : 325 - 327
  • [3] Wave Profile and Tide Monitoring System for Scalable Implementation
    Rocha, J. L.
    Matos, T.
    Faria, C. L.
    Penso, C. M.
    Martins, M. S.
    Gomes, P. A.
    Goncalves, L. M.
    [J]. 2023 IEEE SENSORS, 2023,
  • [4] Design and Implementation of Scalable Wireless Sensor Network for Structural Monitoring
    Pakzad, Shamim N.
    Fenves, Gregory L.
    Kim, Sukun
    Culler, David E.
    [J]. JOURNAL OF INFRASTRUCTURE SYSTEMS, 2008, 14 (01) : 89 - 101
  • [5] Design and implementation of a scalable network vulnerability scanner system
    Huang, Jing
    Su, Purui
    Jiang, Jianchun
    Qing, Sihan
    [J]. Jisuanji Gongcheng/Computer Engineering, 2002, 28 (02):
  • [6] Design and Implementation of Scalable Qos Policy Management System
    Xie, Wentao
    Chen, Xiaomei
    Li, Dandan
    [J]. IEEE 12TH INT CONF UBIQUITOUS INTELLIGENCE & COMP/IEEE 12TH INT CONF ADV & TRUSTED COMP/IEEE 15TH INT CONF SCALABLE COMP & COMMUN/IEEE INT CONF CLOUD & BIG DATA COMP/IEEE INT CONF INTERNET PEOPLE AND ASSOCIATED SYMPOSIA/WORKSHOPS, 2015, : 1682 - 1685
  • [7] Design and implementation of HPC platform for bulk power system reliability evaluation
    The Key Lab of High Voltage Engineering and Electrical New Technology, MOE, Chongqing University, Chongqing 400044, China
    [J]. Dianli Xitong Zidonghue, 2006, 18 (89-93):
  • [8] Scalable system scheduling for HPC and big data
    Reuther, Albert
    Byun, Chansup
    Arcand, William
    Bestor, David
    Bergeron, Bill
    Hubbell, Matthew
    Jones, Michael
    Michaleas, Peter
    Prout, Andrew
    Rosa, Antonio
    Kepner, Jeremy
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2018, 111 : 76 - 92
  • [9] Design and implementation of an automated monitoring system
    Islam, Md Saiful
    Lee, Jung-Chul
    Chong, Uipil
    [J]. JOURNAL OF SUPERCOMPUTING, 2016, 72 (11): : 4247 - 4261
  • [10] DESIGN AND IMPLEMENTATION OF INTRANET MONITORING SYSTEM
    Wang, Zhiqiong
    Zhao, Yue
    Guo, Yidan
    Guo, Shijia
    [J]. CIICT 2008: PROCEEDINGS OF CHINA-IRELAND INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATIONS TECHNOLOGIES 2008, 2008, : 346 - 350