GuardGrid: a high-availability cloud platform for deep learning applications

被引:0
|
作者
Yifan Sui [1 ]
Meng Cai [2 ]
Jianxun Li [1 ]
机构
[1] Shanghai Jiao Tong University,Department of Automation
[2] AVIC,Luoyang Institute of Electro
关键词
Cloud computing; Machine learning system; High availability; Fault-tolerant;
D O I
10.1007/s10586-024-04959-6
中图分类号
学科分类号
摘要
With the development of cloud computing, training machine learning (ML) models on the cloud has become a hot topic. However, the memory-intensive nature of ML training applications places enormous pressure on nodes, easily causing node failure. While there are many works addressing fast recovery from failure, they fail to get the optimal recovery speed as they all ignore the unique characteristics of ML training. We observed that existing fault-tolerant solutions might intensify the out-of-memory (OOM) issue. Besides, they only focus on accelerating the node initialization speed, ignoring the dependency library, dataset, and model loading stages, which take much longer time than node initialization. In this paper, we propose GuardGrid, a fault-tolerant cloud platform that effectively avoiding OOM issues and accelerates ML training task’s recovery speed. It contains a proactive fault-tolerant mechanism that creates redundant nodes in advance for future failures based on both fault rate prediction and servers’ memory load. Besides, to speedup recovery and avoid introducing OOM issues by redundant nodes, we propose a reactive fault-tolerant mechanism that works out when the cluster’s memory load is high. Extensive experiments show that GuardGrid accelerates recovery speed up to 16.7×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} and reduces OOM rate up to 93%, compared with state-of-the-art methods.
引用
收藏
相关论文
共 50 条
  • [1] A High-Availability Cloud for Research Computing
    Riley, Justin
    Noss, John
    Dillingham, Wes
    Cuff, James
    Llorente, Ignacio M.
    COMPUTER, 2017, 50 (06) : 92 - 95
  • [2] A High-availability Urban Rail Cloud Platform Based on OpenStack: Design, Implementation and Availability Analysis
    Zhu L.
    Li Z.
    Tang T.
    Wang X.
    Tiedao Xuebao/Journal of the China Railway Society, 2024, 46 (02): : 94 - 104
  • [3] High-Availability Virtual Communication for Cloud Access
    Sirisutthidecha, Suthee
    Maichalernnukul, Kiattisak
    KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2016, 10 (08): : 3455 - 3473
  • [4] Towards an Environment Supporting Resilience, High-Availability, Reproducibility and Reliability for Cloud Applications
    Stankovski, Vlado
    Taherizadeh, Salman
    Taylor, Ian
    Jones, Andrew
    Mastroianni, Carlo
    Becker, Bruce
    Suhartanto, Heru
    2015 IEEE/ACM 8TH INTERNATIONAL CONFERENCE ON UTILITY AND CLOUD COMPUTING (UCC), 2015, : 383 - 386
  • [5] A Security and High-Availability Layer for Cloud Storage
    Schnjakin, Maxim
    Alnemr, Rehab
    Meinel, Christoph
    WEB INFORMATION SYSTEMS ENGINEERING - WISE 2010 WORKSHOPS, 2011, 6724 : 449 - 462
  • [6] High-Availability Computing Platform with Sensor Fault Resilience
    Lee, Yen-Lin
    Arizky, Shinta Nuraisya
    Chen, Yu-Ren
    Liang, Deron
    Wang, Wei-Jen
    SENSORS, 2021, 21 (02) : 1 - 16
  • [7] High-availability server platform for IP communication services
    Kimura, N
    Yamada, A
    Seshake, H
    Nishizono, T
    ELECTRONICS AND COMMUNICATIONS IN JAPAN PART I-COMMUNICATIONS, 2006, 89 (06): : 41 - 50
  • [8] HAIL: A High-Availability and Integrity Layer for Cloud Storage
    Bowers, Kevin D.
    Juels, Ari
    Oprea, Alina
    CCS'09: PROCEEDINGS OF THE 16TH ACM CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2009, : 187 - 198
  • [9] Continuous operations secured in high-availability applications
    Miller, C
    PIPELINE & GAS JOURNAL, 2001, 228 (10) : 34 - +
  • [10] Design and Implementation of High-availability PaaS Platform Based on Virtualization Platform
    Wen, Zepeng
    Liang, Yan
    Li, Gongliang
    PROCEEDINGS OF 2020 IEEE 5TH INFORMATION TECHNOLOGY AND MECHATRONICS ENGINEERING CONFERENCE (ITOEC 2020), 2020, : 1571 - 1575