GuardGrid: a high-availability cloud platform for deep learning applications

被引:0
|
作者
Yifan Sui [1 ]
Meng Cai [2 ]
Jianxun Li [1 ]
机构
[1] Shanghai Jiao Tong University,Department of Automation
[2] AVIC,Luoyang Institute of Electro
关键词
Cloud computing; Machine learning system; High availability; Fault-tolerant;
D O I
10.1007/s10586-024-04959-6
中图分类号
学科分类号
摘要
With the development of cloud computing, training machine learning (ML) models on the cloud has become a hot topic. However, the memory-intensive nature of ML training applications places enormous pressure on nodes, easily causing node failure. While there are many works addressing fast recovery from failure, they fail to get the optimal recovery speed as they all ignore the unique characteristics of ML training. We observed that existing fault-tolerant solutions might intensify the out-of-memory (OOM) issue. Besides, they only focus on accelerating the node initialization speed, ignoring the dependency library, dataset, and model loading stages, which take much longer time than node initialization. In this paper, we propose GuardGrid, a fault-tolerant cloud platform that effectively avoiding OOM issues and accelerates ML training task’s recovery speed. It contains a proactive fault-tolerant mechanism that creates redundant nodes in advance for future failures based on both fault rate prediction and servers’ memory load. Besides, to speedup recovery and avoid introducing OOM issues by redundant nodes, we propose a reactive fault-tolerant mechanism that works out when the cluster’s memory load is high. Extensive experiments show that GuardGrid accelerates recovery speed up to 16.7×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} and reduces OOM rate up to 93%, compared with state-of-the-art methods.
引用
收藏
相关论文
共 50 条
  • [31] Comparison of high-availability automation networks
    Peronne, Alban
    Dersin, Pierre
    ANNUAL RELIABILITY AND MAINTAINABILITY SYMPOSIUM (RAMS), 2011 PROCEEDINGS, 2011,
  • [32] A Study on Data Storage System Based on High-availability Open Virtual Experiment Platform
    Yan Hui
    Hu Haiyan
    2016 INTERNATIONAL CONFERENCE ON ROBOTS & INTELLIGENT SYSTEM (ICRIS), 2016, : 55 - 58
  • [33] Using Thermal-Aware VM Migration Mechanism for High-Availability Cloud Computing
    Chen, Ying-Jun
    Horng, Gwo-Jiun
    Li, Jian-Hua
    Cheng, Sheng-Tzong
    WIRELESS PERSONAL COMMUNICATIONS, 2017, 97 (01) : 1475 - 1502
  • [34] ACHIEVING HIGH-AVAILABILITY BATCH CONTROL
    LENGYEL, L
    INTECH, 1988, 35 (08) : 47 - 48
  • [35] Is high-availability good enough for you?
    Computer Technology Review, 1993, 13 (14):
  • [36] Research on High-Availability of Softswitch System
    LOU Zhi-qiang1
    2.School of Telecommunication Engineering
    The Journal of China Universities of Posts and Telecommunications, 2006, (02) : 50 - 53
  • [37] HIGH-AVAILABILITY COMPUTER-SYSTEMS
    GRAY, J
    SIEWIOREK, DP
    COMPUTER, 1991, 24 (09) : 39 - 48
  • [38] High-Availability Service Chain Realization Theory
    Sharma, Sidharth
    Gumaste, Ashwin
    Tatipamula, Mallik
    2020 16TH INTERNATIONAL CONFERENCE ON THE DESIGN OF RELIABLE COMMUNICATION NETWORKS DRCN 2020, 2020,
  • [39] Research on High-Availability Based on Architecture of ForCES
    Li, Qun
    Dong, Ligang
    Gao, Ming
    2009 ASIA-PACIFIC CONFERENCE ON INFORMATION PROCESSING (APCIP 2009), VOL 2, PROCEEDINGS, 2009, : 537 - 540
  • [40] Sustaining High-Availability and Quality of Web Services
    Lim, Erbin
    Thiran, Philippe
    CURRENT TRENDS IN WEB ENGINEERING, 2010, 6385s : 560 - 565