GuardGrid: a high-availability cloud platform for deep learning applications

被引:0
|
作者
Yifan Sui [1 ]
Meng Cai [2 ]
Jianxun Li [1 ]
机构
[1] Shanghai Jiao Tong University,Department of Automation
[2] AVIC,Luoyang Institute of Electro
关键词
Cloud computing; Machine learning system; High availability; Fault-tolerant;
D O I
10.1007/s10586-024-04959-6
中图分类号
学科分类号
摘要
With the development of cloud computing, training machine learning (ML) models on the cloud has become a hot topic. However, the memory-intensive nature of ML training applications places enormous pressure on nodes, easily causing node failure. While there are many works addressing fast recovery from failure, they fail to get the optimal recovery speed as they all ignore the unique characteristics of ML training. We observed that existing fault-tolerant solutions might intensify the out-of-memory (OOM) issue. Besides, they only focus on accelerating the node initialization speed, ignoring the dependency library, dataset, and model loading stages, which take much longer time than node initialization. In this paper, we propose GuardGrid, a fault-tolerant cloud platform that effectively avoiding OOM issues and accelerates ML training task’s recovery speed. It contains a proactive fault-tolerant mechanism that creates redundant nodes in advance for future failures based on both fault rate prediction and servers’ memory load. Besides, to speedup recovery and avoid introducing OOM issues by redundant nodes, we propose a reactive fault-tolerant mechanism that works out when the cluster’s memory load is high. Extensive experiments show that GuardGrid accelerates recovery speed up to 16.7×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} and reduces OOM rate up to 93%, compared with state-of-the-art methods.
引用
收藏
相关论文
共 50 条
  • [21] HIGH-AVAILABILITY POWER FOR MX
    OMAN, H
    BANNON, CF
    IEEE TRANSACTIONS ON POWER APPARATUS AND SYSTEMS, 1982, 101 (08): : 2467 - 2470
  • [22] Forming High-Availability Cloud Mechanism for Secure Peer-Servicing Networks
    Tung-Ying Lee
    Yin-Jun Chen
    Gwo-Jiun Horng
    Wireless Personal Communications, 2019, 109 : 361 - 391
  • [23] High-availability Tele-homecare System Design Approached with SoC Platform
    Tai, Shih-Kao
    Lin, Kuang-Hao
    Tseng, Jan-Dong
    2014 INTERNATIONAL SYMPOSIUM ON COMPUTER, CONSUMER AND CONTROL (IS3C 2014), 2014, : 1045 - 1048
  • [24] Microservice Based Architecture: Towards High-Availability for Stateful Applications with Kubernetes
    Vayghan, Leila Abdollahi
    Saied, Mohamed Aymen
    Toeroe, Maria
    Khendek, Ferhat
    2019 IEEE 19TH INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY (QRS 2019), 2019, : 176 - 185
  • [25] Factory: Master Node High-Availability for Big Data Applications and Beyond
    Gankevich, Ivan
    Tipikin, Yuri
    Korkhov, Vladimir
    Gaiduchok, Vladimir
    Degtyarev, Alexander
    Bogdanov, Alexander
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2016, PT II, 2016, 9787 : 379 - 389
  • [26] Benefit evaluation of high-availability middleware
    Neises, R
    SERVICE AVAILABILITY, 2005, 3335 : 73 - 85
  • [27] High-availability cryocooling for infrared sensors
    Arts, Roel
    Willems, Daniel
    Benschop, Tonny
    de Jonge, Garmt
    INFRARED TECHNOLOGY AND APPLICATIONS XLVII, 2021, 11741
  • [28] High-availability Deployment for Large Enterprises
    Lyu, Huahui
    Li, Ping
    Yan, Ruihong
    Qian, Hongjie
    Sheng, Bin
    PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON PROGRESS IN INFORMATICS AND COMPUTING (PIC), VOL 1, 2016, : 503 - 507
  • [29] A high-availability Bebras competition system
    Kristan, Nataša
    Gostiša, Dean
    Fele-Žorž, Gašper
    Brodnik, Andrej
    Brodnik, Andrej
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2014, 8730 : 78 - 87
  • [30] Using Thermal-Aware VM Migration Mechanism for High-Availability Cloud Computing
    Ying-Jun Chen
    Gwo-Jiun Horng
    Jian-Hua Li
    Sheng-Tzong Cheng
    Wireless Personal Communications, 2017, 97 : 1475 - 1502