GuardGrid: a high-availability cloud platform for deep learning applications

被引:0
|
作者
Yifan Sui [1 ]
Meng Cai [2 ]
Jianxun Li [1 ]
机构
[1] Shanghai Jiao Tong University,Department of Automation
[2] AVIC,Luoyang Institute of Electro
关键词
Cloud computing; Machine learning system; High availability; Fault-tolerant;
D O I
10.1007/s10586-024-04959-6
中图分类号
学科分类号
摘要
With the development of cloud computing, training machine learning (ML) models on the cloud has become a hot topic. However, the memory-intensive nature of ML training applications places enormous pressure on nodes, easily causing node failure. While there are many works addressing fast recovery from failure, they fail to get the optimal recovery speed as they all ignore the unique characteristics of ML training. We observed that existing fault-tolerant solutions might intensify the out-of-memory (OOM) issue. Besides, they only focus on accelerating the node initialization speed, ignoring the dependency library, dataset, and model loading stages, which take much longer time than node initialization. In this paper, we propose GuardGrid, a fault-tolerant cloud platform that effectively avoiding OOM issues and accelerates ML training task’s recovery speed. It contains a proactive fault-tolerant mechanism that creates redundant nodes in advance for future failures based on both fault rate prediction and servers’ memory load. Besides, to speedup recovery and avoid introducing OOM issues by redundant nodes, we propose a reactive fault-tolerant mechanism that works out when the cluster’s memory load is high. Extensive experiments show that GuardGrid accelerates recovery speed up to 16.7×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} and reduces OOM rate up to 93%, compared with state-of-the-art methods.
引用
收藏
相关论文
共 50 条
  • [41] A High-availability Data Backup Strategy for IPFS
    Shi, LinFei
    Luo, Hong
    Yang, XueMei
    Sun, Yan
    2019 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN (ICCE-TW), 2019,
  • [42] Control in the field enables high-availability control
    O'Brien, Larry
    HYDROCARBON PROCESSING, 2010, 89 (02): : 13 - 13
  • [43] HIGH-AVAILABILITY DESIGN OF A COGENERATION FACILITY.
    Mortimer, Allen W.
    Turbomachinery International, 1988, 29 (02) : 20 - 24
  • [44] Recovery in CloudDBAppliance's High-availability Middleware
    Abreu, Hugo
    Ferreira, Luis
    Coelho, Fabio
    Alonso, Ana Nunes
    Pereira, Jose
    PROCEEDINGS OF THE 8TH INTERNATIONAL CONFERENCE ON DATA SCIENCE, TECHNOLOGY AND APPLICATIONS (DATA), 2019, : 447 - 453
  • [45] A new approach to developing high-availability server
    Yu, James T.
    CITSA/ISAS 2005: 2nd International Conference on Cybernetics and Information Technologies Systems and Applications: 11th International Conference on Information Systems Analysis and Synthesis, Vol 1, 2005, : 171 - 176
  • [46] High-availability foundation builds on advanced RTOS
    Wong, W
    ELECTRONIC DESIGN, 2001, 49 (18) : 29 - 30
  • [47] COTS hardware and software in high-availability systems
    Iyer, RK
    Avizienis, A
    Barron, D
    Powell, D
    Levendel, H
    Samson, J
    TWENTY-NINTH ANNUAL INTERNATIONAL SYMPOSIUM ON FAULT-TOLERANT COMPUTING, DIGEST OF PAPERS, 1999, : 120 - 120
  • [48] Embedded linux gains high-availability framework
    Wong, W
    ELECTRONIC DESIGN, 2001, 49 (21) : 36 - +
  • [49] Modeling and analysis of high-availability routing software
    Ji, M
    Yu, SH
    2005 INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING PROCEEDINGS, VOLS 1 AND 2, 2005, : 999 - 1002
  • [50] SUMMARY OF HIGH-AVAILABILITY DHCP SERVICE SOLUTIONS
    Lin, Changsheng
    Su, Tian
    Wang, Zhiqian
    2011 4TH IEEE INTERNATIONAL CONFERENCE ON BROADBAND NETWORK AND MULTIMEDIA TECHNOLOGY (4TH IEEE IC-BNMT2011), 2011, : 12 - 17