With the development of cloud computing, training machine learning (ML) models on the cloud has become a hot topic. However, the memory-intensive nature of ML training applications places enormous pressure on nodes, easily causing node failure. While there are many works addressing fast recovery from failure, they fail to get the optimal recovery speed as they all ignore the unique characteristics of ML training. We observed that existing fault-tolerant solutions might intensify the out-of-memory (OOM) issue. Besides, they only focus on accelerating the node initialization speed, ignoring the dependency library, dataset, and model loading stages, which take much longer time than node initialization. In this paper, we propose GuardGrid, a fault-tolerant cloud platform that effectively avoiding OOM issues and accelerates ML training task’s recovery speed. It contains a proactive fault-tolerant mechanism that creates redundant nodes in advance for future failures based on both fault rate prediction and servers’ memory load. Besides, to speedup recovery and avoid introducing OOM issues by redundant nodes, we propose a reactive fault-tolerant mechanism that works out when the cluster’s memory load is high. Extensive experiments show that GuardGrid accelerates recovery speed up to 16.7×\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\times $$\end{document} and reduces OOM rate up to 93%, compared with state-of-the-art methods.