Cloud reliability and efficiency improvement via failure risk based proactive actions

被引:15
|
作者
Tian, Yuli [1 ,2 ,3 ]
Tian, Jeff [4 ,5 ]
Li, Ning [1 ,2 ,3 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Xian, Shannxi, Peoples R China
[2] Northwestern Polytech Univ, Minist Ind, Xian, Shannxi, Peoples R China
[3] Northwestern Polytech Univ, Informat Technol Key Lab Big Data Storage & Manag, Xian, Shannxi, Peoples R China
[4] Southern Methodist Univ, Dept Comp Sci, Dallas, TX 75205 USA
[5] Northwest Univ, Sch Informat, Xian, Shannxi, Peoples R China
关键词
Cloud computing system; Reliability; Efficiency; Risk identification; Failure mitigation and fault tolerance; SOFTWARE-RELIABILITY;
D O I
10.1016/j.jss.2020.110524
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Due to the huge magnitude and complexity of cloud computing systems (CCS), failures are inevitable, which lead to reliability and efficiency losses. Failure mitigation, fault tolerance, and recovery actions can be performed to improve CCS reliability and efficiency. Using data collected during CCS operation, failure prediction and risk identification techniques could anticipate such failure occurrences. In this paper, we develop a framework to combine risk identification with follow-up proactive actions for CCS reliability and efficiency improvement. We start by analyzing cloud failures and the related operational data. Then a tree based predictive model is trained to diagnose high risk cloud tasks. By proactively terminating these high risk tasks, both the number of CCS failures and the resource consumption could be significantly reduced. The impact of these proactive actions can be simulated to quantify the improvement to both system reliability and efficiency. The new approach has been applied on the Google cluster dataset, covering approximately 400GB of operational data over 29 consecutive days, to demonstrate its viability and effectiveness. (C) 2020 Published by Elsevier Inc.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Optimal allocation of reliability improvement target based on the failure risk and improvement cost
    Kim, Kyungmee O.
    Zuo, Ming J.
    [J]. RELIABILITY ENGINEERING & SYSTEM SAFETY, 2018, 180 : 104 - 110
  • [2] Optimization Method of Reliability Improvement Measures Based on Power Failure Risk Correction
    Xiao, Fan
    Kong, Xiangyu
    Sun, Bowei
    [J]. 2021 24TH INTERNATIONAL CONFERENCE ON ELECTRICAL MACHINES AND SYSTEMS (ICEMS 2021), 2021, : 2344 - 2348
  • [3] A Proactive Restoration Strategy for Optical Cloud Networks Based on Failure Predictions
    Natalino, Carlos
    Coelho, Frederico
    Lacerda, Gustavo
    Braga, Antonio
    Wosinska, Lena
    Monti, Paolo
    [J]. 2018 20TH ANNIVERSARY INTERNATIONAL CONFERENCE ON TRANSPARENT OPTICAL NETWORKS (ICTON), 2018,
  • [4] Failure analysis and reliability improvement measures of machining center based on failure density proportion
    College of Mechanical Science and Engineering, Jilin University, Changchun 130022, China
    不详
    [J]. Jilin Daxue Xuebao (Gongxueban)/Journal of Jilin University (Engineering and Technology Edition), 2008, 38 (SUPPL.): : 119 - 122
  • [5] A failure mode and risk assessment method based on cloud model
    Xinlong Li
    Yan Ran
    Genbao Zhang
    Yan He
    [J]. Journal of Intelligent Manufacturing, 2020, 31 : 1339 - 1352
  • [6] Risk Assessment for Power Failure Interruption Based on Cloud Model
    Lin, Yuliang
    [J]. 2017 CHINESE AUTOMATION CONGRESS (CAC), 2017, : 3360 - 3364
  • [7] A failure mode and risk assessment method based on cloud model
    Li, Xinlong
    Ran, Yan
    Zhang, Genbao
    He, Yan
    [J]. JOURNAL OF INTELLIGENT MANUFACTURING, 2020, 31 (06) : 1339 - 1352
  • [8] Reliability-based analysis and design via failure domain bounding
    Crespo, Luis G.
    Giesy, Daniel P.
    Kenny, Sean P.
    [J]. STRUCTURAL SAFETY, 2009, 31 (04) : 306 - 315
  • [9] Sample Efficiency Improvement on Neuroevolution via Estimation-Based Elimination Strategy
    Xu, Shengbo
    Inoue, Yuki
    Inamura, Tetsunari
    Moriguchi, Hirotaka
    Honiden, Shinichi
    [J]. AAMAS'14: PROCEEDINGS OF THE 2014 INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS & MULTIAGENT SYSTEMS, 2014, : 1537 - 1538
  • [10] A Fuzzy Comprehensive Assessment System of Dam Failure Risk Based on Cloud Model
    Jiang, Ying
    Zhang, QiuWen
    [J]. JOURNAL OF COMPUTERS, 2013, 8 (04) : 1043 - 1049