Cloud reliability and efficiency improvement via failure risk based proactive actions

被引:15
|
作者
Tian, Yuli [1 ,2 ,3 ]
Tian, Jeff [4 ,5 ]
Li, Ning [1 ,2 ,3 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Xian, Shannxi, Peoples R China
[2] Northwestern Polytech Univ, Minist Ind, Xian, Shannxi, Peoples R China
[3] Northwestern Polytech Univ, Informat Technol Key Lab Big Data Storage & Manag, Xian, Shannxi, Peoples R China
[4] Southern Methodist Univ, Dept Comp Sci, Dallas, TX 75205 USA
[5] Northwest Univ, Sch Informat, Xian, Shannxi, Peoples R China
关键词
Cloud computing system; Reliability; Efficiency; Risk identification; Failure mitigation and fault tolerance; SOFTWARE-RELIABILITY;
D O I
10.1016/j.jss.2020.110524
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Due to the huge magnitude and complexity of cloud computing systems (CCS), failures are inevitable, which lead to reliability and efficiency losses. Failure mitigation, fault tolerance, and recovery actions can be performed to improve CCS reliability and efficiency. Using data collected during CCS operation, failure prediction and risk identification techniques could anticipate such failure occurrences. In this paper, we develop a framework to combine risk identification with follow-up proactive actions for CCS reliability and efficiency improvement. We start by analyzing cloud failures and the related operational data. Then a tree based predictive model is trained to diagnose high risk cloud tasks. By proactively terminating these high risk tasks, both the number of CCS failures and the resource consumption could be significantly reduced. The impact of these proactive actions can be simulated to quantify the improvement to both system reliability and efficiency. The new approach has been applied on the Google cluster dataset, covering approximately 400GB of operational data over 29 consecutive days, to demonstrate its viability and effectiveness. (C) 2020 Published by Elsevier Inc.
引用
下载
收藏
页数:12
相关论文
共 50 条
  • [31] Risk Assessment for Water Disaster of Karst Tunnel Based on the Weighting of Reliability Measurement and Improved Extension Cloud Model
    Jiang, Yingli
    Cui, Jie
    Liu, Hao
    Zhang, Yanlong
    GEOFLUIDS, 2023, 2023
  • [32] Risk assessment of seepage failure in deep excavations based on fuzzy analytic hierarchy process and cloud model
    Wu, Jian
    Zhou, Zhifang
    ACTA GEOTECHNICA, 2023, 18 (10) : 5635 - 5658
  • [33] Risk assessment of seepage failure in deep excavations based on fuzzy analytic hierarchy process and cloud model
    Jian Wu
    Zhifang Zhou
    Acta Geotechnica, 2023, 18 : 5635 - 5658
  • [34] Cloud-based smart agriculture framework: Optimizing load balancing efficiency via integrated scheduling algorithm
    Sneha
    Singh, Prabh Deep
    Tripathi, Vikas
    JOURNAL OF INFORMATION & OPTIMIZATION SCIENCES, 2024, 45 (02): : 449 - 458
  • [35] Multi-objective reliability based design optimization and risk analysis of motorcycle frame with strength based failure limit
    S. S. Rane
    A. Srividya
    A. K. Verma
    Rane, S.S. (rsurajs@yahoo.com), 1600, Springer (03): : 33 - 39
  • [36] The αδ method for reliability modeling and improvement of NDT-tools for Risk Based Inspection (RBI): Application to corroded structures
    Schoefs, F.
    Boero, J.
    APPLICATIONS OF STATISTICS AND PROBABILITY IN CIVIL ENGINEERING, 2011, : 2442 - 2449
  • [37] Security and storage improvement in distributed cloud data centers by increasing reliability based on particle swarm optimization and artificial immune system algorithms
    Chamkoori, Alireza
    Katebi, Serajdean
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2023, 35 (06): : 1
  • [38] Improvement of Human-Plant Interactivity via Industrial Cloud-Based Supervisory Control and Data Acquisition System
    20143718157985
    (1) Department of Cybernetics and Artificial intelligence, Košice, Slovakia, 1600, GdR MACS: CNRS Research Group on; IFIP WG 5.7 Advances in Production Management Systems; IODE: Res. Federation on Distributed Organizations Eng.; IRTES: Res. Institute on Transports, Energy and Society; Modelling and Analysis of Complex Systems (Springer Science and Business Media, LLC):
  • [39] Improvement of Human-Plant Interactivity via Industrial Cloud-Based Supervisory Control and Data Acquisition System
    Lojka, Tomas
    Zolotova, Iveta
    ADVANCES IN PRODUCTION MANAGEMENT SYSTEMS: INNOVATIVE AND KNOWLEDGE-BASED PRODUCTION MANAGEMENT IN A GLOBAL-LOCAL WORLD, APMS 2014, PT III, 2014, 440 : 83 - 90
  • [40] Energy efficiency improvement of machine tools via peripheral devices management: an optimization-based control approach
    Diaz C, Jenny L.
    Ocampo-Martinez, Carlos
    2019 AMERICAN CONTROL CONFERENCE (ACC), 2019, : 3236 - 3242