Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Cloud-Scale Infrastructure

被引:0
|
作者
Li, Ze [1 ]
Cheng, Qian [1 ]
Hsieh, Ken [1 ]
Dang, Yingnong [1 ]
Huang, Peng [2 ]
Singh, Pankaj [1 ]
Yang, Xinsheng [1 ]
Lin, Qingwei [3 ]
Wu, Youjiang [1 ]
Levy, Sebastien [1 ]
Chintalapati, Murali [1 ]
机构
[1] Microsoft Azure, Redmond, WA 98052 USA
[2] Johns Hopkins Univ, Baltimore, MD 21218 USA
[3] Microsoft Res, Redmond, WA USA
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Modern cloud systems have a vast number of components that continuously undergo updates. Deploying these frequent updates quickly without breaking the system is challenging. In this paper, we present Gandalf, an end-to-end analytics service for safe deployment in a large-scale system infrastructure. Gandalf enables rapid and robust impact assessment of software rollouts to catch bad rollouts before they cause widespread outages. Gandalf monitors and analyzes various fault signals. It will correlate each signal against all the ongoing rollouts using a spatial and temporal correlation algorithm. The core decision logic of Gandalf includes an ensemble ranking algorithm that determines which rollout may have caused the fault signals, and a binary classifier that assesses the impact of the fault signals. The analysis result will decide whether a rollout is safe to proceed or should be stopped. By using a lambda architecture, Gandalf provides both real-time and long-term deployment monitoring with automated decisions and notifications. Gandalf has been running in production in Microsoft Azure for more than 18 months, serving both data-plane and control-plane components. It achieves 92.4% precision and 100% recall (no high-impact service outages in Azure Compute were caused by bad rollouts) for dataplane rollouts. For control-plane rollouts, Gandalf achieves 94.9% precision and 99.8% recall.
引用
收藏
页码:389 / 402
页数:14
相关论文
共 48 条
  • [1] Availability model for edge-fog-cloud continuum: an evaluation of an end-to-end infrastructure of intelligent traffic management service
    Pereira, Paulo
    Melo, Carlos
    Araujo, Jean
    Dantas, Jamilson
    Santos, Vinicius
    Maciel, Paulo
    [J]. JOURNAL OF SUPERCOMPUTING, 2022, 78 (03): : 4421 - 4448
  • [2] Availability model for edge-fog-cloud continuum: an evaluation of an end-to-end infrastructure of intelligent traffic management service
    Paulo Pereira
    Carlos Melo
    Jean Araujo
    Jamilson Dantas
    Vinícius Santos
    Paulo Maciel
    [J]. The Journal of Supercomputing, 2022, 78 : 4421 - 4448
  • [3] The Design and Deployment of an End-To-End IoT Infrastructure for the Natural Environment
    Nundloll, Vatsala
    Porter, Barry
    Blair, Gordon S.
    Emmett, Bridget
    Cosby, Jack
    Jones, Davey L.
    Chadwick, Dave
    Winterbourn, Ben
    Beattie, Philip
    Dean, Graham
    Shaw, Rory
    Shelley, Wayne
    Brown, Mike
    Ullah, Izhar
    [J]. FUTURE INTERNET, 2019, 11 (06)
  • [4] End-to-End Privacy Policy Enforcement in Cloud Infrastructure
    Betge-Brezetz, Stephane
    Kamga, Guy-Bertrand
    Dupont, Marie-Pascale
    Guesmi, Aoues
    [J]. PROCEEDINGS OF THE 2013 IEEE 2ND INTERNATIONAL CONFERENCE ON CLOUD NETWORKING (CLOUDNET), 2013, : 25 - 32
  • [5] Intelligent bandwidth management and end-to-end quality of service
    [J]. Gysel, Peter, 2000, Swisscom, Bern, Switzerland (78):
  • [6] End-to-end QoS support for a medical grid service infrastructure
    Benkner, Siegfried
    Engelbrecht, Gerhard
    Middleton, Stuart E.
    Brandic, Ivona
    Schmidt, Rainer
    [J]. NEW GENERATION COMPUTING, 2007, 25 (04) : 355 - 372
  • [7] End-to-End QoS Support for a Medical Grid Service Infrastructure
    Siegfried Benkner
    Gerhard Engelbrecht
    Stuart E. Middleton
    Ivona Brandic
    Rainer Schmidt
    [J]. New Generation Computing, 2007, 25 : 355 - 372
  • [8] NetWatch: End-to-End Network Performance Measurement as a Service for Cloud
    Liu, Jiaqiang
    Xiao, Shaoran
    Li, Yong
    Song, Haoyu
    Jin, Depeng
    Su, Li
    [J]. IEEE TRANSACTIONS ON CLOUD COMPUTING, 2019, 7 (02) : 553 - 567
  • [9] End-to-End QoS Prediction of Vertical Service Composition in the Cloud
    Karim, Raed
    Ding, Chen
    Miri, Ali
    [J]. 2015 IEEE 8TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, 2015, : 229 - 236
  • [10] Service Function Placement Optimization For Cloud Service With End-to-End Delay Constraints
    Yan, Guofeng
    Su, Zhengwen
    Tan, Hengliang
    Du, Jiao
    [J]. COMPUTER JOURNAL, 2024,