DeepScaling: Autoscaling Microservices With Stable CPU Utilization for Large Scale Production Cloud Systems

被引:0
|
作者
Wang, Ziliang [1 ,2 ,3 ,4 ]
Zhu, Shiyi [5 ]
Li, Jianguo [5 ]
Jiang, Wei [5 ]
Ramakrishnan, K. K. [6 ]
Yan, Meng [7 ]
Zhang, Xiaohong [7 ]
Liu, Alex X. [5 ]
机构
[1] Chongqing Univ, Key Lab Dependable Serv Comp Cyber Phys Soc, Minist Educ, Chongqing 400044, Peoples R China
[2] Chongqing Univ, Sch Big Data & Software Engn, Chongqing 400044, Peoples R China
[3] Peking Univ PKU, Key Lab High Confidence Software Technol HCST, Minist Educ MOE, Beijing 100871, Peoples R China
[4] Peking Univ PKU, Sch Comp Sci SCS, Beijing 100871, Peoples R China
[5] Ant Grp, Hangzhou 310063, Peoples R China
[6] Univ Calif Riverside, Dept Comp Sci & Engn, Riverside, CA 92521 USA
[7] Chongqing Univ, Sch Big Data & Software Engn, Chongqing 401331, Peoples R China
关键词
Microservices autoscaling; cloud systems; horizontal autoscaling; service quality; RESOURCE-MANAGEMENT; ELASTICITY;
D O I
10.1109/TNET.2024.3400953
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Cloud service providers often provision excessive resources to meet the desired Service Level Objectives (SLOs), by setting lower CPU utilization targets. This can result in a waste of resources and a noticeable increase in power consumption in large-scale cloud deployments. To address this issue, this paper presents DeepScaling, an innovative solution for minimizing resource cost while ensuring SLO requirements are met in a dynamic, large-scale production microservice-based system. We propose DeepScaling, which introduces three innovative components to adaptively refine the target CPU utilization of servers in the data center, and we maintain it at a stable value to meet SLO constraints while using minimum amount of system resources. First, DeepScaling forecasts workloads for each service using a Spatio-temporal Graph Neural Network. Secondly, it estimates CPU utilization with a Deep Neural Network, considering factors such as periodic tasks and traffic. Finally, it uses a modified Deep Q-Network (DQN) to generate an autoscaling policy that controls service resources to maximize service stability while meeting SLOs. Evaluation of DeepScaling in Ant Group's large-scale cloud environment shows that it outperforms state-of-the-art autoscaling approaches in terms of maintaining stable performance and resource savings. The deployment of DeepScaling in the real-world environment of 1900+ microservices saves the provisioning of over 100,000 CPU cores per day, on average.
引用
收藏
页码:3961 / 3976
页数:16
相关论文
共 50 条
  • [41] Predicting the Stability of Large-scale Distributed Stream Processing Systems on the Cloud
    Tri Minh Truong
    Harwood, Aaron
    Sinnott, Richard O.
    CLOSER: PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND SERVICES SCIENCE, 2017, : 575 - 582
  • [42] Harnessing the Cloud for Securely Outsourcing Large-Scale Systems of Linear Equations
    Wang, Cong
    Ren, Kui
    Wang, Jia
    Wang, Qian
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2013, 24 (06) : 1172 - 1181
  • [43] Outsourcing Large-scale Systems of Linear Matrix Equations in Cloud Computing
    Zhang, Jian
    Yang, Yang
    Wang, Zhibo
    2016 IEEE 22ND INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2016, : 438 - 447
  • [44] Performance Analysis of Large-scale Distributed Stream Processing Systems on the Cloud
    Tri Minh Truong
    Harwood, Aaron
    Sinnott, Richard O.
    Chen, Shiping
    PROCEEDINGS 2018 IEEE 11TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD), 2018, : 754 - 761
  • [45] Harnessing the Cloud for Securely Solving Large-scale Systems of Linear Equations
    Wang, Cong
    Ren, Kui
    Wang, Jia
    Urs, Karthik Mahendra Raje
    31ST INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2011), 2011, : 549 - 558
  • [46] DCU-CHK: checkpointing for large-scale CPU-DCU heterogeneous computing systems
    Jia, Jie
    Lin, Xinyuan
    Lin, Fang
    Liu, Yi
    CCF TRANSACTIONS ON HIGH PERFORMANCE COMPUTING, 2024, 6 (05) : 519 - 532
  • [47] Online Collection and Forecasting of Resource Utilization in Large-Scale Distributed Systems
    Tuor, Tiffany
    Wang, Shiqiang
    Leung, Kin K.
    Ko, Bong Jun
    2019 39TH IEEE INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2019), 2019, : 133 - 143
  • [48] Utilization of Large-scale Charging Devices Integration Into Power Systems With Microgrids
    Zhang Xiaobo
    Zhang Baohui
    Chao Guang
    2014 14TH INTERNATIONAL CONFERENCE ON ENVIRONMENT AND ELECTRICAL ENGINEERING (EEEIC), 2014, : 245 - 248
  • [49] FINITE STABILITY REGIONS FOR LARGE-SCALE SYSTEMS WITH STABLE AND UNSTABLE SUBSYSTEMS
    MORARI, M
    STEPHANOPOULOS, G
    ARIS, R
    INTERNATIONAL JOURNAL OF CONTROL, 1977, 26 (05) : 805 - 815
  • [50] An In-Depth Analysis of Cloud Block Storage Workloads in Large-Scale Production
    Li, Jinhong
    Wang, Qiuping
    Lee, Patrick P. C.
    Shi, Chao
    2020 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC 2020), 2020, : 37 - 47