DeepScaling: Autoscaling Microservices With Stable CPU Utilization for Large Scale Production Cloud Systems

被引:0
|
作者
Wang, Ziliang [1 ,2 ,3 ,4 ]
Zhu, Shiyi [5 ]
Li, Jianguo [5 ]
Jiang, Wei [5 ]
Ramakrishnan, K. K. [6 ]
Yan, Meng [7 ]
Zhang, Xiaohong [7 ]
Liu, Alex X. [5 ]
机构
[1] Chongqing Univ, Key Lab Dependable Serv Comp Cyber Phys Soc, Minist Educ, Chongqing 400044, Peoples R China
[2] Chongqing Univ, Sch Big Data & Software Engn, Chongqing 400044, Peoples R China
[3] Peking Univ PKU, Key Lab High Confidence Software Technol HCST, Minist Educ MOE, Beijing 100871, Peoples R China
[4] Peking Univ PKU, Sch Comp Sci SCS, Beijing 100871, Peoples R China
[5] Ant Grp, Hangzhou 310063, Peoples R China
[6] Univ Calif Riverside, Dept Comp Sci & Engn, Riverside, CA 92521 USA
[7] Chongqing Univ, Sch Big Data & Software Engn, Chongqing 401331, Peoples R China
关键词
Microservices autoscaling; cloud systems; horizontal autoscaling; service quality; RESOURCE-MANAGEMENT; ELASTICITY;
D O I
10.1109/TNET.2024.3400953
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Cloud service providers often provision excessive resources to meet the desired Service Level Objectives (SLOs), by setting lower CPU utilization targets. This can result in a waste of resources and a noticeable increase in power consumption in large-scale cloud deployments. To address this issue, this paper presents DeepScaling, an innovative solution for minimizing resource cost while ensuring SLO requirements are met in a dynamic, large-scale production microservice-based system. We propose DeepScaling, which introduces three innovative components to adaptively refine the target CPU utilization of servers in the data center, and we maintain it at a stable value to meet SLO constraints while using minimum amount of system resources. First, DeepScaling forecasts workloads for each service using a Spatio-temporal Graph Neural Network. Secondly, it estimates CPU utilization with a Deep Neural Network, considering factors such as periodic tasks and traffic. Finally, it uses a modified Deep Q-Network (DQN) to generate an autoscaling policy that controls service resources to maximize service stability while meeting SLOs. Evaluation of DeepScaling in Ant Group's large-scale cloud environment shows that it outperforms state-of-the-art autoscaling approaches in terms of maintaining stable performance and resource savings. The deployment of DeepScaling in the real-world environment of 1900+ microservices saves the provisioning of over 100,000 CPU cores per day, on average.
引用
收藏
页码:3961 / 3976
页数:16
相关论文
共 50 条
  • [21] Bounding CPU utilization as a part of the model design and the scenario design of a large-scale military training simulation
    Merritt, WR
    1998 WINTER SIMULATION CONFERENCE PROCEEDINGS, VOLS 1 AND 2, 1998, : 789 - 796
  • [22] Fractality in idealized simulations of large-scale tropical cloud systems
    Yano, J
    McWilliams, JC
    Moncrieff, MW
    MONTHLY WEATHER REVIEW, 1996, 124 (05) : 838 - 848
  • [23] Improving Failure Tolerance in Large-Scale Cloud Computing Systems
    Luo, Liang
    Meng, Sa
    Qiu, Xiwei
    Dai, Yuanshun
    IEEE TRANSACTIONS ON RELIABILITY, 2019, 68 (02) : 620 - 632
  • [24] Muclouds: Parallel Simulator for Large-scale Cloud Computing Systems
    Liu, Jinzhao
    Zhou, Yuezhi
    Zhang, Di
    Fang, Yujian
    Han, Wei
    Zhang, Yaoxue
    2014 IEEE 11TH INTL CONF ON UBIQUITOUS INTELLIGENCE AND COMPUTING AND 2014 IEEE 11TH INTL CONF ON AUTONOMIC AND TRUSTED COMPUTING AND 2014 IEEE 14TH INTL CONF ON SCALABLE COMPUTING AND COMMUNICATIONS AND ITS ASSOCIATED WORKSHOPS, 2014, : 80 - 87
  • [25] Cloud services, storage and communications at large scale for reliable enterprise systems
    Xhafa, Fatos
    ENTERPRISE INFORMATION SYSTEMS, 2021, 15 (02) : 131 - 132
  • [26] PLAR: Parallel Large-scale Attribute Reduction on Cloud Systems
    Zhang, Junbo
    Li, Tianrui
    Pan, Yi
    2013 INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES (PDCAT), 2013, : 184 - 191
  • [27] Optimal Virtual Machine Placement in Large-Scale Cloud Systems
    Teyeb, Hana
    Balma, Ali
    Ben Hadj-Alouane, Nejib
    Tata, Samir
    2014 IEEE 7TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD), 2014, : 425 - 432
  • [28] Cloud Computing Applications for Large-Scale Satellite Ground Systems
    Anthony, Richard
    Fritz, John
    Barnhart, Doug
    2011 - MILCOM 2011 MILITARY COMMUNICATIONS CONFERENCE, 2011, : 1894 - 1898
  • [29] Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
    Kreutzer, Moritz
    Hager, Georg
    Wellein, Gerhard
    Pieper, Andreas
    Alvermann, Andreas
    Fehske, Holger
    2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2015, : 417 - 426
  • [30] CONCEPTS AND ECONOMICS OF LARGE-SCALE SYSTEMS UTILIZATION OF SC MAGNETS
    ALLINGER, J
    DANBY, G
    DEVITO, B
    FOELSCHE, H
    HSIEH, S
    JACKSON, J
    PRODELL, A
    BULLETIN OF THE AMERICAN PHYSICAL SOCIETY, 1973, 18 (02): : 215 - 215