DeepScaling: Autoscaling Microservices With Stable CPU Utilization for Large Scale Production Cloud Systems

被引:0
|
作者
Wang, Ziliang [1 ,2 ,3 ,4 ]
Zhu, Shiyi [5 ]
Li, Jianguo [5 ]
Jiang, Wei [5 ]
Ramakrishnan, K. K. [6 ]
Yan, Meng [7 ]
Zhang, Xiaohong [7 ]
Liu, Alex X. [5 ]
机构
[1] Chongqing Univ, Key Lab Dependable Serv Comp Cyber Phys Soc, Minist Educ, Chongqing 400044, Peoples R China
[2] Chongqing Univ, Sch Big Data & Software Engn, Chongqing 400044, Peoples R China
[3] Peking Univ PKU, Key Lab High Confidence Software Technol HCST, Minist Educ MOE, Beijing 100871, Peoples R China
[4] Peking Univ PKU, Sch Comp Sci SCS, Beijing 100871, Peoples R China
[5] Ant Grp, Hangzhou 310063, Peoples R China
[6] Univ Calif Riverside, Dept Comp Sci & Engn, Riverside, CA 92521 USA
[7] Chongqing Univ, Sch Big Data & Software Engn, Chongqing 401331, Peoples R China
关键词
Microservices autoscaling; cloud systems; horizontal autoscaling; service quality; RESOURCE-MANAGEMENT; ELASTICITY;
D O I
10.1109/TNET.2024.3400953
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Cloud service providers often provision excessive resources to meet the desired Service Level Objectives (SLOs), by setting lower CPU utilization targets. This can result in a waste of resources and a noticeable increase in power consumption in large-scale cloud deployments. To address this issue, this paper presents DeepScaling, an innovative solution for minimizing resource cost while ensuring SLO requirements are met in a dynamic, large-scale production microservice-based system. We propose DeepScaling, which introduces three innovative components to adaptively refine the target CPU utilization of servers in the data center, and we maintain it at a stable value to meet SLO constraints while using minimum amount of system resources. First, DeepScaling forecasts workloads for each service using a Spatio-temporal Graph Neural Network. Secondly, it estimates CPU utilization with a Deep Neural Network, considering factors such as periodic tasks and traffic. Finally, it uses a modified Deep Q-Network (DQN) to generate an autoscaling policy that controls service resources to maximize service stability while meeting SLOs. Evaluation of DeepScaling in Ant Group's large-scale cloud environment shows that it outperforms state-of-the-art autoscaling approaches in terms of maintaining stable performance and resource savings. The deployment of DeepScaling in the real-world environment of 1900+ microservices saves the provisioning of over 100,000 CPU cores per day, on average.
引用
收藏
页码:3961 / 3976
页数:16
相关论文
共 50 条
  • [31] Effective Utilization of Large-scale Unobserved Data in Recommendation Systems
    Zhang, Feng
    Xu, Yulin
    Chen, Hongjie
    Yuan, Xu
    Liu, QingWen
    Jiang, YuNing
    PROCEEDINGS OF THE 33RD ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2024, 2024, : 5070 - 5077
  • [32] STABILITY ANALYSIS OF LARGE-SCALE SYSTEMS WITH STABLE AND UNSTABLE SUBSYSTEMS
    GRUJIC, LT
    INTERNATIONAL JOURNAL OF CONTROL, 1974, 20 (03) : 453 - 463
  • [33] A decentralized stable Fuzzy Adaptive Controller for large scale nonlinear systems
    Ghasemi, Reza
    Menhaj, M.B.
    Afshar, A.
    Journal of Applied Sciences, 2009, 9 (05) : 892 - 900
  • [34] Effects of culture pH on glucose utilization and lactate production in small and large scale production cultures
    Acbay, JQ
    Hoy, C
    Widrig, R
    Andersen, DC
    Briggs, TR
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 1998, 216 : U287 - U287
  • [36] Generation and Characterization of Stable Recombinant Baculoviruses for large-scale production of rAAV
    van Tongeren, H. M.
    Visser, R. N.
    van Meerendonk, M.
    Brandjes, A.
    Coolen, A. Gal
    Willems, J.
    Puri, P.
    Jaadar, H.
    McLure, C.
    Marsden, S. R.
    Deventer, S. Jh
    Sanders, B. P.
    HUMAN GENE THERAPY, 2022, 33 (23-24) : A104 - A104
  • [37] REPLACEMENT ANALYSIS FOR COMPONENTS OF LARGE-SCALE PRODUCTION SYSTEMS
    LUXHOJ, JT
    INTERNATIONAL JOURNAL OF PRODUCTION ECONOMICS, 1992, 27 (02) : 97 - 110
  • [38] Understanding Exception-Related Bugs in Large-Scale Cloud Systems
    Chen, Haicheng
    Dou, Wensheng
    Jiang, Yanyan
    Qin, Feng
    34TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE 2019), 2019, : 339 - 351
  • [39] On Verifying Stateful Dataflow Processing Services in Large-Scale Cloud Systems
    Du, Juan
    Gu, Xiaohui
    Yu, Ting
    PROCEEDINGS OF THE 17TH ACM CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY (CCS'10), 2010, : 672 - 674
  • [40] A Scalable Approach for Structuring Large-Scale Hierarchical Cloud Management Systems
    Moens, Hendrik
    De Turck, Filip
    2013 9TH INTERNATIONAL CONFERENCE ON NETWORK AND SERVICE MANAGEMENT (CNSM), 2013, : 1 - 8