DeepScaling: Autoscaling Microservices With Stable CPU Utilization for Large Scale Production Cloud Systems

被引：0

作者：

Wang, Ziliang ^{[1
,2
,3
,4
]}

Zhu, Shiyi ^{[5
]}

Li, Jianguo ^{[5
]}

Jiang, Wei ^{[5
]}

Ramakrishnan, K. K. ^{[6
]}

Yan, Meng ^{[7
]}

Zhang, Xiaohong ^{[7
]}

Liu, Alex X. ^{[5
]}

机构：

[1] Chongqing Univ, Key Lab Dependable Serv Comp Cyber Phys Soc, Minist Educ, Chongqing 400044, Peoples R China

[2] Chongqing Univ, Sch Big Data & Software Engn, Chongqing 400044, Peoples R China

[3] Peking Univ PKU, Key Lab High Confidence Software Technol HCST, Minist Educ MOE, Beijing 100871, Peoples R China

[4] Peking Univ PKU, Sch Comp Sci SCS, Beijing 100871, Peoples R China

[5] Ant Grp, Hangzhou 310063, Peoples R China

[6] Univ Calif Riverside, Dept Comp Sci & Engn, Riverside, CA 92521 USA

[7] Chongqing Univ, Sch Big Data & Software Engn, Chongqing 401331, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON NETWORKING | 2024年 / 32卷 / 05期

关键词：

Microservices autoscaling; cloud systems; horizontal autoscaling; service quality; RESOURCE-MANAGEMENT; ELASTICITY;

D O I：

10.1109/TNET.2024.3400953

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Cloud service providers often provision excessive resources to meet the desired Service Level Objectives (SLOs), by setting lower CPU utilization targets. This can result in a waste of resources and a noticeable increase in power consumption in large-scale cloud deployments. To address this issue, this paper presents DeepScaling, an innovative solution for minimizing resource cost while ensuring SLO requirements are met in a dynamic, large-scale production microservice-based system. We propose DeepScaling, which introduces three innovative components to adaptively refine the target CPU utilization of servers in the data center, and we maintain it at a stable value to meet SLO constraints while using minimum amount of system resources. First, DeepScaling forecasts workloads for each service using a Spatio-temporal Graph Neural Network. Secondly, it estimates CPU utilization with a Deep Neural Network, considering factors such as periodic tasks and traffic. Finally, it uses a modified Deep Q-Network (DQN) to generate an autoscaling policy that controls service resources to maximize service stability while meeting SLOs. Evaluation of DeepScaling in Ant Group's large-scale cloud environment shows that it outperforms state-of-the-art autoscaling approaches in terms of maintaining stable performance and resource savings. The deployment of DeepScaling in the real-world environment of 1900+ microservices saves the provisioning of over 100,000 CPU cores per day, on average.

引用

页码：3961 / 3976

页数：16

共 50 条

[21] Bounding CPU utilization as a part of the model design and the scenario design of a large-scale military training simulation
Merritt, WR
1998 WINTER SIMULATION CONFERENCE PROCEEDINGS, VOLS 1 AND 2, 1998, : 789 - 796
[22] Fractality in idealized simulations of large-scale tropical cloud systems
Yano, J
McWilliams, JC
Moncrieff, MW
MONTHLY WEATHER REVIEW, 1996, 124 (05) : 838 - 848
[23] Improving Failure Tolerance in Large-Scale Cloud Computing Systems
Luo, Liang
Meng, Sa
Qiu, Xiwei
Dai, Yuanshun
IEEE TRANSACTIONS ON RELIABILITY, 2019, 68 (02) : 620 - 632
[24] Muclouds: Parallel Simulator for Large-scale Cloud Computing Systems
Liu, Jinzhao
Zhou, Yuezhi
Zhang, Di
Fang, Yujian
Han, Wei
Zhang, Yaoxue
2014 IEEE 11TH INTL CONF ON UBIQUITOUS INTELLIGENCE AND COMPUTING AND 2014 IEEE 11TH INTL CONF ON AUTONOMIC AND TRUSTED COMPUTING AND 2014 IEEE 14TH INTL CONF ON SCALABLE COMPUTING AND COMMUNICATIONS AND ITS ASSOCIATED WORKSHOPS, 2014, : 80 - 87
[25] Cloud services, storage and communications at large scale for reliable enterprise systems
Xhafa, Fatos
ENTERPRISE INFORMATION SYSTEMS, 2021, 15 (02) : 131 - 132
[26] PLAR: Parallel Large-scale Attribute Reduction on Cloud Systems
Zhang, Junbo
Li, Tianrui
Pan, Yi
2013 INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES (PDCAT), 2013, : 184 - 191
[27] Optimal Virtual Machine Placement in Large-Scale Cloud Systems
Teyeb, Hana
Balma, Ali
Ben Hadj-Alouane, Nejib
Tata, Samir
2014 IEEE 7TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD), 2014, : 425 - 432
[28] Cloud Computing Applications for Large-Scale Satellite Ground Systems
Anthony, Richard
Fritz, John
Barnhart, Doug
2011 - MILCOM 2011 MILITARY COMMUNICATIONS CONFERENCE, 2011, : 1894 - 1898
[29] Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
Kreutzer, Moritz
Hager, Georg
Wellein, Gerhard
Pieper, Andreas
Alvermann, Andreas
Fehske, Holger
2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2015, : 417 - 426
[30] CONCEPTS AND ECONOMICS OF LARGE-SCALE SYSTEMS UTILIZATION OF SC MAGNETS
ALLINGER, J
DANBY, G
DEVITO, B
FOELSCHE, H
HSIEH, S
JACKSON, J
PRODELL, A
BULLETIN OF THE AMERICAN PHYSICAL SOCIETY, 1973, 18 (02): : 215 - 215

← 1 2 3 4 5 →