HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving

被引：1

作者：

Mo, Hao ^{[1
]}

Zhu, Ligu ^{[1
,2
]}

Shi, Lei ^{[1
]}

Tan, Songfu ^{[1
]}

Wang, Suping ^{[1
]}

机构：

[1] Commun Univ China, State Key Lab Media Convergence & Commun, Beijing 100024, Peoples R China

[2] Beijng Key Lab Big Data Secur & Protect Ind, Beijing 100024, Peoples R China

来源：

ELECTRONICS | 2023年 / 12卷 / 01期

关键词：

inference serving; autoscaling; cost effectiveness; multi-tenant inference;

D O I：

10.3390/electronics12010240

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

To accelerate the inference of machine-learning (ML) model serving, clusters of machines require the use of expensive hardware accelerators (e.g., GPUs) to reduce execution time. Advanced inference serving systems are needed to satisfy latency service-level objectives (SLOs) in a cost-effective manner. Novel autoscaling mechanisms that greedily minimize the number of service instances while ensuring SLO compliance are helpful. However, we find that it is not adequate to guarantee cost effectiveness across heterogeneous GPU hardware, and this does not maximize resource utilization. In this paper, we propose HetSev to address these challenges by incorporating heterogeneity-aware autoscaling and resource-efficient scheduling to achieve cost effectiveness. We develop an autoscaling mechanism which accounts for SLO compliance and GPU heterogeneity, thus provisioning the appropriate type and number of instances to guarantee cost effectiveness. We leverage multi-tenant inference to improve GPU resource utilization, while alleviating inter-tenant interference by avoiding the co-location of identical ML instances on the same GPU during placement decisions. HetSev is integrated into Kubernetes and deployed onto a heterogeneous GPU cluster. We evaluated the performance of HetSev using several representative ML models. Compared with default Kubernetes, HetSev reduces resource cost by up to 2.15x while meeting SLO requirements.

引用

页数：18

共 4 条

[1] MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving
Zhang, Chengliang
Yu, Minchen
Wang, Wei
Yan, Feng
[J]. PROCEEDINGS OF THE 2019 USENIX ANNUAL TECHNICAL CONFERENCE, 2019, : 1049 - 1062
[2] Enabling Cost-Effective, SLO-Aware Machine Learning Inference Serving on Public Cloud
Zhang, Chengliang
Yu, Minchen
Wang, Wei
Yan, Feng
[J]. IEEE TRANSACTIONS ON CLOUD COMPUTING, 2022, 10 (03) : 1765 - 1779
[3] Machine learning compliance-aware dynamic software allocation for energy, cost and resource-efficient cloud environment
Helali, Leila
Omri, Mohamed Nazih
[J]. SUSTAINABLE COMPUTING-INFORMATICS & SYSTEMS, 2024, 41
[4] Resource-efficient dimensioning, usage and maintenance of machine components through AI-based virtual sensors: AI-based virtual sensors for cost-effective collection of usage profiles and predictive maintenance
Rinderknecht, Stephan
Foulard, Stéphane
Fietzek, Rafael
[J]. VDI Berichte, 2022, 2022 (2402): : 241 - 254

← 1 →