HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving

被引:1
|
作者
Mo, Hao [1 ]
Zhu, Ligu [1 ,2 ]
Shi, Lei [1 ]
Tan, Songfu [1 ]
Wang, Suping [1 ]
机构
[1] Commun Univ China, State Key Lab Media Convergence & Commun, Beijing 100024, Peoples R China
[2] Beijng Key Lab Big Data Secur & Protect Ind, Beijing 100024, Peoples R China
关键词
inference serving; autoscaling; cost effectiveness; multi-tenant inference;
D O I
10.3390/electronics12010240
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
To accelerate the inference of machine-learning (ML) model serving, clusters of machines require the use of expensive hardware accelerators (e.g., GPUs) to reduce execution time. Advanced inference serving systems are needed to satisfy latency service-level objectives (SLOs) in a cost-effective manner. Novel autoscaling mechanisms that greedily minimize the number of service instances while ensuring SLO compliance are helpful. However, we find that it is not adequate to guarantee cost effectiveness across heterogeneous GPU hardware, and this does not maximize resource utilization. In this paper, we propose HetSev to address these challenges by incorporating heterogeneity-aware autoscaling and resource-efficient scheduling to achieve cost effectiveness. We develop an autoscaling mechanism which accounts for SLO compliance and GPU heterogeneity, thus provisioning the appropriate type and number of instances to guarantee cost effectiveness. We leverage multi-tenant inference to improve GPU resource utilization, while alleviating inter-tenant interference by avoiding the co-location of identical ML instances on the same GPU during placement decisions. HetSev is integrated into Kubernetes and deployed onto a heterogeneous GPU cluster. We evaluated the performance of HetSev using several representative ML models. Compared with default Kubernetes, HetSev reduces resource cost by up to 2.15x while meeting SLO requirements.
引用
收藏
页数:18
相关论文
共 4 条
  • [1] MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving
    Zhang, Chengliang
    Yu, Minchen
    Wang, Wei
    Yan, Feng
    [J]. PROCEEDINGS OF THE 2019 USENIX ANNUAL TECHNICAL CONFERENCE, 2019, : 1049 - 1062
  • [2] Enabling Cost-Effective, SLO-Aware Machine Learning Inference Serving on Public Cloud
    Zhang, Chengliang
    Yu, Minchen
    Wang, Wei
    Yan, Feng
    [J]. IEEE TRANSACTIONS ON CLOUD COMPUTING, 2022, 10 (03) : 1765 - 1779
  • [3] Machine learning compliance-aware dynamic software allocation for energy, cost and resource-efficient cloud environment
    Helali, Leila
    Omri, Mohamed Nazih
    [J]. SUSTAINABLE COMPUTING-INFORMATICS & SYSTEMS, 2024, 41
  • [4] Resource-efficient dimensioning, usage and maintenance of machine components through AI-based virtual sensors: AI-based virtual sensors for cost-effective collection of usage profiles and predictive maintenance
    Rinderknecht, Stephan
    Foulard, Stéphane
    Fietzek, Rafael
    [J]. VDI Berichte, 2022, 2022 (2402): : 241 - 254