Prediction algorithm for failed batch jobs in co-located cloud

被引：0

作者：

Lin W. ^{[1
,2
]}

Shi F. ^{[1
]}

Li Y. ^{[1
]}

Liu F. ^{[1
]}

Liu J. ^{[1
]}

Peng S. ^{[3
]}

Wang J.Z. ^{[4
]}

机构：

[1] School of Computer Science and Engineering, South China University of Technology, Guangzhou

[2] Peng Cheng Laboratory, Shenzhen

[3] College of Computer Science and Electronic Engineering, Hunan University, Changsha

[4] School of Computing, Clemson University, Clemson

来源：

Guofang Keji Daxue Xuebao/Journal of National University of Defense Technology | 2022年 / 44卷 / 05期

关键词：

cloud computing; co-location; failed job prediction; resource utilization;

D O I：

10.11887/j.cn.202205008

中图分类号：

学科分类号：

摘要：

In order to reduce the risk of failed batch jobs in co-located cloud, the K-means algorithm was used to divide batch jobs into four categories.On the basis of classification, the TLNM (two-layer nested classification model) was proposed and the prediction algorithm based on TLNM was implemented. Experiment results based on Ali Trace 2018 data set show that the ROC(receiver operating characteristic) curve of this algorithm is significantly better than other commonly used classifiers, and the area under the ROC curve (i.e.AUC) can reach 0.978, indicating that this algorithm has good classification performance. At the same time, the recall rate can reach 0.951. Through the confusion matrix, it can be seen that the TLNM algorithm can accurately predict the failed batch jobs. © 2022 National University of Defense Technology. All rights reserved.

引用

页码：71 / 79

页数：8

共 20 条

[1] VERMA A, PEDROSA L, KORUPOLU M, Et al., Large-scale cluster management at Google with Borg, Proceedings of the Tenth European Conference on Computer Systems, (2015)
[2] Alibaba production cluster data v2018
[3] ZHANG Z, LI C, TAO Y Y, Et al., Fuxi:a fault-tolerant resource management and job scheduling system at Internet scale, Proceedings of the 40th International Conference on Very Large Data Bases, (2014)
[4] KE G L, MENG Q, Finley T, Et al., LightGBM:a highly efficient gradient boosting decision tree, Proceedings of the 31st Conference on Neural Information Processing Systems, (2017)
[5] LU C Z, YE K J, XU G Y, Et al., Imbalance in the cloud:an analysis on Alibaba cluster trace, Proceedings of IEEE International Conference on Big Data (Big Data), (2017)
[6] CHENG Y, CHAI Z, ANWAR A., Characterizing co-located datacenter workloads:an Alibaba case study
[7] CHEN S, DELIMITROU C, MART NEZ J F., PARTIES:QoS-aware resource partitioning for multiple interactive services[C], Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 107-120, (2019)
[8] REN R, LI J H, WANG L, Et al., Anomaly analysis for co-located datacenter workloads in the Alibaba cluster, arXiv:Distributed, Parallel, and Cluster Computing
[9] LIU Q X, YU Z B., The elasticity and plasticity in semi-containerized co-locating cloud workload:a view from Alibaba trace, Proceedings of the ACM Symposium on Cloud Computing, (2018)
[10] CHEN W Y, YE K J, WANG Y, Et al., How does the workload look like in production cloud? analysis and clustering of workloads on Alibaba cluster trace, Proceedings of IEEE 24th International Conference on Parallel and Distributed Systems, (2018)

← 1 2 →