An Approach for Modeling and Ranking Node-level Stragglers in Cloud Datacenters

被引:7
|
作者
Ouyang, Xue [1 ,2 ]
Garraghan, Peter [1 ]
Wang, Changjian [2 ]
Townend, Paul [1 ]
Xu, Jie [1 ]
机构
[1] Univ Leeds, Sch Comp, Leeds, W Yorkshire, England
[2] Natl Univ Def Technol, Parallel & Distributed Lab, Changsha, Hunan, Peoples R China
关键词
Stragglers; Node Performance; Clusters; Tracelog Data Analysis; Modeling; Ranking; MAPREDUCE;
D O I
10.1109/SCC.2016.93
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The ability of servers to effectively execute tasks within Cloud datacenters varies due to heterogeneous CPU and memory capacities, resource contention situations, network configurations and operational age. Unexpectedly slow server nodes (node-level stragglers) result in assigned tasks becoming task-level stragglers, which dramatically impede parallel job execution. However, it is currently unknown how slow nodes directly correlate to task straggler manifestation. To address this knowledge gap, we propose a method for node performance modeling and ranking in Cloud datacenters based on analyzing parallel job execution tracelog data. By using a production Cloud system as a case study, we demonstrate how node execution performance is driven by temporal changes in node operation as opposed to node hardware capacity. Different sample sets have been filtered in order to evaluate the generality of our framework, and the analytic results demonstrate that node abilities of executing parallel tasks tend to follow a 3-parameter-loglogistic distribution. Further statistical attribute values such as confidence interval, quantile value, extreme case possibility, etc. can also be used for ranking and identifying potential straggler nodes within the cluster. We exploit a graph-based algorithm for partitioning server nodes into five levels, with 0.83% of node-level stragglers identified. Our work lays the foundation towards enhancing scheduling algorithms by avoiding slow nodes, reducing task straggler occurrence, and improving parallel job performance.
引用
收藏
页码:673 / 680
页数:8
相关论文
共 50 条
  • [41] Optimal Throughput Curve for Primary and Secondary Users with Node-level Cooperation
    Yuan, Xu
    Tian, Feng
    Hou, Y. Thomas
    Lou, Wenjing
    Sherali, Hanif D.
    Kompella, Sastry
    Reed, Jeffrey H.
    2015 IEEE INTERNATIONAL SYMPOSIUM ON DYNAMIC SPECTRUM ACCESS NETWORKS (DYSPAN), 2015, : 358 - 364
  • [42] NodeMD: Diagnosing Node-Level Faults in Remote Wireless Sensor Systems
    Krunic, Veljko
    Trumpler, Eric
    Han, Richard
    MOBISYS '07: PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON MOBILE SYSTEMS, APPLICATIONS, AND SERVICES, 2007, : 43 - 56
  • [43] Optimal Control Strategy of a Node-level Dynamical Model with External Computers
    Zhang, Xulong
    Gan, Chenquan
    2017 14TH INTERNATIONAL WORKSHOP ON COMPLEX SYSTEMS AND NETWORKS (IWCSN), 2017, : 21 - 26
  • [44] NCGNN: Node-Level Capsule Graph Neural Network for Semisupervised Classification
    Yang, Rui
    Dai, Wenrui
    Li, Chenglin
    Zou, Junni
    Xiong, Hongkai
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (01) : 1025 - 1039
  • [45] Node-level energy management for sensor networks in the presence of multiple applications
    Boulis, A
    Srivastava, M
    WIRELESS NETWORKS, 2004, 10 (06) : 737 - 746
  • [46] Node-level parallelization for deep neural networks with conditional independent graph
    Zhou, Fugen
    Wu, Fuxiang
    Zhang, Zhengchen
    Dong, Minghui
    NEUROCOMPUTING, 2017, 267 : 261 - 270
  • [47] Adaptive node-level weighted learning for directed graph neural network
    Huang, Jincheng
    Zhu, Xiaofeng
    NEURAL NETWORKS, 2025, 187
  • [48] Virtual Machine Level Temperature Profiling and Prediction in Cloud Datacenters
    Wu, Zhaohui
    Li, Xiang
    Garraghan, Peter
    Jiang, Xiaohong
    Ye, Kejiang
    Zomaya, Albert Y.
    PROCEEDINGS 2016 IEEE 36TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS ICDCS 2016, 2016, : 735 - 736
  • [49] Cancer drug target identification and node-level analysis of the network of MAPK pathways
    Aksam V.K.M.
    Chandrasekaran V.M.
    Pandurangan S.
    Network Modeling Analysis in Health Informatics and Bioinformatics, 2018, 7 (01)
  • [50] Propagation Structure Fusion for Rumor Detection Based on Node-Level Contrastive Learning
    Ma, Jiachen
    Liu, Yong
    Han, Meng
    Hu, Chunqiang
    Ju, Zhaojie
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 35 (12) : 1 - 12