An Approach for Modeling and Ranking Node-level Stragglers in Cloud Datacenters

被引:7
|
作者
Ouyang, Xue [1 ,2 ]
Garraghan, Peter [1 ]
Wang, Changjian [2 ]
Townend, Paul [1 ]
Xu, Jie [1 ]
机构
[1] Univ Leeds, Sch Comp, Leeds, W Yorkshire, England
[2] Natl Univ Def Technol, Parallel & Distributed Lab, Changsha, Hunan, Peoples R China
关键词
Stragglers; Node Performance; Clusters; Tracelog Data Analysis; Modeling; Ranking; MAPREDUCE;
D O I
10.1109/SCC.2016.93
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The ability of servers to effectively execute tasks within Cloud datacenters varies due to heterogeneous CPU and memory capacities, resource contention situations, network configurations and operational age. Unexpectedly slow server nodes (node-level stragglers) result in assigned tasks becoming task-level stragglers, which dramatically impede parallel job execution. However, it is currently unknown how slow nodes directly correlate to task straggler manifestation. To address this knowledge gap, we propose a method for node performance modeling and ranking in Cloud datacenters based on analyzing parallel job execution tracelog data. By using a production Cloud system as a case study, we demonstrate how node execution performance is driven by temporal changes in node operation as opposed to node hardware capacity. Different sample sets have been filtered in order to evaluate the generality of our framework, and the analytic results demonstrate that node abilities of executing parallel tasks tend to follow a 3-parameter-loglogistic distribution. Further statistical attribute values such as confidence interval, quantile value, extreme case possibility, etc. can also be used for ranking and identifying potential straggler nodes within the cluster. We exploit a graph-based algorithm for partitioning server nodes into five levels, with 0.83% of node-level stragglers identified. Our work lays the foundation towards enhancing scheduling algorithms by avoiding slow nodes, reducing task straggler occurrence, and improving parallel job performance.
引用
收藏
页码:673 / 680
页数:8
相关论文
共 50 条
  • [1] Modeling Ransomware Spreading by a Dynamic Node-Level Method
    Liu, Wanping
    IEEE ACCESS, 2019, 7 : 142224 - 142232
  • [2] A Node-Level Model for Service Grid
    Wang, Yan
    Cai, Jifei
    Mobile Information Systems, 2022, 2022
  • [3] One Node at a Time: Node-Level Network Classification
    Shai, Saray
    Jacobs, Isaac
    Mucha, Peter J.
    20TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2021), 2021, : 922 - 929
  • [4] A Node-Level Model for Service Grid
    Wang, Yan
    Cai, Jifei
    MOBILE INFORMATION SYSTEMS, 2022, 2022
  • [5] Dynamically controlling node-level parallelism in Hadoop
    Kc, Kamal
    Freeh, Vincent W.
    2015 IEEE 8TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, 2015, : 309 - 316
  • [6] A modeling framework for detecting and leveraging node-level information in Bayesian network inference
    Xi, Xiaoyue
    Ruffieux, Helene
    BIOSTATISTICS, 2024,
  • [7] A modeling framework for detecting and leveraging node-level information in Bayesian network inference
    Xi, Xiaoyue
    Ruffieux, Helene
    BIOSTATISTICS, 2024, 26 (01)
  • [8] Portable Node-Level Parallelism for the PGAS Model
    Jungblut, Pascal
    Fuerlinger, Karl
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2021, 49 (06) : 867 - 885
  • [9] Node-Level Optimization of Wireless Sensor Networks
    Campanoni, Simone
    Fornaciari, William
    2008 4TH INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING, VOLS 1-31, 2008, : 4005 - 4008
  • [10] Node-level Performance Optimizations in CFD Codes
    Wauligmann, Peter
    Duerrwaechter, Jakob
    Offenhaeuser, Philipp
    Schlottke, Adrian
    Bernreuther, Martin
    Dick, Bjoern
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING IN ASIA-PACIFIC REGION WORKSHOPS (HPC ASIA 2021 WORKSHOPS), 2020, : 7 - 8