Efficient Replication for Fast and Predictable Performance in Distributed Computing

被引:2
|
作者
Behrouzi-Far, Amir [1 ]
Soljanin, Emina [1 ]
机构
[1] Rutgers State Univ, Dept Elect & Comp Engn, New Brunswick, NJ 08901 USA
关键词
Task analysis; Redundancy; Computational modeling; Machine learning; Internet; Computer architecture; Training; replication; distributed systems; distributed computing; latency; coefficient of variations;
D O I
10.1109/TNET.2021.3062215
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Master-worker distributed computing systems use task replication to mitigate the effect of slow workers on job compute time. The master node groups tasks into batches and assigns each batch to one or more workers. We first assume that the batches do not overlap. Using majorization theory, we show that a balanced replication of batches minimizes the average job compute time for a general class of service time distributions. We then show that the balanced assignment of non-overlapping batches achieves a lower average job compute time than the overlapping schemes proposed in the literature. Next, we derive the optimum redundancy level as a function of the task service time distribution. We show that the redundancy level that minimizes the average job compute time may not coincide with the redundancy level that maximizes job compute time predictability. Therefore, there is a trade-off in optimizing the two metrics. By running experiments on Google cluster traces, we observe that redundancy can reduce the job compute time by one order of magnitude. The optimum level of redundancy depends on the distribution of task service time.
引用
收藏
页码:1467 / 1476
页数:10
相关论文
共 50 条
  • [1] An efficient multipath routing for distributed computing systems with data replication
    Chen, DJ
    Chang, PY
    [J]. INFORMATION SCIENCES, 1999, 120 (1-4) : 143 - 157
  • [2] Performance Modeling in Predictable Cloud Computing
    Mancini, Riccardo
    Cucinotta, Tommaso
    Abeni, Luca
    [J]. PROCEEDINGS OF THE 10TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND SERVICES SCIENCE (CLOSER), 2020, : 69 - 78
  • [3] Fast Performance Prediction for Efficient Distributed DNN Training
    Yun, Yugyoung
    Park, Eunhyeok
    [J]. IEEE COMPUTER ARCHITECTURE LETTERS, 2023, 22 (02) : 133 - 136
  • [4] Performance Predictable ServiceBSP Model for Grid Computing
    TONG Weiqin
    [J]. Wuhan University Journal of Natural Sciences, 2007, (05) : 871 - 874
  • [5] Efficient distributed quantum computing
    Beals, Robert
    Brierley, Stephen
    Gray, Oliver
    Harrow, Aram W.
    Kutin, Samuel
    Linden, Noah
    Shepherd, Dan
    Stather, Mark
    [J]. PROCEEDINGS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2013, 469 (2153):
  • [6] Efficient techniques for distributed computing
    Dramlitsch, T
    Allen, G
    Seidel, E
    [J]. 10TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, PROCEEDINGS, 2001, : 435 - 436
  • [7] Energy-efficient high-performance parallel and distributed computing
    Khan, Samee Ullah
    Bouvry, Pascal
    Engel, Thomas
    [J]. JOURNAL OF SUPERCOMPUTING, 2012, 60 (02): : 163 - 164
  • [8] Energy-efficient high-performance parallel and distributed computing
    Samee Ullah Khan
    Pascal Bouvry
    Thomas Engel
    [J]. The Journal of Supercomputing, 2012, 60 : 163 - 164
  • [9] IMPROVED FPGAS DELIVER FAST, PREDICTABLE PERFORMANCE
    TUCK, B
    [J]. COMPUTER DESIGN, 1993, 32 (02): : 108 - 109
  • [10] VERY FAST DISTRIBUTED SPREADSHEET COMPUTING
    ZHOU, HB
    RICHTER, L
    [J]. JOURNAL OF SYSTEMS AND SOFTWARE, 1994, 25 (02) : 185 - 192