Coding-Based Performance Improvement of Distributed Machine Learning in Large-Scale Clusters

被引：0

作者：

Wang Y. ^{[1
]}

Li N. ^{[1
]}

Wang X. ^{[1
]}

Zhong F. ^{[1
]}

机构：

[1] School of Software, East China Jiaotong University, Nanchang

来源：

Jisuanji Yanjiu yu Fazhan/Computer Research and Development | 2020年 / 57卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Coding technology; Distributed computing; Machine learning; Performance improvement; Stragglers tolerate;

D O I：

10.7544/issn1000-1239.2020.20190286

中图分类号：

学科分类号：

摘要：

With the growth of models and data sets, running large-scale machine learning algorithms in distributed clusters has become a common method. This method divides the whole machine learning algorithm and training data into several tasks and each task runs on different worker nodes. Then, the results of all tasks are combined by master node to get the results of the whole algorithm. When there are a large number of nodes in distributed cluster, some worker nodes, called straggler, will inevitably slow down than other nodes due to resource competition and other reasons, which makes the task time of running on this node significantly higher than that of other nodes. Compared with running replica task on multiple nodes, coded computing shows an impact of efficient utilization of computation and storage redundancy to alleviate the effect of stragglers and communication bottlenecks in large-scale machine learning cluster.This paper introduces the research progress of solving the straggler issues and improving the performance of large-scale machine learning cluster based on coding technology. Firstly, we introduce the background of coding technology and large-scale machine learning cluster. Secondly, we divide the related research into several categories according to application scenarios: matrix multiplication, gradient computing, data shuffling and some other applications. Finally, we summarize the difficulties of applying coding technology in large-scale machine learning cluster and discuss the future research trends about it. © 2020, Science Press. All right reserved.

引用

页码：542 / 561

页数：19

共 62 条

[1] Zaharia M., Chowdhury M., Franklin M.J., Et al., Spark: Cluster computing with working sets, Proc of the 2nd USENIX Conf on Hot Topics in Cloud Computing, (2010)
[2] Dean J., Ghemawat S., MapReduce: Simplified data processing on large clusters, Communications of the ACM, 51, 1, pp. 107-113, (2008)
[3] Dean J., Barroso L.A., The tail at scale, Communications of the ACM, 56, 2, pp. 74-80, (2013)
[4] Dimakis A.G., Godfrey P.B., Wu Y., Et al., Network coding for distributed storage systems, IEEE Transactions on Information Theory, 56, 9, pp. 4539-4551, (2010)
[5] Rashmi K.V., Shah N.B., Kumar P.V., Optimal exact-regenerating codes for distributed storage at the MSR and MBR points via a product-matrix construction, IEEE Transactions on Information Theory, 57, 8, pp. 5227-5239, (2011)
[6] Tamo I., Wang Z., Bruck J., MDS array codes with optimal rebuilding, Proc of IEEE Int Symp on Information Theory, pp. 1240-1244, (2011)
[7] Suh C., Ramchandran K., Exact-repair MDS code construction using interference alignment, IEEE Transactions on Information Theory, 57, 3, pp. 1425-1442, (2011)
[8] Cadambe V.R., Huang C., Jafar S.A., Et al., Optimal repair of MDS codes in distributed storage via subspace interference alignment, (2011)
[9] Papailiopoulos D.S., Luo J., Dimakis A.G., Et al., Simple regenerating codes: Network coding for cloud storage, Proc of IEEE INFOCOM 2012, pp. 2801-2805, (2012)
[10] Silberstein N., Rawat A.S., Vishwanath S., Error resilience in distributed storage via rank-metric codes, Proc of the 50th Annual Allerton Conf on Communication, Control, and Computing, pp. 1150-1157, (2012)

← 1 2 3 4 5 6 7 →