A hybrid GPU cluster and volunteer computing platform for scalable deep learning

被引：13

作者：

Kijsipongse, Ekasit ^{[1
]}

Piyatumrong, Apivadee ^{[1
]}

U-ruekolan, Suriya ^{[1
]}

机构：

[1] Natl Elect & Comp Technol Ctr NECTEC, Large Scale Simulat Res Lab, 112 Thailand Sci Pk,Pahon Yothin Rd,Klong 1, Klongluang 12120, Pathumthani, Thailand

来源：

JOURNAL OF SUPERCOMPUTING | 2018年 / 74卷 / 07期

关键词：

Cluster computing; Volunteer computing; Deep learning;

D O I：

10.1007/s11227-018-2375-9

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Deep learning is a very computing-intensive and time-consuming task. It needs an amount of computing resource much greater than a single machine can afford to train a sophisticated model within a reasonable time. Normally, GPU clusters are required to reduce the training time of a deep learning model from days to hours. However, building large dedicated GPU clusters is not always feasible or even ineffective for most organizations due to the cost of purchasing, operation and maintenance while such systems are not fully utilized all the time. In this regard, volunteer computing can address this problem as it provides additional computing resources at less or no cost. This work presents the hybrid cluster and volunteer computing platform that scales out GPU clusters into volunteer computing for distributed deep learning. The owners of the machines contribute unused computing resources on their computers to extend the capability of the GPU cluster. The challenge is to seamlessly align the differences between GPU cluster and volunteer computing systems so as to ensure the scalability transparency, whereas performance is also another major concern. We validate the proposed work with two well-known sample cases. The results show an efficient use of our hybrid platform at sub-linear speedup.

引用

页码：3236 / 3263

页数：28

共 50 条

[1] A hybrid GPU cluster and volunteer computing platform for scalable deep learning
Ekasit Kijsipongse
Apivadee Piyatumrong
Suriya U-ruekolan
The Journal of Supercomputing, 2018, 74 : 3236 - 3263
[2] A Hybrid GPU-FPGA-based Computing Platform for Machine Learning
Liu, Xu
Ounifi, Hibat Allah
Gherbi, Abdelouahed
Lemieux, Yves
Li, Wubin
9TH INTERNATIONAL CONFERENCE ON EMERGING UBIQUITOUS SYSTEMS AND PERVASIVE NETWORKS (EUSPN-2018) / 8TH INTERNATIONAL CONFERENCE ON CURRENT AND FUTURE TRENDS OF INFORMATION AND COMMUNICATION TECHNOLOGIES IN HEALTHCARE (ICTH-2018), 2018, 141 : 104 - 111
[3] A generic middleware-based platform for scalable cluster computing
De Turck, F
Vanhastel, S
Volckaert, B
Demeester, P
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2002, 18 (04): : 549 - 560
[4] Extensible volunteer computing platform
Jankowski, Grzegorz
Debski, Roman
Byrski, Aleksander
PROCEEDINGS 27TH EUROPEAN CONFERENCE ON MODELLING AND SIMULATION ECMS 2013, 2013, : 532 - +
[5] BOINC: A Platform for Volunteer Computing
David P. Anderson
Journal of Grid Computing, 2020, 18 : 99 - 122
[6] BOINC: A Platform for Volunteer Computing
Anderson, David P.
JOURNAL OF GRID COMPUTING, 2020, 18 (01) : 99 - 122
[7] Tiresias: A GPU Cluster Manager for Distributed Deep Learning
Gu, Juncheng
Chowdhury, Mosharaf
Shin, Kang G.
Zhu, Yibo
Jeon, Myeongjae
Qian, Junjie
Liu, Hongqiang
Guo, Chuanxiong
PROCEEDINGS OF THE 16TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, 2019, : 485 - 500
[8] HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees
Zhao, Hanyu
Han, Zhenhua
Yang, Zhi
Zhang, Quanlu
Yang, Fan
Zhou, Lidong
Yang, Mao
Lau, Francis C. M.
Wang, Yuqi
Xiong, Yifan
Wang, Bin
PROCEEDINGS OF THE 14TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI '20), 2020, : 515 - 532
[9] The Design and Implementation of a Scalable Deep Learning Benchmarking Platform
Li, Cheng
Dakkak, Abdul
Xiong, Jinjun
Hwu, Wen-mei
2020 IEEE 13TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD 2020), 2020, : 414 - 425
[10] Acceleration of Large Deep Learning Training with Hybrid GPU Memory Management of Swapping and Re-computing
Imai, Haruki
Le, Tung D.
Negishi, Yasushi
Kawachiya, Kiyokuni
2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 1111 - 1116

← 1 2 3 4 5 →