A hybrid GPU cluster and volunteer computing platform for scalable deep learning

被引:13
|
作者
Kijsipongse, Ekasit [1 ]
Piyatumrong, Apivadee [1 ]
U-ruekolan, Suriya [1 ]
机构
[1] Natl Elect & Comp Technol Ctr NECTEC, Large Scale Simulat Res Lab, 112 Thailand Sci Pk,Pahon Yothin Rd,Klong 1, Klongluang 12120, Pathumthani, Thailand
来源
JOURNAL OF SUPERCOMPUTING | 2018年 / 74卷 / 07期
关键词
Cluster computing; Volunteer computing; Deep learning;
D O I
10.1007/s11227-018-2375-9
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning is a very computing-intensive and time-consuming task. It needs an amount of computing resource much greater than a single machine can afford to train a sophisticated model within a reasonable time. Normally, GPU clusters are required to reduce the training time of a deep learning model from days to hours. However, building large dedicated GPU clusters is not always feasible or even ineffective for most organizations due to the cost of purchasing, operation and maintenance while such systems are not fully utilized all the time. In this regard, volunteer computing can address this problem as it provides additional computing resources at less or no cost. This work presents the hybrid cluster and volunteer computing platform that scales out GPU clusters into volunteer computing for distributed deep learning. The owners of the machines contribute unused computing resources on their computers to extend the capability of the GPU cluster. The challenge is to seamlessly align the differences between GPU cluster and volunteer computing systems so as to ensure the scalability transparency, whereas performance is also another major concern. We validate the proposed work with two well-known sample cases. The results show an efficient use of our hybrid platform at sub-linear speedup.
引用
收藏
页码:3236 / 3263
页数:28
相关论文
共 50 条
  • [1] A hybrid GPU cluster and volunteer computing platform for scalable deep learning
    Ekasit Kijsipongse
    Apivadee Piyatumrong
    Suriya U-ruekolan
    The Journal of Supercomputing, 2018, 74 : 3236 - 3263
  • [2] A Hybrid GPU-FPGA-based Computing Platform for Machine Learning
    Liu, Xu
    Ounifi, Hibat Allah
    Gherbi, Abdelouahed
    Lemieux, Yves
    Li, Wubin
    9TH INTERNATIONAL CONFERENCE ON EMERGING UBIQUITOUS SYSTEMS AND PERVASIVE NETWORKS (EUSPN-2018) / 8TH INTERNATIONAL CONFERENCE ON CURRENT AND FUTURE TRENDS OF INFORMATION AND COMMUNICATION TECHNOLOGIES IN HEALTHCARE (ICTH-2018), 2018, 141 : 104 - 111
  • [3] A generic middleware-based platform for scalable cluster computing
    De Turck, F
    Vanhastel, S
    Volckaert, B
    Demeester, P
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2002, 18 (04): : 549 - 560
  • [4] Extensible volunteer computing platform
    Jankowski, Grzegorz
    Debski, Roman
    Byrski, Aleksander
    PROCEEDINGS 27TH EUROPEAN CONFERENCE ON MODELLING AND SIMULATION ECMS 2013, 2013, : 532 - +
  • [5] BOINC: A Platform for Volunteer Computing
    David P. Anderson
    Journal of Grid Computing, 2020, 18 : 99 - 122
  • [6] BOINC: A Platform for Volunteer Computing
    Anderson, David P.
    JOURNAL OF GRID COMPUTING, 2020, 18 (01) : 99 - 122
  • [7] Tiresias: A GPU Cluster Manager for Distributed Deep Learning
    Gu, Juncheng
    Chowdhury, Mosharaf
    Shin, Kang G.
    Zhu, Yibo
    Jeon, Myeongjae
    Qian, Junjie
    Liu, Hongqiang
    Guo, Chuanxiong
    PROCEEDINGS OF THE 16TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, 2019, : 485 - 500
  • [8] HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees
    Zhao, Hanyu
    Han, Zhenhua
    Yang, Zhi
    Zhang, Quanlu
    Yang, Fan
    Zhou, Lidong
    Yang, Mao
    Lau, Francis C. M.
    Wang, Yuqi
    Xiong, Yifan
    Wang, Bin
    PROCEEDINGS OF THE 14TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI '20), 2020, : 515 - 532
  • [9] The Design and Implementation of a Scalable Deep Learning Benchmarking Platform
    Li, Cheng
    Dakkak, Abdul
    Xiong, Jinjun
    Hwu, Wen-mei
    2020 IEEE 13TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD 2020), 2020, : 414 - 425
  • [10] Acceleration of Large Deep Learning Training with Hybrid GPU Memory Management of Swapping and Re-computing
    Imai, Haruki
    Le, Tung D.
    Negishi, Yasushi
    Kawachiya, Kiyokuni
    2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 1111 - 1116