A hybrid GPU cluster and volunteer computing platform for scalable deep learning

被引:13
|
作者
Kijsipongse, Ekasit [1 ]
Piyatumrong, Apivadee [1 ]
U-ruekolan, Suriya [1 ]
机构
[1] Natl Elect & Comp Technol Ctr NECTEC, Large Scale Simulat Res Lab, 112 Thailand Sci Pk,Pahon Yothin Rd,Klong 1, Klongluang 12120, Pathumthani, Thailand
来源
JOURNAL OF SUPERCOMPUTING | 2018年 / 74卷 / 07期
关键词
Cluster computing; Volunteer computing; Deep learning;
D O I
10.1007/s11227-018-2375-9
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning is a very computing-intensive and time-consuming task. It needs an amount of computing resource much greater than a single machine can afford to train a sophisticated model within a reasonable time. Normally, GPU clusters are required to reduce the training time of a deep learning model from days to hours. However, building large dedicated GPU clusters is not always feasible or even ineffective for most organizations due to the cost of purchasing, operation and maintenance while such systems are not fully utilized all the time. In this regard, volunteer computing can address this problem as it provides additional computing resources at less or no cost. This work presents the hybrid cluster and volunteer computing platform that scales out GPU clusters into volunteer computing for distributed deep learning. The owners of the machines contribute unused computing resources on their computers to extend the capability of the GPU cluster. The challenge is to seamlessly align the differences between GPU cluster and volunteer computing systems so as to ensure the scalability transparency, whereas performance is also another major concern. We validate the proposed work with two well-known sample cases. The results show an efficient use of our hybrid platform at sub-linear speedup.
引用
收藏
页码:3236 / 3263
页数:28
相关论文
共 50 条
  • [11] Pando: A Volunteer Computing Platform for the Web
    Lavoie, Erick
    Hendren, Laurie
    Desprez, Frederic
    2017 IEEE 2ND INTERNATIONAL WORKSHOPS ON FOUNDATIONS AND APPLICATIONS OF SELF* SYSTEMS (FAS*W), 2017, : 387 - 388
  • [12] ZIVIS: A City Computing Platform Based on Volunteer Computing
    Antoli, B.
    Castejon, F.
    Giner, A.
    Losilla, G.
    Reynolds, J. M.
    Rivero, A.
    Sangiao, S.
    Serrano, F.
    Tarancon, A.
    Valles, R.
    Velasco, J. L.
    IBERGRID: 1ST IBERIAN GRID INFRASTRUCTURE CONFERENCE PROCEEDINGS, 2007, : 153 - 159
  • [13] The transformational role of GPU computing and deep learning in drug discovery
    Mohit Pandey
    Michael Fernandez
    Francesco Gentile
    Olexandr Isayev
    Alexander Tropsha
    Abraham C. Stern
    Artem Cherkasov
    Nature Machine Intelligence, 2022, 4 : 211 - 221
  • [14] The transformational role of GPU computing and deep learning in drug discovery
    Pandey, Mohit
    Fernandez, Michael
    Gentile, Francesco
    Isayev, Olexandr
    Tropsha, Alexander
    Stern, Abraham C.
    Cherkasov, Artem
    NATURE MACHINE INTELLIGENCE, 2022, 4 (03) : 211 - 221
  • [15] Distributed Deep Learning With GPU-FPGA Heterogeneous Computing
    Tanaka, Kenji
    Arikawa, Yuki
    Ito, Tsuyoshi
    Morita, Kazutaka
    Nemoto, Naru
    Terada, Kazuhiko
    Teramoto, Junji
    Sakamoto, Takeshi
    IEEE MICRO, 2021, 41 (01) : 15 - 22
  • [16] Adaptive Communication for Distributed Deep Learning on Commodity GPU Cluster
    Ho, Li-Yung
    Wu, Jan-Jan
    Liu, Pangfeng
    2018 18TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2018, : 283 - 290
  • [17] RAPTOR - A Scalable Platform for Rapid Prototyping and FPGA-based Cluster Computing
    Porrmann, Mario
    Hagemeyer, Jens
    Romoth, Johannes
    Strugholtz, Manuel
    Pohl, Christopher
    PARALLEL COMPUTING: FROM MULTICORES AND GPU'S TO PETASCALE, 2010, 19 : 592 - 599
  • [18] Design of a generic platform for efficient and scalable cluster computing based on middleware technology
    Vanhastel, S
    De Turck, F
    Demeester, P
    FIRST IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, PROCEEDINGS, 2001, : 40 - 47
  • [19] Distributed Deep Learning Using Volunteer Computing-Like Paradigm
    Atre, Medha
    Jha, Birendra
    Rao, Ashwini
    2021 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2021, : 933 - 942
  • [20] Modular & Scalable Ultrasound Platform with GPU Processing
    Lewandowski, M.
    Walczak, M.
    Witek, B.
    Kulesza, P.
    Sielewicz, K.
    2012 IEEE INTERNATIONAL ULTRASONICS SYMPOSIUM (IUS), 2012,