Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster

被引:17
|
作者
Kim, Youngrang [1 ]
Choi, Hyeonseong [1 ]
Lee, Jaehwan [1 ]
Kim, Jik-Soo [2 ]
Jei, Hyunseung [3 ]
Roh, Hongchan [3 ]
机构
[1] Korea Aerosp Univ, Goyang Si, South Korea
[2] Myongji Univ, Yongin, South Korea
[3] SK Telecom ML Infra Lab, Seongnam Si, South Korea
基金
新加坡国家研究基金会;
关键词
Data parallel; Distributed deep learning; Heterogeneous cluster; Large-scale deep learning;
D O I
10.1007/s10586-020-03144-9
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a novel "Distributed Deep Learning Framework" for aheterogeneousmulti-GPU cluster that can effectively improve overall resource utilization without sacrificing training accuracy. Specifically, we employ a hybrid aggregation approach using a parameter-server and all-reduce schemes in order to address potential performance degradation problems in running deep learning applications on a heterogeneous computing system. In addition, we design and implement an asynchronous large mini-batch training mechanism to maintain training accuracy for asynchronous data-paralleled deep learning processing with enhanced collective communication capability based on MPI. We successfully implement our proposed framework on TensorFlow and perform extensive experiments in both of homogeneous and heterogeneous computing systems. Evaluation results show that our proposed framework can improve computing performance by decreasing I/O bottlenecks, and effectively increasing the resource utilization in the heterogeneous multi-GPU cluster.
引用
收藏
页码:2287 / 2300
页数:14
相关论文
共 50 条
  • [21] GPU-Centered Parallel Model on Heterogeneous Multi-GPU Clusters
    Wang, Feng
    PROCEEDINGS OF 2012 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT 2012), 2012, : 1865 - 1868
  • [22] Simulating cortical networks on heterogeneous multi-GPU systems
    Nere, Andrew
    Franey, Sean
    Hashmi, Atif
    Lipasti, Mikko
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2013, 73 (07) : 953 - 971
  • [23] Distributed Deep Learning With GPU-FPGA Heterogeneous Computing
    Tanaka, Kenji
    Arikawa, Yuki
    Ito, Tsuyoshi
    Morita, Kazutaka
    Nemoto, Naru
    Terada, Kazuhiko
    Teramoto, Junji
    Sakamoto, Takeshi
    IEEE MICRO, 2021, 41 (01) : 15 - 22
  • [24] Parallel beamlet dose calculation via beamlet contexts in a distributed multi-GPU framework
    Neph, Ryan
    Ouyang, Cheng
    Neylon, John
    Yang, Youming
    Sheng, Ke
    MEDICAL PHYSICS, 2019, 46 (08) : 3719 - 3733
  • [25] Adaptive Communication for Distributed Deep Learning on Commodity GPU Cluster
    Ho, Li-Yung
    Wu, Jan-Jan
    Liu, Pangfeng
    2018 18TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2018, : 283 - 290
  • [26] Pipe-torch: Pipeline-Based Distributed Deep Learning in a GPU Cluster with Heterogeneous Networking
    Zhan, Jun
    Zhang, Jinghui
    2019 SEVENTH INTERNATIONAL CONFERENCE ON ADVANCED CLOUD AND BIG DATA (CBD), 2019, : 55 - 60
  • [27] An Adaptive Batch-Orchestration Algorithm for the Heterogeneous GPU Cluster Environment in Distributed Deep Learning System
    Yang, Eunju
    Kim, Seong-Hwan
    Kim, Tae-Woo
    Jeon, Minsu
    Park, Sangdon
    Youn, Chan-Hyun
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2018, : 725 - 728
  • [28] Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms
    Lin, Zhongyi
    Sun, Ning
    Bhattacharya, Pallab
    Feng, Xizhou
    Feng, Louis
    Owens, John D.
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2025, 36 (02) : 226 - 238
  • [29] PGLBox: Multi-GPU Graph Learning Framework for Web-Scale Recommendation
    Jiao, Xuewu
    Li, Weibin
    Wu, Xinxuan
    Hu, Wei
    Li, Miao
    Bian, Jiang
    Dai, Siming
    Luo, Xinsheng
    Hu, Mingqing
    Huang, Zhengjie
    Feng, Danlei
    Yang, Junchao
    Feng, Shikun
    Xiong, Haoyi
    Yu, Dianhai
    Li, Shuanglong
    He, Jingzhou
    Ma, Yanjun
    Liu, Lin
    PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, : 4262 - 4272
  • [30] Dynamic load balancing on heterogeneous multi-GPU systems
    Acosta, Alejandro
    Blanco, Vicente
    Almeida, Francisco
    COMPUTERS & ELECTRICAL ENGINEERING, 2013, 39 (08) : 2591 - 2602