Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster

被引:17
|
作者
Kim, Youngrang [1 ]
Choi, Hyeonseong [1 ]
Lee, Jaehwan [1 ]
Kim, Jik-Soo [2 ]
Jei, Hyunseung [3 ]
Roh, Hongchan [3 ]
机构
[1] Korea Aerosp Univ, Goyang Si, South Korea
[2] Myongji Univ, Yongin, South Korea
[3] SK Telecom ML Infra Lab, Seongnam Si, South Korea
基金
新加坡国家研究基金会;
关键词
Data parallel; Distributed deep learning; Heterogeneous cluster; Large-scale deep learning;
D O I
10.1007/s10586-020-03144-9
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a novel "Distributed Deep Learning Framework" for aheterogeneousmulti-GPU cluster that can effectively improve overall resource utilization without sacrificing training accuracy. Specifically, we employ a hybrid aggregation approach using a parameter-server and all-reduce schemes in order to address potential performance degradation problems in running deep learning applications on a heterogeneous computing system. In addition, we design and implement an asynchronous large mini-batch training mechanism to maintain training accuracy for asynchronous data-paralleled deep learning processing with enhanced collective communication capability based on MPI. We successfully implement our proposed framework on TensorFlow and perform extensive experiments in both of homogeneous and heterogeneous computing systems. Evaluation results show that our proposed framework can improve computing performance by decreasing I/O bottlenecks, and effectively increasing the resource utilization in the heterogeneous multi-GPU cluster.
引用
收藏
页码:2287 / 2300
页数:14
相关论文
共 50 条
  • [1] Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster
    Youngrang Kim
    Hyeonseong Choi
    Jaehwan Lee
    Jik-Soo Kim
    Hyunseung Jei
    Hongchan Roh
    Cluster Computing, 2020, 23 : 2287 - 2300
  • [2] Efficient Large-scale Deep Learning Framework for Heterogeneous Multi-GPU Cluster
    Kim, Youngrang
    Choi, Hyeonseong
    Lee, Jaehwan
    Kim, Jik-Soo
    Jei, Hyunseung
    Roh, Hongchan
    2019 IEEE 4TH INTERNATIONAL WORKSHOPS ON FOUNDATIONS AND APPLICATIONS OF SELF* SYSTEMS (FAS*W 2019), 2019, : 176 - 181
  • [3] Performance Analysis of Distributed Deep Learning Frameworks in a Multi-GPU Environment
    Kavarakuntla, Tulasi
    Han, Liangxiu
    Lloyd, Huw
    Latham, Annabel
    Akintoye, Samson B.
    20TH INT CONF ON UBIQUITOUS COMP AND COMMUNICAT (IUCC) / 20TH INT CONF ON COMP AND INFORMATION TECHNOLOGY (CIT) / 4TH INT CONF ON DATA SCIENCE AND COMPUTATIONAL INTELLIGENCE (DSCI) / 11TH INT CONF ON SMART COMPUTING, NETWORKING, AND SERV (SMARTCNS), 2021, : 406 - 413
  • [4] Involving CPUs into Multi-GPU Deep Learning
    Le, Tung D.
    Sekiyama, Taro
    Negishi, Yasushi
    Imai, Haruki
    Kawachiya, Kiyokuni
    PROCEEDINGS OF THE 2018 ACM/SPEC INTERNATIONAL CONFERENCE ON PERFORMANCE ENGINEERING (ICPE '18), 2018, : 56 - 67
  • [5] Efficient Multi-GPU Memory Management for Deep Learning Acceleration
    Kim, Youngrang
    Lee, Jaehwan
    Kim, Jik-Soo
    Jei, Hyunseung
    Roh, Hongchan
    2018 IEEE 3RD INTERNATIONAL WORKSHOPS ON FOUNDATIONS AND APPLICATIONS OF SELF* SYSTEMS (FAS*W), 2018, : 37 - 43
  • [6] Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training
    Pal, Saptadeep
    Ebrahimi, Eiman
    Zulfiqar, Arslan
    Fu, Yaosheng
    Zhang, Victor
    Migacz, Szymon
    Nellans, David
    Gupta, Puneet
    IEEE MICRO, 2019, 39 (05) : 91 - 101
  • [7] Towards multi-GPU support for visualization
    Owens, John D.
    SCIDAC 2007: SCIENTIFIC DISCOVERY THROUGH ADVANCED COMPUTING, 2007, 78
  • [8] Moim: A Multi-GPU MapReduce Framework
    Xie, Mengjun
    Kang, Kyoung-Don
    Basaran, Can
    2013 IEEE 16TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND ENGINEERING (CSE 2013), 2013, : 1279 - 1286
  • [9] Efficient Multi-Training Framework of Image Deep Learning on GPU Cluster
    Chen, Chun-Fu
    Lee, Gwo Giun
    Xia, Yinglong
    Lin, W. Sabrina
    Suzumura, Toyotaro
    Lin, Ching-Yung
    2015 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM), 2015, : 489 - 494
  • [10] Empirical Performance Evaluation of Communication Libraries for Multi-GPU based Distributed Deep Learning in a Container Environment
    Choi, HyeonSeong
    Kim, Youngrang
    Lee, Jaehwan
    Kim, Yoonhee
    KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2021, 15 (03): : 911 - 931