Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster

被引：17

作者：

Kim, Youngrang ^{[1
]}

Choi, Hyeonseong ^{[1
]}

Lee, Jaehwan ^{[1
]}

Kim, Jik-Soo ^{[2
]}

Jei, Hyunseung ^{[3
]}

Roh, Hongchan ^{[3
]}

机构：

[1] Korea Aerosp Univ, Goyang Si, South Korea

[2] Myongji Univ, Yongin, South Korea

[3] SK Telecom ML Infra Lab, Seongnam Si, South Korea

来源：

CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS | 2020年 / 23卷 / 03期

基金：

新加坡国家研究基金会;

关键词：

Data parallel; Distributed deep learning; Heterogeneous cluster; Large-scale deep learning;

D O I：

10.1007/s10586-020-03144-9

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper presents a novel "Distributed Deep Learning Framework" for aheterogeneousmulti-GPU cluster that can effectively improve overall resource utilization without sacrificing training accuracy. Specifically, we employ a hybrid aggregation approach using a parameter-server and all-reduce schemes in order to address potential performance degradation problems in running deep learning applications on a heterogeneous computing system. In addition, we design and implement an asynchronous large mini-batch training mechanism to maintain training accuracy for asynchronous data-paralleled deep learning processing with enhanced collective communication capability based on MPI. We successfully implement our proposed framework on TensorFlow and perform extensive experiments in both of homogeneous and heterogeneous computing systems. Evaluation results show that our proposed framework can improve computing performance by decreasing I/O bottlenecks, and effectively increasing the resource utilization in the heterogeneous multi-GPU cluster.

引用

页码：2287 / 2300

页数：14

共 50 条

[11] Tiresias: A GPU Cluster Manager for Distributed Deep Learning
Gu, Juncheng
Chowdhury, Mosharaf
Shin, Kang G.
Zhu, Yibo
Jeon, Myeongjae
Qian, Junjie
Liu, Hongqiang
Guo, Chuanxiong
PROCEEDINGS OF THE 16TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, 2019, : 485 - 500
[12] Comprehensive techniques of multi-GPU memory optimization for deep learning acceleration
Kim, Youngrang
Lee, Jaehwan
Kim, Jik-Soo
Jei, Hyunseung
Roh, Hongchan
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2020, 23 (03): : 2193 - 2204
[13] Comprehensive techniques of multi-GPU memory optimization for deep learning acceleration
Youngrang Kim
Jaehwan Lee
Jik-Soo Kim
Hyunseung Jei
Hongchan Roh
Cluster Computing, 2020, 23 : 2193 - 2204
[14] Distributed texture memory in a Multi-GPU environment
Moerschell, Adam
Owens, John D.
COMPUTER GRAPHICS FORUM, 2008, 27 (01) : 130 - 151
[15] AEML: An Acceleration Engine for Multi-GPU Load-Balancing in Distributed Heterogeneous Environment
Tang, Zhuo
Du, Lifan
Zhang, Xuedong
Yang, Li
Li, Kenli
IEEE TRANSACTIONS ON COMPUTERS, 2022, 71 (06) : 1344 - 1357
[16] Towards a Multi-GPU Implementation of a Seismic Application
Rigon, Pedro H. C.
Schussler, Brenda S.
Padoin, Edson L.
Lorenzon, Arthur F.
Carissimi, Alexandre
Navaux, Philippe O. A.
HIGH PERFORMANCE COMPUTING, CARLA 2023, 2024, 1887 : 146 - 159
[17] Accelerating MapReduce framework on multi-GPU systems
Jiang, Hai
Chen, Yi
Qiao, Zhi
Li, Kuan-Ching
Ro, WonWoo
Gaudiot, Jean-Luc
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2014, 17 (02): : 293 - 301
[18] Accelerating MapReduce framework on multi-GPU systems
Hai Jiang
Yi Chen
Zhi Qiao
Kuan-Ching Li
WonWoo Ro
Jean-Luc Gaudiot
Cluster Computing, 2014, 17 : 293 - 301
[19] A Novel Heterogeneous Multi-GPU Parallel Rendering Framework in UE4 Scene
Zhang, Siyu
Wang, Yanfeng
Guo, Jianjun
INTERNATIONAL JOURNAL OF MULTIPHYSICS, 2024, 18 (02) : 133 - 144
[20] Task-Based Conjugate Gradient: From Multi-GPU Towards Heterogeneous Architectures
Agullo, E.
Giraud, L.
Guermouche, A.
Nakov, S.
Roman, J.
EURO-PAR 2016: PARALLEL PROCESSING WORKSHOPS, 2017, 10104 : 69 - 82

← 1 2 3 4 5 →