Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster

被引：17

作者：

Kim, Youngrang ^{[1
]}

Choi, Hyeonseong ^{[1
]}

Lee, Jaehwan ^{[1
]}

Kim, Jik-Soo ^{[2
]}

Jei, Hyunseung ^{[3
]}

Roh, Hongchan ^{[3
]}

机构：

[1] Korea Aerosp Univ, Goyang Si, South Korea

[2] Myongji Univ, Yongin, South Korea

[3] SK Telecom ML Infra Lab, Seongnam Si, South Korea

来源：

CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS | 2020年 / 23卷 / 03期

基金：

新加坡国家研究基金会;

关键词：

Data parallel; Distributed deep learning; Heterogeneous cluster; Large-scale deep learning;

D O I：

10.1007/s10586-020-03144-9

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper presents a novel "Distributed Deep Learning Framework" for aheterogeneousmulti-GPU cluster that can effectively improve overall resource utilization without sacrificing training accuracy. Specifically, we employ a hybrid aggregation approach using a parameter-server and all-reduce schemes in order to address potential performance degradation problems in running deep learning applications on a heterogeneous computing system. In addition, we design and implement an asynchronous large mini-batch training mechanism to maintain training accuracy for asynchronous data-paralleled deep learning processing with enhanced collective communication capability based on MPI. We successfully implement our proposed framework on TensorFlow and perform extensive experiments in both of homogeneous and heterogeneous computing systems. Evaluation results show that our proposed framework can improve computing performance by decreasing I/O bottlenecks, and effectively increasing the resource utilization in the heterogeneous multi-GPU cluster.

引用

页码：2287 / 2300

页数：14

共 50 条

[41] Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing
Choi, Seungbeom
Lee, Sunho
Kim, Yeonjae
Park, Jongse
Kwon, Youngjin
Huh, Jaehyuk
PROCEEDINGS OF THE 2022 USENIX ANNUAL TECHNICAL CONFERENCE, 2022, : 199 - 215
[42] Multi-CPU/Multi-GPU Based Framework for Multimedia Processing
Mahmoudi, Sidi Ahmed
Manneback, Pierre
COMPUTER SCIENCE AND ITS APPLICATIONS, CIIA 2015, 2015, 456 : 54 - 65
[43] Distributed Multi-GPU Community Detection on Exascale Computing Platforms
Sattar, Naw Safrin
Lu, Hao
Wang, Feiyi
Halappanavar, Mahantesh
2024 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW 2024, 2024, : 815 - 824
[44] Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures
Angeles Navarro
Antonio Vilches
Francisco Corbera
Rafael Asenjo
The Journal of Supercomputing, 2014, 70 : 756 - 771
[45] Hierarchical Heterogeneous Cluster Systems for Scalable Distributed Deep Learning
Wang, Yibo
Geng, Tongsheng
Silva, Ericson
Gaudiot, Jean-Luc
2024 IEEE 27TH INTERNATIONAL SYMPOSIUM ON REAL-TIME DISTRIBUTED COMPUTING, ISORC 2024, 2024,
[46] Multi-GPU performance of incompressible flow computation by lattice Boltzmann method on GPU cluster
Xian, Wang
Takayuki, Aoki
PARALLEL COMPUTING, 2011, 37 (09) : 521 - 535
[47] PARTANS: An Autotuning Framework for Stencil Computation on Multi-GPU Systems
Lutz, Thibaut
Fensch, Christian
Cole, Murray
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2013, 9 (04)
[48] Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures
Navarro, Angeles
Vilches, Antonio
Corbera, Francisco
Asenjo, Rafael
JOURNAL OF SUPERCOMPUTING, 2014, 70 (02): : 756 - 771
[49] A Multi-GPU Framework for In-Memory Text Data Analytics
Chong, Poh Kit
Karuppiah, Ettikan K.
Yong, Keh Kok
2013 IEEE 27TH INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS WORKSHOPS (WAINA), 2013, : 1411 - 1416
[50] Fast STA Graph Partitioning Framework for Multi-GPU Acceleration
Guo, Guannan
Huang, Tsung-Wei
Wong, Martin
2023 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION, DATE, 2023,

← 1 2 3 4 5 →