Accelerating Training for Distributed Deep Neural Networks in MapReduce

被引：0

作者：

Xu, Jie ^{[1
]}

Wang, Jingyu ^{[1
]}

Qi, Qi ^{[1
]}

Sun, Haifeng ^{[1
]}

Liao, Jianxin ^{[1
]}

机构：

[1] Beijing Univ Posts & Telecommun, State Key Lab Networking & Switching Technol, Beijing 100876, Peoples R China

来源：

WEB SERVICES - ICWS 2018 | 2018年 / 10966卷

基金：

中国国家自然科学基金;

关键词：

Deep Neural Networks; Parallel training; MapReduce; Data transmission; Synchronization; DATA LOCALITY; PARALLEL;

D O I：

10.1007/978-3-319-94289-6_12

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Parallel training is prevailing in Deep Neural Networks (DNN) to reduce training time. The training data sets and layered training processes of DNN are assigned to multiple Graphics Processing Units (GPUs) in parallel training. But there are some obstacles to deploy parallel training in GPU cloud services. DNN has a tight-dependent layering structure where the next layer feeds on the output of its former layer. It is unavoidable to transmit big output data between separated layered training processes. Since cloud computing offers separated storage services and computing services, data transmission through network harms the performance in training time. Thus parallel training leads to an inefficient training process in GPU cloud environment. In this paper, we construct a distributed DNN training architecture to implement parallel training for DNN in MapReduce. The architecture assigns GPU cloud resources as a web service. We also address the concern of data transmission by proposing a distributed DNN scheduler to accelerate the training time. The scheduler makes use of minimum cost flows algorithm to assign GPU resources, which considers data locality and synchronization into minimizing training time. Compared with original schedulers, experimental results reveal that distributed DNN scheduler decreases the training time by 50% with least data transmission and synchronizing parallel training.

引用

页码：181 / 195

页数：15

共 50 条

[21] AccelAT: A Framework for Accelerating the Adversarial Training of Deep Neural Networks Through Accuracy Gradient
Nikfam, Farzad
Marchisio, Alberto
Martina, Maurizio
Shafique, Muhammad
[J]. IEEE ACCESS, 2022, 10 : 108997 - 109007
[22] Accelerating Deep Neural Networks implementation: A survey
Dhouibi, Meriam
Ben Salem, Ahmed Karim
Saidi, Afef
Ben Saoud, Slim
[J]. IET COMPUTERS AND DIGITAL TECHNIQUES, 2021, 15 (02): : 79 - 96
[23] Accelerating Sparse Deep Neural Networks on FPGAs
Huang, Sitao
Pearson, Carl
Nagi, Rakesh
Xiong, Jinjun
Chen, Deming
Hwu, Wen-mei
[J]. 2019 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2019,
[24] Accelerating Deep Neural Networks Using FPGA
Adel, Esraa
Magdy, Rana
Mohamed, Sara
Mamdouh, Mona
El Mandouh, Eman
Mostafa, Hassan
[J]. 2018 30TH INTERNATIONAL CONFERENCE ON MICROELECTRONICS (ICM), 2018, : 176 - 179
[25] Accelerating Data Loading in Deep Neural Network Training
Yang, Chih-Chieh
Cong, Guojing
[J]. 2019 IEEE 26TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS (HIPC), 2019, : 235 - 245
[26] A Flexible Research-Oriented Framework for Distributed Training of Deep Neural Networks
Barrachina, Sergio
Castello, Adrian
Catalan, Mar
Dolz, Manuel F.
Mestre, Jose, I
[J]. 2021 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2021, : 730 - 739
[27] DLB: A Dynamic Load Balance Strategy for Distributed Training of Deep Neural Networks
Ye, Qing
Zhou, Yuhao
Shi, Mingjia
Sun, Yanan
Lv, Jiancheng
[J]. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2023, 7 (04): : 1217 - 1227
[28] Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability
Keuper, Janis
Preundt, Franz-Josef
[J]. PROCEEDINGS OF 2016 2ND WORKSHOP ON MACHINE LEARNING IN HPC ENVIRONMENTS (MLHPC), 2016, : 19 - 26
[29] Scalable K-FAC Training for Deep Neural Networks With Distributed Preconditioning
Zhang, Lin
Shi, Shaohuai
Wang, Wei
Li, Bo
[J]. IEEE TRANSACTIONS ON CLOUD COMPUTING, 2023, 11 (03) : 2365 - 2378
[30] Model-Aware Parallelization Strategy for Deep Neural Networks' Distributed Training
Yang, Zhaoyi
Dong, Fang
[J]. 2019 SEVENTH INTERNATIONAL CONFERENCE ON ADVANCED CLOUD AND BIG DATA (CBD), 2019, : 61 - 66

← 1 2 3 4 5 →