Accelerating Training for Distributed Deep Neural Networks in MapReduce

被引：0

作者：

Xu, Jie ^{[1
]}

Wang, Jingyu ^{[1
]}

Qi, Qi ^{[1
]}

Sun, Haifeng ^{[1
]}

Liao, Jianxin ^{[1
]}

机构：

[1] Beijing Univ Posts & Telecommun, State Key Lab Networking & Switching Technol, Beijing 100876, Peoples R China

来源：

WEB SERVICES - ICWS 2018 | 2018年 / 10966卷

基金：

中国国家自然科学基金;

关键词：

Deep Neural Networks; Parallel training; MapReduce; Data transmission; Synchronization; DATA LOCALITY; PARALLEL;

D O I：

10.1007/978-3-319-94289-6_12

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Parallel training is prevailing in Deep Neural Networks (DNN) to reduce training time. The training data sets and layered training processes of DNN are assigned to multiple Graphics Processing Units (GPUs) in parallel training. But there are some obstacles to deploy parallel training in GPU cloud services. DNN has a tight-dependent layering structure where the next layer feeds on the output of its former layer. It is unavoidable to transmit big output data between separated layered training processes. Since cloud computing offers separated storage services and computing services, data transmission through network harms the performance in training time. Thus parallel training leads to an inefficient training process in GPU cloud environment. In this paper, we construct a distributed DNN training architecture to implement parallel training for DNN in MapReduce. The architecture assigns GPU cloud resources as a web service. We also address the concern of data transmission by proposing a distributed DNN scheduler to accelerate the training time. The scheduler makes use of minimum cost flows algorithm to assign GPU resources, which considers data locality and synchronization into minimizing training time. Compared with original schedulers, experimental results reveal that distributed DNN scheduler decreases the training time by 50% with least data transmission and synchronizing parallel training.

引用

页码：181 / 195

页数：15

共 50 条

[1] EmbRace: Accelerating Sparse Communication for Distributed Training of Deep Neural Networks
Li, Shengwei
Lai, Zhiquan
Li, Dongsheng
Zhang, Yiming
Ye, Xiangyu
Duan, Yabo
[J]. 51ST INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2022, 2022,
[2] Distributed B-SDLM: Accelerating the Training Convergence of Deep Neural Networks Through Parallelism
Liew, Shan Sung
Khalil-Hani, Mohamed
Bakhteri, Rabia
[J]. PRICAI 2016: TRENDS IN ARTIFICIAL INTELLIGENCE, 2016, 9810 : 243 - 250
[3] Centered Weight Normalization in Accelerating Training of Deep Neural Networks
Huang, Lei
Liu, Xianglong
Liu, Yang
Lang, Bo
Tao, Dacheng
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2822 - 2830
[4] Accelerating distributed deep neural network training with pipelined MPI allreduce
Castello, Adrian
Quintana-Orti, Enrique S.
Duato, Jose
[J]. CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2021, 24 (04): : 3797 - 3813
[5] Accelerating distributed deep neural network training with pipelined MPI allreduce
Adrián Castelló
Enrique S. Quintana-Ortí
José Duato
[J]. Cluster Computing, 2021, 24 : 3797 - 3813
[6] Accelerating Training of Deep Neural Networks via Sparse Edge Processing
Dey, Sourya
Shao, Yinan
Chugg, Keith M.
Beerel, Peter A.
[J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2017, PT I, 2017, 10613 : 273 - 280
[7] An In-Depth Analysis of Distributed Training of Deep Neural Networks
Ko, Yunyong
Choi, Kibong
Seo, Jiwon
Kim, Sang-Wook
[J]. 2021 IEEE 35TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2021, : 994 - 1003
[8] Distributed training of deep neural networks with spark: The MareNostrum experience
Cruz, Leonel
Tous, Ruben
Otero, Beatriz
[J]. PATTERN RECOGNITION LETTERS, 2019, 125 : 174 - 178
[9] Parallel and Distributed Training of Deep Neural Networks: A brief overview
Farkas, Attila
Kertesz, Gabor
Lovas, Robert
[J]. 2020 IEEE 24TH INTERNATIONAL CONFERENCE ON INTELLIGENT ENGINEERING SYSTEMS (INES 2020), 2020, : 165 - 170
[10] Alleviating Imbalance in Synchronous Distributed Training of Deep Neural Networks
Lin, Haiyang
Yan, Mingyu
Wang, Duo
Li, Wenming
Ye, Xiaochun
Tang, Zhimin
Fan, Dongrui
[J]. 19TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2021), 2021, : 405 - 412

← 1 2 3 4 5 →