Accelerating Training for Distributed Deep Neural Networks in MapReduce

被引:0
|
作者
Xu, Jie [1 ]
Wang, Jingyu [1 ]
Qi, Qi [1 ]
Sun, Haifeng [1 ]
Liao, Jianxin [1 ]
机构
[1] Beijing Univ Posts & Telecommun, State Key Lab Networking & Switching Technol, Beijing 100876, Peoples R China
来源
WEB SERVICES - ICWS 2018 | 2018年 / 10966卷
基金
中国国家自然科学基金;
关键词
Deep Neural Networks; Parallel training; MapReduce; Data transmission; Synchronization; DATA LOCALITY; PARALLEL;
D O I
10.1007/978-3-319-94289-6_12
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Parallel training is prevailing in Deep Neural Networks (DNN) to reduce training time. The training data sets and layered training processes of DNN are assigned to multiple Graphics Processing Units (GPUs) in parallel training. But there are some obstacles to deploy parallel training in GPU cloud services. DNN has a tight-dependent layering structure where the next layer feeds on the output of its former layer. It is unavoidable to transmit big output data between separated layered training processes. Since cloud computing offers separated storage services and computing services, data transmission through network harms the performance in training time. Thus parallel training leads to an inefficient training process in GPU cloud environment. In this paper, we construct a distributed DNN training architecture to implement parallel training for DNN in MapReduce. The architecture assigns GPU cloud resources as a web service. We also address the concern of data transmission by proposing a distributed DNN scheduler to accelerate the training time. The scheduler makes use of minimum cost flows algorithm to assign GPU resources, which considers data locality and synchronization into minimizing training time. Compared with original schedulers, experimental results reveal that distributed DNN scheduler decreases the training time by 50% with least data transmission and synchronizing parallel training.
引用
收藏
页码:181 / 195
页数:15
相关论文
共 50 条
  • [1] EmbRace: Accelerating Sparse Communication for Distributed Training of Deep Neural Networks
    Li, Shengwei
    Lai, Zhiquan
    Li, Dongsheng
    Zhang, Yiming
    Ye, Xiangyu
    Duan, Yabo
    [J]. 51ST INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2022, 2022,
  • [2] Distributed B-SDLM: Accelerating the Training Convergence of Deep Neural Networks Through Parallelism
    Liew, Shan Sung
    Khalil-Hani, Mohamed
    Bakhteri, Rabia
    [J]. PRICAI 2016: TRENDS IN ARTIFICIAL INTELLIGENCE, 2016, 9810 : 243 - 250
  • [3] Centered Weight Normalization in Accelerating Training of Deep Neural Networks
    Huang, Lei
    Liu, Xianglong
    Liu, Yang
    Lang, Bo
    Tao, Dacheng
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2822 - 2830
  • [4] Accelerating distributed deep neural network training with pipelined MPI allreduce
    Castello, Adrian
    Quintana-Orti, Enrique S.
    Duato, Jose
    [J]. CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2021, 24 (04): : 3797 - 3813
  • [5] Accelerating distributed deep neural network training with pipelined MPI allreduce
    Adrián Castelló
    Enrique S. Quintana-Ortí
    José Duato
    [J]. Cluster Computing, 2021, 24 : 3797 - 3813
  • [6] Accelerating Training of Deep Neural Networks via Sparse Edge Processing
    Dey, Sourya
    Shao, Yinan
    Chugg, Keith M.
    Beerel, Peter A.
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2017, PT I, 2017, 10613 : 273 - 280
  • [7] An In-Depth Analysis of Distributed Training of Deep Neural Networks
    Ko, Yunyong
    Choi, Kibong
    Seo, Jiwon
    Kim, Sang-Wook
    [J]. 2021 IEEE 35TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2021, : 994 - 1003
  • [8] Distributed training of deep neural networks with spark: The MareNostrum experience
    Cruz, Leonel
    Tous, Ruben
    Otero, Beatriz
    [J]. PATTERN RECOGNITION LETTERS, 2019, 125 : 174 - 178
  • [9] Parallel and Distributed Training of Deep Neural Networks: A brief overview
    Farkas, Attila
    Kertesz, Gabor
    Lovas, Robert
    [J]. 2020 IEEE 24TH INTERNATIONAL CONFERENCE ON INTELLIGENT ENGINEERING SYSTEMS (INES 2020), 2020, : 165 - 170
  • [10] Alleviating Imbalance in Synchronous Distributed Training of Deep Neural Networks
    Lin, Haiyang
    Yan, Mingyu
    Wang, Duo
    Li, Wenming
    Ye, Xiaochun
    Tang, Zhimin
    Fan, Dongrui
    [J]. 19TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2021), 2021, : 405 - 412