Distributed Deep Learning Framework based on Shared Memory for Fast Deep Neural Network Training

被引:0
|
作者
Lim, Eun-Ji [1 ]
Ahn, Shin-Young [1 ]
Park, Yoo-Mi [1 ]
Choi, Wan [1 ]
机构
[1] ETRI, High Performance Comp Res Grp, Daejeon, South Korea
关键词
Deep learning; machine learning; remote shared memory; distributed DNN training; distributed deep learning; TFSM;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In distributed deep neural network training, since the communication overhead caused by parameter sharing across multiple deep learning workers can be a performance bottleneck, performing efficient parameter sharing is a crucial challenge in distributed deep learning framework. In this paper, we propose a distributed deep learning framework called TFSM, uses remote shared memory for efficient parameter sharing to accelerate distributed DNN training. TFSM is based on the remote shared memory framework which provides shared memory accessible by multi-machines at high-speed. TFSM provides a new asynchronous parameter update method based on the remote shared memory. We confirmed that the TFSM improves the training time of DNN compared to TensorFlow by training well-known deep learning models using 8 GPU workers.
引用
收藏
页码:1239 / 1242
页数:4
相关论文
共 50 条
  • [1] Soft Memory Box: A Virtual Shared Memory Framework for Fast Deep Neural Network Training in Distributed High Performance Computing
    Ahn, Shinyoung
    Kim, Joongheon
    Lim, Eunji
    Kang, Sungwon
    [J]. IEEE ACCESS, 2018, 6 : 26493 - 26504
  • [2] Fast Deep Neural Network Training on Distributed Systems and Cloud TPUs
    You, Yang
    Zhang, Zhao
    Hsieh, Cho-Jui
    Demmel, James
    Keutzer, Kurt
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (11) : 2449 - 2462
  • [3] WholeGraph: A Fast Graph Neural Network Training Framework with Multi-GPU Distributed Shared Memory Architecture
    Yang, Dongxu
    Liu, Junhong
    Qi, Jiaxing
    Lai, Junjie
    [J]. SC22: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2022,
  • [4] A Framework for Distributed Deep Neural Network Training with Heterogeneous Computing Platforms
    Gu, Bontak
    Kong, Joonho
    Munir, Arslan
    Kim, Young Geun
    [J]. 2019 IEEE 25TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2019, : 430 - 437
  • [5] Fast and robust analog in-memory deep neural network training
    Rasch, Malte J.
    Carta, Fabio
    Fagbohungbe, Omobayode
    Gokmen, Tayfun
    [J]. NATURE COMMUNICATIONS, 2024, 15 (01)
  • [6] Memory Efficient Deep Neural Network Training
    Shilova, Alena
    [J]. EURO-PAR 2021: PARALLEL PROCESSING WORKSHOPS, 2022, 13098 : 515 - 519
  • [7] Survey on Network of Distributed Deep Learning Training
    Zhu, Hongrui
    Yuan, Guojun
    Yao, Chengji
    Tan, Guangming
    Wang, Zhan
    Hu, Zhongzhe
    Zhang, Xiaoyang
    An, Xuejun
    [J]. Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2021, 58 (01): : 98 - 115
  • [8] A Memory-Efficient Hybrid Parallel Framework for Deep Neural Network Training
    Li, Dongsheng
    Li, Shengwei
    Lai, Zhiquan
    Fu, Yongquan
    Ye, Xiangyu
    Cai, Lei
    Qiao, Linbo
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2024, 35 (04) : 577 - 591
  • [9] SoftMemoryBox II: A Scalable, Shared Memory Buffer Framework for Accelerating Distributed Training of Large-Scale Deep Neural Networks
    Ahn, Shinyoung
    Lim, Eunji
    [J]. IEEE ACCESS, 2020, 8 : 207097 - 207111
  • [10] Deep Learning Framework using Scalable Shared Memory Buffer Framework
    Lim, Eun-Ji
    Ahn, Shin-Young
    [J]. 2021 INTERNATIONAL CONFERENCE ON ELECTRONICS, INFORMATION, AND COMMUNICATION (ICEIC), 2021,