Model Parameter Prediction Method for Accelerating Distributed DNN Training

被引:0
|
作者
Liu, Wai-xi [1 ]
Chen, Dao-xiao [3 ]
Tan, Miao-quan [3 ]
Chen, Kong-yang [4 ]
Yin, Yue [3 ]
Shang, Wen-Li [3 ]
Li, Jin [4 ]
Cai, Jun [2 ]
机构
[1] Guangzhou Univ, Dept Elect & Commun Engn, Guangzhou, Peoples R China
[2] Guangdong Polytech Normal Univ, Guangzhou, Peoples R China
[3] Guangzhou Univ, Guangzhou, Peoples R China
[4] Guangzhou Univ, Inst Artificial Intelligence, Guangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Distributed training; Communication optimization; Parameter prediction; COMMUNICATION;
D O I
10.1016/j.comnet.2024.110883
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As the size of deep neural network (DNN) models and datasets increases, distributed training becomes popular to reduce the training time. However, a severe communication bottleneck in distributed training limits its scalability. Many methods aim to address this communication bottleneck by reducing communication traffic, such as gradient sparsification and quantization. However, these methods either are at the expense of losing model accuracy or introducing lots of computing overhead. We have observed that the data distribution between layers of neural network models is similar. Thus, we propose a model parameter prediction method (MP2) to accelerate distributed DNN training under parameter server (PS) framework, where workers push only a subset of model parameters to the PS, and residual model parameters are locally predicted by an already-trained deep neural network model on the PS. We address several key challenges in this approach. First, we build a hierarchical parameters dataset by randomly sampling a subset of model from normal distributed trainings. Second, we design a neural network model with the structure of "convolution + channel attention + Max pooling" for predicting model parameters by using a prediction result-based evaluation method. For VGGNet, ResNet, and AlexNet models on CIFAR10 and CIFAR100 datasets, compared with Baseline, Top-k, deep gradient compression (DGC), and weight nowcaster network (WNN), MP2 can reduce traffic by up to 88.98%; and accelerates the training by up to 47.32% while not losing the model accuracy. MP2 has shown good generalization.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Accelerating Distributed DNN Training via Transport Layer Scheduling
    Duan, Qingyang
    Peng, Chao
    Wang, Zeqin
    Xu, Yuedong
    Liu, Shaoteng
    Wu, Jun
    Lui, John C. S.
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2023, 34 (05) : 1650 - 1666
  • [2] Accelerating Training of DNN in Distributed Machine Learning System with Shared Memory
    Lim, Eun-Ji
    Ahn, Shin-Young
    Choi, Wan
    2017 INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY CONVERGENCE (ICTC), 2017, : 1209 - 1212
  • [3] PipePar: A Pipelined Hybrid Parallel Approach for Accelerating Distributed DNN Training
    Li, Jiange
    Wang, Yuchen
    Zhang, Jinghui
    Jin, Jiahui
    Dong, Fang
    Qian, Lei
    PROCEEDINGS OF THE 2021 IEEE 24TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN (CSCWD), 2021, : 470 - 475
  • [4] DAPP: Accelerating Training of DNN
    Sapna
    Sreenivasalu, N. S.
    Paul, Kolin
    IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS / IEEE 16TH INTERNATIONAL CONFERENCE ON SMART CITY / IEEE 4TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS), 2018, : 867 - 872
  • [5] Fast Performance Prediction for Efficient Distributed DNN Training
    Yun, Yugyoung
    Park, Eunhyeok
    IEEE COMPUTER ARCHITECTURE LETTERS, 2023, 22 (02) : 133 - 136
  • [6] A Survey on Performance Modeling and Prediction for Distributed DNN Training
    Guo, Zhenhua
    Tang, Yinan
    Zhai, Jidong
    Yuan, Tongtong
    Jin, Jian
    Wang, Li
    Zhao, Yaqian
    Li, Rengang
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2024, 35 (12) : 2463 - 2478
  • [7] ADA-GP: Accelerating DNN Training By Adaptive Gradient Prediction
    Janfaza, Vahid
    Mandal, Shantanu
    Mahmud, Farabi
    Muzahid, Abdullah
    56TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, MICRO 2023, 2023, : 1092 - 1105
  • [8] EDDIS: Accelerating Distributed Data -Parallel DNN Training for Heterogeneous GPU Cluster
    Ahn, Shinyoung
    Ahn, Hooyoung
    Choi, Hyeonseong
    Lee, Jaehyun
    2024 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW 2024, 2024, : 1167 - 1168
  • [9] A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters
    Jiang, Yimin
    Zhu, Yibo
    Lan, Chang
    Yi, Bairen
    Cui, Yong
    Guo, Chuanxiong
    PROCEEDINGS OF THE 14TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI '20), 2020, : 463 - 479
  • [10] Prediction Confidence based Low Complexity Gradient Computation for Accelerating DNN Training
    Shin, Dongyeob
    Kim, Geonho
    Jo, Joongho
    Park, Jongsun
    PROCEEDINGS OF THE 2020 57TH ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2020,