Model Parameter Prediction Method for Accelerating Distributed DNN Training

被引:0
|
作者
Liu, Wai-xi [1 ]
Chen, Dao-xiao [3 ]
Tan, Miao-quan [3 ]
Chen, Kong-yang [4 ]
Yin, Yue [3 ]
Shang, Wen-Li [3 ]
Li, Jin [4 ]
Cai, Jun [2 ]
机构
[1] Guangzhou Univ, Dept Elect & Commun Engn, Guangzhou, Peoples R China
[2] Guangdong Polytech Normal Univ, Guangzhou, Peoples R China
[3] Guangzhou Univ, Guangzhou, Peoples R China
[4] Guangzhou Univ, Inst Artificial Intelligence, Guangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Distributed training; Communication optimization; Parameter prediction; COMMUNICATION;
D O I
10.1016/j.comnet.2024.110883
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As the size of deep neural network (DNN) models and datasets increases, distributed training becomes popular to reduce the training time. However, a severe communication bottleneck in distributed training limits its scalability. Many methods aim to address this communication bottleneck by reducing communication traffic, such as gradient sparsification and quantization. However, these methods either are at the expense of losing model accuracy or introducing lots of computing overhead. We have observed that the data distribution between layers of neural network models is similar. Thus, we propose a model parameter prediction method (MP2) to accelerate distributed DNN training under parameter server (PS) framework, where workers push only a subset of model parameters to the PS, and residual model parameters are locally predicted by an already-trained deep neural network model on the PS. We address several key challenges in this approach. First, we build a hierarchical parameters dataset by randomly sampling a subset of model from normal distributed trainings. Second, we design a neural network model with the structure of "convolution + channel attention + Max pooling" for predicting model parameters by using a prediction result-based evaluation method. For VGGNet, ResNet, and AlexNet models on CIFAR10 and CIFAR100 datasets, compared with Baseline, Top-k, deep gradient compression (DGC), and weight nowcaster network (WNN), MP2 can reduce traffic by up to 88.98%; and accelerates the training by up to 47.32% while not losing the model accuracy. MP2 has shown good generalization.
引用
收藏
页数:15
相关论文
共 50 条
  • [41] Fast EIS acquisition method based on SSA-DNN prediction model
    Chang, Chun
    Pan, Yaliang
    Wang, Shaojin
    Jiang, Jiuchun
    Tian, Aina
    Gao, Yang
    Jiang, Yan
    Wu, Tiezhou
    ENERGY, 2024, 288
  • [42] Accelerating On-Device DNN Training Workloads via Runtime Convergence Monitor
    Choi, Seungkyu
    Shin, Jaekang
    Kim, Lee-Sup
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2023, 42 (05) : 1574 - 1587
  • [43] Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization
    Unger, Colin
    Jia, Zhihao
    Wu, Wei
    Lin, Sina
    Baines, Mandeep
    Narvaez, Carlos Efrain Quintero
    Ramakrishnaiah, Vinay
    Prajapati, Nirmal
    McCormick, Pat
    Mohd-Yusof, Jamaludin
    Luo, Xi
    Mudigere, Dheevatsa
    Park, Jongsoo
    Smelyanskiy, Misha
    Aiken, Alex
    PROCEEDINGS OF THE 16TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, OSDI 2022, 2022, : 267 - 284
  • [44] Ada-boundary: accelerating DNN training via adaptive boundary batch selection
    Song, Hwanjun
    Kim, Sundong
    Kim, Minseok
    Lee, Jae-Gil
    MACHINE LEARNING, 2020, 109 (9-10) : 1837 - 1853
  • [45] Engineering Cost Prediction Model Based on DNN
    Li, Bingxin
    Xin, Quanying
    Zhang, Lixin
    SCIENTIFIC PROGRAMMING, 2022, 2022
  • [46] SAVE: Sparsity-Aware Vector Engine for Accelerating DNN Training and Inference on CPUs
    Gong, Zhangxiaowen
    Ji, Houxiang
    Fletcher, Christopher W.
    Hughes, Christopher J.
    Baghsorkhi, Sara
    Torrellas, Josep
    2020 53RD ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO 2020), 2020, : 796 - 810
  • [47] Ada-boundary: accelerating DNN training via adaptive boundary batch selection
    Hwanjun Song
    Sundong Kim
    Minseok Kim
    Jae-Gil Lee
    Machine Learning, 2020, 109 : 1837 - 1853
  • [48] Accelerating Distributed ML Training via Selective Synchronization
    Tyagi, Sahil
    Swany, Martin
    2023 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING WORKSHOPS, CLUSTER WORKSHOPS, 2023, : 56 - 57
  • [49] A Parameter Transfer Method for HMM-DNN Heterogeneous Model with the Scarce Mongolian Data Set
    Ma, Zhiqiang
    Zhang, Junpeng
    Li, Tuya
    Yang, Rui
    Wang, Hongbin
    2020 INTERNATIONAL CONFERENCE ON IDENTIFICATION, INFORMATION AND KNOWLEDGE IN THE INTERNET OF THINGS (IIKI2020), 2021, 187 : 258 - 263
  • [50] Accelerating Distributed Training With Collaborative In-Network Aggregation
    Fang, Jin
    Xu, Hongli
    Zhao, Gongming
    Yu, Zhuolong
    Shen, Bingchen
    Xie, Liguang
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2024, 32 (04) : 3437 - 3452