Model Parameter Prediction Method for Accelerating Distributed DNN Training

被引:0
|
作者
Liu, Wai-xi [1 ]
Chen, Dao-xiao [3 ]
Tan, Miao-quan [3 ]
Chen, Kong-yang [4 ]
Yin, Yue [3 ]
Shang, Wen-Li [3 ]
Li, Jin [4 ]
Cai, Jun [2 ]
机构
[1] Guangzhou Univ, Dept Elect & Commun Engn, Guangzhou, Peoples R China
[2] Guangdong Polytech Normal Univ, Guangzhou, Peoples R China
[3] Guangzhou Univ, Guangzhou, Peoples R China
[4] Guangzhou Univ, Inst Artificial Intelligence, Guangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Distributed training; Communication optimization; Parameter prediction; COMMUNICATION;
D O I
10.1016/j.comnet.2024.110883
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As the size of deep neural network (DNN) models and datasets increases, distributed training becomes popular to reduce the training time. However, a severe communication bottleneck in distributed training limits its scalability. Many methods aim to address this communication bottleneck by reducing communication traffic, such as gradient sparsification and quantization. However, these methods either are at the expense of losing model accuracy or introducing lots of computing overhead. We have observed that the data distribution between layers of neural network models is similar. Thus, we propose a model parameter prediction method (MP2) to accelerate distributed DNN training under parameter server (PS) framework, where workers push only a subset of model parameters to the PS, and residual model parameters are locally predicted by an already-trained deep neural network model on the PS. We address several key challenges in this approach. First, we build a hierarchical parameters dataset by randomly sampling a subset of model from normal distributed trainings. Second, we design a neural network model with the structure of "convolution + channel attention + Max pooling" for predicting model parameters by using a prediction result-based evaluation method. For VGGNet, ResNet, and AlexNet models on CIFAR10 and CIFAR100 datasets, compared with Baseline, Top-k, deep gradient compression (DGC), and weight nowcaster network (WNN), MP2 can reduce traffic by up to 88.98%; and accelerates the training by up to 47.32% while not losing the model accuracy. MP2 has shown good generalization.
引用
收藏
页数:15
相关论文
共 50 条
  • [21] Patronus: Countering Model Poisoning Attacks in Edge Distributed DNN Training
    Wu, Zhonghui
    Xu, Changqiao
    Wang, Mu
    Ma, Yunxiao
    Wu, Zhongrui
    Xiahou, Zhenyu
    Grieco, Luigi Alfredo
    2024 IEEE WIRELESS COMMUNICATIONS AND NETWORKING CONFERENCE, WCNC 2024, 2024,
  • [22] Proteus: Simulating the Performance of Distributed DNN Training
    Duan, Jiangfei
    Li, Xiuhong
    Xu, Ping
    Zhang, Xingcheng
    Yan, Shengen
    Liang, Yun
    Lin, Dahua
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2024, 35 (10) : 1867 - 1878
  • [23] DistSim: A performance model of large-scale hybrid distributed DNN training
    Lu, Guandong
    Chen, Runzhe
    Wang, Yakai
    Zhou, Yangjie
    Zhang, Rui
    Hu, Zheng
    Miao, Yanming
    Cai, Zhifang
    Li, Li
    Leng, Jingwen
    Guo, Minyi
    PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2023, CF 2023, 2023, : 112 - 122
  • [24] Accelerating DNN Training in Wireless Federated Edge Learning Systems
    Ren, Jinke
    Yu, Guanding
    Ding, Guangyao
    IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2021, 39 (01) : 219 - 232
  • [25] Accelerating CPU-based Distributed DNN Training on Modern HPC Clusters using BlueField-2 DPUs
    Jain, Arpan
    Alnaasan, Nawras
    Shafi, Aamir
    Subramoni, Hari
    Panda, Dhabaleswar K.
    2021 IEEE SYMPOSIUM ON HIGH-PERFORMANCE INTERCONNECTS (HOTI 2021), 2021, : 17 - 24
  • [26] Accelerating Distributed GNN Training by Codes
    Wang, Yanhong
    Guan, Tianchan
    Niu, Dimin
    Zou, Qiaosha
    Zheng, Hongzhong
    Shi, C. -J. Richard
    Xie, Yuan
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2023, 34 (09) : 2598 - 2614
  • [27] DNN Model Deployment on Distributed Edges
    Cho, Eunho
    Yoon, Juyeon
    Daehyeon, K.
    Lee, Dongman
    Bae, Doo-Hwan
    ICWE 2021 WORKSHOPS, ICWE 2021 INTERNATIONAL WORKSHOPS, 2022, 1508 : 15 - 26
  • [28] Momentum-driven adaptive synchronization model for distributed DNN training on HPC clusters
    Zhang, Zhaorui
    Ji, Zhuoran
    Wang, Choli
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2022, 159 : 65 - 84
  • [29] Model-Distributed DNN Training for Memory-Constrained Edge Computing Devices
    Li, Pengzhen
    Seferoglu, Hulya
    Dasarit, Venkat R.
    Koyuncu, Erdem
    2021 27TH IEEE INTERNATIONAL SYMPOSIUM ON LOCAL AND METROPOLITAN AREA NETWORKS (LANMAN), 2021,
  • [30] A Generic Communication Scheduler for Distributed DNN Training Acceleration
    Peng, Yanghua
    Zhu, Yibo
    Chen, Yangrui
    Bao, Yixin
    Yi, Bairen
    Lan, Chang
    Wu, Chuan
    Guo, Chuanxiong
    PROCEEDINGS OF THE TWENTY-SEVENTH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES (SOSP '19), 2019, : 16 - 29