Model Parameter Prediction Method for Accelerating Distributed DNN Training

被引:0
|
作者
Liu, Wai-xi [1 ]
Chen, Dao-xiao [3 ]
Tan, Miao-quan [3 ]
Chen, Kong-yang [4 ]
Yin, Yue [3 ]
Shang, Wen-Li [3 ]
Li, Jin [4 ]
Cai, Jun [2 ]
机构
[1] Guangzhou Univ, Dept Elect & Commun Engn, Guangzhou, Peoples R China
[2] Guangdong Polytech Normal Univ, Guangzhou, Peoples R China
[3] Guangzhou Univ, Guangzhou, Peoples R China
[4] Guangzhou Univ, Inst Artificial Intelligence, Guangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Distributed training; Communication optimization; Parameter prediction; COMMUNICATION;
D O I
10.1016/j.comnet.2024.110883
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As the size of deep neural network (DNN) models and datasets increases, distributed training becomes popular to reduce the training time. However, a severe communication bottleneck in distributed training limits its scalability. Many methods aim to address this communication bottleneck by reducing communication traffic, such as gradient sparsification and quantization. However, these methods either are at the expense of losing model accuracy or introducing lots of computing overhead. We have observed that the data distribution between layers of neural network models is similar. Thus, we propose a model parameter prediction method (MP2) to accelerate distributed DNN training under parameter server (PS) framework, where workers push only a subset of model parameters to the PS, and residual model parameters are locally predicted by an already-trained deep neural network model on the PS. We address several key challenges in this approach. First, we build a hierarchical parameters dataset by randomly sampling a subset of model from normal distributed trainings. Second, we design a neural network model with the structure of "convolution + channel attention + Max pooling" for predicting model parameters by using a prediction result-based evaluation method. For VGGNet, ResNet, and AlexNet models on CIFAR10 and CIFAR100 datasets, compared with Baseline, Top-k, deep gradient compression (DGC), and weight nowcaster network (WNN), MP2 can reduce traffic by up to 88.98%; and accelerates the training by up to 47.32% while not losing the model accuracy. MP2 has shown good generalization.
引用
收藏
页数:15
相关论文
共 50 条
  • [31] An Approach Towards Distributed DNN Training on FPGA Clusters
    Kreowsky, Philipp
    Knapheide, Justin
    Stabernack, Benno
    ARCHITECTURE OF COMPUTING SYSTEMS, ARCS 2024, 2024, 14842 : 18 - 32
  • [32] Efficient Pipeline Planning for Expedited Distributed DNN Training
    Luo, Ziyue
    Yi, Xiaodong
    Long, Guoping
    Fan, Shiqing
    Wu, Chuan
    Yang, Jun
    Lin, Wei
    IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2022), 2022, : 340 - 349
  • [33] NoSync: Particle Swarm Inspired Distributed DNN Training
    Isakov, Mihailo
    Kinsy, Michel A.
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2018, PT II, 2018, 11140 : 607 - 619
  • [34] Distriformer: Research on a Distributed Training Rockburst Prediction Method
    Zhang, Yu
    Fang, Kongyi
    Guo, Zhengjia
    PROCESSES, 2024, 12 (06)
  • [35] Prediction of blast furnace temperature based on the distributed parameter model
    Chen, Ming
    Yin, Yi-Xin
    Zhu, Qiao
    Zhang, Hai-Gang
    Kongzhi Lilun Yu Yingyong/Control Theory and Applications, 2014, 31 (09): : 1232 - 1237
  • [36] iCACHE: An Importance-Sampling-Informed Cache for Accelerating I/O-Bound DNN Model Training
    Chen, Weijian
    He, Shuibing
    Xu, Yaowen
    Zhang, Xuechen
    Yang, Siling
    Hu, Shuang
    Sun, Xian-He
    Chen, Gang
    2023 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA, 2023, : 220 - 232
  • [37] Accelerating Distributed MoE Training and Inference with Lina
    Li, Jiamin
    Jiang, Yimin
    Zhu, Yibo
    Wang, Cong
    Xu, Hong
    PROCEEDINGS OF THE 2023 USENIX ANNUAL TECHNICAL CONFERENCE, 2023, : 945 - 959
  • [38] Accelerating Distributed Machine Learning by Smart Parameter Server
    Geng, Jinkun
    Li, Dan
    Wang, Shuai
    PROCEEDINGS OF THE 2019 ASIA-PACIFIC WORKSHOP ON NETWORKING (APNET '19), 2019, : 92 - 98
  • [39] Optimizing Resource Allocation in Pipeline Parallelism for Distributed DNN Training
    Duan, Yubin
    Wu, Jie
    2022 IEEE 28TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS, ICPADS, 2022, : 161 - 168
  • [40] Preliminary Performance Analysis of Distributed DNN Training with Relaxed Synchronization
    Shirahata, Koichi
    Haderbache, Amir
    Fukumoto, Naoto
    Nakashima, Kohta
    IEICE TRANSACTIONS ON ELECTRONICS, 2021, E104C (06) : 257 - 260