Model Parameter Prediction Method for Accelerating Distributed DNN Training

被引：0

作者：

Liu, Wai-xi ^{[1
]}

Chen, Dao-xiao ^{[3
]}

Tan, Miao-quan ^{[3
]}

Chen, Kong-yang ^{[4
]}

Yin, Yue ^{[3
]}

Shang, Wen-Li ^{[3
]}

Li, Jin ^{[4
]}

Cai, Jun ^{[2
]}

机构：

[1] Guangzhou Univ, Dept Elect & Commun Engn, Guangzhou, Peoples R China

[2] Guangdong Polytech Normal Univ, Guangzhou, Peoples R China

[3] Guangzhou Univ, Guangzhou, Peoples R China

[4] Guangzhou Univ, Inst Artificial Intelligence, Guangzhou, Peoples R China

来源：

COMPUTER NETWORKS | 2024年 / 255卷

基金：

中国国家自然科学基金;

关键词：

Distributed training; Communication optimization; Parameter prediction; COMMUNICATION;

D O I：

10.1016/j.comnet.2024.110883

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

As the size of deep neural network (DNN) models and datasets increases, distributed training becomes popular to reduce the training time. However, a severe communication bottleneck in distributed training limits its scalability. Many methods aim to address this communication bottleneck by reducing communication traffic, such as gradient sparsification and quantization. However, these methods either are at the expense of losing model accuracy or introducing lots of computing overhead. We have observed that the data distribution between layers of neural network models is similar. Thus, we propose a model parameter prediction method (MP2) to accelerate distributed DNN training under parameter server (PS) framework, where workers push only a subset of model parameters to the PS, and residual model parameters are locally predicted by an already-trained deep neural network model on the PS. We address several key challenges in this approach. First, we build a hierarchical parameters dataset by randomly sampling a subset of model from normal distributed trainings. Second, we design a neural network model with the structure of "convolution + channel attention + Max pooling" for predicting model parameters by using a prediction result-based evaluation method. For VGGNet, ResNet, and AlexNet models on CIFAR10 and CIFAR100 datasets, compared with Baseline, Top-k, deep gradient compression (DGC), and weight nowcaster network (WNN), MP2 can reduce traffic by up to 88.98%; and accelerates the training by up to 47.32% while not losing the model accuracy. MP2 has shown good generalization.

引用

页数：15

共 50 条

[21] Patronus: Countering Model Poisoning Attacks in Edge Distributed DNN Training
Wu, Zhonghui
Xu, Changqiao
Wang, Mu
Ma, Yunxiao
Wu, Zhongrui
Xiahou, Zhenyu
Grieco, Luigi Alfredo
2024 IEEE WIRELESS COMMUNICATIONS AND NETWORKING CONFERENCE, WCNC 2024, 2024,
[22] Proteus: Simulating the Performance of Distributed DNN Training
Duan, Jiangfei
Li, Xiuhong
Xu, Ping
Zhang, Xingcheng
Yan, Shengen
Liang, Yun
Lin, Dahua
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2024, 35 (10) : 1867 - 1878
[23] DistSim: A performance model of large-scale hybrid distributed DNN training
Lu, Guandong
Chen, Runzhe
Wang, Yakai
Zhou, Yangjie
Zhang, Rui
Hu, Zheng
Miao, Yanming
Cai, Zhifang
Li, Li
Leng, Jingwen
Guo, Minyi
PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2023, CF 2023, 2023, : 112 - 122
[24] Accelerating DNN Training in Wireless Federated Edge Learning Systems
Ren, Jinke
Yu, Guanding
Ding, Guangyao
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2021, 39 (01) : 219 - 232
[25] Accelerating CPU-based Distributed DNN Training on Modern HPC Clusters using BlueField-2 DPUs
Jain, Arpan
Alnaasan, Nawras
Shafi, Aamir
Subramoni, Hari
Panda, Dhabaleswar K.
2021 IEEE SYMPOSIUM ON HIGH-PERFORMANCE INTERCONNECTS (HOTI 2021), 2021, : 17 - 24
[26] Accelerating Distributed GNN Training by Codes
Wang, Yanhong
Guan, Tianchan
Niu, Dimin
Zou, Qiaosha
Zheng, Hongzhong
Shi, C. -J. Richard
Xie, Yuan
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2023, 34 (09) : 2598 - 2614
[27] DNN Model Deployment on Distributed Edges
Cho, Eunho
Yoon, Juyeon
Daehyeon, K.
Lee, Dongman
Bae, Doo-Hwan
ICWE 2021 WORKSHOPS, ICWE 2021 INTERNATIONAL WORKSHOPS, 2022, 1508 : 15 - 26
[28] Momentum-driven adaptive synchronization model for distributed DNN training on HPC clusters
Zhang, Zhaorui
Ji, Zhuoran
Wang, Choli
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2022, 159 : 65 - 84
[29] Model-Distributed DNN Training for Memory-Constrained Edge Computing Devices
Li, Pengzhen
Seferoglu, Hulya
Dasarit, Venkat R.
Koyuncu, Erdem
2021 27TH IEEE INTERNATIONAL SYMPOSIUM ON LOCAL AND METROPOLITAN AREA NETWORKS (LANMAN), 2021,
[30] A Generic Communication Scheduler for Distributed DNN Training Acceleration
Peng, Yanghua
Zhu, Yibo
Chen, Yangrui
Bao, Yixin
Yi, Bairen
Lan, Chang
Wu, Chuan
Guo, Chuanxiong
PROCEEDINGS OF THE TWENTY-SEVENTH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES (SOSP '19), 2019, : 16 - 29

← 1 2 3 4 5 →