NBSync: Parallelism of Local Computing and Global Synchronization for Fast Distributed Machine Learning in WANs

被引：1

作者：

Zhou, Huaman ^{[1
]}

Li, Zonghang ^{[1
]}

Yu, Hongfang ^{[1
,2
]}

Luo, Long ^{[1
]}

Sun, Gang ^{[1
,3
]}

机构：

[1] Univ Elect Sci & Technol China, Key Lab Opt Fiber Sensing & Commun, Minist Educ, Chengdu 610056, Sichuan, Peoples R China

[2] Peng Cheng Lab, Shenzhen 518066, Guangdong, Peoples R China

[3] Agile & Intelligent Comp Key Lab Sichuan Prov, Chengdu 610036, Sichuan, Peoples R China

来源：

IEEE TRANSACTIONS ON SERVICES COMPUTING | 2023年 / 16卷 / 06期

基金：

中国国家自然科学基金;

关键词：

Distributed machine learning; federated learning; parameter server system; and distributed optimization;

D O I：

10.1109/TSC.2023.3304312

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recently, due to privacy concerns, distributed machine learning in Wide-Area Networks (DML-WANs) attracts increasing attention and has been widely deployed to promote the widespread application of intelligence services that rely on geographically distributed data. DML-WANs is essentially performing collaboratively federated learning over a combination of servers at both edge and cloud on a large spatial scale. However, efficient model training is challenging for DML-WANs because it is blocked by the high overhead of model parameter synchronization between computing servers over WANs. The reason is that there has a sequential dependency between local model computing and global model synchronization of traditional DML-WANs training methods intrinsically producing a sequential blockage between them, e.g., FedAvg. When the computing heterogeneity and the low WAN bandwidth coexist, a long block of global model synchronization prolongs the training time and leads to low utilization of local computing. Despite many efforts on alleviating synchronization overhead with novel communication technologies and synchronization methods, they still use traditional training patterns with sequential dependency and thereby have very limited improvements, such as FedAsync and ESync. In this article, we propose NBSync, a novel training algorithm for DML-WANs, which greatly speeds up the model training by the parallelism of local computing and global synchronization. NBSync employs a well-designed pipelining scheme, which can properly relax the sequential dependency of local computing and global synchronization and process them in parallel so as to overlap their operating overhead in the time dimension. NBSync also realizes flexible, differentiated and dynamical local computing for workers to maximize the overlap ratio in dynamically heterogeneous training environments. Convergence analysis shows that the convergence rate of NBSync training process is asymptotically equal to that of SSGD, and NBSync has a better convergence efficiency. We implemented the prototype of NBSync based on a popular parameter server system, i.e., MXNET's PS-LITE library, and evaluate its performance on a DML-WANs testbed. Experimental results show that NBSync speeds up training about 1.43x-2.79x than state-of-the-art distributed training algorithms (DTAs) in DML-WANs scenarios where computing heterogeneity and low WAN bandwidth coexist.

引用

页码：4115 / 4127

页数：13

共 50 条

[21] GADMM: Fast and Communication Efficient Framework for Distributed Machine Learning
Elgabli, Anis
Park, Jihong
Bedi, Amrit S.
Bennis, Mehdi
Aggarwal, Vaneet
Journal of Machine Learning Research, 2020, 21
[22] Global Versus Local Computations: Fast Computing with Identifiers (Short Paper)
Rabie, Mikael
STABILIZATION, SAFETY, AND SECURITY OF DISTRIBUTED SYSTEMS, SSS 2016, 2016, 10083 : 304 - 309
[23] The Fast Inertial ADMM optimization framework for distributed machine learning
Wang, Guozheng
Wang, Dongxia
Li, Chengfan
Lei, Yongmei
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2025, 164
[24] A Fast Machine Learning Framework with Distributed Packet Loss Tolerance
Wu, Shuang
2024 2ND ASIA CONFERENCE ON COMPUTER VISION, IMAGE PROCESSING AND PATTERN RECOGNITION, CVIPPR 2024, 2024,
[25] Topology Applied to Machine Learning: From Global to Local
Adams, Henry
Moy, Michael
FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2021, 4
[26] UbiNN: A Communication Efficient Framework for Distributed Machine Learning in Edge Computing
Li, Ke
Chen, Kexun
Luo, Shouxi
Zhang, Honghao
Fan, Pingzhi
IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, 2023, 10 (06): : 3368 - 3383
[27] Distributed machine learning load balancing strategy in cloud computing services
Li, Mingwei
Zhang, Jilin
Wan, Jian
Ren, Yongjian
Zhou, Li
Wu, Baofu
Yang, Rui
Wang, Jue
WIRELESS NETWORKS, 2020, 26 (08) : 5517 - 5533
[28] Machine Learning-Based Scheduling and Resources Allocation in Distributed Computing
Toporkov, Victor
Yemelyanov, Dmitry
Bulkhak, Artem
COMPUTATIONAL SCIENCE, ICCS 2022, PT IV, 2022, : 3 - 16
[29] Distributed Data Platform for Machine Learning Using the Fog Computing Model
Tsuchiya T.
Mochizuki R.
Hirose H.
Yamada T.
Koyanagi K.
Minh Tran Q.
SN Computer Science, 2020, 1 (3)
[30] EdgePS: Selective Parameter Aggregation for Distributed Machine Learning in Edge Computing
Zhao, Yangming
Hou, Yunfei
Qiao, Chumming
2021 IEEE 14TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD 2021), 2021, : 217 - 227

← 1 2 3 4 5 →