NBSync: Parallelism of Local Computing and Global Synchronization for Fast Distributed Machine Learning in WANs

被引:1
|
作者
Zhou, Huaman [1 ]
Li, Zonghang [1 ]
Yu, Hongfang [1 ,2 ]
Luo, Long [1 ]
Sun, Gang [1 ,3 ]
机构
[1] Univ Elect Sci & Technol China, Key Lab Opt Fiber Sensing & Commun, Minist Educ, Chengdu 610056, Sichuan, Peoples R China
[2] Peng Cheng Lab, Shenzhen 518066, Guangdong, Peoples R China
[3] Agile & Intelligent Comp Key Lab Sichuan Prov, Chengdu 610036, Sichuan, Peoples R China
基金
中国国家自然科学基金;
关键词
Distributed machine learning; federated learning; parameter server system; and distributed optimization;
D O I
10.1109/TSC.2023.3304312
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recently, due to privacy concerns, distributed machine learning in Wide-Area Networks (DML-WANs) attracts increasing attention and has been widely deployed to promote the widespread application of intelligence services that rely on geographically distributed data. DML-WANs is essentially performing collaboratively federated learning over a combination of servers at both edge and cloud on a large spatial scale. However, efficient model training is challenging for DML-WANs because it is blocked by the high overhead of model parameter synchronization between computing servers over WANs. The reason is that there has a sequential dependency between local model computing and global model synchronization of traditional DML-WANs training methods intrinsically producing a sequential blockage between them, e.g., FedAvg. When the computing heterogeneity and the low WAN bandwidth coexist, a long block of global model synchronization prolongs the training time and leads to low utilization of local computing. Despite many efforts on alleviating synchronization overhead with novel communication technologies and synchronization methods, they still use traditional training patterns with sequential dependency and thereby have very limited improvements, such as FedAsync and ESync. In this article, we propose NBSync, a novel training algorithm for DML-WANs, which greatly speeds up the model training by the parallelism of local computing and global synchronization. NBSync employs a well-designed pipelining scheme, which can properly relax the sequential dependency of local computing and global synchronization and process them in parallel so as to overlap their operating overhead in the time dimension. NBSync also realizes flexible, differentiated and dynamical local computing for workers to maximize the overlap ratio in dynamically heterogeneous training environments. Convergence analysis shows that the convergence rate of NBSync training process is asymptotically equal to that of SSGD, and NBSync has a better convergence efficiency. We implemented the prototype of NBSync based on a popular parameter server system, i.e., MXNET's PS-LITE library, and evaluate its performance on a DML-WANs testbed. Experimental results show that NBSync speeds up training about 1.43x-2.79x than state-of-the-art distributed training algorithms (DTAs) in DML-WANs scenarios where computing heterogeneity and low WAN bandwidth coexist.
引用
收藏
页码:4115 / 4127
页数:13
相关论文
共 50 条
  • [41] A Novel Distributed Machine Learning Model to Detect Attacks on Edge Computing Network
    Trong-Minh Hoang
    Trang-Linh Le Thi
    Nguyen Minh Quy
    JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, 2023, 14 (01) : 153 - 159
  • [42] Distributed Matrix Multiplication Performance Estimator for Machine Learning Jobs in Cloud Computing
    Son, Myungjun
    Lee, Kyungyong
    PROCEEDINGS 2018 IEEE 11TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD), 2018, : 638 - 645
  • [43] A scalable distributed machine learning approach for attack detection in edge computing environments
    Kozik, Rafal
    Choras, Michal
    Ficco, Massimo
    Palmieri, Francesco
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2018, 119 : 18 - 26
  • [44] PSscheduler: A parameter synchronization scheduling algorithm for distributed machine learning in reconfigurable optical networks
    Liu, Ling
    Xu, Xiaoqiong
    Zhou, Pan
    Chen, Xi
    Ergu, Daji
    Yu, Hongfang
    Sun, Gang
    Guizani, Mohsen
    NEUROCOMPUTING, 2025, 616
  • [45] Hare: Exploiting Inter-job and Intra-job Parallelism of Distributed Machine Learning on Heterogeneous GPUs
    Chen, Fahao
    Li, Peng
    Wu, Celimuge
    Guo, Song
    PROCEEDINGS OF THE 31ST INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, HPDC 2022, 2022, : 253 - 264
  • [46] LGNet: Local and global representation learning for fast biomedical image segmentation
    Xu, Guoping
    Zhang, Xuan
    Liao, Wentao
    Chen, Shangbin
    Wu, Xinglong
    JOURNAL OF INNOVATIVE OPTICAL HEALTH SCIENCES, 2023, 16 (04)
  • [47] Model Aggregation Method for Data Parallelism in Distributed Real-Time Machine Learning of Smart Sensing Equipment
    Fan, Yuchen
    Zhang, Jilin
    Zhao, Nailiang
    Ren, Yongjian
    Wan, Jian
    Zhou, Li
    Shen, Zhongyu
    Wang, Jue
    Zhang, Juncong
    Wei, Zhenguo
    IEEE ACCESS, 2019, 7 : 172065 - 172073
  • [48] Learning Representations With Local and Global Geometries Preserved for Machine Fault Diagnosis
    Li, Yue
    Lekamalage, Chamara Kasun Liyanaarachchi
    Liu, Tianchi
    Chen, Pin-An
    Huang, Guang-Bin
    IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, 2020, 67 (03) : 2360 - 2370
  • [49] Distributed Machine Learning for Predictive Analytics in Mobile Edge Computing Based IoT Environments
    Abeysekara, Prabath
    Dong, Hai
    Qin, A. K.
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [50] Design and optimization of distributed energy management system based on edge computing and machine learning
    Nan Feng
    Conglin Ran
    Energy Informatics, 8 (1)