Accelerating Distributed Training With Collaborative In-Network Aggregation

被引：0

作者：

Fang, Jin ^{[1
]}

Xu, Hongli ^{[1
]}

Zhao, Gongming ^{[1
]}

Yu, Zhuolong ^{[2
]}

Shen, Bingchen ^{[1
]}

Xie, Liguang ^{[3
]}

机构：

[1] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei 230027, Anhui, Peoples R China

[2] Johns Hopkins Univ, Dept Comp Sci, Baltimore, MD 21218 USA

[3] Virginia Tech, Dept Comp Sci, Blacksburg, VA 24061 USA

来源：

IEEE-ACM TRANSACTIONS ON NETWORKING | 2024年 / 32卷 / 04期

基金：

美国国家科学基金会;

关键词：

In-network aggregation; gradient scheduling; distributed training; datacenter network; programmable network;

D O I：

10.1109/TNET.2024.3387948

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The surging scale of distributed training (DT) incurs significant communication overhead in datacenters, while a promising solution is in-network aggregation (INA). It leverages programmable switches (e.g., Intel Tofino switches) for gradient aggregation to accelerate DT tasks. Due to switches' limited on-chip memory size, existing solutions try to design the memory sharing mechanism for INA. This mechanism requires gradients to arrive at switches synchronously, while network dynamics make it common for the asynchronous arrival of gradients, resulting in existing solutions being inefficient (e.g., massive communication overhead). To address this issue, we propose GOAT, the first-of-its-kind work on gradient scheduling with collaborative in-network aggregation, so that switches can efficiently aggregate asynchronously arriving gradients. Specifically, GOAT first partitions the model into a set of sub-models, then decides which sub-model gradients each switch is responsible for aggregating exclusively and to which switch each worker should send its sub-model gradients. To this end, we design an efficient knapsack-based randomized rounding algorithm and formally analyze the approximation performance. We implement GOAT and evaluate its performance on a testbed consisting of 3 Intel Tofino switches and 9 servers. Experimental results show that GOAT can speed up the DT by 1.5 x compared to the state-of-the-art solutions.

引用

页码：3437 / 3452

页数：16

共 50 条

[1] GRID: Gradient Routing With In-Network Aggregation for Distributed Training
Fang, Jin
Zhao, Gongming
Xu, Hongli
Wu, Changbo
Yu, Zhuolong
IEEE-ACM TRANSACTIONS ON NETWORKING, 2023, 31 (05) : 2267 - 2280
[2] Straggler-Aware In-Network Aggregation for Accelerating Distributed Deep Learning
Lee, Hochan
Lee, Jaewook
Kim, Heewon
Pack, Sangheon
IEEE TRANSACTIONS ON SERVICES COMPUTING, 2023, 16 (06) : 4198 - 4204
[3] Maximizing Aggregation Throughput for Distributed Training with Constrained In-Network Computing
Luo, Long
Yang, Shulin
Wu, Hao
Yu, Hongfang
Lei, Bo
Gao, Shuai
ICC 2023-IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS, 2023, : 3652 - 3657
[4] PARING: Joint Task Placement and Routing for Distributed Training With In-Network Aggregation
Qiu, Yuhang
Zhao, Gongming
Xu, Hongli
Huang, He
Qiao, Chunming
IEEE-ACM TRANSACTIONS ON NETWORKING, 2024, 32 (05) : 4317 - 4332
[5] InGo: In-Network Aggregation Routing with Batch Size Adjustment for Distributed Training
Bao, Jianfeng
Zhao, Gongming
Xu, Hongli
Wang, Haibo
Yang, Peng
2024 IEEE/ACM 32ND INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE, IWQOS, 2024,
[6] NetEC: Accelerating Erasure Coding Reconstruction With In-Network Aggregation
Qiao, Yi
Zhang, Menghao
Zhou, Yu
Kong, Xiao
Zhang, Han
Bi, Jun
Xu, Mingwei
Wang, Jilong
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (10) : 2571 - 2583
[7] Accelerating Distributed Cloud Storage Systems with In-Network Computing
Jiang, Wei
Jiang, Hao
Wu, Jing
Chen, Qimei
IEEE NETWORK, 2023, 37 (04): : 64 - 70
[8] Scaling Distributed Machine Learning with In-Network Aggregation
Sapio, Amedeo
Canini, Marco
Ho, Chen-Yu
Nelson, Jacob
Kalnis, Panos
Kim, Changhoon
Krishnamurthy, Arvind
Moshref, Masoud
Ports, Dan R. K.
Richtarik, Peter
PROCEEDINGS OF THE 18TH USENIX SYMPOSIUM ON NETWORKED SYSTEM DESIGN AND IMPLEMENTATION, 2021, : 785 - 808
[9] Accelerating and Securing Federated Learning with Stateless In-network Aggregation at the Edge
Xiat, Junxu
Wu, Wenfei
Luo, Lailong
Cheng, Geyao
Guo, Deke
Niar, Qifeng
2024 IEEE 44TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, ICDCS 2024, 2024, : 692 - 702
[10] Accelerating LSH-based Distributed Search with In-network Computation
Zhang, Penghao
Pan, Heng
Li, Zhenyu
He, Peng
Zhang, Zhibin
Tyson, Gareth
Xie, Gaogang
IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2021), 2021,

← 1 2 3 4 5 →