Accelerating Distributed Training With Collaborative In-Network Aggregation

被引：0

作者：

Fang, Jin ^{[1
]}

Xu, Hongli ^{[1
]}

Zhao, Gongming ^{[1
]}

Yu, Zhuolong ^{[2
]}

Shen, Bingchen ^{[1
]}

Xie, Liguang ^{[3
]}

机构：

[1] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei 230027, Anhui, Peoples R China

[2] Johns Hopkins Univ, Dept Comp Sci, Baltimore, MD 21218 USA

[3] Virginia Tech, Dept Comp Sci, Blacksburg, VA 24061 USA

来源：

IEEE-ACM TRANSACTIONS ON NETWORKING | 2024年 / 32卷 / 04期

基金：

美国国家科学基金会;

关键词：

In-network aggregation; gradient scheduling; distributed training; datacenter network; programmable network;

D O I：

10.1109/TNET.2024.3387948

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The surging scale of distributed training (DT) incurs significant communication overhead in datacenters, while a promising solution is in-network aggregation (INA). It leverages programmable switches (e.g., Intel Tofino switches) for gradient aggregation to accelerate DT tasks. Due to switches' limited on-chip memory size, existing solutions try to design the memory sharing mechanism for INA. This mechanism requires gradients to arrive at switches synchronously, while network dynamics make it common for the asynchronous arrival of gradients, resulting in existing solutions being inefficient (e.g., massive communication overhead). To address this issue, we propose GOAT, the first-of-its-kind work on gradient scheduling with collaborative in-network aggregation, so that switches can efficiently aggregate asynchronously arriving gradients. Specifically, GOAT first partitions the model into a set of sub-models, then decides which sub-model gradients each switch is responsible for aggregating exclusively and to which switch each worker should send its sub-model gradients. To this end, we design an efficient knapsack-based randomized rounding algorithm and formally analyze the approximation performance. We implement GOAT and evaluate its performance on a testbed consisting of 3 Intel Tofino switches and 9 servers. Experimental results show that GOAT can speed up the DT by 1.5 x compared to the state-of-the-art solutions.

引用

页码：3437 / 3452

页数：16

共 50 条

[41] Comprex: In-Network Compression for Accelerating IoT Analytics at Scale
Oliveira, Rafael
Gavrilovska, Ada
IEEE MICRO, 2024, 44 (02) : 20 - 30
[42] Accelerating Data Serialization/Deserialization Protocols with In-Network Compute
Cao, Shiyi
Di Girolamo, Salvatore
Hoefler, Torsten
2022 IEEE/ACM INTERNATIONAL WORKSHOP ON EXASCALE MPI (EXAMPI), 2022, : 22 - 30
[43] Accelerating neural network training with distributed asynchronous and selective optimization (DASO)
Coquelin, Daniel
Debus, Charlotte
Goetz, Markus
von der Lehr, Fabrice
Kahn, James
Siggel, Martin
Streit, Achim
JOURNAL OF BIG DATA, 2022, 9 (01)
[44] Accelerating distributed deep neural network training with pipelined MPI allreduce
Castello, Adrian
Quintana-Orti, Enrique S.
Duato, Jose
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2021, 24 (04): : 3797 - 3813
[45] Accelerating neural network training with distributed asynchronous and selective optimization (DASO)
Daniel Coquelin
Charlotte Debus
Markus Götz
Fabrice von der Lehr
James Kahn
Martin Siggel
Achim Streit
Journal of Big Data, 9
[46] In-network Computing for Secure Distributed AI
Grasselli, Chiara
Rinieril, Lorenzo
Gonzalez, Pol
Velasco, Luis
Careglio, Davide
Callegati, Franco
2024 24TH INTERNATIONAL CONFERENCE ON TRANSPARENT OPTICAL NETWORKS, ICTON 2024, 2024,
[47] Accelerating distributed deep neural network training with pipelined MPI allreduce
Adrián Castelló
Enrique S. Quintana-Ortí
José Duato
Cluster Computing, 2021, 24 : 3797 - 3813
[48] Accelerating Prefix Scan with in-network computing on Intel PIUMA
Lahotia, Kartik
Petrini, Fabrizio
Kannan, Rajopgal
Prasanna, Viktor
2022 IEEE 29TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS, HIPC, 2022, : 59 - 68
[49] Accelerating OpenSHMEM Collectives using In-Network Computing Approach
Venkata, Manjunath Gorentla
Shainer, Gilad
Graham, Richard L.
Bloch, Gil
2019 31ST INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD 2019), 2019, : 212 - 219
[50] XAgg: Accelerating Heterogeneous Distributed Training Through XDP-Based Gradient Aggregation
Zhang, Qianyu
Zhao, Gongming
Xu, Hongli
Yang, Peng
IEEE-ACM TRANSACTIONS ON NETWORKING, 2024, 32 (03) : 2174 - 2188

← 1 2 3 4 5 →