Accelerating Distributed Training With Collaborative In-Network Aggregation

被引：0

作者：

Fang, Jin ^{[1
]}

Xu, Hongli ^{[1
]}

Zhao, Gongming ^{[1
]}

Yu, Zhuolong ^{[2
]}

Shen, Bingchen ^{[1
]}

Xie, Liguang ^{[3
]}

机构：

[1] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei 230027, Anhui, Peoples R China

[2] Johns Hopkins Univ, Dept Comp Sci, Baltimore, MD 21218 USA

[3] Virginia Tech, Dept Comp Sci, Blacksburg, VA 24061 USA

来源：

IEEE-ACM TRANSACTIONS ON NETWORKING | 2024年 / 32卷 / 04期

基金：

美国国家科学基金会;

关键词：

In-network aggregation; gradient scheduling; distributed training; datacenter network; programmable network;

D O I：

10.1109/TNET.2024.3387948

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The surging scale of distributed training (DT) incurs significant communication overhead in datacenters, while a promising solution is in-network aggregation (INA). It leverages programmable switches (e.g., Intel Tofino switches) for gradient aggregation to accelerate DT tasks. Due to switches' limited on-chip memory size, existing solutions try to design the memory sharing mechanism for INA. This mechanism requires gradients to arrive at switches synchronously, while network dynamics make it common for the asynchronous arrival of gradients, resulting in existing solutions being inefficient (e.g., massive communication overhead). To address this issue, we propose GOAT, the first-of-its-kind work on gradient scheduling with collaborative in-network aggregation, so that switches can efficiently aggregate asynchronously arriving gradients. Specifically, GOAT first partitions the model into a set of sub-models, then decides which sub-model gradients each switch is responsible for aggregating exclusively and to which switch each worker should send its sub-model gradients. To this end, we design an efficient knapsack-based randomized rounding algorithm and formally analyze the approximation performance. We implement GOAT and evaluate its performance on a testbed consisting of 3 Intel Tofino switches and 9 servers. Experimental results show that GOAT can speed up the DT by 1.5 x compared to the state-of-the-art solutions.

引用

页码：3437 / 3452

页数：16

共 50 条

[21] Dynamic Approaches to In-Network Aggregation
Kennedy, Oliver
Koch, Christoph
Demers, Al
ICDE: 2009 IEEE 25TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2009, : 1331 - 1334
[22] Preemptive Switch Memory Usage to Accelerate Training Jobs with Shared In-Network Aggregation
Wang, Hao
Qin, Yuxuan
Lao, ChonLam
Le, Yanfang
Wu, Wenfei
Chen, Kai
2023 IEEE 31ST INTERNATIONAL CONFERENCE ON NETWORK PROTOCOLS, ICNP, 2023,
[23] Distributed In-Network Coflow Scheduling
Du, Jing
Lin, Kate Ching-Ju
2022 IEEE 30TH INTERNATIONAL CONFERENCE ON NETWORK PROTOCOLS (ICNP 2022), 2022,
[24] DISTRIBUTED IN-NETWORK COOPERATIVE CACHING
Hu, Xiaoyan
Gong, Jian
2012 IEEE 2ND INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENT SYSTEMS (CCIS) VOLS 1-3, 2012, : 735 - 740
[25] In-Network Compression for Accelerating IoT Analytics at Scale
Oliveira, Rafael
Gavrilovska, Ada
2023 IEEE SYMPOSIUM ON HIGH-PERFORMANCE INTERCONNECTS, HOTI, 2023, : 15 - 24
[26] Accelerating Allreduce With In-Network Reduction on Intel PIUMA
Lakhotia, Kartik
Petrini, Fabrizio
Kannan, Rajgopal
Prasanna, Viktor
IEEE MICRO, 2022, 42 (02) : 44 - 52
[27] Distributed In-Network Channel Decoding
Zhu, Hao
Giannakis, Georgios B.
Cano, Alfonso
IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2009, 57 (10) : 3970 - 3983
[28] Accelerating Byzantine Fault Tolerance with In-Network Computing
Yang F.
Zhang P.
Wang Z.
Yuan G.
An X.
Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2021, 58 (01): : 164 - 177
[29] Collaborative In-Network Processing for Target Tracking
Juan Liu
James Reich
Feng Zhao
EURASIP Journal on Advances in Signal Processing, 2003
[30] MVSINK: Incrementally improving in-network aggregation
Fernandes, Leonardo L.
Murphy, Amy L.
2007 IEEE INTERNATIONAL CONFERENCE ON MOBILE AD-HOC AND SENSOR SYSTEMS, VOLS 1-3, 2007, : 1066 - +

← 1 2 3 4 5 →