Accelerating Distributed Training With Collaborative In-Network Aggregation

被引：0

作者：

Fang, Jin ^{[1
]}

Xu, Hongli ^{[1
]}

Zhao, Gongming ^{[1
]}

Yu, Zhuolong ^{[2
]}

Shen, Bingchen ^{[1
]}

Xie, Liguang ^{[3
]}

机构：

[1] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei 230027, Anhui, Peoples R China

[2] Johns Hopkins Univ, Dept Comp Sci, Baltimore, MD 21218 USA

[3] Virginia Tech, Dept Comp Sci, Blacksburg, VA 24061 USA

来源：

IEEE-ACM TRANSACTIONS ON NETWORKING | 2024年 / 32卷 / 04期

基金：

美国国家科学基金会;

关键词：

In-network aggregation; gradient scheduling; distributed training; datacenter network; programmable network;

D O I：

10.1109/TNET.2024.3387948

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The surging scale of distributed training (DT) incurs significant communication overhead in datacenters, while a promising solution is in-network aggregation (INA). It leverages programmable switches (e.g., Intel Tofino switches) for gradient aggregation to accelerate DT tasks. Due to switches' limited on-chip memory size, existing solutions try to design the memory sharing mechanism for INA. This mechanism requires gradients to arrive at switches synchronously, while network dynamics make it common for the asynchronous arrival of gradients, resulting in existing solutions being inefficient (e.g., massive communication overhead). To address this issue, we propose GOAT, the first-of-its-kind work on gradient scheduling with collaborative in-network aggregation, so that switches can efficiently aggregate asynchronously arriving gradients. Specifically, GOAT first partitions the model into a set of sub-models, then decides which sub-model gradients each switch is responsible for aggregating exclusively and to which switch each worker should send its sub-model gradients. To this end, we design an efficient knapsack-based randomized rounding algorithm and formally analyze the approximation performance. We implement GOAT and evaluate its performance on a testbed consisting of 3 Intel Tofino switches and 9 servers. Experimental results show that GOAT can speed up the DT by 1.5 x compared to the state-of-the-art solutions.

引用

页码：3437 / 3452

页数：16

共 50 条

[31] Collaborative in-network processing for target tracking
Liu, J. (jjliu@parc.com), 1600, Hindawi Publishing Corporation (2003):
[32] Determining the routing path for in-network aggregation
Zhao, Xiwei
Makki, S. Kami
Pissinou, Niki
2006 INTERNATIONAL CONFERENCE ON HYBRID INFORMATION TECHNOLOGY, VOL 2, PROCEEDINGS, 2006, : 318 - +
[33] ALEPH: Accelerating Distributed Training With eBPF-Based Hierarchical Gradient Aggregation
Yang, Peng
Xu, Hongli
Zhao, Gongming
Zhang, Qianyu
Liu, Jiawei
Qiao, Chunming
IEEE-ACM TRANSACTIONS ON NETWORKING, 2024, 32 (05) : 4128 - 4143
[34] Proof sketches: Verifiable in-network aggregation
Garofalakis, Minos
Hellerstein, Joseph M.
Maniatis, Petros
2007 IEEE 23RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2007, : 971 - 980
[35] Collaborative in-network processing for target tracking
Liu, J
Reich, J
Zhao, F
EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2003, 2003 (04) : 378 - 391
[36] Exact In-Network Aggregation with Integrity and Confidentiality
Papadopoulos, Stavros
Kiayias, Aggelos
Papadias, Dimitris
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2012, 24 (10) : 1760 - 1773
[37] NetSecu: A collaborative network security platform for in-network security
Chen, Xinming
Mu, Beipeng
Chen, Zhen
Proceedings - 2011 3rd International Conference on Communications and Mobile Computing, CMC 2011, 2011, : 59 - 64
[38] DAG based in-network aggregation for sensor network monitoring
Motegi, S
Yoshihara, K
Horiuchi, H
INTERNATIONAL SYMPOSIUM ON APPLICATIONS AND THE INTERNET , PROCEEDINGS, 2006, : 292 - 299
[39] DISTRIBUTED LASSO FOR IN-NETWORK LINEAR REGRESSION
Bazerque, Juan Andres
Mateos, Gonzalo
Giannakis, Georgios B.
2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 2978 - 2981
[40] In-Network Distributed Solar Current Prediction
Basha, Elizabeth
Jurdak, Raja
Rus, Daniela
ACM TRANSACTIONS ON SENSOR NETWORKS, 2015, 11 (02)

← 1 2 3 4 5 →