Accelerating Distributed Training With Collaborative In-Network Aggregation

被引:0
|
作者
Fang, Jin [1 ]
Xu, Hongli [1 ]
Zhao, Gongming [1 ]
Yu, Zhuolong [2 ]
Shen, Bingchen [1 ]
Xie, Liguang [3 ]
机构
[1] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei 230027, Anhui, Peoples R China
[2] Johns Hopkins Univ, Dept Comp Sci, Baltimore, MD 21218 USA
[3] Virginia Tech, Dept Comp Sci, Blacksburg, VA 24061 USA
基金
美国国家科学基金会;
关键词
In-network aggregation; gradient scheduling; distributed training; datacenter network; programmable network;
D O I
10.1109/TNET.2024.3387948
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The surging scale of distributed training (DT) incurs significant communication overhead in datacenters, while a promising solution is in-network aggregation (INA). It leverages programmable switches (e.g., Intel Tofino switches) for gradient aggregation to accelerate DT tasks. Due to switches' limited on-chip memory size, existing solutions try to design the memory sharing mechanism for INA. This mechanism requires gradients to arrive at switches synchronously, while network dynamics make it common for the asynchronous arrival of gradients, resulting in existing solutions being inefficient (e.g., massive communication overhead). To address this issue, we propose GOAT, the first-of-its-kind work on gradient scheduling with collaborative in-network aggregation, so that switches can efficiently aggregate asynchronously arriving gradients. Specifically, GOAT first partitions the model into a set of sub-models, then decides which sub-model gradients each switch is responsible for aggregating exclusively and to which switch each worker should send its sub-model gradients. To this end, we design an efficient knapsack-based randomized rounding algorithm and formally analyze the approximation performance. We implement GOAT and evaluate its performance on a testbed consisting of 3 Intel Tofino switches and 9 servers. Experimental results show that GOAT can speed up the DT by 1.5 x compared to the state-of-the-art solutions.
引用
收藏
页码:3437 / 3452
页数:16
相关论文
共 50 条
  • [31] Collaborative in-network processing for target tracking
    Liu, J. (jjliu@parc.com), 1600, Hindawi Publishing Corporation (2003):
  • [32] Determining the routing path for in-network aggregation
    Zhao, Xiwei
    Makki, S. Kami
    Pissinou, Niki
    2006 INTERNATIONAL CONFERENCE ON HYBRID INFORMATION TECHNOLOGY, VOL 2, PROCEEDINGS, 2006, : 318 - +
  • [33] ALEPH: Accelerating Distributed Training With eBPF-Based Hierarchical Gradient Aggregation
    Yang, Peng
    Xu, Hongli
    Zhao, Gongming
    Zhang, Qianyu
    Liu, Jiawei
    Qiao, Chunming
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2024, 32 (05) : 4128 - 4143
  • [34] Proof sketches: Verifiable in-network aggregation
    Garofalakis, Minos
    Hellerstein, Joseph M.
    Maniatis, Petros
    2007 IEEE 23RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2007, : 971 - 980
  • [35] Collaborative in-network processing for target tracking
    Liu, J
    Reich, J
    Zhao, F
    EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2003, 2003 (04) : 378 - 391
  • [36] Exact In-Network Aggregation with Integrity and Confidentiality
    Papadopoulos, Stavros
    Kiayias, Aggelos
    Papadias, Dimitris
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2012, 24 (10) : 1760 - 1773
  • [37] NetSecu: A collaborative network security platform for in-network security
    Chen, Xinming
    Mu, Beipeng
    Chen, Zhen
    Proceedings - 2011 3rd International Conference on Communications and Mobile Computing, CMC 2011, 2011, : 59 - 64
  • [38] DAG based in-network aggregation for sensor network monitoring
    Motegi, S
    Yoshihara, K
    Horiuchi, H
    INTERNATIONAL SYMPOSIUM ON APPLICATIONS AND THE INTERNET , PROCEEDINGS, 2006, : 292 - 299
  • [39] DISTRIBUTED LASSO FOR IN-NETWORK LINEAR REGRESSION
    Bazerque, Juan Andres
    Mateos, Gonzalo
    Giannakis, Georgios B.
    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 2978 - 2981
  • [40] In-Network Distributed Solar Current Prediction
    Basha, Elizabeth
    Jurdak, Raja
    Rus, Daniela
    ACM TRANSACTIONS ON SENSOR NETWORKS, 2015, 11 (02)