Accelerating Distributed Training With Collaborative In-Network Aggregation

被引:0
|
作者
Fang, Jin [1 ]
Xu, Hongli [1 ]
Zhao, Gongming [1 ]
Yu, Zhuolong [2 ]
Shen, Bingchen [1 ]
Xie, Liguang [3 ]
机构
[1] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei 230027, Anhui, Peoples R China
[2] Johns Hopkins Univ, Dept Comp Sci, Baltimore, MD 21218 USA
[3] Virginia Tech, Dept Comp Sci, Blacksburg, VA 24061 USA
基金
美国国家科学基金会;
关键词
In-network aggregation; gradient scheduling; distributed training; datacenter network; programmable network;
D O I
10.1109/TNET.2024.3387948
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The surging scale of distributed training (DT) incurs significant communication overhead in datacenters, while a promising solution is in-network aggregation (INA). It leverages programmable switches (e.g., Intel Tofino switches) for gradient aggregation to accelerate DT tasks. Due to switches' limited on-chip memory size, existing solutions try to design the memory sharing mechanism for INA. This mechanism requires gradients to arrive at switches synchronously, while network dynamics make it common for the asynchronous arrival of gradients, resulting in existing solutions being inefficient (e.g., massive communication overhead). To address this issue, we propose GOAT, the first-of-its-kind work on gradient scheduling with collaborative in-network aggregation, so that switches can efficiently aggregate asynchronously arriving gradients. Specifically, GOAT first partitions the model into a set of sub-models, then decides which sub-model gradients each switch is responsible for aggregating exclusively and to which switch each worker should send its sub-model gradients. To this end, we design an efficient knapsack-based randomized rounding algorithm and formally analyze the approximation performance. We implement GOAT and evaluate its performance on a testbed consisting of 3 Intel Tofino switches and 9 servers. Experimental results show that GOAT can speed up the DT by 1.5 x compared to the state-of-the-art solutions.
引用
收藏
页码:3437 / 3452
页数:16
相关论文
共 50 条
  • [1] GRID: Gradient Routing With In-Network Aggregation for Distributed Training
    Fang, Jin
    Zhao, Gongming
    Xu, Hongli
    Wu, Changbo
    Yu, Zhuolong
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2023, 31 (05) : 2267 - 2280
  • [2] Straggler-Aware In-Network Aggregation for Accelerating Distributed Deep Learning
    Lee, Hochan
    Lee, Jaewook
    Kim, Heewon
    Pack, Sangheon
    IEEE TRANSACTIONS ON SERVICES COMPUTING, 2023, 16 (06) : 4198 - 4204
  • [3] Maximizing Aggregation Throughput for Distributed Training with Constrained In-Network Computing
    Luo, Long
    Yang, Shulin
    Wu, Hao
    Yu, Hongfang
    Lei, Bo
    Gao, Shuai
    ICC 2023-IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS, 2023, : 3652 - 3657
  • [4] PARING: Joint Task Placement and Routing for Distributed Training With In-Network Aggregation
    Qiu, Yuhang
    Zhao, Gongming
    Xu, Hongli
    Huang, He
    Qiao, Chunming
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2024, 32 (05) : 4317 - 4332
  • [5] InGo: In-Network Aggregation Routing with Batch Size Adjustment for Distributed Training
    Bao, Jianfeng
    Zhao, Gongming
    Xu, Hongli
    Wang, Haibo
    Yang, Peng
    2024 IEEE/ACM 32ND INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE, IWQOS, 2024,
  • [6] NetEC: Accelerating Erasure Coding Reconstruction With In-Network Aggregation
    Qiao, Yi
    Zhang, Menghao
    Zhou, Yu
    Kong, Xiao
    Zhang, Han
    Bi, Jun
    Xu, Mingwei
    Wang, Jilong
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (10) : 2571 - 2583
  • [7] Accelerating Distributed Cloud Storage Systems with In-Network Computing
    Jiang, Wei
    Jiang, Hao
    Wu, Jing
    Chen, Qimei
    IEEE NETWORK, 2023, 37 (04): : 64 - 70
  • [8] Scaling Distributed Machine Learning with In-Network Aggregation
    Sapio, Amedeo
    Canini, Marco
    Ho, Chen-Yu
    Nelson, Jacob
    Kalnis, Panos
    Kim, Changhoon
    Krishnamurthy, Arvind
    Moshref, Masoud
    Ports, Dan R. K.
    Richtarik, Peter
    PROCEEDINGS OF THE 18TH USENIX SYMPOSIUM ON NETWORKED SYSTEM DESIGN AND IMPLEMENTATION, 2021, : 785 - 808
  • [9] Accelerating and Securing Federated Learning with Stateless In-network Aggregation at the Edge
    Xiat, Junxu
    Wu, Wenfei
    Luo, Lailong
    Cheng, Geyao
    Guo, Deke
    Niar, Qifeng
    2024 IEEE 44TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, ICDCS 2024, 2024, : 692 - 702
  • [10] Accelerating LSH-based Distributed Search with In-network Computation
    Zhang, Penghao
    Pan, Heng
    Li, Zhenyu
    He, Peng
    Zhang, Zhibin
    Tyson, Gareth
    Xie, Gaogang
    IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2021), 2021,