Accelerating Distributed Training With Collaborative In-Network Aggregation

被引:0
|
作者
Fang, Jin [1 ]
Xu, Hongli [1 ]
Zhao, Gongming [1 ]
Yu, Zhuolong [2 ]
Shen, Bingchen [1 ]
Xie, Liguang [3 ]
机构
[1] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei 230027, Anhui, Peoples R China
[2] Johns Hopkins Univ, Dept Comp Sci, Baltimore, MD 21218 USA
[3] Virginia Tech, Dept Comp Sci, Blacksburg, VA 24061 USA
基金
美国国家科学基金会;
关键词
In-network aggregation; gradient scheduling; distributed training; datacenter network; programmable network;
D O I
10.1109/TNET.2024.3387948
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The surging scale of distributed training (DT) incurs significant communication overhead in datacenters, while a promising solution is in-network aggregation (INA). It leverages programmable switches (e.g., Intel Tofino switches) for gradient aggregation to accelerate DT tasks. Due to switches' limited on-chip memory size, existing solutions try to design the memory sharing mechanism for INA. This mechanism requires gradients to arrive at switches synchronously, while network dynamics make it common for the asynchronous arrival of gradients, resulting in existing solutions being inefficient (e.g., massive communication overhead). To address this issue, we propose GOAT, the first-of-its-kind work on gradient scheduling with collaborative in-network aggregation, so that switches can efficiently aggregate asynchronously arriving gradients. Specifically, GOAT first partitions the model into a set of sub-models, then decides which sub-model gradients each switch is responsible for aggregating exclusively and to which switch each worker should send its sub-model gradients. To this end, we design an efficient knapsack-based randomized rounding algorithm and formally analyze the approximation performance. We implement GOAT and evaluate its performance on a testbed consisting of 3 Intel Tofino switches and 9 servers. Experimental results show that GOAT can speed up the DT by 1.5 x compared to the state-of-the-art solutions.
引用
收藏
页码:3437 / 3452
页数:16
相关论文
共 50 条
  • [41] Comprex: In-Network Compression for Accelerating IoT Analytics at Scale
    Oliveira, Rafael
    Gavrilovska, Ada
    IEEE MICRO, 2024, 44 (02) : 20 - 30
  • [42] Accelerating Data Serialization/Deserialization Protocols with In-Network Compute
    Cao, Shiyi
    Di Girolamo, Salvatore
    Hoefler, Torsten
    2022 IEEE/ACM INTERNATIONAL WORKSHOP ON EXASCALE MPI (EXAMPI), 2022, : 22 - 30
  • [43] Accelerating neural network training with distributed asynchronous and selective optimization (DASO)
    Coquelin, Daniel
    Debus, Charlotte
    Goetz, Markus
    von der Lehr, Fabrice
    Kahn, James
    Siggel, Martin
    Streit, Achim
    JOURNAL OF BIG DATA, 2022, 9 (01)
  • [44] Accelerating distributed deep neural network training with pipelined MPI allreduce
    Castello, Adrian
    Quintana-Orti, Enrique S.
    Duato, Jose
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2021, 24 (04): : 3797 - 3813
  • [45] Accelerating neural network training with distributed asynchronous and selective optimization (DASO)
    Daniel Coquelin
    Charlotte Debus
    Markus Götz
    Fabrice von der Lehr
    James Kahn
    Martin Siggel
    Achim Streit
    Journal of Big Data, 9
  • [46] In-network Computing for Secure Distributed AI
    Grasselli, Chiara
    Rinieril, Lorenzo
    Gonzalez, Pol
    Velasco, Luis
    Careglio, Davide
    Callegati, Franco
    2024 24TH INTERNATIONAL CONFERENCE ON TRANSPARENT OPTICAL NETWORKS, ICTON 2024, 2024,
  • [47] Accelerating distributed deep neural network training with pipelined MPI allreduce
    Adrián Castelló
    Enrique S. Quintana-Ortí
    José Duato
    Cluster Computing, 2021, 24 : 3797 - 3813
  • [48] Accelerating Prefix Scan with in-network computing on Intel PIUMA
    Lahotia, Kartik
    Petrini, Fabrizio
    Kannan, Rajopgal
    Prasanna, Viktor
    2022 IEEE 29TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS, HIPC, 2022, : 59 - 68
  • [49] Accelerating OpenSHMEM Collectives using In-Network Computing Approach
    Venkata, Manjunath Gorentla
    Shainer, Gilad
    Graham, Richard L.
    Bloch, Gil
    2019 31ST INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD 2019), 2019, : 212 - 219
  • [50] XAgg: Accelerating Heterogeneous Distributed Training Through XDP-Based Gradient Aggregation
    Zhang, Qianyu
    Zhao, Gongming
    Xu, Hongli
    Yang, Peng
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2024, 32 (03) : 2174 - 2188