Accelerating Distributed Training With Collaborative In-Network Aggregation

被引:0
|
作者
Fang, Jin [1 ]
Xu, Hongli [1 ]
Zhao, Gongming [1 ]
Yu, Zhuolong [2 ]
Shen, Bingchen [1 ]
Xie, Liguang [3 ]
机构
[1] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei 230027, Anhui, Peoples R China
[2] Johns Hopkins Univ, Dept Comp Sci, Baltimore, MD 21218 USA
[3] Virginia Tech, Dept Comp Sci, Blacksburg, VA 24061 USA
基金
美国国家科学基金会;
关键词
In-network aggregation; gradient scheduling; distributed training; datacenter network; programmable network;
D O I
10.1109/TNET.2024.3387948
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The surging scale of distributed training (DT) incurs significant communication overhead in datacenters, while a promising solution is in-network aggregation (INA). It leverages programmable switches (e.g., Intel Tofino switches) for gradient aggregation to accelerate DT tasks. Due to switches' limited on-chip memory size, existing solutions try to design the memory sharing mechanism for INA. This mechanism requires gradients to arrive at switches synchronously, while network dynamics make it common for the asynchronous arrival of gradients, resulting in existing solutions being inefficient (e.g., massive communication overhead). To address this issue, we propose GOAT, the first-of-its-kind work on gradient scheduling with collaborative in-network aggregation, so that switches can efficiently aggregate asynchronously arriving gradients. Specifically, GOAT first partitions the model into a set of sub-models, then decides which sub-model gradients each switch is responsible for aggregating exclusively and to which switch each worker should send its sub-model gradients. To this end, we design an efficient knapsack-based randomized rounding algorithm and formally analyze the approximation performance. We implement GOAT and evaluate its performance on a testbed consisting of 3 Intel Tofino switches and 9 servers. Experimental results show that GOAT can speed up the DT by 1.5 x compared to the state-of-the-art solutions.
引用
收藏
页码:3437 / 3452
页数:16
相关论文
共 50 条
  • [21] Dynamic Approaches to In-Network Aggregation
    Kennedy, Oliver
    Koch, Christoph
    Demers, Al
    ICDE: 2009 IEEE 25TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2009, : 1331 - 1334
  • [22] Preemptive Switch Memory Usage to Accelerate Training Jobs with Shared In-Network Aggregation
    Wang, Hao
    Qin, Yuxuan
    Lao, ChonLam
    Le, Yanfang
    Wu, Wenfei
    Chen, Kai
    2023 IEEE 31ST INTERNATIONAL CONFERENCE ON NETWORK PROTOCOLS, ICNP, 2023,
  • [23] Distributed In-Network Coflow Scheduling
    Du, Jing
    Lin, Kate Ching-Ju
    2022 IEEE 30TH INTERNATIONAL CONFERENCE ON NETWORK PROTOCOLS (ICNP 2022), 2022,
  • [24] DISTRIBUTED IN-NETWORK COOPERATIVE CACHING
    Hu, Xiaoyan
    Gong, Jian
    2012 IEEE 2ND INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENT SYSTEMS (CCIS) VOLS 1-3, 2012, : 735 - 740
  • [25] In-Network Compression for Accelerating IoT Analytics at Scale
    Oliveira, Rafael
    Gavrilovska, Ada
    2023 IEEE SYMPOSIUM ON HIGH-PERFORMANCE INTERCONNECTS, HOTI, 2023, : 15 - 24
  • [26] Accelerating Allreduce With In-Network Reduction on Intel PIUMA
    Lakhotia, Kartik
    Petrini, Fabrizio
    Kannan, Rajgopal
    Prasanna, Viktor
    IEEE MICRO, 2022, 42 (02) : 44 - 52
  • [27] Distributed In-Network Channel Decoding
    Zhu, Hao
    Giannakis, Georgios B.
    Cano, Alfonso
    IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2009, 57 (10) : 3970 - 3983
  • [28] Accelerating Byzantine Fault Tolerance with In-Network Computing
    Yang F.
    Zhang P.
    Wang Z.
    Yuan G.
    An X.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2021, 58 (01): : 164 - 177
  • [29] Collaborative In-Network Processing for Target Tracking
    Juan Liu
    James Reich
    Feng Zhao
    EURASIP Journal on Advances in Signal Processing, 2003
  • [30] MVSINK: Incrementally improving in-network aggregation
    Fernandes, Leonardo L.
    Murphy, Amy L.
    2007 IEEE INTERNATIONAL CONFERENCE ON MOBILE AD-HOC AND SENSOR SYSTEMS, VOLS 1-3, 2007, : 1066 - +