Preemptive Switch Memory Usage to Accelerate Training Jobs with Shared In-Network Aggregation

被引:4
|
作者
Wang, Hao [1 ]
Qin, Yuxuan [1 ]
Lao, ChonLam [2 ]
Le, Yanfang [3 ]
Wu, Wenfei [4 ]
Chen, Kai [1 ]
机构
[1] Hong Kong Univ Sci & Technol, iSING Lab, Hong Kong, Peoples R China
[2] Harvard Univ, Cambridge, MA USA
[3] Intel, Santa Clara, CA USA
[4] Peking Univ, Beijing, Peoples R China
关键词
D O I
10.1109/ICNP59255.2023.10355574
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Recent works introduce In-Network Aggregation (INA) for distributed training (DT), which moves the gradient summation into network programmable switches. INA can reduce the traffic volume and accelerate communication in DT jobs. However, switch memory is a scarce resource, unable to support massive DT jobs in data centers, and existing INA solutions have not utilized switch memory to the best extent. We propose DSA, an Efficient Data-Plane switch memory Scheduler for in-network Aggregation. DSA introduces preemption to the switch memory management for INA jobs. In the data plane, DSA allows gradient tensors with high priority to preempt the switch aggregators (basic computation unit in INA) from tensors with low priority, which avoids an aggregator wasting time in idle. In the control plane, DSA devises a priority policy which assigns high priority to gradient tensors that benefit overall job efficiency more, e.g., communication intensive jobs. We prototype DSA and experiments show that DSA can improve the average JCT by up to 1.35x compared with baseline solutions.
引用
收藏
页数:12
相关论文
共 14 条
  • [1] Accelerating Distributed Training With Collaborative In-Network Aggregation
    Fang, Jin
    Xu, Hongli
    Zhao, Gongming
    Yu, Zhuolong
    Shen, Bingchen
    Xie, Liguang
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2024, 32 (04) : 3437 - 3452
  • [2] Concordia: Distributed Shared Memory with In-Network Cache Coherence
    Wang, Qing
    Lu, Youyou
    Xu, Erci
    Li, Junru
    Chen, Youmin
    Shu, Jiwu
    PROCEEDINGS OF THE 19TH USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES (FAST '21), 2021, : 277 - 292
  • [3] GRID: Gradient Routing With In-Network Aggregation for Distributed Training
    Fang, Jin
    Zhao, Gongming
    Xu, Hongli
    Wu, Changbo
    Yu, Zhuolong
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2023, 31 (05) : 2267 - 2280
  • [4] Training Job Placement in Clusters with Statistical In-Network Aggregation
    Zhao, Bohan
    Xu, Wei
    Liu, Shuo
    Tian, Yang
    Wang, Qiaoling
    Wu, Wenfei
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, ASPLOS 2024, VOL 1, 2024, : 420 - 434
  • [5] An In-Network Architecture for Accelerating Shared-Memory Multiprocessor Collectives
    Klenk, Benjamin
    Jiang, Nan
    Thorson, Greg
    Dennison, Larry
    2020 ACM/IEEE 47TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2020), 2020, : 996 - 1009
  • [6] Multi-Switch Cooperative In-Network Aggregation for Distributed Deep Learning
    Su, Ming-Wei
    Li, Yuan-Yu
    Lin, Kate Ching-Ju
    IEEE CONFERENCE ON GLOBAL COMMUNICATIONS, GLOBECOM, 2023, : 4767 - 4772
  • [7] Maximizing Aggregation Throughput for Distributed Training with Constrained In-Network Computing
    Luo, Long
    Yang, Shulin
    Wu, Hao
    Yu, Hongfang
    Lei, Bo
    Gao, Shuai
    ICC 2023-IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS, 2023, : 3652 - 3657
  • [8] PARING: Joint Task Placement and Routing for Distributed Training With In-Network Aggregation
    Qiu, Yuhang
    Zhao, Gongming
    Xu, Hongli
    Huang, He
    Qiao, Chunming
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2024, 32 (05) : 4317 - 4332
  • [9] InGo: In-Network Aggregation Routing with Batch Size Adjustment for Distributed Training
    Bao, Jianfeng
    Zhao, Gongming
    Xu, Hongli
    Wang, Haibo
    Yang, Peng
    2024 IEEE/ACM 32ND INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE, IWQOS, 2024,
  • [10] GPU memory usage optimization for backward propagation in deep network training
    Hong, Ding-Yong
    Tsai, Tzu-Hsien
    Wang, Ning
    Liu, Pangfeng
    Wu, Jan-Jan
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2025, 199