PARING: Joint Task Placement and Routing for Distributed Training With In-Network Aggregation

被引:0
|
作者
Qiu, Yuhang [1 ,2 ]
Zhao, Gongming [1 ,2 ]
Xu, Hongli [1 ,2 ]
Huang, He [3 ]
Qiao, Chunming [4 ]
机构
[1] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei 230027, Anhui, Peoples R China
[2] Univ Sci & Technol China, Suzhou Inst Adv Res, Suzhou 215123, Jiangsu, Peoples R China
[3] Soochow Univ, Sch Comp Sci & Technol, Suzhou 215123, Jiangsu, Peoples R China
[4] Univ Buffalo, Dept Comp Sci & Engn, Buffalo, NY 16260 USA
基金
美国国家科学基金会;
关键词
Task analysis; Servers; Routing; Training; Aggregates; Topology; Switches; In-network aggregation; distributed training; task placement; gradient routing;
D O I
10.1109/TNET.2024.3414853
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With the increase in both the model size and dataset size of distributed training (DT) tasks, communication between the workers and parameter servers (PSs) in a cluster has become a bottleneck. In-network aggregation (INA) enabled by programmable switches has been proposed as a promising solution to alleviate the communication bottleneck. However, existing works focused on in-network aggregation implementation based on simple DT placement and fixed routing policies, which may lead to a large communication overhead and inefficient use of resources (e.g., storage, computing power and bandwidth). In this paper, we propose PARING, the first-of-its-kind INA approach that jointly optimizes DT task placement and routing in order to reduce traffic volume and minimize communication time. We formulate the problem as a nonlinear multi-objective mixed-integer programming problem, and prove its NP-Hardness. Based on the concept of Steiner trees, an algorithm with bounded approximation factors is proposed for this problem. Large-scale simulations show that our algorithm can reduce communication time by up to 81.0% and traffic volume by up to 19.1% compared to the state-of-the-art algorithms.
引用
收藏
页码:4317 / 4332
页数:16
相关论文
共 50 条
  • [41] Joint Optimization of Task Placement and Routing in Minimizing Inter-DC Coflow Completion Time
    Guo, Yingya
    Wang, Zhiliang
    Yin, Xia
    Shi, Xingang
    Wu, Jianping
    2017 26TH INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATION AND NETWORKS (ICCCN 2017), 2017,
  • [42] Distributed Mechanism for Computation Offloading Task Routing in Mobile Edge Cloud Network
    Dong, Lijun
    Li, Richard
    2019 INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKING AND COMMUNICATIONS (ICNC), 2019, : 630 - 636
  • [43] Intra-cluster aggregation aware routing for distributed training in wireless sensor networks
    Chen, Zhaohong
    Long, Xin
    Chen, Long
    Wu, Yalan
    Wu, Jigang
    Liu, Shuangyin
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2023, 35 (17):
  • [44] Topology-aware energy efficient task assignment for collaborative in-network processing in distributed sensor systems
    Zhao, Baokang
    Wang, Meng
    Shao, Zili
    Cao, Jiannong
    Chan, Keith C. C.
    Su, Jinshu
    DISTRIBUTED EMBEDDED SYSTEMS: DESIGN, MIDDLEWARE AND RESOURCES, 2008, : 201 - +
  • [45] Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration in Heterogeneous Systems
    Luo, Ziyue
    Bao, Yixin
    Wu, Chuan
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2024, 32 (05) : 3715 - 3729
  • [46] Distributed Algorithms for Joint Routing and Frame Aggregation in 802.11n Wireless Mesh Networks
    Gong, Dawei
    Yang, Yuanyuan
    IEEE 27TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2013), 2013, : 1122 - 1132
  • [47] Joint Virtual Network Function Placement and Flow Routing in Edge-Cloud Continuum
    Mao, Yingling
    Shang, Xiaojun
    Liu, Yu
    Yang, Yuanyuan
    IEEE TRANSACTIONS ON COMPUTERS, 2024, 73 (03) : 872 - 886
  • [48] Low-Delay Ultra-Small Packet Transmission With In-Network Aggregation via Distributed Stochastic Learning
    Zhang, Nannan
    Wang, Wei
    Xin, Xiaofeng
    Liu, Yuanwei
    Shan, Hangguan
    Huang, Aiping
    IEEE TRANSACTIONS ON COMMUNICATIONS, 2024, 72 (05) : 2655 - 2669
  • [49] Joint Antenna Placement and Power Allocation for Target Detection in a Distributed MIMO Radar Network
    Qi, Cheng
    Xie, Junwei
    Zhang, Haowei
    REMOTE SENSING, 2022, 14 (11)
  • [50] Joint content placement and lightpath routing and spectrum assignment in CDNs over elastic optical network scenarios
    Perello, Jordi
    Walkowiak, Krzysztof
    Klinkowski, Miroslaw
    Spadaro, Salvatore
    Careglio, Davide
    COMPUTER COMMUNICATIONS, 2016, 77 : 72 - 84