Sampling-Based Multi-Job Placement for Heterogeneous Deep Learning Clusters

被引:0
|
作者
Liu, Kaiyang [1 ]
Wang, Jingrong [2 ]
Huang, Zhiming [3 ]
Pan, Jianping [3 ]
机构
[1] Mem Univ Newfoundland, Dept Comp Sci, St John, NF A1B 3X5, Canada
[2] Univ Toronto, Dept Elect & Comp Engn, Toronto, ON M5S 3G4, Canada
[3] Univ Victoria, Dept Comp Sci, Victoria, BC V8P 5C2, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Training; Deep learning; Load management; Processor scheduling; Computational modeling; Throughput; Parallel processing; Distributed deep learning; job placement; job sizing; load balancing; heterogeneity-aware scheduling; fairness;
D O I
10.1109/TPDS.2024.3390109
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Heterogeneous deep learning clusters commonly host a variety of distributed learning jobs. In such scenarios, the training efficiency of learning models is negatively affected by the slowest worker. To accelerate the training process, multiple learning jobs may compete for limited computational resources, posing significant challenges to multi-job placement among heterogeneous workers. This article presents a heterogeneity-aware scheduler to solve the multi-job placement problem while taking into account job sizing and load balancing, minimizing the average Job Completion Time (JCT) of deep learning jobs. A novel scheme based on proportional training workload assignment, feasible solution categorization, and matching markets is proposed with theoretical guarantees. To further reduce the computational complexity for low latency decision-making and improve scheduling fairness, we propose to construct the sparsification of feasible solution categories through sampling, which has negligible performance loss in JCT. We evaluate the performance of our design with real-world deep neural network benchmarks on heterogeneous computing clusters. Experimental results show that, compared to existing solutions, the proposed sampling-based scheme can achieve 1) results within 2.04% of the optimal JCT with orders-of-magnitude improvements in algorithm running time, and 2) high scheduling fairness among learning jobs.
引用
收藏
页码:874 / 888
页数:15
相关论文
共 50 条
  • [1] Deep Learning-Based Job Placement in Distributed Machine Learning Clusters With Heterogeneous Workloads
    Bao, Yixin
    Peng, Yanghua
    Wu, Chuan
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2023, 31 (02) : 634 - 647
  • [2] Deep Learning-based Job Placement in Distributed Machine Learning Clusters
    Bao, Yixin
    Peng, Yanghua
    Wu, Chuan
    IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2019), 2019, : 505 - 513
  • [3] Efficient Device Scheduling with Multi-Job Federated Learning
    Zhou, Chendi
    Liu, Ji
    Jia, Juncheng
    Zhou, Jingbo
    Zhou, Yang
    Dai, Huaiyu
    Dou, Dejing
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 9971 - 9979
  • [4] Energy- and locality-efficient multi-job scheduling based on MapReduce for heterogeneous datacenter
    Chen, Lei
    Liu, Zhao-Hua
    SERVICE ORIENTED COMPUTING AND APPLICATIONS, 2019, 13 (04) : 297 - 308
  • [5] Energy- and locality-efficient multi-job scheduling based on MapReduce for heterogeneous datacenter
    Lei Chen
    Zhao-Hua Liu
    Service Oriented Computing and Applications, 2019, 13 : 297 - 308
  • [6] Joint Job Assignment and Resource Allocation for Multi-Job Wireless Federated Learning
    Li, Tan
    Wei, Zeheng
    Liu, Hai
    Lin, Zhiyong
    Chan, Tse-Tin
    2024 IEEE 21ST INTERNATIONAL CONFERENCE ON MOBILE AD-HOC AND SMART SYSTEMS, MASS 2024, 2024, : 419 - 427
  • [7] Self-Learning MapReduce Scheduler in Multi-job Environment
    Lin, Changhang
    Guo, Wenzhong
    Lin, Changhui
    2013 INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA (CLOUDCOM-ASIA), 2013, : 610 - 612
  • [8] A Sampling-Based Approach for Discovering Subspace Clusters
    Moens, Sandy
    Cule, Boris
    Goethals, Bart
    DISCOVERY SCIENCE (DS 2019), 2019, 11828 : 61 - 71
  • [9] Efficient multi-job federated learning scheduling with fault tolerance
    Fu, Boqian
    Chen, Fahao
    Pan, Shengli
    Li, Peng
    Su, Zhou
    PEER-TO-PEER NETWORKING AND APPLICATIONS, 2025, 18 (02)
  • [10] Multi-Objective Job Placement in Clusters
    Blagodurov, Sergey
    Fedorova, Alexandra
    Vinnik, Evgeny
    Dwyer, Tyler
    Hermenier, Fabien
    PROCEEDINGS OF SC15: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2015,