Preemptive Scheduling for Distributed Machine Learning Jobs in Edge-Cloud Networks

被引:5
|
作者
Wang, Ne [1 ]
Zhou, Ruiting [1 ,2 ]
Jiao, Lei [3 ]
Zhang, Renli [1 ,2 ]
Li, Bo [4 ]
Li, Zongpeng [1 ,5 ]
机构
[1] Wuhan Univ, Sch Comp Sci, Wuhan 430072, Peoples R China
[2] Wuhan Univ, Sch Cyber Sci & Engn, Minist Educ, Key Lab Aerosp Informat Secur & Trusted Comp, Wuhan 430072, Peoples R China
[3] Univ Oregon, Dept Comp & Informat Sci, Eugene, OR 97403 USA
[4] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Peoples R China
[5] Tsinghua Univ, Inst Network Sci & Cyberspace, Beijing 100190, Peoples R China
基金
美国国家科学基金会;
关键词
Distributed machine learning; parameter server architecture; preemptive scheduling; edge-cloud networks;
D O I
10.1109/JSAC.2022.3180772
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Recent advances in 5G and edge computing enable rapid development and deployment of edge-cloud systems, which are ideal for delay-sensitive machine learning (ML) applications such as autonomous driving and smart city. Distributed ML jobs often need to train a large model with enormous datasets, which can only be handled by deploying a distributed set of workers in an edge-cloud system. One common approach is to employ a parameter server (PS) architecture, in which training is carried out at multiple workers, while PSs are used for aggregation and model updates. In this architecture, one of the fundamental challenges is how to dispatch ML jobs to workers and PSs such that the average job completion time (JCT) can be minimized. In this work, we propose a novel online preemptive scheduling framework to decide the location and the execution time window of concurrent workers and PSs upon each job arrival. Specifically, our proposed scheduling framework consists of: i) a job dispatching and scheduling algorithm that assigns each ML job to workers and decides the schedule to train each data chunk; ii) a PS assignment algorithm that determines the placement of PS. We prove theoretically that our proposed algorithm is D-max(1 + 1/epsilon)-competitive with (1 + epsilon)-speed augmentation, where D-max is the maximal number of data chunks in any job. Extensive testbed experiments and trace-driven simulations show that our algorithm can reduce the average JCT by up to 30% compared with state-of-the-art baselines.
引用
收藏
页码:2411 / 2425
页数:15
相关论文
共 50 条
  • [1] DPS: Dynamic Pricing and Scheduling for Distributed Machine Learning Jobs in Edge-Cloud Networks
    Zhou, Ruiting
    Wang, Ne
    Huang, Yifeng
    Pang, Jinlong
    Chen, Hao
    [J]. IEEE TRANSACTIONS ON MOBILE COMPUTING, 2023, 22 (11) : 6377 - 6393
  • [2] Resource Allocation for Distributed Machine Learning at the Edge-Cloud Continuum
    Sartzetakis, Ippokratis
    Soumplis, Polyzois
    Pantazopoulos, Panagiotis
    Katsaros, Konstantinos V.
    Sourlas, Vasilis
    Varvarigos, Emmanouel
    [J]. IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC 2022), 2022, : 5017 - 5022
  • [3] Edge-Cloud Solutions for Big Data Analysis and Distributed Machine Learning-1
    Belcastro, Loris
    Carretero, Jesus
    Talia, Domenico
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2024, 159 : 323 - 326
  • [4] Task Offloading and Resource Scheduling in Hybrid Edge-Cloud Networks
    Zhang, Qi
    Gui, Lin
    Zhu, Shichao
    Lang, Xiupu
    [J]. IEEE ACCESS, 2021, 9 : 85350 - 85366
  • [5] Learning to Optimize Workflow Scheduling for an Edge-Cloud Computing Environment
    Zhu, Kaige
    Zhang, Zhenjiang
    Zeadally, Sherali
    Sun, Feng
    [J]. IEEE TRANSACTIONS ON CLOUD COMPUTING, 2024, 12 (03) : 897 - 912
  • [6] An Efficient Edge-Cloud Partitioning of Random Forests for Distributed Sensor Networks
    Shen, Tianyi
    Mishra, Cyan Subhra
    Sampson, Jack
    Kandemir, Mahmut Taylan
    Narayanan, Vijaykrishnan
    [J]. IEEE EMBEDDED SYSTEMS LETTERS, 2024, 16 (01) : 21 - 24
  • [7] Online Scheduling Algorithm for Heterogeneous Distributed Machine Learning Jobs
    Zhou, Ruiting
    Pang, Jinlong
    Zhang, Qin
    Wu, Chuan
    Jiao, Lei
    Zhong, Yi
    Li, Zongpeng
    [J]. IEEE TRANSACTIONS ON CLOUD COMPUTING, 2023, 11 (02) : 1514 - 1529
  • [8] Adaptive Pricing and Online Scheduling for Distributed Machine Learning Jobs
    Wang, Yafei
    Su, Lina
    Chen, Junmei
    Wang, Ne
    Li, Zongpeng
    [J]. IEEE INTERNET OF THINGS JOURNAL, 2024, 11 (07) : 12966 - 12983
  • [9] Edge-cloud Collaborative Heterogeneous Task Scheduling in Multilayer Elastic Optical Networks
    Yang, Zeyuan
    Gu, Rentao
    Zhu, Zuqing
    Ji, Yuefeng
    [J]. 2021 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2021,
  • [10] Distributed Dataflow Across the Edge-Cloud Continuum
    Ekaireb, Tyler
    Brand, Lukas
    Avaraddy, Nagarjun
    Mock, Markus
    Krintz, Chandra
    Wolski, Rich
    [J]. 2024 IEEE 17TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, CLOUD 2024, 2024, : 316 - 327