Preemptive Scheduling for Distributed Machine Learning Jobs in Edge-Cloud Networks

被引:5
|
作者
Wang, Ne [1 ]
Zhou, Ruiting [1 ,2 ]
Jiao, Lei [3 ]
Zhang, Renli [1 ,2 ]
Li, Bo [4 ]
Li, Zongpeng [1 ,5 ]
机构
[1] Wuhan Univ, Sch Comp Sci, Wuhan 430072, Peoples R China
[2] Wuhan Univ, Sch Cyber Sci & Engn, Minist Educ, Key Lab Aerosp Informat Secur & Trusted Comp, Wuhan 430072, Peoples R China
[3] Univ Oregon, Dept Comp & Informat Sci, Eugene, OR 97403 USA
[4] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Peoples R China
[5] Tsinghua Univ, Inst Network Sci & Cyberspace, Beijing 100190, Peoples R China
基金
美国国家科学基金会;
关键词
Distributed machine learning; parameter server architecture; preemptive scheduling; edge-cloud networks;
D O I
10.1109/JSAC.2022.3180772
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Recent advances in 5G and edge computing enable rapid development and deployment of edge-cloud systems, which are ideal for delay-sensitive machine learning (ML) applications such as autonomous driving and smart city. Distributed ML jobs often need to train a large model with enormous datasets, which can only be handled by deploying a distributed set of workers in an edge-cloud system. One common approach is to employ a parameter server (PS) architecture, in which training is carried out at multiple workers, while PSs are used for aggregation and model updates. In this architecture, one of the fundamental challenges is how to dispatch ML jobs to workers and PSs such that the average job completion time (JCT) can be minimized. In this work, we propose a novel online preemptive scheduling framework to decide the location and the execution time window of concurrent workers and PSs upon each job arrival. Specifically, our proposed scheduling framework consists of: i) a job dispatching and scheduling algorithm that assigns each ML job to workers and decides the schedule to train each data chunk; ii) a PS assignment algorithm that determines the placement of PS. We prove theoretically that our proposed algorithm is D-max(1 + 1/epsilon)-competitive with (1 + epsilon)-speed augmentation, where D-max is the maximal number of data chunks in any job. Extensive testbed experiments and trace-driven simulations show that our algorithm can reduce the average JCT by up to 30% compared with state-of-the-art baselines.
引用
收藏
页码:2411 / 2425
页数:15
相关论文
共 50 条
  • [41] Optimizing Edge-Cloud Cooperation for Machine Learning Accuracy Considering Transmission Latency and Bandwidth Congestion
    Tajiri, Kengo
    Kawahara, Ryoichi
    Matsuo, Yoichi
    [J]. PROCEEDINGS OF THE IEEE/IFIP NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM 2022, 2022,
  • [42] Edge-cloud Collaborative Learning with Federated and Centralized Features
    Li, Zexi
    Li, Qunwei
    Zhou, Yi
    Zhong, Wenliang
    Zhang, Guannan
    Wu, Chao
    [J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 1949 - 1953
  • [43] Multiagent Reinforcement Learning Based Distributed Channel Access for Industrial Edge-Cloud Web 3.0
    Yang, Chen
    Wang, Yushi
    Lan, Shulin
    Zhu, Liehuang
    [J]. IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, 2024, 11 (05): : 3943 - 3954
  • [44] Proactive Caching in the Edge-Cloud Continuum with Federated Learning
    Zyrianoff, Ivan
    Montecchiari, Leonardo
    Trotta, Angelo
    Gigli, Lorenzo
    Kamienski, Carlos
    Di Felice, Marco
    [J]. 2024 IEEE 21ST CONSUMER COMMUNICATIONS & NETWORKING CONFERENCE, CCNC, 2024, : 234 - 240
  • [45] FedCAE: A New Federated Learning Framework for Edge-Cloud Collaboration Based Machine Fault Diagnosis
    Yu, Yaoxiang
    Guo, Liang
    Gao, Hongli
    He, Yichen
    You, Zhichao
    Duan, Andongzhe
    [J]. IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, 2024, 71 (04) : 4108 - 4119
  • [46] Federated Deep Reinforcement Learning for Recommendation-Enabled Edge Caching in Mobile Edge-Cloud Computing Networks
    Sun, Chuan
    Li, Xiuhua
    Wen, Junhao
    Wang, Xiaofei
    Han, Zhu
    Leung, Victor C. M.
    [J]. IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2023, 41 (03) : 690 - 705
  • [47] Towards Edge-Cloud Collaborative Machine Learning: A Quality-aware Task Partition Framework
    Zheng, Zimu
    Li, Yunzhe
    Song, Han
    Wang, Lanjun
    Xia, Fei
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 3705 - 3714
  • [48] Optimizing Edge-Cloud Cooperation for Machine Learning Accuracy Considering Transmission Latency and Bandwidth Congestion
    Tajiri, Kengo
    Kawahara, Ryoichi
    Matsuo, Yoichi
    [J]. Proceedings of the IEEE/IFIP Network Operations and Management Symposium 2022: Network and Service Management in the Era of Cloudification, Softwarization and Artificial Intelligence, NOMS 2022, 2022,
  • [49] Edge-cloud collaborative intelligent production scheduling based on digital twin
    Yifan, Han
    Tao, Feng
    Xiaokai, Liü
    Fangmin, Xu
    Chenglin, Zhao
    [J]. Journal of China Universities of Posts and Telecommunications, 2022, 29 (02): : 108 - 120
  • [50] Optimized scheduling with prioritization to enhance network sustainability in edge-cloud environment
    Prethi, K. N. Apinaya
    Sangeetha, M.
    Nithya, S.
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2023, 44 (03) : 4323 - 4334