MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale

被引:0
|
作者
Choudhury, Arnab [1 ]
Wang, Yang [1 ,2 ]
Pelkonen, Tuomas [1 ]
Srinivasan, Kutta [1 ]
Jain, Abha [1 ]
Lin, Shenghao [1 ]
David, Delia [1 ]
Soleimanifard, Siavash [1 ]
Chen, Michael [1 ]
Yadav, Abhishek [1 ]
Tijoriwala, Ritesh [1 ]
Samoylov, Denis [1 ]
Tang, Chunqiang [1 ]
机构
[1] Meta Platforms, Menlo Pk, CA 94025 USA
[2] Ohio State Univ, Columbus, OH 43210 USA
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In public clouds, users must manually select a data-center region to upload their ML training data and launch ML training workloads in the same region to ensure data and computation colocation. Unfortunately, isolated decisions by individual users can lead to a mismatch between workload demand and hardware supply across regions, hurting the cloud provider's hardware utilization and profitability. To address this problem in Meta's hyperscale private cloud, we provide a global-scheduling abstraction to all ML training workloads. Users simply submit their training workloads to MAST, our global scheduler, and rely on it to intelligently place both data and training workloads to different regions. We describe three design principles that enable MAST to schedule complex ML training workloads at a global scale: temporal decoupling, scope decoupling, and exhaustive search. MAST successfully balances the load across global regions. Before MAST, the most overloaded region had a GPU demand-to-supply ratio of 2.63 for high-priority workloads. With MAST, this ratio has been reduced to 0.98, effectively eliminating the overload.
引用
收藏
页码:563 / 580
页数:18
相关论文
共 50 条
  • [41] MapReduce Task Scheduling in Heterogeneous Geo-Distributed Data Centers
    Li, Xiaoping
    Chen, Fuchao
    Ruiz, Ruben
    Zhu, Jie
    IEEE TRANSACTIONS ON SERVICES COMPUTING, 2022, 15 (06) : 3317 - 3329
  • [42] Joint Scheduling of Data and Computation in Geo-distributed Cloud Systems
    Yin, Lingyan
    Sun, Jizhou
    Zhao, Laiping
    Cui, Chenzhou
    Xiao, Jian
    Yu, Ce
    2015 15TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING, 2015, : 657 - 666
  • [43] Load Balance Based Job Scheduling in Geo-Distributed Clouds
    Li, Chunlin
    Tang, Jianhang
    Luo, Youlong
    WIRELESS PERSONAL COMMUNICATIONS, 2019, 107 (01) : 169 - 192
  • [44] Uncertainty Level-Based Algorithms by Managing Renewable Energy for Geo-Distributed Datacenters
    Padhi, Slokashree
    Subramanyam, R. B. V.
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2024, 27 (04): : 5337 - 5354
  • [45] Sketch-based Data Placement among Geo-distributed Datacenters for Cloud Storages
    Yu, Boyang
    Pan, Jianping
    IEEE INFOCOM 2016 - THE 35TH ANNUAL IEEE INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATIONS, 2016,
  • [46] Load Balance Based Job Scheduling in Geo-Distributed Clouds
    Chunlin Li
    Jianhang Tang
    Youlong Luo
    Wireless Personal Communications, 2019, 107 : 169 - 192
  • [47] VNF Deployment and Flow Scheduling in Geo-distributed Data Centers
    Gu, Lin
    Chen, Xiaoxiao
    Jin, Hai
    Lu, Feng
    2018 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2018,
  • [48] A Scheduling Framework for Periodic Tasks in Geo-Distributed Data Centers
    Li, Yan
    Zhang, Hong
    Wang, Yong
    Liu, Xinran
    Zhang, Peng
    9TH IEEE INTERNATIONAL SYMPOSIUM ON SERVICE-ORIENTED SYSTEM ENGINEERING (SOSE 2015), 2015, : 247 - 252
  • [49] Online Training Flow Scheduling for Geo-Distributed Machine Learning Jobs Over Heterogeneous and Dynamic Networks
    Fan, Lang
    Zhang, Xiaoning
    Zhao, Yangming
    Sood, Keshav
    Yu, Shui
    IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, 2024, 10 (01) : 277 - 291
  • [50] Optimized Provisioning of SDN-enabled Virtual Networks in Geo-distributed Cloud Computing Datacenters
    Alhazmi, Khaled
    Shami, Abdallah
    Refaey, Ahmed
    JOURNAL OF COMMUNICATIONS AND NETWORKS, 2017, 19 (04) : 402 - 415