MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale

被引:0
|
作者
Choudhury, Arnab [1 ]
Wang, Yang [1 ,2 ]
Pelkonen, Tuomas [1 ]
Srinivasan, Kutta [1 ]
Jain, Abha [1 ]
Lin, Shenghao [1 ]
David, Delia [1 ]
Soleimanifard, Siavash [1 ]
Chen, Michael [1 ]
Yadav, Abhishek [1 ]
Tijoriwala, Ritesh [1 ]
Samoylov, Denis [1 ]
Tang, Chunqiang [1 ]
机构
[1] Meta Platforms, Menlo Pk, CA 94025 USA
[2] Ohio State Univ, Columbus, OH 43210 USA
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In public clouds, users must manually select a data-center region to upload their ML training data and launch ML training workloads in the same region to ensure data and computation colocation. Unfortunately, isolated decisions by individual users can lead to a mismatch between workload demand and hardware supply across regions, hurting the cloud provider's hardware utilization and profitability. To address this problem in Meta's hyperscale private cloud, we provide a global-scheduling abstraction to all ML training workloads. Users simply submit their training workloads to MAST, our global scheduler, and rely on it to intelligently place both data and training workloads to different regions. We describe three design principles that enable MAST to schedule complex ML training workloads at a global scale: temporal decoupling, scope decoupling, and exhaustive search. MAST successfully balances the load across global regions. Before MAST, the most overloaded region had a GPU demand-to-supply ratio of 2.63 for high-priority workloads. With MAST, this ratio has been reduced to 0.98, effectively eliminating the overload.
引用
收藏
页码:563 / 580
页数:18
相关论文
共 50 条
  • [21] Truthful auction mechanisms for VNF chain provisioning and allocation across geo-distributed datacenters
    Wang, Xueyi
    Wang, Xingwei
    Wu, Dongkuo
    Ma, Lianbo
    Huang, Min
    Computer Networks, 2022, 217
  • [22] Global reduction for geo-distributed MapReduce across cloud federation
    Gouasmi, Thouraya
    Kacem, Ahmed Hadj
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2025, 162
  • [23] GreenBDT: Renewable-aware scheduling of bulk data transfers for geo-distributed sustainable datacenters
    Lu, Xingjian
    Jiang, Dongxu
    He, Gaoqi
    Yu, Huiqun
    SUSTAINABLE COMPUTING-INFORMATICS & SYSTEMS, 2018, 20 : 120 - 129
  • [24] Workload-Aware Scheduling Across Geo-distributed Data Centers
    Jin, Yibo
    Gao, Yuan
    Qian, Zhuzhong
    Zhai, Mingyu
    Peng, Hui
    Lu, Sanglu
    2016 IEEE TRUSTCOM/BIGDATASE/ISPA, 2016, : 1455 - 1462
  • [25] GOFS: Geo-distributed Scheduling in OpenFaaS
    Rossi, Fabiana
    Falvo, Simone
    Cardellini, Valeria
    26TH IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS (IEEE ISCC 2021), 2021,
  • [26] Optimizing Concurrent Evacuation Transfers for Geo-Distributed Datacenters in SDN
    Li, Xiaole
    Wang, Hua
    Yi, Shanwen
    Yao, Xibo
    Zhu, Fangjin
    Zhai, Linbo
    ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2017, 2017, 10393 : 99 - 114
  • [27] Efficient Graph Query Processing over Geo-Distributed Datacenters
    Yuan, Ye
    Ma, Delong
    Wen, Zhenyu
    Ma, Yuliang
    Wang, Guoren
    Chen, Lei
    PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, : 619 - 628
  • [28] Towards Geo-Distributed Training of ML Models in a Multi-Cloud Environment
    Phalak, Chetan
    Chahal, Dheeraj
    Ramesh, Manju
    Singhal, Rekha
    COMPANION OF THE 15TH ACM/SPEC INTERNATIONAL CONFERENCE ON PERFORMANCE ENGINEERING, ICPE COMPANION 2024, 2024, : 211 - 217
  • [29] A Communication-Contention-Aware Privacy-Preserving Workflow Scheduling Method for Geo-Distributed Datacenters
    Shu, Xinyue
    Wu, Quanwang
    Zhou, MengChu
    Wen, Junhao
    IEEE TRANSACTIONS ON SERVICES COMPUTING, 2024, 17 (05) : 1887 - 1898
  • [30] On Achieving Efficient Data Transfer for Graph Processing in Geo-Distributed Datacenters
    Zhou, Amelie Chi
    Ibrahim, Shadi
    He, Bingsheng
    2017 IEEE 37TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2017), 2017, : 1397 - 1407