MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale

被引:0
|
作者
Choudhury, Arnab [1 ]
Wang, Yang [1 ,2 ]
Pelkonen, Tuomas [1 ]
Srinivasan, Kutta [1 ]
Jain, Abha [1 ]
Lin, Shenghao [1 ]
David, Delia [1 ]
Soleimanifard, Siavash [1 ]
Chen, Michael [1 ]
Yadav, Abhishek [1 ]
Tijoriwala, Ritesh [1 ]
Samoylov, Denis [1 ]
Tang, Chunqiang [1 ]
机构
[1] Meta Platforms, Menlo Pk, CA 94025 USA
[2] Ohio State Univ, Columbus, OH 43210 USA
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In public clouds, users must manually select a data-center region to upload their ML training data and launch ML training workloads in the same region to ensure data and computation colocation. Unfortunately, isolated decisions by individual users can lead to a mismatch between workload demand and hardware supply across regions, hurting the cloud provider's hardware utilization and profitability. To address this problem in Meta's hyperscale private cloud, we provide a global-scheduling abstraction to all ML training workloads. Users simply submit their training workloads to MAST, our global scheduler, and rely on it to intelligently place both data and training workloads to different regions. We describe three design principles that enable MAST to schedule complex ML training workloads at a global scale: temporal decoupling, scope decoupling, and exhaustive search. MAST successfully balances the load across global regions. Before MAST, the most overloaded region had a GPU demand-to-supply ratio of 2.63 for high-priority workloads. With MAST, this ratio has been reduced to 0.98, effectively eliminating the overload.
引用
收藏
页码:563 / 580
页数:18
相关论文
共 50 条
  • [31] CloudSimPer: Simulating Geo-Distributed Datacenters Powered by Renewable Energy Mix
    Song, Jie
    Zhu, Peimeng
    Zhang, Yanfeng
    Yu, Ge
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2024, 35 (04) : 531 - 547
  • [32] Joint Energy Optimization on the Server and Network Sides for Geo-Distributed Datacenters
    Qin, Yang
    Han, Wuji
    Yang, Yuanyuan
    Yang, Weihong
    Liu, Bing
    ICC 2019 - 2019 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2019,
  • [33] Graph partition-based data and task co-scheduling of scientific workflow in geo-distributed datacenters
    Zhang, Jinghui
    Chen, Jian
    Zhan, Jun
    Jin, Jiahui
    Song, Aibo
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2019, 31 (24):
  • [34] Workload and energy management of geo-distributed datacenters considering demand response programs
    Zhao, Mengmeng
    Wang, Xiaoying
    Mo, Junrong
    SUSTAINABLE ENERGY TECHNOLOGIES AND ASSESSMENTS, 2023, 55
  • [35] Efficient Online Scheduling of Service Function Chains Across Multiple Geo-Distributed Regions
    He, Rui
    Ren, Bangbang
    Xie, Junjie
    Guo, Deke
    Zhao, Laiping
    IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2024, 21 (03): : 3440 - 3453
  • [36] A Framework of Hypergraph-Based Data Placement Among Geo-Distributed Datacenters
    Yu, Boyang
    Pan, Jianping
    IEEE TRANSACTIONS ON SERVICES COMPUTING, 2020, 13 (03) : 395 - 409
  • [37] Bellini: Ferrying Application Traffic Flows through Geo-distributed Datacenters in the Cloud
    Liu, Zimu
    Feng, Yuan
    Li, Baochun
    2013 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2013, : 1753 - 1759
  • [38] Time- and Cost- Efficient Task Scheduling across Geo-Distributed Data Centers
    Hu, Zhiming
    Li, Baochun
    Luo, Jun
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2018, 29 (03) : 705 - 718
  • [39] Scheduling Stream Processing Tasks on Geo-Distributed Heterogeneous Resources
    Janssen, Gerrit
    Verbitskiy, Ilya
    Renner, Thomas
    Thamsen, Lauritz
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 5159 - 5164
  • [40] Cost-Aware Partitioning for Efficient Large Graph Processing in Geo-Distributed Datacenters
    Zhou, Amelie Chi
    Shen, Bingkun
    Xiao, Yao
    Ibrahim, Shadi
    He, Bingsheng
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2020, 31 (07) : 1707 - 1723