MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale

被引：0

作者：

Choudhury, Arnab ^{[1
]}

Wang, Yang ^{[1
,2
]}

Pelkonen, Tuomas ^{[1
]}

Srinivasan, Kutta ^{[1
]}

Jain, Abha ^{[1
]}

Lin, Shenghao ^{[1
]}

David, Delia ^{[1
]}

Soleimanifard, Siavash ^{[1
]}

Chen, Michael ^{[1
]}

Yadav, Abhishek ^{[1
]}

Tijoriwala, Ritesh ^{[1
]}

Samoylov, Denis ^{[1
]}

Tang, Chunqiang ^{[1
]}

机构：

[1] Meta Platforms, Menlo Pk, CA 94025 USA

[2] Ohio State Univ, Columbus, OH 43210 USA

来源：

PROCEEDINGS OF THE 18TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, OSDI 2024 | 2024年

关键词：

D O I：

暂无

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

In public clouds, users must manually select a data-center region to upload their ML training data and launch ML training workloads in the same region to ensure data and computation colocation. Unfortunately, isolated decisions by individual users can lead to a mismatch between workload demand and hardware supply across regions, hurting the cloud provider's hardware utilization and profitability. To address this problem in Meta's hyperscale private cloud, we provide a global-scheduling abstraction to all ML training workloads. Users simply submit their training workloads to MAST, our global scheduler, and rely on it to intelligently place both data and training workloads to different regions. We describe three design principles that enable MAST to schedule complex ML training workloads at a global scale: temporal decoupling, scope decoupling, and exhaustive search. MAST successfully balances the load across global regions. Before MAST, the most overloaded region had a GPU demand-to-supply ratio of 2.63 for high-priority workloads. With MAST, this ratio has been reduced to 0.98, effectively eliminating the overload.

引用

页码：563 / 580

页数：18

共 50 条

[21] Truthful auction mechanisms for VNF chain provisioning and allocation across geo-distributed datacenters
Wang, Xueyi
Wang, Xingwei
Wu, Dongkuo
Ma, Lianbo
Huang, Min
Computer Networks, 2022, 217
[22] Global reduction for geo-distributed MapReduce across cloud federation
Gouasmi, Thouraya
Kacem, Ahmed Hadj
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2025, 162
[23] GreenBDT: Renewable-aware scheduling of bulk data transfers for geo-distributed sustainable datacenters
Lu, Xingjian
Jiang, Dongxu
He, Gaoqi
Yu, Huiqun
SUSTAINABLE COMPUTING-INFORMATICS & SYSTEMS, 2018, 20 : 120 - 129
[24] Workload-Aware Scheduling Across Geo-distributed Data Centers
Jin, Yibo
Gao, Yuan
Qian, Zhuzhong
Zhai, Mingyu
Peng, Hui
Lu, Sanglu
2016 IEEE TRUSTCOM/BIGDATASE/ISPA, 2016, : 1455 - 1462
[25] GOFS: Geo-distributed Scheduling in OpenFaaS
Rossi, Fabiana
Falvo, Simone
Cardellini, Valeria
26TH IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS (IEEE ISCC 2021), 2021,
[26] Optimizing Concurrent Evacuation Transfers for Geo-Distributed Datacenters in SDN
Li, Xiaole
Wang, Hua
Yi, Shanwen
Yao, Xibo
Zhu, Fangjin
Zhai, Linbo
ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2017, 2017, 10393 : 99 - 114
[27] Efficient Graph Query Processing over Geo-Distributed Datacenters
Yuan, Ye
Ma, Delong
Wen, Zhenyu
Ma, Yuliang
Wang, Guoren
Chen, Lei
PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, : 619 - 628
[28] Towards Geo-Distributed Training of ML Models in a Multi-Cloud Environment
Phalak, Chetan
Chahal, Dheeraj
Ramesh, Manju
Singhal, Rekha
COMPANION OF THE 15TH ACM/SPEC INTERNATIONAL CONFERENCE ON PERFORMANCE ENGINEERING, ICPE COMPANION 2024, 2024, : 211 - 217
[29] A Communication-Contention-Aware Privacy-Preserving Workflow Scheduling Method for Geo-Distributed Datacenters
Shu, Xinyue
Wu, Quanwang
Zhou, MengChu
Wen, Junhao
IEEE TRANSACTIONS ON SERVICES COMPUTING, 2024, 17 (05) : 1887 - 1898
[30] On Achieving Efficient Data Transfer for Graph Processing in Geo-Distributed Datacenters
Zhou, Amelie Chi
Ibrahim, Shadi
He, Bingsheng
2017 IEEE 37TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2017), 2017, : 1397 - 1407

← 1 2 3 4 5 →