SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters

被引:4
|
作者
Zhao, Hanyu [1 ]
Han, Zhenhua [2 ]
Yang, Zhi [1 ]
Zhang, Quanlu [2 ]
Li, Mingxia [3 ]
Yang, Fan [2 ]
Zhang, Qianxi [2 ]
Li, Binyang [4 ]
Yang, Yuqing [2 ]
Qiu, Lili [2 ]
Zhang, Lintao [5 ]
Zhou, Lidong [2 ]
机构
[1] Peking Univ, Beijing, Peoples R China
[2] Microsoft Res, Beijing, Peoples R China
[3] USTC, Hefei, Peoples R China
[4] Microsoft, Beijing, Peoples R China
[5] BaseBit Technol, Hong Kong, Peoples R China
来源
PROCEEDINGS OF THE EIGHTEENTH EUROPEAN CONFERENCE ON COMPUTER SYSTEMS, EUROSYS 2023 | 2023年
关键词
Machine learning systems; cloud computing; cache systems;
D O I
10.1145/3552326.3567499
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning training on cloud platforms usually follows the tradition of the separation of storage and computing. The training executes on a compute cluster equipped with GPUs/TPUs while reading data from a separate cluster hosting the storage service. To alleviate the potential bottleneck, a training cluster usually leverages its local storage as a cache to reduce the remote IO from the storage cluster. However, existing deep learning schedulers do not manage storage resources thus fail to consider the diverse caching effects across different training jobs. This could degrade scheduling quality significantly. To address this issue, we present SiloD, a scheduling framework that co-designs the cluster scheduler and the cache subsystems for deep learning training. SiloD treats cache and remote IO as first-class resources and can integrate different state-of-the-art deep learning scheduling policies in a unified scheduling framework. To achieve this, SiloD develops an enhanced job performance estimator to help different schedulers to jointly consider the impact of storage and compute resource allocation while preserving their respective scheduling objectives. The SiloD-enhanced performance estimator leverages the unique data access pattern of deep learning training to develop a closed-form analytic model that captures the diverse cache / remote IO requirements from different training jobs. Evaluations show that SiloD improves the average job completion time, cluster utilization, and fairness by up to 7.4x, 2.57x, and 1.89x, respectively, compared to different combinations of cache systems and cluster schedulers where they operate independently.
引用
收藏
页码:883 / 898
页数:16
相关论文
共 50 条
  • [41] Co-design
    Santarini, M
    EDN, 2006, 51 (03) : 46 - +
  • [42] Popularity-Aware Caching for Vehicle Clusters With Federated Deep Reinforcement Learning
    Wang, Yuanyu
    Zheng, Ke
    Ye, Wenhui
    Tang, Yuliang
    IEEE COMMUNICATIONS LETTERS, 2023, 27 (06) : 1644 - 1648
  • [43] Rate Splitting With Wireless Edge Caching: A System-Level-Based Co-Design
    Demarchou, Eleni
    Psomas, Constantinos
    Krikidis, Ioannis
    IEEE TRANSACTIONS ON COMMUNICATIONS, 2022, 70 (01) : 664 - 679
  • [44] TOWARDS A CO-DESIGN APPROACH TO DIGITAL DEVICE FOR LEARNING
    Bour, Raphaelle
    Capus, Laurence
    Valles-parlangeau, Nathalie
    Soule-Dupuy, Chantal
    EDULEARN19: 11TH INTERNATIONAL CONFERENCE ON EDUCATION AND NEW LEARNING TECHNOLOGIES, 2019, : 4948 - 4952
  • [45] Co-design of technology-enhanced learning resources
    Treasure-Jones, Tamsin
    Joynes, Viktoria
    CLINICAL TEACHER, 2018, 15 (04): : 281 - 286
  • [46] Hardware/Software Co-design for Machine Learning Accelerators
    Chen, Hanqiu
    Hao, Cong
    2023 IEEE 31ST ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES, FCCM, 2023, : 233 - 235
  • [47] Scheduling Co-Design for Reliability and Energy in Cyber-Physical Systems
    Lin, Man
    Pan, Yongwen
    Yang, Laurence T.
    Guo, Minyi
    Zheng, Nenggan
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2013, 1 (02) : 353 - 365
  • [48] Scheduling and Control Co-Design for Control Systems under Computational Constraints
    Zhao, Yun-Bo
    Dong, Hui
    Ni, Hongjie
    IFAC PAPERSONLINE, 2017, 50 (01): : 5881 - 5886
  • [49] Co-design for Control and Scheduling over Wireless Industrial Control Networks
    Peters, Edwin G. W.
    Quevedo, Daniel E.
    Fu, Minyue
    2015 54TH IEEE CONFERENCE ON DECISION AND CONTROL (CDC), 2015, : 2459 - 2464
  • [50] LQG Control and Scheduling Co-design for Wireless Sensor and Actuator Networks
    Iwaki, Takuya
    Johansson, Karl Henrik
    2018 IEEE 19TH INTERNATIONAL WORKSHOP ON SIGNAL PROCESSING ADVANCES IN WIRELESS COMMUNICATIONS (SPAWC), 2018, : 146 - 150