DLTAP: A Network-efficient Scheduling Method for Distributed Deep Learning Workload in Containerized Cluster Environment

被引:2
|
作者
Qiao, Wei [1 ]
Li, Ying [1 ]
Wu, Zhong-Hai [1 ]
机构
[1] Peking Univ, Sch Software & Microelect, Beijing, Peoples R China
关键词
D O I
10.1051/itmconf/20171203030
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Deep neural networks (DNNs) have recently yielded strong results on a range of applications. Training these DNNs using a cluster of commodity machines is a promising approach since training is time consuming and compute-intensive. Furthermore, putting DNN tasks into containers of clusters would enable broader and easier deployment of DNN-based algorithms. Toward this end, this paper addresses the problem of scheduling DNN tasks in the containerized cluster environment. Efficiently scheduling data-parallel computation jobs like DNN over containerized clusters is critical for job performance, system throughput, and resource utilization. It becomes even more challenging with the complex workloads. We propose a scheduling method called Deep Learning Task Allocation Priority (DLTAP) which performs scheduling decisions in a distributed manner, and each of scheduling decisions takes aggregation degree of parameter sever task and worker task into account, in particularly, to reduce cross-node network transmission traffic and, correspondingly, decrease the DNN training time. We evaluate the DLTAP scheduling method using a state-of-the-art distributed DNN training framework on 3 benchmarks. The results show that the proposed method can averagely reduce 12% cross-node network traffic, and decrease the DNN training time even with the cluster of low-end servers.
引用
收藏
页数:5
相关论文
共 50 条
  • [1] Liquid: Intelligent Resource Estimation and Network-Efficient Scheduling for Deep Learning Jobs on Distributed GPU Clusters
    Gu, Rong
    Chen, Yuquan
    Liu, Shuai
    Dai, Haipeng
    Chen, Guihai
    Zhang, Kai
    Che, Yang
    Huang, Yihua
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (11) : 2808 - 2820
  • [2] Scheduling Distributed Deep Learning Jobs in Heterogeneous Cluster with Placement Awareness
    Li, Qingping
    Xu, Jingwei
    Cao, Chun
    THE 12TH ASIA-PACIFIC SYMPOSIUM ON INTERNETWARE, INTERNETWARE 2020, 2021, : 217 - 228
  • [3] TensorExpress: In-Network Communication Scheduling for Distributed Deep Learning
    Kang, Minkoo
    Yang, Gyeongsik
    Yoo, Yeonho
    Yoo, Chuck
    2020 IEEE 13TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD 2020), 2020, : 25 - 27
  • [4] Efficient Flow Scheduling in Distributed Deep Learning Training with Echelon Formation
    Pan, Rui
    Lei, Yiming
    Li, Jialong
    Xie, Zhiqiang
    Yuan, Binhang
    Xia, Yiting
    THE 21ST ACM WORKSHOP ON HOT TOPICS IN NETWORKS, HOTNETS 2022, 2022, : 93 - 100
  • [5] An Efficient Method for Training Deep Learning Networks Distributed
    Wang, Chenxu
    Lu, Yutong
    Chen, Zhiguang
    Li, Junnan
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2020, E103D (12) : 2444 - 2456
  • [6] A resource scheduling method for reliable and trusted distributed composite services in cloud environment based on deep reinforcement learning
    Yu, Lei
    Yu, Philip S.
    Duan, Yucong
    Qiao, Hongyu
    FRONTIERS IN GENETICS, 2022, 13
  • [7] An Optimal Network-Aware Scheduling Technique for Distributed Deep Learning in Distributed HPC Platforms
    Lee, Sangkwon
    Shah, Syed Asif Raza
    Seok, Woojin
    Moon, Jeonghoon
    Kim, Kihyeon
    Shah, Syed Hasnain Raza
    ELECTRONICS, 2023, 12 (14)
  • [8] Towards Efficient Workflow Scheduling over Yarn Cluster using Deep Reinforcement Learning
    Xue, Jianguo
    Wang, Ting
    Cai, Puyu
    IEEE CONFERENCE ON GLOBAL COMMUNICATIONS, GLOBECOM, 2023, : 473 - 478
  • [9] An efficient task scheduling method for improved network delay in distributed sensor networks
    Liu, Haoying
    Yuan, Xiaojing
    Moges, Mequanint
    2007 3RD INTERNATIONAL CONFERENCE ON TESTBEDS AND RESEARCH INFRASTRUCTURE FOR THE DEVELOPMENT OF NETWORKS AND COMMUNITIES, 2007, : 150 - 157
  • [10] Energy efficient distributed cluster head scheduling scheme for two tiered wireless sensor network
    Kannan, G.
    Raja, T. Sree Renga
    EGYPTIAN INFORMATICS JOURNAL, 2015, 16 (02) : 167 - 174