A deep learning container cloud for GPU resources

被引：0

作者：

机构：

[1] [1,Xiao, Yi

[2] Gao, Pengdong

[3] Qi, Quan

[4] Lu, Yongquan

来源：

Gao, Pengdong (pdgao@cuc.edu.cn) | 1600年 / Universidad Central de Venezuela卷 / 55期

关键词：

Graphics processing unit - Topology - Deep neural networks - Distributed computer systems - Computer aided instruction;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

With the development of deep learning, deep learning framework has become an important tool for the deep neural network development. The framework greatly shortens the network construction and computing time, and its powerful computing ability comes from GPU. But It is an important issue that how to effectively allocate and use GPU resources in heterogeneous cluster among many frameworks. In this paper, we propose a Deep Learning Container Cloud (DLC) architecture for GPU resources specifically. With the characteristics of easy deployment and easy migration, the frameworks can be deployed on heterogeneous cluster in the form of container, and the GPU driver and container can be decoupled according to NVIDIA-docker volume. The DLC provides services in the form of the MESOS framework. After obtaining resources through scheduler, a deep learning framework is created quickly to meet the requirements. DLC will loads the specified GPU resource and the corresponding runtime library to achieve the rapid creation of a deep learning environment with specific version. In addition, this paper proposes an allocation algorithm based on GPU topology. DLC constructs the topo-tree by analyzing the GPU topology structure in agent node, and on this basis, assigns the GPU with the P2P function within the node. Our experiment shows that the use of P2P data transmission in containers will significantly increase bandwidth. It is of great significance for promoting the development of deep learning.

引用

共 50 条

[31] Estimating GPU Memory Consumption of Deep Learning Models
Gao, Yanjie
Liu, Yu
Zhang, Hongyu
Li, Zhengxian
Zhu, Yonghao
Lin, Haoxiang
Yang, Mao
PROCEEDINGS OF THE 28TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (ESEC/FSE '20), 2020, : 1342 - 1352
[32] Involving CPUs into Multi-GPU Deep Learning
Le, Tung D.
Sekiyama, Taro
Negishi, Yasushi
Imai, Haruki
Kawachiya, Kiyokuni
PROCEEDINGS OF THE 2018 ACM/SPEC INTERNATIONAL CONFERENCE ON PERFORMANCE ENGINEERING (ICPE '18), 2018, : 56 - 67
[33] Deep Learning Workload Scheduling in GPU Datacenters: A Survey
Ye, Zhisheng
Gao, Wei
Hu, Qinghao
Sun, Peng
Wang, Xiaolin
Luo, Yingwei
Zhang, Tianwei
Wen, Yonggang
ACM COMPUTING SURVEYS, 2024, 56 (06)
[34] AntMan: Dynamic Scaling on GPU Clusters for Deep Learning
Xiao, Wencong
Ren, Shiru
Li, Yong
Zhang, Yang
Hou, Pengyang
Li, Zhi
Feng, Yihui
Lin, Wei
Jia, Yangqing
PROCEEDINGS OF THE 14TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI '20), 2020, : 533 - 548
[35] Understanding of GPU Architectural Vulnerability for Deep Learning Workloads
Santoso, Danny
Jeon, Hyeran
2019 IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI AND NANOTECHNOLOGY SYSTEMS (DFT), 2019,
[36] Scaling Deep Learning on GPU and Knights Landing clusters
You, Yang
Buluc, Aydin
Demmel, James
SC'17: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2017,
[37] Performance Evaluation of Deep Learning Frameworks on Embedded GPU
Fang, Hao
Lan, Qiang
Shi, Yang
Wen, Mei
2016 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND INFORMATION SECURITY (CSIS 2016), 2016, : 200 - 205
[38] Tiresias: A GPU Cluster Manager for Distributed Deep Learning
Gu, Juncheng
Chowdhury, Mosharaf
Shin, Kang G.
Zhu, Yibo
Jeon, Myeongjae
Qian, Junjie
Liu, Hongqiang
Guo, Chuanxiong
PROCEEDINGS OF THE 16TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, 2019, : 485 - 500
[39] Identification of Asphyxia in Newborns using GPU for Deep Learning
Moharir, Minal
Sachin, M. U.
Nagaraj, Rishab
Samiksha, M.
Rao, Sanil
2017 2ND INTERNATIONAL CONFERENCE FOR CONVERGENCE IN TECHNOLOGY (I2CT), 2017, : 236 - 239
[40] Optimizing Deep Learning Workloads on ARM GPU with TVM
Zheng, Lianmin
Chen, Tianqi
1ST ACM REQUEST WORKSHOP/TOURNAMENT ON REPRODUCIBLE SOFTWARE/HARDWARE CO-DESIGN OF PARETO-EFFICIENT DEEP LEARNING, 2018,

← 1 2 3 4 5 →