Enabling Efficient Large-Scale Deep Learning Training with Cache Coherent Disaggregated Memory Systems

被引：4

作者：

Wang, Zixuan ^{[1
]}

Sim, Joonseop ^{[2
]}

Lim, Euicheol ^{[2
]}

Zhao, Jishen ^{[1
]}

机构：

[1] Univ Calif San Diego, San Diego, CA 92103 USA

[2] SK Hynix, Syst Architecture Div, Icheon Si, South Korea

来源：

2022 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2022) | 2022年

关键词：

deep learning; training; cache coherence;

D O I：

10.1109/HPCA53966.2022.00018

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Modern deep learning (DL) training is memory-consuming, constrained by the memory capacity of each computation component and cross-device communication bandwidth. In response to such constraints, current approaches include increasing parallelism in distributed training and optimizing inter-device communication. However, model parameter communication is becoming a key performance bottleneck in distributed DL training. To improve parameter communication performance, we propose COARSE, a disaggregated memory extension for distributed DL training. COARSE is built on modern cache-coherent interconnect (CCI) protocols and MPI-like collective communication for synchronization, to allow low-latency and parallel access to training data and model parameters shared among worker GPUs. To enable high bandwidth transfers between GPUs and the disaggregated memory system, we propose a decentralized parameter communication scheme to decouple and localize parameter synchronization traffic. Furthermore, we propose dynamic tensor routing and partitioning to fully utilize the non-uniform serial bus bandwidth varied across different cloud computing systems. Finally, we design a deadlock avoidance and dual synchronization to ensure high-performance parameter synchronization. Our evaluation shows that COARSE achieves up to 48.3% faster DL training compared to the state-of-the-art MPI AllReduce communication.

引用

页码：126 / 140

页数：15

共 50 条

[1] Efficient Use of GPU Memory for Large-Scale Deep Learning Model Training
Choi, Hyeonseong
Lee, Jaehwan
[J]. APPLIED SCIENCES-BASEL, 2021, 11 (21):
[2] On Efficient Training of Large-Scale Deep Learning Models
Shen, Li
Sun, Yan
Yu, Zhiyuan
Ding, Liang
Tian, Xinmei
Tao, Dacheng
[J]. ACM Computing Surveys, 57 (03):
[3] Enabling Efficient Erasure Coding in Disaggregated Memory Systems
Li, Qiliang
Xu, Liangliang
Li, Yongkun
Lyu, Min
Wang, Wei
Zuo, Pengfei
Xu, Yinlong
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2024, 35 (01) : 154 - 168
[4] DRackSim: Simulating CXL-enabled Large-Scale Disaggregated Memory Systems
Puri, Amit
Bellamconda, Kartheek
Narreddy, Kailash
Jose, John
Venkatesh, Tamarapalli
Narayanan, Vijaykrishnan
[J]. PROCEEDINGS OF THE 38TH ACM SIGSIM INTERNATIONAL CONFERENCE ON PRINCIPLES OF ADVANCED DISCRETE SIMULATION, ACM SIGSIM-PADS 2024, 2024, : 3 - 14
[5] Toward Optimally Efficient Search With Deep Learning for Large-Scale MIMO Systems
He, Le
He, Ke
Fan, Lisheng
Lei, Xianfu
Nallanathan, Arumugam
Karagiannidis, George K.
[J]. IEEE TRANSACTIONS ON COMMUNICATIONS, 2022, 70 (05) : 3157 - 3168
[6] Memory-Efficient Learning for Large-Scale Computational Imaging
Kellman, Michael
Zhang, Kevin
Markley, Eric
Tamir, Jon
Bostan, Emrah
Lustig, Michael
Waller, Laura
[J]. IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING, 2020, 6 (06) : 1403 - 1414
[7] Large-Scale Deep Learning for Building Intelligent Computer Systems
Dean, Jeff
[J]. PROCEEDINGS OF THE NINTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM'16), 2016, : 1 - 1
[8] A Survey on Auto-Parallelism of Large-Scale Deep Learning Training
Liang, Peng
Tang, Yu
Zhang, Xiaoda
Bai, Youhui
Su, Teng
Lai, Zhiquan
Qiao, Linbo
Li, Dongsheng
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2023, 34 (08) : 2377 - 2390
[9] Hermes: Enabling efficient large-scale simulation in MATSim
Graur, Dan
Bruno, Rodrigo
Bischoff, Joschka
Rieser, Marcel
Scherr, Wolfgang
Hoefler, Torsten
Alonso, Gustavo
[J]. 12TH INTERNATIONAL CONFERENCE ON AMBIENT SYSTEMS, NETWORKS AND TECHNOLOGIES (ANT) / THE 4TH INTERNATIONAL CONFERENCE ON EMERGING DATA AND INDUSTRY 4.0 (EDI40) / AFFILIATED WORKSHOPS, 2021, 184 : 635 - 641
[10] Pipelined conditional synchronization on large-scale cache-coherent multiprocessors
Takesue, M
[J]. PARALLEL AND DISTRIBUTED COMPUTING SYSTEMS, PROCEEDINGS, 2003, : 132 - 138

← 1 2 3 4 5 →