Enabling Efficient Large-Scale Deep Learning Training with Cache Coherent Disaggregated Memory Systems

被引:4
|
作者
Wang, Zixuan [1 ]
Sim, Joonseop [2 ]
Lim, Euicheol [2 ]
Zhao, Jishen [1 ]
机构
[1] Univ Calif San Diego, San Diego, CA 92103 USA
[2] SK Hynix, Syst Architecture Div, Icheon Si, South Korea
关键词
deep learning; training; cache coherence;
D O I
10.1109/HPCA53966.2022.00018
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Modern deep learning (DL) training is memory-consuming, constrained by the memory capacity of each computation component and cross-device communication bandwidth. In response to such constraints, current approaches include increasing parallelism in distributed training and optimizing inter-device communication. However, model parameter communication is becoming a key performance bottleneck in distributed DL training. To improve parameter communication performance, we propose COARSE, a disaggregated memory extension for distributed DL training. COARSE is built on modern cache-coherent interconnect (CCI) protocols and MPI-like collective communication for synchronization, to allow low-latency and parallel access to training data and model parameters shared among worker GPUs. To enable high bandwidth transfers between GPUs and the disaggregated memory system, we propose a decentralized parameter communication scheme to decouple and localize parameter synchronization traffic. Furthermore, we propose dynamic tensor routing and partitioning to fully utilize the non-uniform serial bus bandwidth varied across different cloud computing systems. Finally, we design a deadlock avoidance and dual synchronization to ensure high-performance parameter synchronization. Our evaluation shows that COARSE achieves up to 48.3% faster DL training compared to the state-of-the-art MPI AllReduce communication.
引用
收藏
页码:126 / 140
页数:15
相关论文
共 50 条
  • [1] Efficient Use of GPU Memory for Large-Scale Deep Learning Model Training
    Choi, Hyeonseong
    Lee, Jaehwan
    [J]. APPLIED SCIENCES-BASEL, 2021, 11 (21):
  • [2] On Efficient Training of Large-Scale Deep Learning Models
    Shen, Li
    Sun, Yan
    Yu, Zhiyuan
    Ding, Liang
    Tian, Xinmei
    Tao, Dacheng
    [J]. ACM Computing Surveys, 57 (03):
  • [3] Enabling Efficient Erasure Coding in Disaggregated Memory Systems
    Li, Qiliang
    Xu, Liangliang
    Li, Yongkun
    Lyu, Min
    Wang, Wei
    Zuo, Pengfei
    Xu, Yinlong
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2024, 35 (01) : 154 - 168
  • [4] DRackSim: Simulating CXL-enabled Large-Scale Disaggregated Memory Systems
    Puri, Amit
    Bellamconda, Kartheek
    Narreddy, Kailash
    Jose, John
    Venkatesh, Tamarapalli
    Narayanan, Vijaykrishnan
    [J]. PROCEEDINGS OF THE 38TH ACM SIGSIM INTERNATIONAL CONFERENCE ON PRINCIPLES OF ADVANCED DISCRETE SIMULATION, ACM SIGSIM-PADS 2024, 2024, : 3 - 14
  • [5] Toward Optimally Efficient Search With Deep Learning for Large-Scale MIMO Systems
    He, Le
    He, Ke
    Fan, Lisheng
    Lei, Xianfu
    Nallanathan, Arumugam
    Karagiannidis, George K.
    [J]. IEEE TRANSACTIONS ON COMMUNICATIONS, 2022, 70 (05) : 3157 - 3168
  • [6] Memory-Efficient Learning for Large-Scale Computational Imaging
    Kellman, Michael
    Zhang, Kevin
    Markley, Eric
    Tamir, Jon
    Bostan, Emrah
    Lustig, Michael
    Waller, Laura
    [J]. IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING, 2020, 6 (06) : 1403 - 1414
  • [7] Large-Scale Deep Learning for Building Intelligent Computer Systems
    Dean, Jeff
    [J]. PROCEEDINGS OF THE NINTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM'16), 2016, : 1 - 1
  • [8] A Survey on Auto-Parallelism of Large-Scale Deep Learning Training
    Liang, Peng
    Tang, Yu
    Zhang, Xiaoda
    Bai, Youhui
    Su, Teng
    Lai, Zhiquan
    Qiao, Linbo
    Li, Dongsheng
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2023, 34 (08) : 2377 - 2390
  • [9] Hermes: Enabling efficient large-scale simulation in MATSim
    Graur, Dan
    Bruno, Rodrigo
    Bischoff, Joschka
    Rieser, Marcel
    Scherr, Wolfgang
    Hoefler, Torsten
    Alonso, Gustavo
    [J]. 12TH INTERNATIONAL CONFERENCE ON AMBIENT SYSTEMS, NETWORKS AND TECHNOLOGIES (ANT) / THE 4TH INTERNATIONAL CONFERENCE ON EMERGING DATA AND INDUSTRY 4.0 (EDI40) / AFFILIATED WORKSHOPS, 2021, 184 : 635 - 641
  • [10] Pipelined conditional synchronization on large-scale cache-coherent multiprocessors
    Takesue, M
    [J]. PARALLEL AND DISTRIBUTED COMPUTING SYSTEMS, PROCEEDINGS, 2003, : 132 - 138