Enabling Efficient Large-Scale Deep Learning Training with Cache Coherent Disaggregated Memory Systems

被引:4
|
作者
Wang, Zixuan [1 ]
Sim, Joonseop [2 ]
Lim, Euicheol [2 ]
Zhao, Jishen [1 ]
机构
[1] Univ Calif San Diego, San Diego, CA 92103 USA
[2] SK Hynix, Syst Architecture Div, Icheon Si, South Korea
关键词
deep learning; training; cache coherence;
D O I
10.1109/HPCA53966.2022.00018
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Modern deep learning (DL) training is memory-consuming, constrained by the memory capacity of each computation component and cross-device communication bandwidth. In response to such constraints, current approaches include increasing parallelism in distributed training and optimizing inter-device communication. However, model parameter communication is becoming a key performance bottleneck in distributed DL training. To improve parameter communication performance, we propose COARSE, a disaggregated memory extension for distributed DL training. COARSE is built on modern cache-coherent interconnect (CCI) protocols and MPI-like collective communication for synchronization, to allow low-latency and parallel access to training data and model parameters shared among worker GPUs. To enable high bandwidth transfers between GPUs and the disaggregated memory system, we propose a decentralized parameter communication scheme to decouple and localize parameter synchronization traffic. Furthermore, we propose dynamic tensor routing and partitioning to fully utilize the non-uniform serial bus bandwidth varied across different cloud computing systems. Finally, we design a deadlock avoidance and dual synchronization to ensure high-performance parameter synchronization. Our evaluation shows that COARSE achieves up to 48.3% faster DL training compared to the state-of-the-art MPI AllReduce communication.
引用
收藏
页码:126 / 140
页数:15
相关论文
共 50 条
  • [31] Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
    Li, Shenggui
    Liu, Hongxin
    Bian, Zhengda
    Fang, Jiarui
    Huang, Haichen
    Liu, Yuliang
    Wang, Boxiang
    You, Yang
    PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023, 2023, : 766 - 775
  • [32] Hybrid Electrical/Optical Switch Architectures for Training Distributed Deep Learning in Large-Scale
    Thao-Nguyen Truong
    Takano, Ryousei
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2021, E104D (08) : 1332 - 1339
  • [33] Large-Scale Semi-Supervised Training in Deep Learning Acoustic Model for ASR
    Long, Yanhua
    Li, Yijie
    Wei, Shuang
    Zhang, Qiaozheng
    Yang, Chunxia
    IEEE ACCESS, 2019, 7 : 133615 - 133627
  • [34] Enabling Parallel Simulation of Large-Scale HPC Network Systems
    Mubarak, Misbah
    Carothers, Christopher D.
    Ross, Robert B.
    Carns, Philip
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2017, 28 (01) : 87 - 100
  • [35] Latch: Enabling large-scale automated testing on constrained systems
    Lauwaerts T.
    Marr S.
    Scholliers C.
    Science of Computer Programming, 2024, 238
  • [36] Designing Reconfigurable Large-Scale Deep Learning Systems Using Stochastic Computing
    Ren, Ao
    Li, Zhe
    Wang, Yanzhi
    Qiu, Qinru
    Yuan, Bo
    2016 IEEE INTERNATIONAL CONFERENCE ON REBOOTING COMPUTING (ICRC), 2016,
  • [37] Enabling large-scale screening of Barrett's esophagus using weakly supervised deep learning in histopathology
    Bouzid, Kenza
    Sharma, Harshita
    Killcoyne, Sarah
    Castro, Daniel C.
    Schwaighofer, Anton
    Ilse, Max
    Salvatelli, Valentina
    Oktay, Ozan
    Murthy, Sumanth
    Bordeaux, Lucas
    Moore, Luiza
    O'Donovan, Maria
    Thieme, Anja
    Nori, Aditya
    Gehrung, Marcel
    Alvarez-Valle, Javier
    NATURE COMMUNICATIONS, 2024, 15 (01)
  • [38] Enabling large-scale screening of Barrett’s esophagus using weakly supervised deep learning in histopathology
    Kenza Bouzid
    Harshita Sharma
    Sarah Killcoyne
    Daniel C. Castro
    Anton Schwaighofer
    Max Ilse
    Valentina Salvatelli
    Ozan Oktay
    Sumanth Murthy
    Lucas Bordeaux
    Luiza Moore
    Maria O’Donovan
    Anja Thieme
    Aditya Nori
    Marcel Gehrung
    Javier Alvarez-Valle
    Nature Communications, 15
  • [39] TIM: Enabling Large-Scale White-Box Testing on In-App Deep Learning Models
    Wu, Hao
    Gong, Yuhang
    Ke, Xiaopeng
    Liang, Hanzhong
    Xu, Fengyuan
    Liu, Yunxin
    Zhong, Sheng
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2024, 19 : 8188 - 8203
  • [40] Efficient Objective Functions for Coordinated Learning in Large-Scale Distributed OSA Systems
    NoroozOliaee, MohammadJavad
    Hamdaoui, Bechir
    Tumer, Kagan
    IEEE TRANSACTIONS ON MOBILE COMPUTING, 2013, 12 (05) : 931 - 944