HammingMesh: A Network Topology for Large-Scale Deep Learning

被引:0
|
作者
Hoefler, Torsten [1 ]
Bonoto, Tommaso [2 ]
De Sensi, Daniele [2 ]
Di Girolamo, Salvatore [2 ]
Li, Shigang [2 ]
Heddes, Marco [3 ]
Goel, Deepak [4 ]
Castro, Miguel [5 ]
Scott, Steve [3 ]
机构
[1] Swiss Fed Inst Technol, Microsoft Corp, Zurich, Switzerland
[2] Swiss Fed Inst Technol, Zurich, Switzerland
[3] Microsoft, Redmond, WA USA
[4] Microsoft, Sunnyvale, CA USA
[5] Microsoft, Cambridge, MA USA
关键词
Deep neural networks - Hamming distance;
D O I
10.1145/3623490
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the ongoing AI revolution. With the exhaustion of such optimizations, the growth of modern AI is now gated by the performance of training systems, especially their data movement. Instead of focusing on single accelerators, we investigate data-movement characteristics of large-scale training at full system scale. Based on our workload analysis, we design HammingMesh, a novel network topology that provides high bandwidth at low cost with high job-scheduling flexibility. Specifically, HammingMesh can support full bandwidth and isolation to deep learning training jobs with two dimensions of parallelism. Furthermore, it also supports high global bandwidth for generic traffic. Thus, HammingMesh will power future large-scale deep-learning systems with extreme bandwidth requirements.
引用
收藏
页码:97 / 105
页数:9
相关论文
共 50 条
  • [1] HammingMesh: A Network Topology for Large-Scale Deep Learning
    Hoefler, Torsten
    Bonato, Tommaso
    De Sensi, Daniele
    Di Girolamo, Salvatore
    Li, Shigang
    Heddes, Marco
    Belk, Jon
    Goel, Deepak
    Castro, Miguel
    Scott, Steve
    SC22: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2022,
  • [2] Topology-aware Sparse Allreduce for Large-scale Deep Learning
    Thao Nguyen Truong
    Wahib, Mohamed
    Takano, Ryousei
    2019 IEEE 38TH INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC), 2019,
  • [3] Large-scale Deep Learning at Baidu
    Yu, Kai
    PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 2211 - 2211
  • [4] Deep Large-Scale Multitask Learning Network for Gene Expression Inference
    Dizaji, Kamran Ghasedi
    Chen, Wei
    Huang, Heng
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2021, 28 (05) : 485 - 500
  • [5] Deep Reinforcement Learning for Network Service Recovery in Large-scale Failures
    Akashi, Kazuaki
    Fukuda, Nobukazu
    Kanai, Shunsuke
    Tayama, Kenichi
    2023 19TH INTERNATIONAL CONFERENCE ON NETWORK AND SERVICE MANAGEMENT, CNSM, 2023,
  • [6] NetSentry: A deep learning approach to detecting incipient large-scale network attacks
    Liu, Haoyu
    Patras, Paul
    COMPUTER COMMUNICATIONS, 2022, 191 : 119 - 132
  • [7] A deep learning network for semantic labeling of large-scale urban point clouds
    Yang B.
    Han X.
    Dong Z.
    Cehui Xuebao/Acta Geodaetica et Cartographica Sinica, 2021, 50 (08): : 1059 - 1067
  • [8] Large-scale transport simulation by deep learning
    Jie Pan
    Nature Computational Science, 2021, 1 : 306 - 306
  • [9] Tractable large-scale deep reinforcement learning
    Sarang, Nima
    Poullis, Charalambos
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 232
  • [10] Large-scale transport simulation by deep learning
    Pan, Jie
    NATURE COMPUTATIONAL SCIENCE, 2021, 1 (05): : 306 - 306