HammingMesh: A Network Topology for Large-Scale Deep Learning

被引:3
|
作者
Hoefler, Torsten [1 ,2 ]
Bonato, Tommaso [1 ]
De Sensi, Daniele [1 ]
Di Girolamo, Salvatore [1 ]
Li, Shigang [1 ]
Heddes, Marco [2 ]
Belk, Jon [2 ]
Goel, Deepak [2 ]
Castro, Miguel [2 ]
Scott, Steve [2 ]
机构
[1] Swiss Fed Inst Technol, Dept Comp Sci, Ramistr 101, CH-8092 Zurich, Switzerland
[2] Microsoft Corp, One Microsoft Way, Redmond, WA 98052 USA
基金
欧洲研究理事会;
关键词
Network architecture; Deep Learning; Clusters; Software defined networking;
D O I
10.1109/SC41404.2022.00016
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the AI revolution. With the exhaustion of such optimizations, the growth of modern AI is now gated by the performance of training systems, especially their data movement. Instead of focusing on single accelerators, we investigate data-movement characteristics of large-scale training at full system scale. Based on our workload analysis, we design HammingMesh, a novel network topology that provides high bandwidth at low cost with high job scheduling flexibility. Specifically, HammingMesh can support full bandwidth and isolation to deep learning training jobs with two dimensions of parallelism. Furthermore, it also supports high global bandwidth for generic traffic. Thus, HammingMesh will power future large-scale deep learning systems with extreme bandwidth requirements.
引用
收藏
页数:18
相关论文
共 50 条
  • [1] Topology-aware Sparse Allreduce for Large-scale Deep Learning
    Thao Nguyen Truong
    Wahib, Mohamed
    Takano, Ryousei
    [J]. 2019 IEEE 38TH INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC), 2019,
  • [2] Large-scale Deep Learning at Baidu
    Yu, Kai
    [J]. PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 2211 - 2211
  • [3] Deep Reinforcement Learning for Network Service Recovery in Large-scale Failures
    Akashi, Kazuaki
    Fukuda, Nobukazu
    Kanai, Shunsuke
    Tayama, Kenichi
    [J]. 2023 19TH INTERNATIONAL CONFERENCE ON NETWORK AND SERVICE MANAGEMENT, CNSM, 2023,
  • [4] Deep Large-Scale Multitask Learning Network for Gene Expression Inference
    Dizaji, Kamran Ghasedi
    Chen, Wei
    Huang, Heng
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 2021, 28 (05) : 485 - 500
  • [5] NetSentry: A deep learning approach to detecting incipient large-scale network attacks
    Liu, Haoyu
    Patras, Paul
    [J]. COMPUTER COMMUNICATIONS, 2022, 191 : 119 - 132
  • [6] Large-scale transport simulation by deep learning
    Jie Pan
    [J]. Nature Computational Science, 2021, 1 : 306 - 306
  • [7] Tractable large-scale deep reinforcement learning
    Sarang, Nima
    Poullis, Charalambos
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 232
  • [8] Deep Learning on Large-scale Muticore Clusters
    Sakiyama, Kazumasa
    Kato, Shinpei
    Ishikawa, Yutaka
    Hori, Atsushi
    Monrroy, Abraham
    [J]. 2018 30TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD 2018), 2018, : 314 - 321
  • [9] Learning Deep Representation with Large-scale Attributes
    Ouyang, Wanli
    Li, Hongyang
    Zeng, Xingyu
    Wang, Xiaogang
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 1895 - 1903
  • [10] Large-scale Pollen Recognition with Deep Learning
    de Geus, Andre R.
    Barcelos, Celia A. Z.
    Batista, Marcos A.
    da Silva, Sergio F.
    [J]. 2019 27TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2019,