HammingMesh: A Network Topology for Large-Scale Deep Learning

被引:0
|
作者
Hoefler, Torsten [1 ]
Bonoto, Tommaso [2 ]
De Sensi, Daniele [2 ]
Di Girolamo, Salvatore [2 ]
Li, Shigang [2 ]
Heddes, Marco [3 ]
Goel, Deepak [4 ]
Castro, Miguel [5 ]
Scott, Steve [3 ]
机构
[1] Swiss Fed Inst Technol, Microsoft Corp, Zurich, Switzerland
[2] Swiss Fed Inst Technol, Zurich, Switzerland
[3] Microsoft, Redmond, WA USA
[4] Microsoft, Sunnyvale, CA USA
[5] Microsoft, Cambridge, MA USA
关键词
Deep neural networks - Hamming distance;
D O I
10.1145/3623490
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the ongoing AI revolution. With the exhaustion of such optimizations, the growth of modern AI is now gated by the performance of training systems, especially their data movement. Instead of focusing on single accelerators, we investigate data-movement characteristics of large-scale training at full system scale. Based on our workload analysis, we design HammingMesh, a novel network topology that provides high bandwidth at low cost with high job-scheduling flexibility. Specifically, HammingMesh can support full bandwidth and isolation to deep learning training jobs with two dimensions of parallelism. Furthermore, it also supports high global bandwidth for generic traffic. Thus, HammingMesh will power future large-scale deep-learning systems with extreme bandwidth requirements.
引用
收藏
页码:97 / 105
页数:9
相关论文
共 50 条
  • [21] SOFAR: Large-Scale Association Network Learning
    Uematsu, Yoshimasa
    Fan, Yingying
    Chen, Kun
    Lv, Jinchi
    Lin, Wei
    IEEE TRANSACTIONS ON INFORMATION THEORY, 2019, 65 (08) : 4924 - 4939
  • [22] Lotus: A New Topology for Large-scale Distributed Machine Learning
    Lu, Yunfeng
    Gu, Huaxi
    Yu, Xiaoshan
    Chakrabarty, Krishnendu
    ACM JOURNAL ON EMERGING TECHNOLOGIES IN COMPUTING SYSTEMS, 2021, 17 (01)
  • [23] Large-Scale Transportation Network Congestion Evolution Prediction Using Deep Learning Theory
    Ma, Xiaolei
    Yu, Haiyang
    Wang, Yunpeng
    Wang, Yinhai
    PLOS ONE, 2015, 10 (03):
  • [24] Large-Scale Nodes Classification With Deep Aggregation Network
    Li, Jiangtao
    Wu, Jianshe
    He, Weiquan
    Zhou, Peng
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2021, 33 (06) : 2560 - 2572
  • [25] Learning Traffic as Images: A Deep Convolutional Neural Network for Large-Scale Transportation Network Speed Prediction
    Ma, Xiaolei
    Dai, Zhuang
    He, Zhengbing
    Ma, Jihui
    Wang, Yong
    Wang, Yunpeng
    SENSORS, 2017, 17 (04)
  • [26] Deep learning for the large-scale cancer data analysis
    Tsuji, Shingo
    Aburatani, Hiroyuki
    CANCER RESEARCH, 2015, 75 (22)
  • [27] Deep Reinforcement Learning for Large-Scale Epidemic Control
    Libin, Pieter J. K.
    Moonens, Arno
    Verstraeten, Timothy
    Perez-Sanjines, Fabian
    Hens, Niel
    Lemey, Philippe
    Nowe, Ann
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: APPLIED DATA SCIENCE AND DEMO TRACK, ECML PKDD 2020, PT V, 2021, 12461 : 155 - 170
  • [28] Deep learning large-scale drug discovery and repurposing
    Yu, Min
    Li, Weiming
    Yu, Yunru
    Zhao, Yu
    Xiao, Lizhi
    Lauschke, Volker M.
    Cheng, Yiyu
    Zhang, Xingcai
    Wang, Yi
    NATURE COMPUTATIONAL SCIENCE, 2024, 4 (08): : 600 - 614
  • [29] On Efficient Training of Large-Scale Deep Learning Models
    Shen, Li
    Sun, Yan
    Yu, Zhiyuan
    Ding, Liang
    Tian, Xinmei
    Tao, Dacheng
    ACM COMPUTING SURVEYS, 2025, 57 (03)
  • [30] Analysis of Large-Scale Hybrid Peer-to-Peer Network Topology
    Xie, Chao
    Pan, Yi
    GLOBECOM 2006 - 2006 IEEE GLOBAL TELECOMMUNICATIONS CONFERENCE, 2006,