HammingMesh: A Network Topology for Large-Scale Deep Learning

被引:0
|
作者
Hoefler, Torsten [1 ]
Bonoto, Tommaso [2 ]
De Sensi, Daniele [2 ]
Di Girolamo, Salvatore [2 ]
Li, Shigang [2 ]
Heddes, Marco [3 ]
Goel, Deepak [4 ]
Castro, Miguel [5 ]
Scott, Steve [3 ]
机构
[1] Swiss Fed Inst Technol, Microsoft Corp, Zurich, Switzerland
[2] Swiss Fed Inst Technol, Zurich, Switzerland
[3] Microsoft, Redmond, WA USA
[4] Microsoft, Sunnyvale, CA USA
[5] Microsoft, Cambridge, MA USA
关键词
Deep neural networks - Hamming distance;
D O I
10.1145/3623490
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the ongoing AI revolution. With the exhaustion of such optimizations, the growth of modern AI is now gated by the performance of training systems, especially their data movement. Instead of focusing on single accelerators, we investigate data-movement characteristics of large-scale training at full system scale. Based on our workload analysis, we design HammingMesh, a novel network topology that provides high bandwidth at low cost with high job-scheduling flexibility. Specifically, HammingMesh can support full bandwidth and isolation to deep learning training jobs with two dimensions of parallelism. Furthermore, it also supports high global bandwidth for generic traffic. Thus, HammingMesh will power future large-scale deep-learning systems with extreme bandwidth requirements.
引用
收藏
页码:97 / 105
页数:9
相关论文
共 50 条
  • [41] Topology of large-scale underdense regions
    Soltan, A. M.
    MONTHLY NOTICES OF THE ROYAL ASTRONOMICAL SOCIETY, 2017, 472 (02) : 1705 - 1715
  • [42] A learning style classification approach based on deep belief network for large-scale online education
    Zhang, Hao
    Huang, Tao
    Liu, Sanya
    Yin, Hao
    Li, Jia
    Yang, Huali
    Xia, Yu
    JOURNAL OF CLOUD COMPUTING-ADVANCES SYSTEMS AND APPLICATIONS, 2020, 9 (01):
  • [43] Explore Deep Neural Network and Reinforcement Learning to Large-scale Tasks Processing in Big Data
    Wu, Chunyi
    Xu, Gaochao
    Ding, Yan
    Zhao, Jia
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2019, 33 (13)
  • [44] An Allreduce Algorithm and Network Co-design for Large-Scale Training of Distributed Deep Learning
    Nguyen, Truong Thao
    Wahib, Mohamed
    21ST IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2021), 2021, : 396 - 405
  • [45] A learning style classification approach based on deep belief network for large-scale online education
    Hao Zhang
    Tao Huang
    Sanya Liu
    Hao Yin
    Jia Li
    Huali Yang
    Yu Xia
    Journal of Cloud Computing, 9
  • [46] Handling Large-Scale Action Space in Deep Q Network
    Zhao, Zhiheng
    Liang, Yi
    Jin, Xiaoming
    2018 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND BIG DATA (ICAIBD), 2018, : 93 - 96
  • [47] A Deep Residual Network for Large-Scale Acoustic Scene Analysis
    Ford, Logan
    Tang, Hao
    Grondin, Francois
    Glass, James
    INTERSPEECH 2019, 2019, : 2568 - 2572
  • [48] Approximate Deep Network Embedding for Mining Large-scale Graphs
    Zhou, Yang
    Liu, Ling
    2019 IEEE FIRST INTERNATIONAL CONFERENCE ON COGNITIVE MACHINE INTELLIGENCE (COGMI 2019), 2019, : 53 - 60
  • [49] Large-Scale Deep Learning for Building Intelligent Computer Systems
    Dean, Jeff
    PROCEEDINGS OF THE NINTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM'16), 2016, : 1 - 1
  • [50] Rich Punctuations Prediction Using Large-scale Deep Learning
    Wu, Xueyang
    Zhu, Su
    Wu, Yue
    Yu, Kai
    2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,