HammingMesh: A Network Topology for Large-Scale Deep Learning

被引：0

作者：

Hoefler, Torsten ^{[1
]}

Bonoto, Tommaso ^{[2
]}

De Sensi, Daniele ^{[2
]}

Di Girolamo, Salvatore ^{[2
]}

Li, Shigang ^{[2
]}

Heddes, Marco ^{[3
]}

Goel, Deepak ^{[4
]}

Castro, Miguel ^{[5
]}

Scott, Steve ^{[3
]}

机构：

[1] Swiss Fed Inst Technol, Microsoft Corp, Zurich, Switzerland

[2] Swiss Fed Inst Technol, Zurich, Switzerland

[3] Microsoft, Redmond, WA USA

[4] Microsoft, Sunnyvale, CA USA

[5] Microsoft, Cambridge, MA USA

来源：

COMMUNICATIONS OF THE ACM | 2024年 / 67卷 / 12期

关键词：

Deep neural networks - Hamming distance;

D O I：

10.1145/3623490

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the ongoing AI revolution. With the exhaustion of such optimizations, the growth of modern AI is now gated by the performance of training systems, especially their data movement. Instead of focusing on single accelerators, we investigate data-movement characteristics of large-scale training at full system scale. Based on our workload analysis, we design HammingMesh, a novel network topology that provides high bandwidth at low cost with high job-scheduling flexibility. Specifically, HammingMesh can support full bandwidth and isolation to deep learning training jobs with two dimensions of parallelism. Furthermore, it also supports high global bandwidth for generic traffic. Thus, HammingMesh will power future large-scale deep-learning systems with extreme bandwidth requirements.

引用

页码：97 / 105

页数：9

共 50 条

[21] SOFAR: Large-Scale Association Network Learning
Uematsu, Yoshimasa
Fan, Yingying
Chen, Kun
Lv, Jinchi
Lin, Wei
IEEE TRANSACTIONS ON INFORMATION THEORY, 2019, 65 (08) : 4924 - 4939
[22] Lotus: A New Topology for Large-scale Distributed Machine Learning
Lu, Yunfeng
Gu, Huaxi
Yu, Xiaoshan
Chakrabarty, Krishnendu
ACM JOURNAL ON EMERGING TECHNOLOGIES IN COMPUTING SYSTEMS, 2021, 17 (01)
[23] Large-Scale Transportation Network Congestion Evolution Prediction Using Deep Learning Theory
Ma, Xiaolei
Yu, Haiyang
Wang, Yunpeng
Wang, Yinhai
PLOS ONE, 2015, 10 (03):
[24] Large-Scale Nodes Classification With Deep Aggregation Network
Li, Jiangtao
Wu, Jianshe
He, Weiquan
Zhou, Peng
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2021, 33 (06) : 2560 - 2572
[25] Learning Traffic as Images: A Deep Convolutional Neural Network for Large-Scale Transportation Network Speed Prediction
Ma, Xiaolei
Dai, Zhuang
He, Zhengbing
Ma, Jihui
Wang, Yong
Wang, Yunpeng
SENSORS, 2017, 17 (04)
[26] Deep learning for the large-scale cancer data analysis
Tsuji, Shingo
Aburatani, Hiroyuki
CANCER RESEARCH, 2015, 75 (22)
[27] Deep Reinforcement Learning for Large-Scale Epidemic Control
Libin, Pieter J. K.
Moonens, Arno
Verstraeten, Timothy
Perez-Sanjines, Fabian
Hens, Niel
Lemey, Philippe
Nowe, Ann
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: APPLIED DATA SCIENCE AND DEMO TRACK, ECML PKDD 2020, PT V, 2021, 12461 : 155 - 170
[28] Deep learning large-scale drug discovery and repurposing
Yu, Min
Li, Weiming
Yu, Yunru
Zhao, Yu
Xiao, Lizhi
Lauschke, Volker M.
Cheng, Yiyu
Zhang, Xingcai
Wang, Yi
NATURE COMPUTATIONAL SCIENCE, 2024, 4 (08): : 600 - 614
[29] On Efficient Training of Large-Scale Deep Learning Models
Shen, Li
Sun, Yan
Yu, Zhiyuan
Ding, Liang
Tian, Xinmei
Tao, Dacheng
ACM COMPUTING SURVEYS, 2025, 57 (03)
[30] Analysis of Large-Scale Hybrid Peer-to-Peer Network Topology
Xie, Chao
Pan, Yi
GLOBECOM 2006 - 2006 IEEE GLOBAL TELECOMMUNICATIONS CONFERENCE, 2006,

← 1 2 3 4 5 →