Lotus: A New Topology for Large-scale Distributed Machine Learning

被引:3
|
作者
Lu, Yunfeng [1 ]
Gu, Huaxi [1 ]
Yu, Xiaoshan [1 ]
Chakrabarty, Krishnendu [2 ]
机构
[1] Xidian Univ, State Key Lab Integrated Serv Networks, Taibai South Rd 2, Xian 710071, Shaanxi, Peoples R China
[2] Duke Univ, Dept Elect & Comp Engn, 2080 Duke Univ Rd, Durham, NC 27708 USA
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Optical interconnects; machine learning; topology; routing algorithm; ARCHITECTURE;
D O I
10.1145/3415749
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Machine learning is at the heart of many services provided by data centers. To improve the performance of machine learning, several parameter (gradient) synchronization methods have been proposed in the literature. These synchronization algorithms have different communication characteristics and accordingly place different demands on the network architecture. However, traditional data-center networks cannot easily meet these demands. Therefore, we analyze the communication profiles associated with several common synchronization algorithms and propose a machine learning-oriented network architecture to match their characteristics. The proposed design, named Lotus, because it looks like a lotus flower, is a hybrid optical/electrical architecture based on arrayed waveguide grating routers (AWGRs). In Lotus, a complete bipartite graph is used within the group to improve bisection bandwidth and scalability. Each pair of groups is connected by an optical link, and AWGRs between adjacent groups enhance path diversity and network reliability. We also present an efficient routing algorithm to make full use of the path diversity of Lotus, which leads to a further increase in network performance. Simulation results show that the network performance of Lotus is better than Dragonfly and 3D-Torus under realistic traffic patterns for different synchronization algorithms.
引用
收藏
页数:21
相关论文
共 50 条
  • [1] Coding for Large-Scale Distributed Machine Learning
    Xiao, Ming
    Skoglund, Mikael
    [J]. ENTROPY, 2022, 24 (09)
  • [2] Angel: a new large-scale machine learning system
    Jiang, Jie
    Yu, Lele
    Jiang, Jiawei
    Liu, Yuhong
    Cui, Bin
    [J]. NATIONAL SCIENCE REVIEW, 2018, 5 (02) : 216 - 236
  • [3] Angel: a new large-scale machine learning system
    Jie Jiang
    Lele Yu
    Jiawei Jiang
    Yuhong Liu
    Bin Cui
    [J]. National Science Review, 2018, 5 (02) : 216 - 236
  • [4] Distributed Learning Algorithm for Distributed PV Large-Scale Access to Power Grid Based on Machine Learning
    Lei, Zhen
    Yang, Yong-biao
    Xu, Xiao-hui
    [J]. ADVANCED HYBRID INFORMATION PROCESSING, ADHIP 2019, PT I, 2019, 301 : 439 - 447
  • [5] A Survey on Large-Scale Machine Learning
    Wang, Meng
    Fu, Weijie
    He, Xiangnan
    Hao, Shijie
    Wu, Xindong
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2022, 34 (06) : 2574 - 2594
  • [6] Efficient Distributed Machine Learning for Large-scale Models by Reducing Redundant Communication
    Yokoyama, Harumichi
    Araki, Takuya
    [J]. 2017 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTED, SCALABLE COMPUTING & COMMUNICATIONS, CLOUD & BIG DATA COMPUTING, INTERNET OF PEOPLE AND SMART CITY INNOVATION (SMARTWORLD/SCALCOM/UIC/ATC/CBDCOM/IOP/SCI), 2017,
  • [7] TGE: Machine Learning Based Task Graph Embedding for Large-Scale Topology Mapping
    Choi, Jong Youl
    Logan, Jeremy
    Wolf, Matthew
    Ostrouchov, George
    Kurc, Tahsin
    Liu, Qing
    Podhorszki, Norbert
    Klasky, Scott
    Romanus, Melissa
    Sun, Qian
    Parashar, Manish
    Churchill, Randy Michael
    Chang, C. S.
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2017, : 587 - 591
  • [8] HammingMesh: A Network Topology for Large-Scale Deep Learning
    Hoefler, Torsten
    Bonato, Tommaso
    De Sensi, Daniele
    Di Girolamo, Salvatore
    Li, Shigang
    Heddes, Marco
    Belk, Jon
    Goel, Deepak
    Castro, Miguel
    Scott, Steve
    [J]. SC22: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2022,
  • [9] HammingMesh: A Network Topology for Large-Scale Deep Learning
    Hoefler, Torsten
    Bonoto, Tommaso
    De Sensi, Daniele
    Di Girolamo, Salvatore
    Li, Shigang
    Heddes, Marco
    Goel, Deepak
    Castro, Miguel
    Scott, Steve
    [J]. Communications of the ACM, 2024, 67 (12) : 97 - 105
  • [10] Coding-Based Performance Improvement of Distributed Machine Learning in Large-Scale Clusters
    Wang Y.
    Li N.
    Wang X.
    Zhong F.
    [J]. Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2020, 57 (03): : 542 - 561