PS-Hybrid: Hybrid communication framework for large recommendation model training

被引:0
|
作者
Miao X. [1 ]
Zhang M. [1 ]
Shao Y. [2 ]
Cui B. [1 ]
机构
[1] School of Electronics Engineering and Computer Science, Peking University, Beijing
[2] School of Computer Science, Beijing University of Posts and Telecommunications, Beijing
关键词
AllReduce; distributed deep learning; parameter server; recommendation model;
D O I
10.16511/j.cnki.qhdxxb.2021.22.041
中图分类号
学科分类号
摘要
Most traditional distributed deep learning training systems have been based on parameter servers which have centralized communication architectures that face serious communication bottlenecks due to the large amounts of communications and AllReduce communication frameworks which have decentralized communication architectures that cannot store the entire model due to the large number of parameters. This paper presents PS-Hybrid, a hybrid communication framework. for large deep learning recommendation model training which decouples the communication logic from the embedded parameters and other parameters. Tests show that this prototype system achieves better performance than previous parameter servers for recommendation model training. The system is 48% faster than TensorFlow-PS with 16 computing nodes. © 2022 Press of Tsinghua University. All rights reserved.
引用
收藏
页码:1417 / 1425
页数:8
相关论文
共 18 条
  • [1] FEDUS W, ZOPH B, SHAZEER N., Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity
  • [2] CHENG H T, KOC L, HARMSEN J, Et al., Wide & deep learning for recommender systems, Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pp. 7-10, (2016)
  • [3] WANG RX, FU B, FU G, Et al., Deep & cross network for ad click predictions [C], Proceedings of the ADKDD17, pp. 1-7, (2017)
  • [4] LI M, ANDERSEN D G, PARK J W, Et al., Scaling distributed machine learning with the parameter server, Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation, pp. 583-598, (2014)
  • [5] JIANG J, YU L L, JIANG J W, Et al., Angel: A new large-scale machine learning system, National Science Review, 5, 2, pp. 216-236, (2018)
  • [6] LI M, ANDERSEN D G, SMOLA A, Et al., Communication efficient distributed machine learning with the parameter server, Proceedings of the 27 th International Conference on Neural Information Processing Systems, 1, pp. 19-27, (2014)
  • [7] ABADI M, BARHAM P, CHEN J M, Et al., TensorFlow: A system for large-scale machine learning, Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation, pp. 265-283, (2016)
  • [8] PASZKE A, GROSS S, MASS A F, Et al., PyTorch: An imperative style, high-performance deep learning library, Proceedings of the 33rd Conference on Neural Information Processing Systems, pp. 8026-8037, (2019)
  • [9] LI S, ZHAO Y L, VARMA R, Et al., PyTorch distributed: Experiences on accelerating data parallel training, Proceedings of the VLDB Endowment, 13, 12, pp. 3005-3018, (2020)
  • [10] PS-Lite, A light and efficient implementation of the parameter server framework