Petuum: A New Platform for Distributed Machine Learning on Big Data

被引:55
|
作者
Xing, Eric P. [1 ]
Ho, Qirong [2 ]
Dai, Wei [1 ]
Kim, Jin Kyu [1 ]
Wei, Jinliang [1 ]
Lee, Seunghak [1 ]
Zheng, Xun [1 ]
Xie, Pengtao [1 ]
Kumar, Abhimanu [1 ]
Yu, Yaoliang [1 ]
机构
[1] Carnegie Mellon Univ, Sch Comp Sci, Pittsburgh, PA 15213 USA
[2] ASTAR, Inst Infocomm Res, Singapore, Singapore
基金
美国国家科学基金会;
关键词
Machine Learning; Big Data; Big Model; Distributed Systems; Theory; Data-Parallelism; Model-Parallelism;
D O I
10.1145/2783258.2783323
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
How can one build a distributed framework that allows efficient deployment of a wide spectrum of modern advanced machine learning (ML) programs for industrial-scale problems using Big Models (100s of billions of parameters) on Big Data (terabytes or petabytes)? Contemporary parallelization strategies employ fine-grained operations and scheduling beyond the classic bulk-synchronous processing paradigm popularized by MapReduce, or even specialized operators relying on graphical representations of ML programs. The variety of approaches tends to pull systems and algorithms design in different directions, and it remains difficult to find a universal platform applicable to a wide range of different ML programs at scale. We propose a general-purpose framework that systematically addresses data- and model-parallel challenges in large-scale ML, by leveraging several fundamental properties underlying ML programs that make them different from conventional operation-centric programs: error tolerance, dynamic structure, and nonuniform convergence; all stem from the optimization-centric nature shared in ML programs' mathematical definitions, and the iterative convergent behavior of their algorithmic solutions. These properties present unique opportunities for an integrative system design, built on bounded-latency network synchronization and dynamic load-balancing scheduling, which is efficient, programmable, and enjoys provable correctness guarantees. We demonstrate how such a design in light of ML first principles leads to significant performance improvements versus well-known implementations of several ML programs, allowing them to run in much less time and at considerably larger model sizes, on modestly-sized computer clusters.
引用
收藏
页码:1335 / 1344
页数:10
相关论文
共 50 条
  • [1] Distributed Mixture-of-Experts for Big Data using PETUUM framework
    Peralta, Billy
    Parra, Luis
    Herrera, Oriel
    Caro, Luis
    2017 36TH INTERNATIONAL CONFERENCE OF THE CHILEAN COMPUTER SCIENCE SOCIETY (SCCC), 2017,
  • [2] Big Data Platform Configuration Using Machine Learning
    Yeh, Chao-Chun
    Lu, Han-Lin
    Zhou, Jiazheng
    Chang, Sheng-An
    Lin, Xuan-Yi
    Sun, Yi-Chiao
    Huang, Shih-Kun
    JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2020, 36 (03) : 469 - 493
  • [3] Strategies and Principles of Distributed Machine Learning on Big Data
    Xing, Eric P.
    Ho, Qirong
    Xie, Pengtao
    Wei, Dai
    ENGINEERING, 2016, 2 (02) : 179 - 195
  • [4] MeLoN: Distributed Deep Learning meets the Big Data Platform
    Kang, Dae-Cheol
    Heo, Seoungbeom
    Jang, Hyeounji
    Lee, Hyeock-Jin
    Cho, Minkyoung
    Kim, Jik-Soo
    2021 IEEE INTERNATIONAL CONFERENCE ON AUTONOMIC COMPUTING AND SELF-ORGANIZING SYSTEMS COMPANION (ACSOS-C 2021), 2021, : 32 - 37
  • [5] Distributed Weighted Extreme Learning Machine for Big Imbalanced Data Learning
    Wang, Zhiqiong
    Xin, Junchang
    Tian, Shuo
    Yu, Ge
    PROCEEDINGS OF ELM-2015, VOL 1: THEORY, ALGORITHMS AND APPLICATIONS (I), 2016, 6 : 319 - 332
  • [6] Distributed and Weighted Extreme Learning Machine for Imbalanced Big Data Learning
    Zhiqiong Wang
    Junchang Xin
    Hongxu Yang
    Shuo Tian
    Ge Yu
    Chenren Xu
    Yudong Yao
    Tsinghua Science and Technology, 2017, 22 (02) : 160 - 173
  • [7] Distributed and Weighted Extreme Learning Machine for Imbalanced Big Data Learning
    Wang, Zhiqiong
    Xin, Junchang
    Yang, Hongxu
    Tian, Shuo
    Yu, Ge
    Xu, Chenren
    Yao, Yudong
    TSINGHUA SCIENCE AND TECHNOLOGY, 2017, 22 (02) : 160 - 173
  • [8] SPARK-A Big Data Processing Platform for Machine Learning
    Fu, Jian
    Sun, Junwei
    Wang, Kaiyuan
    2016 2ND INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS - COMPUTING TECHNOLOGY, INTELLIGENT TECHNOLOGY, INDUSTRIAL INFORMATION INTEGRATION (ICIICII), 2016, : 48 - 51
  • [9] Protecting Machine Learning Integrity in Distributed Big Data Networking
    Wei, Yunkai
    Chen, Yijin
    Xiao, Mingyue
    Maharjan, Sabita
    Zhang, Yan
    IEEE NETWORK, 2020, 34 (04): : 84 - 90
  • [10] A Survey of Distributed and Parallel Extreme Learning Machine for Big Data
    Wang, Zhiqiong
    Sui, Ling
    Xin, Junchang
    Qu, Luxuan
    Yao, Yudong
    IEEE ACCESS, 2020, 8 : 201247 - 201258