Petuum: A New Platform for Distributed Machine Learning on Big Data

被引：55

作者：

Xing, Eric P. ^{[1
]}

Ho, Qirong ^{[2
]}

Dai, Wei ^{[1
]}

Kim, Jin Kyu ^{[1
]}

Wei, Jinliang ^{[1
]}

Lee, Seunghak ^{[1
]}

Zheng, Xun ^{[1
]}

Xie, Pengtao ^{[1
]}

Kumar, Abhimanu ^{[1
]}

Yu, Yaoliang ^{[1
]}

机构：

[1] Carnegie Mellon Univ, Sch Comp Sci, Pittsburgh, PA 15213 USA

[2] ASTAR, Inst Infocomm Res, Singapore, Singapore

来源：

KDD'15: PROCEEDINGS OF THE 21ST ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING | 2015年

基金：

美国国家科学基金会;

关键词：

Machine Learning; Big Data; Big Model; Distributed Systems; Theory; Data-Parallelism; Model-Parallelism;

D O I：

10.1145/2783258.2783323

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

How can one build a distributed framework that allows efficient deployment of a wide spectrum of modern advanced machine learning (ML) programs for industrial-scale problems using Big Models (100s of billions of parameters) on Big Data (terabytes or petabytes)? Contemporary parallelization strategies employ fine-grained operations and scheduling beyond the classic bulk-synchronous processing paradigm popularized by MapReduce, or even specialized operators relying on graphical representations of ML programs. The variety of approaches tends to pull systems and algorithms design in different directions, and it remains difficult to find a universal platform applicable to a wide range of different ML programs at scale. We propose a general-purpose framework that systematically addresses data- and model-parallel challenges in large-scale ML, by leveraging several fundamental properties underlying ML programs that make them different from conventional operation-centric programs: error tolerance, dynamic structure, and nonuniform convergence; all stem from the optimization-centric nature shared in ML programs' mathematical definitions, and the iterative convergent behavior of their algorithmic solutions. These properties present unique opportunities for an integrative system design, built on bounded-latency network synchronization and dynamic load-balancing scheduling, which is efficient, programmable, and enjoys provable correctness guarantees. We demonstrate how such a design in light of ML first principles leads to significant performance improvements versus well-known implementations of several ML programs, allowing them to run in much less time and at considerably larger model sizes, on modestly-sized computer clusters.

引用

页码：1335 / 1344

页数：10

共 50 条

[1] Distributed Mixture-of-Experts for Big Data using PETUUM framework
Peralta, Billy
Parra, Luis
Herrera, Oriel
Caro, Luis
2017 36TH INTERNATIONAL CONFERENCE OF THE CHILEAN COMPUTER SCIENCE SOCIETY (SCCC), 2017,
[2] Big Data Platform Configuration Using Machine Learning
Yeh, Chao-Chun
Lu, Han-Lin
Zhou, Jiazheng
Chang, Sheng-An
Lin, Xuan-Yi
Sun, Yi-Chiao
Huang, Shih-Kun
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2020, 36 (03) : 469 - 493
[3] Strategies and Principles of Distributed Machine Learning on Big Data
Xing, Eric P.
Ho, Qirong
Xie, Pengtao
Wei, Dai
ENGINEERING, 2016, 2 (02) : 179 - 195
[4] MeLoN: Distributed Deep Learning meets the Big Data Platform
Kang, Dae-Cheol
Heo, Seoungbeom
Jang, Hyeounji
Lee, Hyeock-Jin
Cho, Minkyoung
Kim, Jik-Soo
2021 IEEE INTERNATIONAL CONFERENCE ON AUTONOMIC COMPUTING AND SELF-ORGANIZING SYSTEMS COMPANION (ACSOS-C 2021), 2021, : 32 - 37
[5] Distributed Weighted Extreme Learning Machine for Big Imbalanced Data Learning
Wang, Zhiqiong
Xin, Junchang
Tian, Shuo
Yu, Ge
PROCEEDINGS OF ELM-2015, VOL 1: THEORY, ALGORITHMS AND APPLICATIONS (I), 2016, 6 : 319 - 332
[6] Distributed and Weighted Extreme Learning Machine for Imbalanced Big Data Learning
Zhiqiong Wang
Junchang Xin
Hongxu Yang
Shuo Tian
Ge Yu
Chenren Xu
Yudong Yao
Tsinghua Science and Technology, 2017, 22 (02) : 160 - 173
[7] Distributed and Weighted Extreme Learning Machine for Imbalanced Big Data Learning
Wang, Zhiqiong
Xin, Junchang
Yang, Hongxu
Tian, Shuo
Yu, Ge
Xu, Chenren
Yao, Yudong
TSINGHUA SCIENCE AND TECHNOLOGY, 2017, 22 (02) : 160 - 173
[8] SPARK-A Big Data Processing Platform for Machine Learning
Fu, Jian
Sun, Junwei
Wang, Kaiyuan
2016 2ND INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS - COMPUTING TECHNOLOGY, INTELLIGENT TECHNOLOGY, INDUSTRIAL INFORMATION INTEGRATION (ICIICII), 2016, : 48 - 51
[9] Protecting Machine Learning Integrity in Distributed Big Data Networking
Wei, Yunkai
Chen, Yijin
Xiao, Mingyue
Maharjan, Sabita
Zhang, Yan
IEEE NETWORK, 2020, 34 (04): : 84 - 90
[10] A Survey of Distributed and Parallel Extreme Learning Machine for Big Data
Wang, Zhiqiong
Sui, Ling
Xin, Junchang
Qu, Luxuan
Yao, Yudong
IEEE ACCESS, 2020, 8 : 201247 - 201258

← 1 2 3 4 5 →