Petuum: A New Platform for Distributed Machine Learning on Big Data

被引：55

作者：

Xing, Eric P. ^{[1
]}

Ho, Qirong ^{[2
]}

Dai, Wei ^{[1
]}

Kim, Jin Kyu ^{[1
]}

Wei, Jinliang ^{[1
]}

Lee, Seunghak ^{[1
]}

Zheng, Xun ^{[1
]}

Xie, Pengtao ^{[1
]}

Kumar, Abhimanu ^{[1
]}

Yu, Yaoliang ^{[1
]}

机构：

[1] Carnegie Mellon Univ, Sch Comp Sci, Pittsburgh, PA 15213 USA

[2] ASTAR, Inst Infocomm Res, Singapore, Singapore

来源：

KDD'15: PROCEEDINGS OF THE 21ST ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING | 2015年

基金：

美国国家科学基金会;

关键词：

Machine Learning; Big Data; Big Model; Distributed Systems; Theory; Data-Parallelism; Model-Parallelism;

D O I：

10.1145/2783258.2783323

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

How can one build a distributed framework that allows efficient deployment of a wide spectrum of modern advanced machine learning (ML) programs for industrial-scale problems using Big Models (100s of billions of parameters) on Big Data (terabytes or petabytes)? Contemporary parallelization strategies employ fine-grained operations and scheduling beyond the classic bulk-synchronous processing paradigm popularized by MapReduce, or even specialized operators relying on graphical representations of ML programs. The variety of approaches tends to pull systems and algorithms design in different directions, and it remains difficult to find a universal platform applicable to a wide range of different ML programs at scale. We propose a general-purpose framework that systematically addresses data- and model-parallel challenges in large-scale ML, by leveraging several fundamental properties underlying ML programs that make them different from conventional operation-centric programs: error tolerance, dynamic structure, and nonuniform convergence; all stem from the optimization-centric nature shared in ML programs' mathematical definitions, and the iterative convergent behavior of their algorithmic solutions. These properties present unique opportunities for an integrative system design, built on bounded-latency network synchronization and dynamic load-balancing scheduling, which is efficient, programmable, and enjoys provable correctness guarantees. We demonstrate how such a design in light of ML first principles leads to significant performance improvements versus well-known implementations of several ML programs, allowing them to run in much less time and at considerably larger model sizes, on modestly-sized computer clusters.

引用

页码：1335 / 1344

页数：10

共 50 条

[31] Machine Learning under Big Data
Shi, Chunhe
Wu, Chengdong
Han, Xiaowei
Xie, Yinghong
Li, Zhen
PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON ELECTRONIC, MECHANICAL, INFORMATION AND MANAGEMENT SOCIETY (EMIM), 2016, 40 : 301 - 305
[32] Scalable malware detection system using big data and distributed machine learning approach
Manish Kumar
Soft Computing, 2022, 26 : 3987 - 4003
[33] Scalable malware detection system using big data and distributed machine learning approach
Kumar, Manish
SOFT COMPUTING, 2022, 26 (08) : 3987 - 4003
[34] What’s new in ICU in 2050: big data and machine learning
Sébastien Bailly
Geert Meyfroidt
Jean-François Timsit
Intensive Care Medicine, 2018, 44 : 1524 - 1527
[35] What's new in ICU in 2050: big data and machine learning
Bailly, Sebastien
Meyfroidt, Geert
Timsit, Jean-Francois
INTENSIVE CARE MEDICINE, 2018, 44 (09) : 1524 - 1527
[36] Big Data Analysis in a Social Learning Platform
Huang, Lan
Wei, Yunfeng
Zamboni, Alessio
Zhang, Jing
Xu, Hao
PROCEEDINGS OF THE 2015 4TH INTERNATIONAL CONFERENCE ON COMPUTER, MECHATRONICS, CONTROL AND ELECTRONIC ENGINEERING (ICCMCEE 2015), 2015, 37 : 1474 - 1477
[37] Development of Big Data Multi-VM Platform for Rapid Prototyping of Distributed Deep Learning
Wu, Chien-Heng
Chuang, Chiao-Ning
Chang, Wen-Yi
Tsai, Whey-Fone
BIG DATA - BIGDATA 2018, 2018, 10968 : 182 - 193
[38] Optimization of Management and Processing of Big Data on a Platform for Distributed Data Storage
Nerić, Vedrana
Sarajlić, Nermin
Hadžić, Đulaga
Elektrotehniski Vestnik/Electrotechnical Review, 2024, 91 (05): : 272 - 283
[39] Online Incremental Machine Learning Platform for Big Data-Driven Smart Traffic Management
Nallaperuma, Dinithi
Nawaratne, Rashmika
Bandaragoda, Tharindu
Adikari, Achini
Su Nguyen
Kempitiya, Thimal
De Silva, Daswin
Alahakoon, Damminda
Pothuhera, Dakshan
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2019, 20 (12) : 4679 - 4690
[40] A Platform for Integrating Internet of Things, Machine Learning, and Big Data Practicum in Electrical Engineering Curricula
Jayachandran, Nandana
Abdrabou, Atef
Yamane, Naod
Al-Dulaimi, Anwer
COMPUTERS, 2024, 13 (08)

← 1 2 3 4 5 →