KungFu: Making Training in Distributed Machine Learning Adaptive

被引：0

作者：

Mai, Luo ^{[1
]}

Li, Guo ^{[1
]}

Wagenlander, Marcel ^{[1
]}

Fertakis, Konstantinos ^{[1
]}

Brabete, Andrei-Octavian ^{[1
]}

Pietzuch, Peter ^{[1
]}

机构：

[1] Imperial Coll London, London, England

来源：

PROCEEDINGS OF THE 14TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI '20) | 2020年

关键词：

D O I：

暂无

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

When using distributed machine learning (ML) systems to train models on a cluster of worker machines, users must configure a large number of parameters: hyper-parameters (e.g. the batch size and the learning rate) affect model convergence; system parameters (e.g. the number of workers and their communication topology) impact training performance. In current systems, adapting such parameters during training is ill-supported. Users must set system parameters at deployment time, and provide fixed adaptation schedules for hyper-parameters in the training program. We describe KungFu, a distributed ML library for Tensor-Flow that is designed to enable adaptive training KungFu allows users to express high-level Adaptation Policies (APs) that describe how to change hyper- and system parameters during training. APs take real-time monitored metrics (e.g. signal-to-noise ratios and noise scale) as input and trigger control actions (e.g. cluster resealing or synchronisation strategy updates). For execution, APs are translated into monitoring and control operators, which are embedded in the dataflow graph. APs exploit an efficient asynchronous collective communication layer, which ensures concurrency and consistency of monitoring and adaptation operations.

引用

页码：937 / 954

页数：18

共 50 条

[1] Adaptive synchronous strategy for distributed machine learning
Tan, Miaoquan
Liu, Wai-Xi
Luo, Junming
Chen, Haosen
Guo, Zhen-Zheng
[J]. INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2022, 37 (12) : 11713 - 11741
[2] An Adaptive Synchronous Parallel Strategy for Distributed Machine Learning
Zhang, Jilin
Tu, Hangdi
Ren, Yongjian
Wan, Jian
Zhou, Li
Li, Mingwei
Wang, Jue
[J]. IEEE ACCESS, 2018, 6 : 19222 - 19230
[3] Dependable Distributed Training of Compressed Machine Learning Models
Malandrino, Francesco
Di Giacomo, Giuseppe
Levorato, Marco
Chiasserini, Carla Fabiana
[J]. PROCEEDINGS 2024 IEEE 25TH INTERNATIONAL SYMPOSIUM ON A WORLD OF WIRELESS, MOBILE AND MULTIMEDIA NETWORKS, WOWMOM 2024, 2024, : 147 - 156
[4] Building simulation in adaptive training of machine learning models
Amini, Hamed
Alanne, Kari
Kosonen, Risto
[J]. AUTOMATION IN CONSTRUCTION, 2024, 165
[5] Adaptive Pricing and Online Scheduling for Distributed Machine Learning Jobs
Wang, Yafei
Su, Lina
Chen, Junmei
Wang, Ne
Li, Zongpeng
[J]. IEEE INTERNET OF THINGS JOURNAL, 2024, 11 (07) : 12966 - 12983
[6] A-ELM*: Adaptive Distributed Extreme Learning Machine with MapReduce
Xin, Junchang
Wang, Zhiqiong
Qu, Luxuan
Yu, Ge
Kang, Yan
[J]. NEUROCOMPUTING, 2016, 174 : 368 - 374
[7] Adaptive Distributed Beacon Congestion Control with Machine Learning in VANETs
Mohammadi, Mahboubeh
Balador, Ali
Fernandez, Zaloa
Val, Inaki
[J]. 2021 17TH INTERNATIONAL CONFERENCE ON MOBILITY, SENSING AND NETWORKING (MSN 2021), 2021, : 766 - 771
[8] Boosting the Training Time of Weakly Coordinated Distributed Machine Learning
Duriakova, Erika
Tragos, Elias
Lawlor, Aonghus
Smyth, Barry
Hurley, Neil
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 1023 - 1029
[9] Efficient Distributed Machine Learning with Trigger Driven Parallel Training
Li, Shenglong
Xue, Jilong
Yang, Zhi
Dai, Yafei
[J]. 2016 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2016,
[10] AdaGL: Adaptive Learning for Agile Distributed Training of Gigantic GNNs
Zhang, Ruisi
Javaheripi, Mojan
Ghodsi, Zahra
Bleiweiss, Amit
Koushanfar, Farinaz
[J]. 2023 60TH ACM/IEEE DESIGN AUTOMATION CONFERENCE, DAC, 2023,

← 1 2 3 4 5 →