KungFu: Making Training in Distributed Machine Learning Adaptive

被引:0
|
作者
Mai, Luo [1 ]
Li, Guo [1 ]
Wagenlander, Marcel [1 ]
Fertakis, Konstantinos [1 ]
Brabete, Andrei-Octavian [1 ]
Pietzuch, Peter [1 ]
机构
[1] Imperial Coll London, London, England
关键词
D O I
暂无
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
When using distributed machine learning (ML) systems to train models on a cluster of worker machines, users must configure a large number of parameters: hyper-parameters (e.g. the batch size and the learning rate) affect model convergence; system parameters (e.g. the number of workers and their communication topology) impact training performance. In current systems, adapting such parameters during training is ill-supported. Users must set system parameters at deployment time, and provide fixed adaptation schedules for hyper-parameters in the training program. We describe KungFu, a distributed ML library for Tensor-Flow that is designed to enable adaptive training KungFu allows users to express high-level Adaptation Policies (APs) that describe how to change hyper- and system parameters during training. APs take real-time monitored metrics (e.g. signal-to-noise ratios and noise scale) as input and trigger control actions (e.g. cluster resealing or synchronisation strategy updates). For execution, APs are translated into monitoring and control operators, which are embedded in the dataflow graph. APs exploit an efficient asynchronous collective communication layer, which ensures concurrency and consistency of monitoring and adaptation operations.
引用
收藏
页码:937 / 954
页数:18
相关论文
共 50 条
  • [1] Adaptive synchronous strategy for distributed machine learning
    Tan, Miaoquan
    Liu, Wai-Xi
    Luo, Junming
    Chen, Haosen
    Guo, Zhen-Zheng
    [J]. INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2022, 37 (12) : 11713 - 11741
  • [2] An Adaptive Synchronous Parallel Strategy for Distributed Machine Learning
    Zhang, Jilin
    Tu, Hangdi
    Ren, Yongjian
    Wan, Jian
    Zhou, Li
    Li, Mingwei
    Wang, Jue
    [J]. IEEE ACCESS, 2018, 6 : 19222 - 19230
  • [3] Dependable Distributed Training of Compressed Machine Learning Models
    Malandrino, Francesco
    Di Giacomo, Giuseppe
    Levorato, Marco
    Chiasserini, Carla Fabiana
    [J]. PROCEEDINGS 2024 IEEE 25TH INTERNATIONAL SYMPOSIUM ON A WORLD OF WIRELESS, MOBILE AND MULTIMEDIA NETWORKS, WOWMOM 2024, 2024, : 147 - 156
  • [4] Building simulation in adaptive training of machine learning models
    Amini, Hamed
    Alanne, Kari
    Kosonen, Risto
    [J]. AUTOMATION IN CONSTRUCTION, 2024, 165
  • [5] Adaptive Pricing and Online Scheduling for Distributed Machine Learning Jobs
    Wang, Yafei
    Su, Lina
    Chen, Junmei
    Wang, Ne
    Li, Zongpeng
    [J]. IEEE INTERNET OF THINGS JOURNAL, 2024, 11 (07) : 12966 - 12983
  • [6] A-ELM*: Adaptive Distributed Extreme Learning Machine with MapReduce
    Xin, Junchang
    Wang, Zhiqiong
    Qu, Luxuan
    Yu, Ge
    Kang, Yan
    [J]. NEUROCOMPUTING, 2016, 174 : 368 - 374
  • [7] Adaptive Distributed Beacon Congestion Control with Machine Learning in VANETs
    Mohammadi, Mahboubeh
    Balador, Ali
    Fernandez, Zaloa
    Val, Inaki
    [J]. 2021 17TH INTERNATIONAL CONFERENCE ON MOBILITY, SENSING AND NETWORKING (MSN 2021), 2021, : 766 - 771
  • [8] Boosting the Training Time of Weakly Coordinated Distributed Machine Learning
    Duriakova, Erika
    Tragos, Elias
    Lawlor, Aonghus
    Smyth, Barry
    Hurley, Neil
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 1023 - 1029
  • [9] Efficient Distributed Machine Learning with Trigger Driven Parallel Training
    Li, Shenglong
    Xue, Jilong
    Yang, Zhi
    Dai, Yafei
    [J]. 2016 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2016,
  • [10] AdaGL: Adaptive Learning for Agile Distributed Training of Gigantic GNNs
    Zhang, Ruisi
    Javaheripi, Mojan
    Ghodsi, Zahra
    Bleiweiss, Amit
    Koushanfar, Farinaz
    [J]. 2023 60TH ACM/IEEE DESIGN AUTOMATION CONFERENCE, DAC, 2023,