Orchestra: Adaptively Accelerating Distributed Deep Learning in Heterogeneous Environments

被引:1
|
作者
Du, Haizhou [1 ]
Huang, Sheng [1 ]
Xiang, Qiao [2 ]
机构
[1] Shanghai Univ Elect Power, Shanghai, Peoples R China
[2] Xiamen Univ, Xiamen, Peoples R China
关键词
Distributed Deep Learning; Local Update Adaptation; Load-Balance; Heterogeneous Environments;
D O I
10.1145/3528416.3530246
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The synchronized Local-SGD(Stochastic gradient descent) strategy becomes a more popular in distributed deep learning (DML) since it can effectively reduce the frequency of model communication and ensure global model convergence. However, it works not well and leads to excessive training time in heterogeneous environments due to the difference in workers' performance. Especially, in some data unbalanced scenarios, these differences between workers may aggravate low utilization of resources and eventually lead to stragglers, which seriously hurt the whole training procedure. Existing solutions either suffer from a heterogeneity of computing resources or do not fully address the environment dynamics. In this paper, we eliminate the negative impacts of dynamic resource constraints issues in heterogeneous DML environments with a novel, adaptive load-balancing framework called Orchestra. The main idea of Orchestra is to improve resource utilization by load balance between worker performance and the unbalance of data volume. Additionally, one of Orchestra's strongest features is the number of local updates adaptation at each epoch per worker. To achieve this improvement, we propose a distributed deep reinforcement learning-driven algorithm for per-worker to dynamically determine the number of local updates adaptation and training data volume, subject to mini-batch cost time and resource constraints at each epoch. Our design significantly improves the convergence speed of the model in DML compared with other state-of-the-art.
引用
收藏
页码:181 / 184
页数:4
相关论文
共 50 条
  • [41] vRaft: Accelerating the Distributed Consensus Under Virtualized Environments
    Wang, Yangyang
    Chai, Yunpeng
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS (DASFAA 2021), PT I, 2021, 12681 : 53 - 70
  • [42] Leaderless Synchronization of Heterogeneous Oscillators by Adaptively Learning the Group Model
    Baldi, Simone
    Frasca, Paolo
    IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2020, 65 (01) : 412 - 418
  • [43] Hybrid Network-on-Chip Architectures for Accelerating Deep Learning Kernels on Heterogeneous Manycore Platforms
    Choi, Wonje
    Duraisamy, Karthi
    Kim, Ryan Gary
    Doppa, Janardhan Rao
    Pande, Partha Pratim
    Marculescu, Radu
    Marculescu, Diana
    2016 INTERNATIONAL CONFERENCE ON COMPILERS, ARCHITECTURE AND SYNTHESIS FOR EMBEDDED SYSTEMS (CASES), 2016,
  • [44] Deep Multi-Critic Network for accelerating Policy Learning in multi-agent environments
    Hook, Joosep
    De Silva, Varuna
    Kondoz, Ahmet
    NEURAL NETWORKS, 2020, 128 : 97 - 106
  • [45] Adaptively Calibrated Critic Estimates for Deep Reinforcement Learning
    Dorka, Nicolai
    Welschehold, Tim
    Boedecker, Joschka
    Burgard, Wolfram
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2023, 8 (02) : 624 - 631
  • [46] Deep Metric Learning With Adaptively Composite Dynamic Constraints
    Zheng, Wenzhao
    Lu, Jiwen
    Zhou, Jie
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (07) : 8265 - 8283
  • [47] Interacting with Heterogeneous Information Ecologies: Challenges and Opportunities for Students in Diverse and Distributed Learning Environments
    Dodson, Samuel
    PROCEEDINGS OF THE 2019 CONFERENCE ON HUMAN INFORMATION INTERACTION AND RETRIEVAL (CHIIR'19), 2019, : 445 - 448
  • [48] Deep Learning-Based Job Placement in Distributed Machine Learning Clusters With Heterogeneous Workloads
    Bao, Yixin
    Peng, Yanghua
    Wu, Chuan
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2023, 31 (02) : 634 - 647
  • [49] DISNET: Distributed Micro-Split Deep Learning in Heterogeneous Dynamic IoT
    Samikwa, Eric
    Di Maio, Antonio
    Braun, Torsten
    IEEE INTERNET OF THINGS JOURNAL, 2024, 11 (04): : 6199 - 6216
  • [50] Accelerating Deep Learning by Binarized Hardware
    Takamaeda-Yamazaki, Shinya
    Ueyoshi, Kodai
    Ando, Kota
    Uematsu, Ryota
    Hirose, Kazutoshi
    Ikebe, Masayuki
    Asai, Tetsuya
    Motomura, Masato
    2017 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC 2017), 2017, : 1045 - 1051