Performance and Consistency Analysis for Distributed Deep Learning Applications

被引:0
|
作者
Jia, Danlin [1 ]
Saha, Manoj Pravakar [2 ]
Bhimani, Janki [2 ]
Mi, Ningfang [1 ]
机构
[1] Northeastern Univ, Boston, MA 02115 USA
[2] Florida Int Univ, Miami, FL 33199 USA
基金
美国国家科学基金会;
关键词
D O I
10.1109/IPCCC50635.2020.9391566
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Accelerating the training of Deep Neural Network (DNN) models is very important for successfully using deep learning techniques in fields like computer vision and speech recognition. Distributed frameworks help to speed up the training process for large DNN models and datasets. Plenty of works have been done to improve model accuracy and training efficiency, based on mathematical analysis of computations in the Convolutional Neural Networks (CNN). However, to run distributed deep learning applications in the real world, users and developers need to consider the impacts of system resource distribution. In this work, we deploy a real distributed deep learning cluster with multiple virtual machines. We conduct an in-depth analysis to understand the impacts of system configurations, distribution typologies, and application parameters, on the latency and correctness of the distributed deep learning applications. We analyze the performance diversity under different model consistency and data parallelism by profiling run-time system utilization and tracking application activities. Based on our observations and analysis, we develop design guidelines for accelerating distributed deep-learning training on virtualized environments.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] Performance Analysis of Distributed and Scalable Deep Learning
    Mahon, Sean
    Varrette, Sebastien
    Plugaru, Valentin
    Pinel, Frederic
    Bouvry, Pascal
    2020 20TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2020), 2020, : 760 - 766
  • [2] Hierarchical Roofline Performance Analysis for Deep Learning Applications
    Yang, Charlene
    Wang, Yunsong
    Kurth, Thorsten
    Farrell, Steven
    Williams, Samuel
    INTELLIGENT COMPUTING, VOL 2, 2021, 284 : 473 - 491
  • [3] MD-Roofline: A Training Performance Analysis Model for Distributed Deep Learning
    Miao, Tianhao
    Wu, Qinghua
    Liu, Ting
    Cui, Penglai
    Ren, Rui
    Li, Zhenyu
    Xie, Gaogang
    2022 27TH IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS (IEEE ISCC 2022), 2022,
  • [4] Performance Analysis of Distributed Deep Learning Frameworks in a Multi-GPU Environment
    Kavarakuntla, Tulasi
    Han, Liangxiu
    Lloyd, Huw
    Latham, Annabel
    Akintoye, Samson B.
    20TH INT CONF ON UBIQUITOUS COMP AND COMMUNICAT (IUCC) / 20TH INT CONF ON COMP AND INFORMATION TECHNOLOGY (CIT) / 4TH INT CONF ON DATA SCIENCE AND COMPUTATIONAL INTELLIGENCE (DSCI) / 11TH INT CONF ON SMART COMPUTING, NETWORKING, AND SERV (SMARTCNS), 2021, : 406 - 413
  • [5] File Access Patterns of Distributed Deep Learning Applications
    Parraga, Edixon
    Leon, Betzabeth
    Mendez, Sandra
    Rexachs, Dolores
    Luque, Emilio
    CLOUD COMPUTING, BIG DATA & EMERGING TOPICS, JCC-BD&ET 2022, 2022, 1634 : 3 - 19
  • [6] Towards a Scalable and Distributed Infrastructure for Deep Learning Applications
    Hasheminezhad, Bita
    Shirzad, Shahrzad
    Wu, Nanmiao
    Diehl, Patrick
    Schulz, Hannes
    Kaiser, Hartmut
    PROCEEDINGS OF 2020 IEEE/ACM 5TH WORKSHOP ON DEEP LEARNING ON SUPERCOMPUTERS (DLS 2020), 2020, : 20 - 30
  • [7] Performance Analysis of Google Colaboratory as a Tool for Accelerating Deep Learning Applications
    Carneiro, Tiago
    Medeiros Da Nobrega, Raul Victor
    Nepomuceno, Thiago
    Bian, Gui-Bin
    De Albuquerque, Victor Hugo C.
    Reboucas Filho, Pedro Pedrosa
    IEEE ACCESS, 2018, 6 : 61677 - 61685
  • [8] A Generic Performance Model for Deep Learning in a Distributed Environment
    Kavarakuntla, Tulasi
    Han, Liangxiu
    Lloyd, Huw
    Latham, Annabel
    Kleerekoper, Anthony
    Akintoye, Samson B.
    IEEE ACCESS, 2024, 12 : 8207 - 8219
  • [9] Detailed Performance Analysis of Distributed Tensorflow on a GPU Cluster using Deep Learning Algorithms
    Malik, Abid
    Lu, Micheal
    Wang, Nathenial
    Lin, Yeiwei
    Yoo, Shinjae
    2018 NEW YORK SCIENTIFIC DATA SUMMIT (NYSDS), 2018,
  • [10] Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks
    Marcel Aach
    Eray Inanc
    Rakesh Sarma
    Morris Riedel
    Andreas Lintermann
    Journal of Big Data, 10