Performance and Consistency Analysis for Distributed Deep Learning Applications

被引：0

作者：

Jia, Danlin ^{[1
]}

Saha, Manoj Pravakar ^{[2
]}

Bhimani, Janki ^{[2
]}

Mi, Ningfang ^{[1
]}

机构：

[1] Northeastern Univ, Boston, MA 02115 USA

[2] Florida Int Univ, Miami, FL 33199 USA

来源：

2020 IEEE 39TH INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC) | 2020年

基金：

美国国家科学基金会;

关键词：

D O I：

10.1109/IPCCC50635.2020.9391566

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Accelerating the training of Deep Neural Network (DNN) models is very important for successfully using deep learning techniques in fields like computer vision and speech recognition. Distributed frameworks help to speed up the training process for large DNN models and datasets. Plenty of works have been done to improve model accuracy and training efficiency, based on mathematical analysis of computations in the Convolutional Neural Networks (CNN). However, to run distributed deep learning applications in the real world, users and developers need to consider the impacts of system resource distribution. In this work, we deploy a real distributed deep learning cluster with multiple virtual machines. We conduct an in-depth analysis to understand the impacts of system configurations, distribution typologies, and application parameters, on the latency and correctness of the distributed deep learning applications. We analyze the performance diversity under different model consistency and data parallelism by profiling run-time system utilization and tracking application activities. Based on our observations and analysis, we develop design guidelines for accelerating distributed deep-learning training on virtualized environments.

引用

页数：8

共 50 条

[1] Performance Analysis of Distributed and Scalable Deep Learning
Mahon, Sean
Varrette, Sebastien
Plugaru, Valentin
Pinel, Frederic
Bouvry, Pascal
2020 20TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2020), 2020, : 760 - 766
[2] Hierarchical Roofline Performance Analysis for Deep Learning Applications
Yang, Charlene
Wang, Yunsong
Kurth, Thorsten
Farrell, Steven
Williams, Samuel
INTELLIGENT COMPUTING, VOL 2, 2021, 284 : 473 - 491
[3] MD-Roofline: A Training Performance Analysis Model for Distributed Deep Learning
Miao, Tianhao
Wu, Qinghua
Liu, Ting
Cui, Penglai
Ren, Rui
Li, Zhenyu
Xie, Gaogang
2022 27TH IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS (IEEE ISCC 2022), 2022,
[4] Performance Analysis of Distributed Deep Learning Frameworks in a Multi-GPU Environment
Kavarakuntla, Tulasi
Han, Liangxiu
Lloyd, Huw
Latham, Annabel
Akintoye, Samson B.
20TH INT CONF ON UBIQUITOUS COMP AND COMMUNICAT (IUCC) / 20TH INT CONF ON COMP AND INFORMATION TECHNOLOGY (CIT) / 4TH INT CONF ON DATA SCIENCE AND COMPUTATIONAL INTELLIGENCE (DSCI) / 11TH INT CONF ON SMART COMPUTING, NETWORKING, AND SERV (SMARTCNS), 2021, : 406 - 413
[5] File Access Patterns of Distributed Deep Learning Applications
Parraga, Edixon
Leon, Betzabeth
Mendez, Sandra
Rexachs, Dolores
Luque, Emilio
CLOUD COMPUTING, BIG DATA & EMERGING TOPICS, JCC-BD&ET 2022, 2022, 1634 : 3 - 19
[6] Towards a Scalable and Distributed Infrastructure for Deep Learning Applications
Hasheminezhad, Bita
Shirzad, Shahrzad
Wu, Nanmiao
Diehl, Patrick
Schulz, Hannes
Kaiser, Hartmut
PROCEEDINGS OF 2020 IEEE/ACM 5TH WORKSHOP ON DEEP LEARNING ON SUPERCOMPUTERS (DLS 2020), 2020, : 20 - 30
[7] Performance Analysis of Google Colaboratory as a Tool for Accelerating Deep Learning Applications
Carneiro, Tiago
Medeiros Da Nobrega, Raul Victor
Nepomuceno, Thiago
Bian, Gui-Bin
De Albuquerque, Victor Hugo C.
Reboucas Filho, Pedro Pedrosa
IEEE ACCESS, 2018, 6 : 61677 - 61685
[8] A Generic Performance Model for Deep Learning in a Distributed Environment
Kavarakuntla, Tulasi
Han, Liangxiu
Lloyd, Huw
Latham, Annabel
Kleerekoper, Anthony
Akintoye, Samson B.
IEEE ACCESS, 2024, 12 : 8207 - 8219
[9] Detailed Performance Analysis of Distributed Tensorflow on a GPU Cluster using Deep Learning Algorithms
Malik, Abid
Lu, Micheal
Wang, Nathenial
Lin, Yeiwei
Yoo, Shinjae
2018 NEW YORK SCIENTIFIC DATA SUMMIT (NYSDS), 2018,
[10] Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks
Marcel Aach
Eray Inanc
Rakesh Sarma
Morris Riedel
Andreas Lintermann
Journal of Big Data, 10

← 1 2 3 4 5 →