Visual Diagnostics of Parallel Performance in Training Large-Scale DNN Models

被引：2

作者：

Wei, Yating ^{[1
]}

Wang, Zhiyong ^{[1
]}

Wang, Zhongwei ^{[1
]}

Dai, Yong ^{[1
]}

Ou, Gongchang ^{[2
]}

Gao, Han ^{[2
]}

Yang, Haitao ^{[2
]}

Wang, Yue ^{[2
]}

Cao, Caleb Chen ^{[2
]}

Weng, Luoxuan ^{[1
]}

Lu, Jiaying ^{[1
]}

Zhu, Rongchen ^{[1
]}

Chen, Wei ^{[1
]}

机构：

[1] Zhejiang Univ, State Key Lab CAD & CG, Hangzhou 310058, Zhejiang, Peoples R China

[2] Huawei Technol Co Ltd, Distributed Data Lab, Shenzhen 518129, Peoples R China

来源：

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS | 2024年 / 30卷 / 07期

基金：

中国国家自然科学基金;

关键词：

Training; Data visualization; Computational modeling; Solid modeling; Parallel processing; Performance evaluation; Data models; Deep neural network; model training; parallel performance; visual analysis; VISUALIZATION; FLOW;

D O I：

10.1109/TVCG.2023.3243228

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Diagnosing the cluster-based performance of large-scale deep neural network (DNN) models during training is essential for improving training efficiency and reducing resource consumption. However, it remains challenging due to the incomprehensibility of the parallelization strategy and the sheer volume of complex data generated in the training processes. Prior works visually analyze performance profiles and timeline traces to identify anomalies from the perspective of individual devices in the cluster, which is not amenable for studying the root cause of anomalies. In this article, we present a visual analytics approach that empowers analysts to visually explore the parallel training process of a DNN model and interactively diagnose the root cause of a performance issue. A set of design requirements is gathered through discussions with domain experts. We propose an enhanced execution flow of model operators for illustrating parallelization strategies within the computational graph layout. We design and implement an enhanced Marey's graph representation, which introduces the concept of time-span and a banded visual metaphor to convey training dynamics and help experts identify inefficient training processes. We also propose a visual aggregation technique to improve visualization efficiency. We evaluate our approach using case studies, a user study and expert interviews on two large-scale models run in a cluster, namely, the PanGu-alpha 13B model (40 layers), and the Resnet model (50 layers).

引用

页码：3915 / 3929

页数：15

共 50 条

[1] DistSim: A performance model of large-scale hybrid distributed DNN training
Lu, Guandong
Chen, Runzhe
Wang, Yakai
Zhou, Yangjie
Zhang, Rui
Hu, Zheng
Miao, Yanming
Cai, Zhifang
Li, Li
Leng, Jingwen
Guo, Minyi
[J]. PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2023, CF 2023, 2023, : 112 - 122
[2] GradientFlow: Optimizing Network Performance for Large-Scale Distributed DNN Training
Sun, Peng
Wen, Yonggang
Han, Ruobing
Feng, Wansen
Yan, Shengen
[J]. IEEE TRANSACTIONS ON BIG DATA, 2022, 8 (02) : 495 - 507
[3] Interactive visual analytics of parallel training strategies for DNN models
Wang, Zhongwei
Wei, Yating
Ou, Gongchang
Gao, Han
Yang, Haitao
Wang, Yue
Cao, Chen
Zhu, Minfeng
Chen, Wei
[J]. COMPUTERS & GRAPHICS-UK, 2023, 115 : 392 - 403
[4] mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training
Dreuning, Henk
Bal, Henri E.
van Nieuwpoort, Rob, V
[J]. EURO-PAR 2022: PARALLEL PROCESSING, 2022, 13440 : 155 - 170
[5] Swift : Expedited Failure Recovery for Large-Scale DNN Training
Zhong, Yuchen
Sheng, Guangming
Liu, Juncheng
Yuan, Jinhui
Wu, Chuan
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2024, 35 (09) : 1644 - 1656
[6] AI Accelerator Embedded Computational Storage for Large-Scale DNN Models
Aim, Byungmin
Jang, Jaehun
Na, Hanbyeul
Seo, Mankeun
Son, Hongrak
Song, Yong Ho
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE CIRCUITS AND SYSTEMS (AICAS 2022): INTELLIGENT TECHNOLOGY IN THE POST-PANDEMIC ERA, 2022, : 483 - 486
[7] Performance prediction of large-scale parallel discrete event models of physical systems
Perumalla, KS
Fujimoto, RM
Thakare, PJ
Pande, S
Karimabadi, H
Omelchenko, Y
Driscoll, J
[J]. Proceedings of the 2005 Winter Simulation Conference, Vols 1-4, 2005, : 356 - 364
[8] Formal Metrics for Large-Scale Parallel Performance
Moreland, Kenneth
Oldfield, Ron
[J]. HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2015, 2015, 9137 : 488 - 496
[9] PERFORMANCE PROPERTIES OF LARGE-SCALE PARALLEL SYSTEMS
GUPTA, A
KUMAR, V
[J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1993, 19 (03) : 234 - 244
[10] Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models
Wang, Wei
Lai, Zhiquan
Li, Shengwei
Liu, Weijie
Ge, Keshi
Liu, Yujie
Shen, Ao
Li, Dongsheng
[J]. 2023 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, CLUSTER, 2023, : 82 - 94

← 1 2 3 4 5 →