Visual Diagnostics of Parallel Performance in Training Large-Scale DNN Models

被引:2
|
作者
Wei, Yating [1 ]
Wang, Zhiyong [1 ]
Wang, Zhongwei [1 ]
Dai, Yong [1 ]
Ou, Gongchang [2 ]
Gao, Han [2 ]
Yang, Haitao [2 ]
Wang, Yue [2 ]
Cao, Caleb Chen [2 ]
Weng, Luoxuan [1 ]
Lu, Jiaying [1 ]
Zhu, Rongchen [1 ]
Chen, Wei [1 ]
机构
[1] Zhejiang Univ, State Key Lab CAD & CG, Hangzhou 310058, Zhejiang, Peoples R China
[2] Huawei Technol Co Ltd, Distributed Data Lab, Shenzhen 518129, Peoples R China
基金
中国国家自然科学基金;
关键词
Training; Data visualization; Computational modeling; Solid modeling; Parallel processing; Performance evaluation; Data models; Deep neural network; model training; parallel performance; visual analysis; VISUALIZATION; FLOW;
D O I
10.1109/TVCG.2023.3243228
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Diagnosing the cluster-based performance of large-scale deep neural network (DNN) models during training is essential for improving training efficiency and reducing resource consumption. However, it remains challenging due to the incomprehensibility of the parallelization strategy and the sheer volume of complex data generated in the training processes. Prior works visually analyze performance profiles and timeline traces to identify anomalies from the perspective of individual devices in the cluster, which is not amenable for studying the root cause of anomalies. In this article, we present a visual analytics approach that empowers analysts to visually explore the parallel training process of a DNN model and interactively diagnose the root cause of a performance issue. A set of design requirements is gathered through discussions with domain experts. We propose an enhanced execution flow of model operators for illustrating parallelization strategies within the computational graph layout. We design and implement an enhanced Marey's graph representation, which introduces the concept of time-span and a banded visual metaphor to convey training dynamics and help experts identify inefficient training processes. We also propose a visual aggregation technique to improve visualization efficiency. We evaluate our approach using case studies, a user study and expert interviews on two large-scale models run in a cluster, namely, the PanGu-alpha 13B model (40 layers), and the Resnet model (50 layers).
引用
收藏
页码:3915 / 3929
页数:15
相关论文
共 50 条
  • [1] DistSim: A performance model of large-scale hybrid distributed DNN training
    Lu, Guandong
    Chen, Runzhe
    Wang, Yakai
    Zhou, Yangjie
    Zhang, Rui
    Hu, Zheng
    Miao, Yanming
    Cai, Zhifang
    Li, Li
    Leng, Jingwen
    Guo, Minyi
    [J]. PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2023, CF 2023, 2023, : 112 - 122
  • [2] GradientFlow: Optimizing Network Performance for Large-Scale Distributed DNN Training
    Sun, Peng
    Wen, Yonggang
    Han, Ruobing
    Feng, Wansen
    Yan, Shengen
    [J]. IEEE TRANSACTIONS ON BIG DATA, 2022, 8 (02) : 495 - 507
  • [3] Interactive visual analytics of parallel training strategies for DNN models
    Wang, Zhongwei
    Wei, Yating
    Ou, Gongchang
    Gao, Han
    Yang, Haitao
    Wang, Yue
    Cao, Chen
    Zhu, Minfeng
    Chen, Wei
    [J]. COMPUTERS & GRAPHICS-UK, 2023, 115 : 392 - 403
  • [4] mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training
    Dreuning, Henk
    Bal, Henri E.
    van Nieuwpoort, Rob, V
    [J]. EURO-PAR 2022: PARALLEL PROCESSING, 2022, 13440 : 155 - 170
  • [5] Swift : Expedited Failure Recovery for Large-Scale DNN Training
    Zhong, Yuchen
    Sheng, Guangming
    Liu, Juncheng
    Yuan, Jinhui
    Wu, Chuan
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2024, 35 (09) : 1644 - 1656
  • [6] AI Accelerator Embedded Computational Storage for Large-Scale DNN Models
    Aim, Byungmin
    Jang, Jaehun
    Na, Hanbyeul
    Seo, Mankeun
    Son, Hongrak
    Song, Yong Ho
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE CIRCUITS AND SYSTEMS (AICAS 2022): INTELLIGENT TECHNOLOGY IN THE POST-PANDEMIC ERA, 2022, : 483 - 486
  • [7] Performance prediction of large-scale parallel discrete event models of physical systems
    Perumalla, KS
    Fujimoto, RM
    Thakare, PJ
    Pande, S
    Karimabadi, H
    Omelchenko, Y
    Driscoll, J
    [J]. Proceedings of the 2005 Winter Simulation Conference, Vols 1-4, 2005, : 356 - 364
  • [8] Formal Metrics for Large-Scale Parallel Performance
    Moreland, Kenneth
    Oldfield, Ron
    [J]. HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2015, 2015, 9137 : 488 - 496
  • [9] PERFORMANCE PROPERTIES OF LARGE-SCALE PARALLEL SYSTEMS
    GUPTA, A
    KUMAR, V
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1993, 19 (03) : 234 - 244
  • [10] Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models
    Wang, Wei
    Lai, Zhiquan
    Li, Shengwei
    Liu, Weijie
    Ge, Keshi
    Liu, Yujie
    Shen, Ao
    Li, Dongsheng
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, CLUSTER, 2023, : 82 - 94