Visual Diagnostics of Parallel Performance in Training Large-Scale DNN Models

被引:2
|
作者
Wei, Yating [1 ]
Wang, Zhiyong [1 ]
Wang, Zhongwei [1 ]
Dai, Yong [1 ]
Ou, Gongchang [2 ]
Gao, Han [2 ]
Yang, Haitao [2 ]
Wang, Yue [2 ]
Cao, Caleb Chen [2 ]
Weng, Luoxuan [1 ]
Lu, Jiaying [1 ]
Zhu, Rongchen [1 ]
Chen, Wei [1 ]
机构
[1] Zhejiang Univ, State Key Lab CAD & CG, Hangzhou 310058, Zhejiang, Peoples R China
[2] Huawei Technol Co Ltd, Distributed Data Lab, Shenzhen 518129, Peoples R China
基金
中国国家自然科学基金;
关键词
Training; Data visualization; Computational modeling; Solid modeling; Parallel processing; Performance evaluation; Data models; Deep neural network; model training; parallel performance; visual analysis; VISUALIZATION; FLOW;
D O I
10.1109/TVCG.2023.3243228
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Diagnosing the cluster-based performance of large-scale deep neural network (DNN) models during training is essential for improving training efficiency and reducing resource consumption. However, it remains challenging due to the incomprehensibility of the parallelization strategy and the sheer volume of complex data generated in the training processes. Prior works visually analyze performance profiles and timeline traces to identify anomalies from the perspective of individual devices in the cluster, which is not amenable for studying the root cause of anomalies. In this article, we present a visual analytics approach that empowers analysts to visually explore the parallel training process of a DNN model and interactively diagnose the root cause of a performance issue. A set of design requirements is gathered through discussions with domain experts. We propose an enhanced execution flow of model operators for illustrating parallelization strategies within the computational graph layout. We design and implement an enhanced Marey's graph representation, which introduces the concept of time-span and a banded visual metaphor to convey training dynamics and help experts identify inefficient training processes. We also propose a visual aggregation technique to improve visualization efficiency. We evaluate our approach using case studies, a user study and expert interviews on two large-scale models run in a cluster, namely, the PanGu-alpha 13B model (40 layers), and the Resnet model (50 layers).
引用
收藏
页码:3915 / 3929
页数:15
相关论文
共 50 条
  • [41] Performance analysis of a parallel algorithm for restoring large-scale CT images
    Harizanov, Stanislav
    Lirkov, Ivan
    Georgiev, Krassimir
    Paprzycki, Marcin
    Ganzha, Maria
    [J]. JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS, 2017, 310 : 104 - 114
  • [42] Design of large-scale parallel simulations
    Knepley, MG
    Sameh, AH
    Sarin, V
    [J]. PARALLEL COMPUTATIONAL FLUID DYNAMICS: TOWARDS TERAFLOPS, OPTIMIZATION, AND NOVEL FORMULATIONS, 2000, : 273 - 279
  • [43] Parallel genesis for large-scale modeling
    Goddard, NH
    Hood, G
    [J]. COMPUTATIONAL NEUROSCIENCE: TRENDS IN RESEARCH, 1997, 1997, : 911 - 917
  • [44] Diagnostics and forecasting of breakage of large-scale objects
    V. S. Kuksenko
    [J]. Physics of the Solid State, 2005, 47 : 812 - 816
  • [45] Large-scale parallel data clustering
    Judd, D
    McKinley, PK
    Jain, AK
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1998, 20 (08) : 871 - 876
  • [46] Large-Scale Parallel Computing on Grids
    Bal, Henri
    Verstoep, Kees
    [J]. ELECTRONIC NOTES IN THEORETICAL COMPUTER SCIENCE, 2008, 220 (02) : 3 - 17
  • [47] LARGE-SCALE PARALLEL PROCESSING SYSTEMS
    SIEGEL, HJ
    SCHWEDERSKI, T
    MEYER, DG
    HSU, WT
    [J]. MICROPROCESSORS AND MICROSYSTEMS, 1987, 11 (01) : 3 - 20
  • [48] A Large-scale Parallel Fuzzing System
    Li, Yang
    Feng, Chao
    Tang, Chaojing
    [J]. ICAIP 2018: 2018 THE 2ND INTERNATIONAL CONFERENCE ON ADVANCES IN IMAGE PROCESSING, 2018, : 194 - 197
  • [49] Large-scale parallel numerical integration
    de Doncker, E
    Gupta, A
    Zanny, RR
    [J]. JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS, 1999, 112 (1-2) : 29 - 44
  • [50] Large-scale parallel numerical integration
    Department of Computer Science, Western Michigan University, Kalamazoo, MI 49008, United States
    [J]. J Comput Appl Math, 1 (29-44):