High-Performance Dense Tucker Decomposition on GPU Clusters

被引:0
|
作者
Choi, Jee [1 ]
Liu, Xing [1 ]
Chakaravarthy, Venkatesan [2 ]
机构
[1] IBM TJ Watson Res Ctr, Yorktown Hts, NY 10598 USA
[2] IBM India Res Lab, New Delhi, India
关键词
tensor decomposition; Tucker; GPU; MPI; distributed; high-performance computing; HPC; HOSVD; COMPRESSION; TRUNCATION;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The dense Tucker decomposition method is one of the most popular algorithms for analyzing and compressing data with multi-way relationship. Its execution time is typically dominated by dense matrix multiplication operations, which makes it well-suited for GPU acceleration. State-of-the-art distributed dense Tucker implementations for CPU clusters adopt multi-dimensional partitioning that optimizes for storage and communication. This, however, leads to smaller matrix dimensions that result in under-utilizing the GPU resources. In this paper, we present our optimized implementation and performance analysis of dense Tucker decomposition on a multi-GPU cluster. We propose three key optimizations: a new partitioning strategy that improves performance for GPUs, a new tensor matricization layout that halves the number of communication and matricization steps, and a variation of the randomized SVD algorithm to overcome the eigenvalue calculation bottleneck that arises from the high speedup gained from GPU acceleration. When compared to the state-of-the-art TuckerMPI library, our best GPU implementation, which employs all three optimizations described above, achieves up to 11.8x speedup on 64 nodes. Our best CPU implementation, which also employs all three optimizations, achieves up to 3.6x speedup over TuckerMPI on 64 nodes. When we compare our best GPU implementation to our best CPU implementation, the speedup ranges from 2.1x to 3.6x on a single node, and from 1.8x to 3.3x on 64 nodes, depending on the input data set.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] Gunrock: A High-Performance Graph Processing Library on the GPU
    Wang, Yangzihao
    Davidson, Andrew
    Pan, Yuechao
    Wu, Yuduo
    Riffel, Andy
    Owens, John D.
    [J]. ACM SIGPLAN NOTICES, 2015, 50 (08) : 265 - 266
  • [32] Gunrock: A High-Performance Graph Processing Library on the GPU
    Wang, Yangzihao
    Davidson, Andrew
    Pan, Yuechao
    Wu, Yuduo
    Riffel, Andy
    Owens, John D.
    [J]. ACM SIGPLAN NOTICES, 2016, 51 (08) : 123 - 134
  • [33] Engineering a High-Performance GPU B-Tree
    Awad, Muhammad A.
    Ashkiani, Saman
    Johnson, Rob
    Farach-Colton, Martin
    Owens, John D.
    [J]. PROCEEDINGS OF THE 24TH SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING (PPOPP '19), 2019, : 145 - 157
  • [34] HPGraph: High-Performance Graph Analytics with Productivity on the GPU
    Yang, Haoduo
    Su, Huayou
    Lan, Qiang
    Wen, Mei
    Zhang, Chunyuan
    [J]. SCIENTIFIC PROGRAMMING, 2018, 2018
  • [35] Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters
    Jain, Arpan
    Shafi, Aamir
    Anthony, Quentin
    Kousha, Pouya
    Subramoni, Hari
    Panda, Dhableswar K.
    [J]. HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2022, 2022, 13289 : 109 - 130
  • [36] High-Performance Flow Classification of Big Data Using Hybrid CPU-GPU Clusters of Cloud Environments
    Fazel-Najafabadi, Azam
    Abbasi, Mahdi
    Attar, Hani H.
    Amer, Ayman
    Taherkordi, Amir
    Shokrollahi, Azad
    Khosravi, Mohammad R.
    Solyman, Ahmed A.
    [J]. TSINGHUA SCIENCE AND TECHNOLOGY, 2024, 29 (04) : 1118 - 1137
  • [37] Thanos: High-Performance CPU-GPU Based Balanced Graph Partitioning Using Cross-Decomposition
    Kim, Dae Hee
    Nagi, Rakesh
    Chen, Deming
    [J]. 2020 25TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, ASP-DAC 2020, 2020, : 91 - 96
  • [38] High-Performance Numerical Optimization on Multicore Clusters
    Hadjidoukas, Panagiotis E.
    Voglis, Constantinos
    Dimakopoulos, Vassilios V.
    Lagaris, Isaac E.
    Papageorgiou, Dimitris G.
    [J]. EURO-PAR 2011 PARALLEL PROCESSING, PT 2, 2011, 6853 : 353 - 364
  • [39] Using multirail networks in high-performance clusters
    Coll, S
    Frachtenberg, E
    Petrini, F
    Hoisie, A
    Gurvits, L
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2003, 15 (7-8): : 625 - 651
  • [40] An extensible message layer for high-performance clusters
    Ulmer, C
    Yalamanchili, S
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS I-V, 2000, : 1104 - 1109