High-Performance Dense Tucker Decomposition on GPU Clusters

被引：0

作者：

Choi, Jee ^{[1
]}

Liu, Xing ^{[1
]}

Chakaravarthy, Venkatesan ^{[2
]}

机构：

[1] IBM TJ Watson Res Ctr, Yorktown Hts, NY 10598 USA

[2] IBM India Res Lab, New Delhi, India

来源：

PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE, AND ANALYSIS (SC'18) | 2018年

关键词：

tensor decomposition; Tucker; GPU; MPI; distributed; high-performance computing; HPC; HOSVD; COMPRESSION; TRUNCATION;

D O I：

暂无

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The dense Tucker decomposition method is one of the most popular algorithms for analyzing and compressing data with multi-way relationship. Its execution time is typically dominated by dense matrix multiplication operations, which makes it well-suited for GPU acceleration. State-of-the-art distributed dense Tucker implementations for CPU clusters adopt multi-dimensional partitioning that optimizes for storage and communication. This, however, leads to smaller matrix dimensions that result in under-utilizing the GPU resources. In this paper, we present our optimized implementation and performance analysis of dense Tucker decomposition on a multi-GPU cluster. We propose three key optimizations: a new partitioning strategy that improves performance for GPUs, a new tensor matricization layout that halves the number of communication and matricization steps, and a variation of the randomized SVD algorithm to overcome the eigenvalue calculation bottleneck that arises from the high speedup gained from GPU acceleration. When compared to the state-of-the-art TuckerMPI library, our best GPU implementation, which employs all three optimizations described above, achieves up to 11.8x speedup on 64 nodes. Our best CPU implementation, which also employs all three optimizations, achieves up to 3.6x speedup over TuckerMPI on 64 nodes. When we compare our best GPU implementation to our best CPU implementation, the speedup ranges from 2.1x to 3.6x on a single node, and from 1.8x to 3.3x on 64 nodes, depending on the input data set.

引用

页数：11

共 50 条

[31] Gunrock: A High-Performance Graph Processing Library on the GPU
Wang, Yangzihao
Davidson, Andrew
Pan, Yuechao
Wu, Yuduo
Riffel, Andy
Owens, John D.
[J]. ACM SIGPLAN NOTICES, 2015, 50 (08) : 265 - 266
[32] Gunrock: A High-Performance Graph Processing Library on the GPU
Wang, Yangzihao
Davidson, Andrew
Pan, Yuechao
Wu, Yuduo
Riffel, Andy
Owens, John D.
[J]. ACM SIGPLAN NOTICES, 2016, 51 (08) : 123 - 134
[33] Engineering a High-Performance GPU B-Tree
Awad, Muhammad A.
Ashkiani, Saman
Johnson, Rob
Farach-Colton, Martin
Owens, John D.
[J]. PROCEEDINGS OF THE 24TH SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING (PPOPP '19), 2019, : 145 - 157
[34] HPGraph: High-Performance Graph Analytics with Productivity on the GPU
Yang, Haoduo
Su, Huayou
Lan, Qiang
Wen, Mei
Zhang, Chunyuan
[J]. SCIENTIFIC PROGRAMMING, 2018, 2018
[35] Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters
Jain, Arpan
Shafi, Aamir
Anthony, Quentin
Kousha, Pouya
Subramoni, Hari
Panda, Dhableswar K.
[J]. HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2022, 2022, 13289 : 109 - 130
[36] High-Performance Flow Classification of Big Data Using Hybrid CPU-GPU Clusters of Cloud Environments
Fazel-Najafabadi, Azam
Abbasi, Mahdi
Attar, Hani H.
Amer, Ayman
Taherkordi, Amir
Shokrollahi, Azad
Khosravi, Mohammad R.
Solyman, Ahmed A.
[J]. TSINGHUA SCIENCE AND TECHNOLOGY, 2024, 29 (04) : 1118 - 1137
[37] Thanos: High-Performance CPU-GPU Based Balanced Graph Partitioning Using Cross-Decomposition
Kim, Dae Hee
Nagi, Rakesh
Chen, Deming
[J]. 2020 25TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, ASP-DAC 2020, 2020, : 91 - 96
[38] High-Performance Numerical Optimization on Multicore Clusters
Hadjidoukas, Panagiotis E.
Voglis, Constantinos
Dimakopoulos, Vassilios V.
Lagaris, Isaac E.
Papageorgiou, Dimitris G.
[J]. EURO-PAR 2011 PARALLEL PROCESSING, PT 2, 2011, 6853 : 353 - 364
[39] Using multirail networks in high-performance clusters
Coll, S
Frachtenberg, E
Petrini, F
Hoisie, A
Gurvits, L
[J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2003, 15 (7-8): : 625 - 651
[40] An extensible message layer for high-performance clusters
Ulmer, C
Yalamanchili, S
[J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS I-V, 2000, : 1104 - 1109

← 1 2 3 4 5 →