High-Performance Dense Tucker Decomposition on GPU Clusters

被引：0

作者：

Choi, Jee ^{[1
]}

Liu, Xing ^{[1
]}

Chakaravarthy, Venkatesan ^{[2
]}

机构：

[1] IBM TJ Watson Res Ctr, Yorktown Hts, NY 10598 USA

[2] IBM India Res Lab, New Delhi, India

来源：

PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE, AND ANALYSIS (SC'18) | 2018年

关键词：

tensor decomposition; Tucker; GPU; MPI; distributed; high-performance computing; HPC; HOSVD; COMPRESSION; TRUNCATION;

D O I：

暂无

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The dense Tucker decomposition method is one of the most popular algorithms for analyzing and compressing data with multi-way relationship. Its execution time is typically dominated by dense matrix multiplication operations, which makes it well-suited for GPU acceleration. State-of-the-art distributed dense Tucker implementations for CPU clusters adopt multi-dimensional partitioning that optimizes for storage and communication. This, however, leads to smaller matrix dimensions that result in under-utilizing the GPU resources. In this paper, we present our optimized implementation and performance analysis of dense Tucker decomposition on a multi-GPU cluster. We propose three key optimizations: a new partitioning strategy that improves performance for GPUs, a new tensor matricization layout that halves the number of communication and matricization steps, and a variation of the randomized SVD algorithm to overcome the eigenvalue calculation bottleneck that arises from the high speedup gained from GPU acceleration. When compared to the state-of-the-art TuckerMPI library, our best GPU implementation, which employs all three optimizations described above, achieves up to 11.8x speedup on 64 nodes. Our best CPU implementation, which also employs all three optimizations, achieves up to 3.6x speedup over TuckerMPI on 64 nodes. When we compare our best GPU implementation to our best CPU implementation, the speedup ranges from 2.1x to 3.6x on a single node, and from 1.8x to 3.3x on 64 nodes, depending on the input data set.

引用

页数：11

共 50 条

[1] GPU Clusters for High-Performance Computing
Kindratenko, Volodymyr V.
Enos, Jeremy J.
Shi, Guochun
Showerman, Michael T.
Arnold, Galen W.
Stone, John E.
Phillips, James C.
Hwu, Wen-mei
[J]. 2009 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING AND WORKSHOPS, 2009, : 638 - +
[2] High-Performance Tucker Factorization on Heterogeneous Platforms
Oh, Sejoon
Park, Namyong
Jang, Jun-Gi
Sael, Lee
Kang, U.
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (10) : 2237 - 2248
[3] On Optimizing Distributed Tucker Decomposition for Dense Tensors
Chakaravarthy, Venkatesan T.
Choi, Jee W.
Joseph, Douglas J.
Liu, Xing
Murali, Prakash
Sabharwal, Yogish
Sreedhar, Dheeraj
[J]. 2017 31ST IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2017, : 1038 - 1047
[4] Static and Streaming Tucker Decomposition for Dense Tensors
Jang, Jun-Gi
Kang, U.
[J]. ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2023, 17 (05)
[5] Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters
Zhou, Q.
Chu, C.
Kumar, N. S.
Kousha, P.
Ghazimirsaeed, S. M.
Subramoni, H.
Panda, D. K.
[J]. 2021 IEEE 35TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2021, : 444 - 453
[6] High-Performance Packet Classification on GPU
Zhou, Shijie
Singapura, Shreyas G.
Prasanna, Viktor K.
[J]. 2014 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2014,
[7] Productive High-Performance k-Truss Decomposition on GPU Using Linear Algebra
Wang, Runze
Yu, Linchen
Wang, Qinggang
Xin, Jie
Zheng, Long
[J]. 2021 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2021,
[8] Building high-performance clusters
Jeffords, C
Pham, D
[J]. DR DOBBS JOURNAL, 2005, 30 (04): : 70 - +
[9] High-Performance Recommender System Training using Co-Clustering on CPU/GPU Clusters
Atasu, Kubilay
Parnell, Thomas
Dunner, Celestine
Vlachos, Michail
Pozidis, Haralampos
[J]. 2017 46TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP), 2017, : 372 - 381
[10] Designing a Profiling and Visualization Tool for Scalable and In-Depth Analysis of High-Performance GPU Clusters
Kousha, Pouya
Ramesh, Bharath
Suresh, Kaushik Kandadi
Chu, Ching-Hsiang
Jain, Arpan
Sarkauskas, Nick
Subramoni, Hari
Panda, Dhabaleswar K.
[J]. 2019 IEEE 26TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS (HIPC), 2019, : 93 - 102

← 1 2 3 4 5 →