High-Performance Dense Tucker Decomposition on GPU Clusters

被引:0
|
作者
Choi, Jee [1 ]
Liu, Xing [1 ]
Chakaravarthy, Venkatesan [2 ]
机构
[1] IBM TJ Watson Res Ctr, Yorktown Hts, NY 10598 USA
[2] IBM India Res Lab, New Delhi, India
关键词
tensor decomposition; Tucker; GPU; MPI; distributed; high-performance computing; HPC; HOSVD; COMPRESSION; TRUNCATION;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The dense Tucker decomposition method is one of the most popular algorithms for analyzing and compressing data with multi-way relationship. Its execution time is typically dominated by dense matrix multiplication operations, which makes it well-suited for GPU acceleration. State-of-the-art distributed dense Tucker implementations for CPU clusters adopt multi-dimensional partitioning that optimizes for storage and communication. This, however, leads to smaller matrix dimensions that result in under-utilizing the GPU resources. In this paper, we present our optimized implementation and performance analysis of dense Tucker decomposition on a multi-GPU cluster. We propose three key optimizations: a new partitioning strategy that improves performance for GPUs, a new tensor matricization layout that halves the number of communication and matricization steps, and a variation of the randomized SVD algorithm to overcome the eigenvalue calculation bottleneck that arises from the high speedup gained from GPU acceleration. When compared to the state-of-the-art TuckerMPI library, our best GPU implementation, which employs all three optimizations described above, achieves up to 11.8x speedup on 64 nodes. Our best CPU implementation, which also employs all three optimizations, achieves up to 3.6x speedup over TuckerMPI on 64 nodes. When we compare our best GPU implementation to our best CPU implementation, the speedup ranges from 2.1x to 3.6x on a single node, and from 1.8x to 3.3x on 64 nodes, depending on the input data set.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] GPU Clusters for High-Performance Computing
    Kindratenko, Volodymyr V.
    Enos, Jeremy J.
    Shi, Guochun
    Showerman, Michael T.
    Arnold, Galen W.
    Stone, John E.
    Phillips, James C.
    Hwu, Wen-mei
    [J]. 2009 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING AND WORKSHOPS, 2009, : 638 - +
  • [2] High-Performance Tucker Factorization on Heterogeneous Platforms
    Oh, Sejoon
    Park, Namyong
    Jang, Jun-Gi
    Sael, Lee
    Kang, U.
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (10) : 2237 - 2248
  • [3] On Optimizing Distributed Tucker Decomposition for Dense Tensors
    Chakaravarthy, Venkatesan T.
    Choi, Jee W.
    Joseph, Douglas J.
    Liu, Xing
    Murali, Prakash
    Sabharwal, Yogish
    Sreedhar, Dheeraj
    [J]. 2017 31ST IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2017, : 1038 - 1047
  • [4] Static and Streaming Tucker Decomposition for Dense Tensors
    Jang, Jun-Gi
    Kang, U.
    [J]. ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2023, 17 (05)
  • [5] Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters
    Zhou, Q.
    Chu, C.
    Kumar, N. S.
    Kousha, P.
    Ghazimirsaeed, S. M.
    Subramoni, H.
    Panda, D. K.
    [J]. 2021 IEEE 35TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2021, : 444 - 453
  • [6] High-Performance Packet Classification on GPU
    Zhou, Shijie
    Singapura, Shreyas G.
    Prasanna, Viktor K.
    [J]. 2014 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2014,
  • [7] Productive High-Performance k-Truss Decomposition on GPU Using Linear Algebra
    Wang, Runze
    Yu, Linchen
    Wang, Qinggang
    Xin, Jie
    Zheng, Long
    [J]. 2021 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2021,
  • [8] Building high-performance clusters
    Jeffords, C
    Pham, D
    [J]. DR DOBBS JOURNAL, 2005, 30 (04): : 70 - +
  • [9] High-Performance Recommender System Training using Co-Clustering on CPU/GPU Clusters
    Atasu, Kubilay
    Parnell, Thomas
    Dunner, Celestine
    Vlachos, Michail
    Pozidis, Haralampos
    [J]. 2017 46TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP), 2017, : 372 - 381
  • [10] Designing a Profiling and Visualization Tool for Scalable and In-Depth Analysis of High-Performance GPU Clusters
    Kousha, Pouya
    Ramesh, Bharath
    Suresh, Kaushik Kandadi
    Chu, Ching-Hsiang
    Jain, Arpan
    Sarkauskas, Nick
    Subramoni, Hari
    Panda, Dhabaleswar K.
    [J]. 2019 IEEE 26TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS (HIPC), 2019, : 93 - 102