tpSpMV: A two-phase large-scale sparse matrix-vector multiplication kernel for manycore architectures

被引:10
|
作者
Chen, Yuedan [1 ,2 ]
Xiao, Guoqing [1 ,2 ]
Wu, Fan [1 ,2 ]
Tang, Zhuo [1 ,2 ]
Li, Keqin [1 ,2 ,3 ]
机构
[1] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Hunan, Peoples R China
[2] Natl Supercomp Ctr Changsha, Changsha 410082, Hunan, Peoples R China
[3] SUNY Coll New Paltz, Dept Comp Sci, New Paltz, NY 12561 USA
基金
中国国家自然科学基金;
关键词
CSR; Manycore; Parallelization; Sparse matrix-vector multiplication (SpMV); SW26010; SPMV; OPTIMIZATION; SYSTEMS;
D O I
10.1016/j.ins.2020.03.020
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Sparse matrix-vector multiplication (SpMV) is one of the important subroutines in numerical linear algebras widely used in lots of large-scale applications. Accelerating SpMV on multicore and manycore architectures based on Compressed Sparse Row (CSR) format via row-wise parallelization is one of the most popular directions. However, there are three main challenges in optimizing parallel CSR-based SpMV: (a) limited local memory of each computing unit can be overwhelmed by assignments to long rows of large-scale sparse matrices; (b) irregular accesses to the input vector result in expensive memory access latency; (c) sparse data structure leads to low bandwidth usage. This paper proposes a two-phase large-scale SpMV, called tpSpMV, based on the memory structure and computing architecture of multicore and manycore architectures to alleviate the three main difficulties. First, we propose the two-phase parallel execution technique for tpSpMV that performs parallel CSR-based SpMV into two separate phases to overcome the computational scale limitation. Second, we respectively propose the adaptive partitioning methods and parallelization designs using the local memory caching technique for the two phases to exploit the architectural advantages of the high-performance computing platforms and alleviate the problem of high memory access latency. Third, we design several optimizations, such as data reduction, aligned memory accessing, and pipeline technique, to improve bandwidth usage and optimize tpSpMV's performance. Experimental results on SW26010 CPUs of the Sunway TaihuLight supercomputer prove that tpSpMV achieves up to 28.61 speedups and yields the performance improvement of 13.16% over the state-of-the-art work on average. (C) 2020 Elsevier Inc. All rights reserved.
引用
收藏
页码:279 / 295
页数:17
相关论文
共 50 条
  • [21] Spatiotemporal Graph and Hypergraph Partitioning Models for Sparse Matrix-Vector Multiplication on Many-Core Architectures
    Abubaker, Nabil
    Akbudak, Kadir
    Aykanat, Cevdet
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (02) : 445 - 458
  • [22] Optimising Sparse Matrix Vector Multiplication for Large Scale FEM problems on FPGA
    Grigoras, Paul
    Burovskiy, Pavel
    Luk, Wayne
    Sherwin, Spencer
    2016 26TH INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE LOGIC AND APPLICATIONS (FPL), 2016,
  • [23] Efficient implementation of sparse matrix- sparse vector multiplication for large scale graph analytics
    Serrano, Mauricio J.
    2019 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2019,
  • [24] Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems
    Acer, Seher
    Selvitopi, Oguz
    Aykanat, Cevdet
    PARALLEL COMPUTING, 2016, 59 : 71 - 96
  • [25] swSpAMM: optimizing large-scale sparse approximate matrix multiplication on Sunway Taihulight
    LIU Xiaoyan
    LIU Yi
    YIN Bohong
    YANG Hailong
    LUAN Zhongzhi
    QIAN Depei
    Frontiers of Computer Science, 2023, 17 (04)
  • [26] Optimizing and Auto-Tuning Scale-Free Sparse Matrix-Vector Multiplication on Intel Xeon Phi
    Tang, Wai Teng
    Zhao, Ruizhe
    Lu, Mian
    Liang, Yun
    Huynh Phung Huyng
    Li, Xibai
    Goh, Rick Siow Mong
    2015 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO), 2015, : 136 - 145
  • [27] swSpAMM: optimizing large-scale sparse approximate matrix multiplication on Sunway Taihulight
    Liu, Xiaoyan
    Liu, Yi
    Yin, Bohong
    Yang, Hailong
    Luan, Zhongzhi
    Qian, Depei
    FRONTIERS OF COMPUTER SCIENCE, 2023, 17 (04)
  • [28] THE TWO-PHASE OPERATION IN LARGE-SCALE NETWORK
    Mostovoi, J. A.
    COMPUTER OPTICS, 2013, 37 (01) : 120 - 130
  • [29] Fast Matrix-vector Multiplications for Large-scale Logistic Regression on Shared-memory Systems
    Lee, Mu-Chu
    Chiang, Wei-Lin
    Lin, Chih-Jen
    2015 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2015, : 835 - 840
  • [30] A large-scale integrated vector–matrix multiplication processor based on monolayer molybdenum disulfide memories
    Guilherme Migliato Marega
    Hyun Goo Ji
    Zhenyu Wang
    Gabriele Pasquale
    Mukesh Tripathi
    Aleksandra Radenovic
    Andras Kis
    Nature Electronics, 2023, 6 : 991 - 998