Improving Execution Concurrency of Large-Scale Matrix Multiplication on Distributed Data-Parallel Platforms

被引:16
|
作者
Gu, Rong [1 ]
Tang, Yun [1 ]
Tian, Chen [1 ]
Zhou, Hucheng [2 ]
Li, Guanru [2 ]
Zheng, Xudong [2 ]
Huang, Yihua [1 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing 210000, Jiangsu, Peoples R China
[2] Microsoft Res, Beijing 100084, Peoples R China
基金
中国国家自然科学基金;
关键词
Parallel matrix multiplication; data-parallel algorithms; machine learning library;
D O I
10.1109/TPDS.2017.2686384
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Matrix multiplication is a dominant but very time-consuming operation in many big data analytic applications. Thus its performance optimization is an important and fundamental research issue. The performance of large-scale matrix multiplication on distributed data-parallel platforms is determined by both computation and IO costs. For existing matrix multiplication execution strategies, when the execution concurrency scales up above a threshold, their execution performance deteriorates quickly because the increase of the IO cost outweighs the decrease of the computation cost. This paper presents a novel parallel execution strategy CRMM (Concurrent Replication-based Matrix Multiplication) along with a parallel algorithm, Marlin, for large-scale matrix multiplication on data-parallel platforms. The CRMM strategy exploits higher execution concurrency for sub-block matrix multiplication with the same IO cost. To further improve the performance of Marlin, we also propose a number of novel system-level optimizations, including increasing the concurrency of local data exchange by calling native library in batch, reducing the overhead of block matrix transformation, and reducing disk heavy shuffle operations by exploiting the semantics of matrix computation. We have implemented Marlin as a library along with a set of related matrix operations on Spark and also contributed Marlin to the open-source community. For large-sized matrix multiplication, Marlin outperforms existing systems including Spark MLlib, SystemML and SciDB, with about 1.29x, 3.53x and 2.21x speedup on average, respectively. The evaluation upon a real-world DNN workload also indicates that Marlin outperforms above systems by about 12.8x, 5.1x and 27.2x speedup, respectively.
引用
收藏
页码:2539 / 2552
页数:14
相关论文
共 50 条
  • [1] Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms
    Quintin, Jean-Noel
    Hasanov, Khalid
    Lastovetsky, Alexey
    [J]. 2013 42ND ANNUAL INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP), 2013, : 754 - 762
  • [2] Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms
    Hasanov, Khalid
    Quintin, Jean-Noel
    Lastovetsky, Alexey
    [J]. JOURNAL OF SUPERCOMPUTING, 2015, 71 (11): : 3991 - 4014
  • [3] Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms
    Khalid Hasanov
    Jean-Noël Quintin
    Alexey Lastovetsky
    [J]. The Journal of Supercomputing, 2015, 71 : 3991 - 4014
  • [4] Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems
    Acer, Seher
    Selvitopi, Oguz
    Aykanat, Cevdet
    [J]. PARALLEL COMPUTING, 2016, 59 : 71 - 96
  • [5] ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform
    Bo Zhao
    Hucheng Zhou
    Guoqiang Li
    Yihua Huang
    [J]. Big Data Mining and Analytics, 2018, (01) : 57 - 74
  • [6] ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform
    Zhao, Bo
    Zhou, Hucheng
    Li, Guoqiang
    Huang, Yihua
    [J]. BIG DATA MINING AND ANALYTICS, 2018, 1 (01): : 57 - 74
  • [7] Towards Efficient Large-Scale Interprocedural Program Static Analysis on Distributed Data-Parallel Computation
    Gu, Rong
    Zuo, Zhiqiang
    Jiang, Xi
    Yin, Han
    Wang, Zhaokang
    Wang, Linzhang
    Li, Xuandong
    Huang, Yihua
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (04) : 867 - 883
  • [8] On Execution Platforms for Large-Scale Aggregate Computing
    Viroli, Mirko
    Casadei, Roberto
    Pianini, Danilo
    [J]. UBICOMP'16 ADJUNCT: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL JOINT CONFERENCE ON PERVASIVE AND UBIQUITOUS COMPUTING, 2016, : 1321 - 1326
  • [9] On Distributed Multiplication of Large-Scale Matrices
    Glushan, V. M.
    Lozovoy, A. Yu
    [J]. 2021 IEEE 15TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT2021), 2021,
  • [10] ParNCL and ParGAL: Data-parallel tools for postprocessing of large-scale Earth science data
    Jacob, Robert
    Krishna, Jayesh
    Xu, Xiabing
    Tautges, Tim
    Grindeanu, Iulian
    Latham, Rob
    Peterson, Kara
    Bochev, Pavel
    Haley, Mary
    Brown, David
    Brownrigg, Richard
    Shea, Dennis
    Huang, Wei
    Middleton, Don
    [J]. 2013 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE, 2013, 18 : 1245 - 1254