Massively Parallel Polar Decomposition on Distributed-memory Systems

被引：3

作者：

Ltaief, Hatem ^{[1
]}

Sukkari, Dalal ^{[1
]}

Esposito, Aniello ^{[2
]}

Nakatsukasa, Yuji ^{[3
]}

Keyes, David ^{[1
]}

机构：

[1] King Abdullah Univ Sci & Technol, Extreme Comp Res Ctr, 4700 King Abdullah Blvd, Jeddah 23955, Saudi Arabia

[2] Cray EMEA Res Lab, Bristol, Avon, England

[3] Univ Oxford, Math Inst, Oxford, England

来源：

ACM TRANSACTIONS ON PARALLEL COMPUTING | 2019年 / 6卷 / 01期

关键词：

Polar decomposition; Zolotarev functions; parallel algorithms; strong scaling; distributed-memory systems; ITERATION; ALGORITHMS;

D O I：

10.1145/3328723

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

We present a high-performance implementation of the Polar Decomposition (PD) on distributed-memory systems. Building upon on the QR-based Dynamically Weighted Halley (QDWH) algorithm, the key idea lies in finding the best rational approximation for the scalar sign function, which also corresponds to the polar factor for symmetric matrices, to further accelerate the QDWH convergence. Based on the Zolotarev rational functions-introduced by Zolotarev (ZOLO) in 1877-this new PD algorithm ZOLO-PD converges within two iterations even for ill-conditioned matrices, instead of the original six iterations needed for QDWH. ZOLO-PD uses the property of Zolotarev functions that optimality is maintained when two functions are composed in an appropriate manner. The resulting ZOLO-PD has a convergence rate up to 17, in contrast to the cubic convergence rate for QDWH. This comes at the price of higher arithmetic costs and memory footprint. These extra floating-point operations can, however, be processed in an embarrassingly parallel fashion. We demonstrate performance using up to 102,400 cores on two supercomputers. We demonstrate that, in the presence of a large number of processing units, ZOLO-PD is able to outperform QDWH by up to 2.3x speedup, especially in situations where QDWH runs out of work, for instance, in the strong scaling mode of operation.

引用

页数：15

共 50 条

[1] PARALLEL ANNEALING ON DISTRIBUTED-MEMORY SYSTEMS
LEE, FH
STILES, GS
SWAMINATHAN, V
[J]. PROGRAMMING AND COMPUTER SOFTWARE, 1995, 21 (01) : 1 - 8
[2] Numerical integration on distributed-memory parallel systems
Ciegis, R
Sablinskas, R
Wasniewski, J
[J]. RECENT ADVANCES IN PARALLEL VIRTUAL MACHINE AND MESSAGE PASSING INTERFACE, 1997, 1332 : 329 - 336
[3] Efficient Breadth-First Search on Massively Parallel and Distributed-Memory Machines
Ueno K.
Suzumura T.
Maruyama N.
Fujisawa K.
Matsuoka S.
[J]. Data Science and Engineering, 2017, 2 (1) : 22 - 35
[4] New parallel scheduling algorithm on distributed-memory systems
Lu, G.H.
Sun, S.X.
[J]. Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2001, 38 (02):
[5] PARALLEL SOLUTION OF TRIANGULAR SYSTEMS ON DISTRIBUTED-MEMORY MULTIPROCESSORS
HEATH, MT
ROMINE, CH
[J]. SIAM JOURNAL ON SCIENTIFIC AND STATISTICAL COMPUTING, 1988, 9 (03): : 558 - 588
[6] Distributed-Memory Parallel JointNMF
Eswar, Srinivas
Cobb, Benjamin
Hayashi, Koby
Kannan, Ramakrishnan
Ballard, Grey
Vuduc, Richard
Park, Haesun
[J]. PROCEEDINGS OF THE 37TH INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, ACM ICS 2023, 2023, : 301 - 312
[7] Parallel H-matrix arithmetic on distributed-memory systems
Izadi, Mohammad
[J]. COMPUTING AND VISUALIZATION IN SCIENCE, 2012, 15 (02) : 87 - 97
[8] Processor allocation in multiprogrammed distributed-memory parallel computer systems
Naik, VK
Setia, SK
Squillante, MS
[J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1997, 46 (01) : 28 - 47
[9] Parallel ILP for distributed-memory architectures
Nuno A. Fonseca
Ashwin Srinivasan
Fernando Silva
Rui Camacho
[J]. Machine Learning, 2009, 74 : 257 - 279
[10] COMPUTATION MIGRATION - ENHANCING LOCALITY FOR DISTRIBUTED-MEMORY PARALLEL SYSTEMS
HSIEH, WC
WANG, P
WEIHL, WE
[J]. SIGPLAN NOTICES, 1993, 28 (07): : 239 - 248

← 1 2 3 4 5 →