PARALLEL MATRIX TRANSPOSE ALGORITHMS ON DISTRIBUTED-MEMORY CONCURRENT COMPUTERS

被引:24
|
作者
CHOI, JY
DONGARRA, JJ
WALKER, DW
机构
[1] OAK RIDGE NATL LAB,MATH SCI SECT,OAK RIDGE,TN 37831
[2] UNIV TENNESSEE,DEPT COMP SCI,KNOXVILLE,TN 37996
关键词
LINEAR ALGEBRA; MATRIX TRANSPOSE ALGORITHM; DISTRIBUTED MEMORY MULTIPROCESSORS; POINT-TO-POINT COMMUNICATION; INTEL TOUCHSTONE DELTA;
D O I
10.1016/0167-8191(95)00016-H
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. We assume that the matrix is distributed over a P X Q processor template with a block cyclic data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The communication schemes of the algorithms are determined by the greatest common divisor (GCD) of P and e. If P and Q are relatively prime, the matrix transpose algorithm involves complete exchange communication. If P and P are not relatively prime, processors are divided into GCD groups and the communication operations are overlapped for different groups of processors. Processors transpose GCD wrapped diagonal blocks simultaneously, and the matrix can be transposed with LCM/GCD steps, where LCM is the least common multiple of P and Q. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix muliplication routine, C = A . B, the algorithms are used to compute parallel multiplications of transposed matrices, C = A(T) . B-T, in the PUMMA package [5]. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.
引用
收藏
页码:1387 / 1405
页数:19
相关论文
共 50 条
  • [31] Scalable parallel matrix multiplication on distributed memory parallel computers
    Li, KQ
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2001, 61 (12) : 1709 - 1731
  • [33] Parallel H-matrix arithmetic on distributed-memory systems
    Izadi, Mohammad
    COMPUTING AND VISUALIZATION IN SCIENCE, 2012, 15 (02) : 87 - 97
  • [34] Cache blocking of distributed-memory parallel matrix power kernels
    Lacey, Dane
    Alappat, Christie
    Lange, Florian
    Hager, Georg
    Fehske, Holger
    Wellein, Gerhard
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2025,
  • [35] Distributed-Memory Parallel JointNMF
    Eswar, Srinivas
    Cobb, Benjamin
    Hayashi, Koby
    Kannan, Ramakrishnan
    Ballard, Grey
    Vuduc, Richard
    Park, Haesun
    PROCEEDINGS OF THE 37TH INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, ACM ICS 2023, 2023, : 301 - 312
  • [36] A FULLY PARALLEL CONDENSATION METHOD FOR GENERALIZED EIGENVALUE PROBLEMS ON DISTRIBUTED-MEMORY COMPUTERS
    ROTHE, K
    VOSS, H
    PARALLEL COMPUTING, 1995, 21 (06) : 907 - 921
  • [37] Efficient all-to-all broadcast schemes in distributed-memory parallel computers
    Oh, ES
    Kanj, IA
    16TH ANNUAL INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTING SYSTEMS AND APPLICATIONS, PROCEEDINGS, 2002, : 71 - 76
  • [38] Implementation of multiple-precision parallel division and square root on distributed-memory parallel computers
    Takahashi, D
    2000 INTERNATIONAL WORKSHOPS ON PARALLEL PROCESSING, PROCEEDINGS, 2000, : 229 - 235
  • [39] A framework for generating distributed-memory parallel programs for block recursive algorithms
    Gupta, SKS
    Huang, CH
    Sadayappan, P
    Johnson, RW
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1996, 34 (02) : 137 - 153
  • [40] Comparison of backfilling algorithms for job scheduling in distributed-memory parallel systems
    Department of Computer Science, Bowling Green State University, Bowling Green, OH 43403
    Comput. Educ. J., 2007, 4 (22-31):