A.V. Gerbessiotis: Parallel Dense Matrix Multiplication

Parallel Dense Matrix Multiplication

A number of dense matrix matrix multiplication routines have been made available. The code is implemented in ANSI C. It works under both LAM-MPI, BSPlib, and PUB-Library through recompilation of the same source. The underlying assumption for storing matrices is a column-major oriented one-dimensional distribution, i.e. element [i,j] of an nXn matrix is stored in location [j*n+i] of an n*nX1 matrix. A Block distribution of a matrix over the p processors of a parallel machine is assumed so that each processor is assigned a block (n/sqrt(p))x(n/sqrt(p)) elements of A, B and the result C=AxB.

A variety of algorithms are implemented: some use remote PUT operations for communication, some remote GET operations. The LAM MPI version also runs through the use of MPI_Send and MPI_Irecv The underlying algorithm is a memory efficient algorithm that uses O(n*n/p) memory. Separately, an one superstep memory inefficient algorithm is also implemented.

Multiplication using one sided PUT operations A memory efficient parallel matrix multiplication and an optimized version of it have been implemented; the algorithm require sqrt(p) BSP supersteps and uses remote PUT operations.
Multiplication using one sided GET operations A memory efficient parallel matrix multiplication and an optimized version of it have been implemented; the algorithm requires sqrt(p) BSP supersteps and uses remote GET operations.
Multiplication using two-sided send/recv operations A memory efficient parallel matrix multiplication and an optimized version of it have been implemented; the algorithm requires sqrt(p) BSP supersteps and uses remote MPI_Send MPI_Irecv operations.
Synchronization Efficient/Memory Inefficient matrix multiplication A memory inefficient parallel matrix multiplication that is synchronization efficient (one superstep) has been implemented.

In the unoptimized variant the standard loop of matrix multiplication is used to multiply A and B. In the optimized version insted of using the standard loop to perform AxB we first transpose A (in place) to get tr(A) and the modify the loop of matrix multiplication by multiplying tr(A) and B. The former operation multiplies rows of A with columns of B; the latter multiplies columns of tr(A) (which are rows of A) with columns of B. Given that a column major distribution is used to store A, B, tr(A) the improvement in performance comes from the induced locality of reference.

tars/mult03v6.tar New version Untar file, read copyright notice in file copy, and then read file matmul.rdm (July 2003).
tars/mult03v1.tar Older version (without the two-sided communication implementation). Untar file, read copyright notice in file copy, and then read file matmul.rdm (June 2003).
tars/mmnew02a.tar Older version Untar file, read copyright notice in file copy, and then read file matmul.rdm (July 2002).
tars/mmnew02.tar Older version Untar file, read copyright notice in file copy, and then read file matmul.rdm (April 2002).

Last Update: July 28, 2003