A number of dense matrix matrix multiplication routines have been made available. The code is implemented in ANSI C. It works under both LAM-MPI, BSPlib, and PUB-Library through recompilation of the same source. The underlying assumption for storing matrices is a column-major oriented one-dimensional distribution, i.e. element [i,j] of an nXn matrix is stored in location [j*n+i] of an n*nX1 matrix. A Block distribution of a matrix over the p processors of a parallel machine is assumed so that each processor is assigned a block (n/sqrt(p))x(n/sqrt(p)) elements of A, B and the result C=AxB.
A variety of algorithms are implemented: some use remote PUT operations for communication, some remote GET operations. The LAM MPI version also runs through the use of MPI_Send and MPI_Irecv The underlying algorithm is a memory efficient algorithm that uses O(n*n/p) memory. Separately, an one superstep memory inefficient algorithm is also implemented.
In the unoptimized variant the standard loop of matrix multiplication is used to multiply A and B. In the optimized version insted of using the standard loop to perform AxB we first transpose A (in place) to get tr(A) and the modify the loop of matrix multiplication by multiplying tr(A) and B. The former operation multiplies rows of A with columns of B; the latter multiplies columns of tr(A) (which are rows of A) with columns of B. Given that a column major distribution is used to store A, B, tr(A) the improvement in performance comes from the induced locality of reference.
Last Update: July 28, 2003