Tuning Strassen's Matrix Multiplication for Memory Efficiency

Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of this algorithm that uses several unconventional techniques to make the algorithm memory-friendly. First, the algorithm internally uses a non- standard array layout known as Morton order that is based on a quad-tree decomposition of the matrix. Second, we dynamically select the recursion truncation point to minimize padding without affecting the performance of the algorithm, which we can do by virtue of the cache behavior of the Morton ordering. Each technique is critical for performance, and their combination as done in our code multiplies their effectiveness. Performance comparisons of our implementation with that of competing implementations show that our implementation often outperforms the alternative techniques (up to 25%). However, we also observe wide variability across platforms and across matrix sizes, indicating that at this time, no single implementation is a clear choice for all platforms or matrix sizes. We also note that the time required to convert matrices to/from Morton order is a noticeable amount of execution time (5% to 15%). Eliminating this overhead further reduces our execution time.

[1]  V. Strassen Gaussian elimination is not optimal , 1969 .

[2]  Patrick C. Fischer,et al.  Efficient Procedures for Using Matrix Algorithms , 1974, ICALP.

[3]  Antoni Kreczmar On Memory Requirements of Strassen's Algorithms , 1976, MFCS.

[4]  David S. Wise,et al.  Experiments with Quadtree Representation of Matrices , 1988, ISSAC.

[5]  David H. Bailey,et al.  Extra high speed matrix multiplication on the Cray-2 , 1988 .

[6]  Alan Jay Smith,et al.  Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.

[7]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[8]  David S. Wise,et al.  Costs of Quadtree Representation of Nondense Matrices , 1990, J. Parallel Distributed Comput..

[9]  R. W. Johnson,et al.  A tensor product formulation of Strassen's matrix multiplication algorithm with memory reduction , 1993, [1993] Proceedings Seventh International Parallel Processing Symposium.

[10]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[11]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[12]  P. Sadayappan,et al.  A tensor product formulation of Strassen's matrix multiplication algorithm with memory reduction , 1993, [1993] Proceedings Seventh International Parallel Processing Symposium.

[13]  Michael A. Heroux,et al.  GEMMW: A Portable Level 3 BLAS Winograd Variant of Strassen's Matrix-Matrix Multiply Algorithm , 1994, Journal of Computational Physics.

[14]  David A. Wood,et al.  Cache profiling and the SPEC benchmarks: a case study , 1994, Computer.

[15]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[16]  J. R. Johnson,et al.  Implementation of Strassen's Algorithm for Matrix Multiplication , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[17]  Scott B. Baden,et al.  Dynamic Partitioning of Non-Uniform Structured Workloads with Spacefilling Curves , 1996, IEEE Trans. Parallel Distributed Syst..

[18]  Jeremy D. Frens,et al.  Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.

[19]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[20]  Sharad Malik,et al.  Cache miss equations: an analytical representation of cache misses , 1997, ICS '97.

[21]  P. Pauca Architecture-eecient Strassen's Matrix Multiplication: a Case Study of Divide-and-conquer Algorithms , 1997 .

[22]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.