Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

PHiPAC was an early attempt to improve software performance by searching in a large design space of possible implementations to find the best one. At the time, in the early 1990s, the most efficient numerical linear algebra libraries were carefully hand tuned for specific microarchitectures and compilers, and were often written in assembly language. This allowed very precise tuning of an algorithm to the specifics of a current platform, and provided great opportunity for high efficiency. The prevailing thought at the time was that such an approach was necessary to produce near-peak performance. On the other hand, this approach was brittle, and required great human effort to try each code variant, and so only a tiny subset of the possible code design points could be explored. Worse, given the combined complexities of the compiler and microarchitecture, it was difficult to predict which code variants would be worth the implementation effort. PHiPAC circumvented this effort by using code generators that could easily generate a vast assortment of very different points within a design space, and even across very different design spaces altogether. By following a set of carefully crafted coding guidelines, the generated code was reasonably efficient for any point in the design space. To search the design space, PHiPAC took a rather naive but effective approach. Due to the human-designed and deterministic nature of computing systems, one might reasonably think that smart modeling of the microprocessor and compiler would be sufficient to predict, without performing any timing, the optimal point for a given algorithm. But the combination of an optimizing compiler and a dynamically scheduled mi-

[1]  James Demmel,et al.  The PHiPAC v1.0 Matrix-Multiply Distribution , 1998 .

[2]  James Demmel,et al.  Using PHiPAC to speed error back-propagation learning , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Jacqueline Chame,et al.  The combined effectiveness of unimodular transformations, tiling, and software prefetching , 1996, Proceedings of International Conference on Parallel Processing.

[4]  Brian Kingsbury,et al.  Spert-II: A Vector Microprocessor System , 1996, Computer.

[5]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[6]  Bowen Alpern,et al.  Space-limited procedures: a methodology for portable high-performance , 1995, Programming Models for Massively Parallel Computers.

[7]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[8]  Larry Carter,et al.  Hierarchical tiling for improved superscalar performance , 1995, Proceedings of 9th International Parallel Processing Symposium.

[9]  John McCalpin,et al.  Automatic benchmark generation for cache optimization of matrix operations , 1995, ACM-SE 33.

[10]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[11]  Chandrika Kamath,et al.  DXML: A High-performance Scientific Subroutine Library , 1994, Digit. Tech. J..

[12]  Bo Kågström,et al.  Portable High Performance GEMM-Based Level 3 BLAS , 1993, PPSC.

[13]  Ed Anderson,et al.  LAPACK users' guide - [release 1.0] , 1992 .

[14]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[15]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[16]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[17]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[18]  G. Golub Matrix computations , 1983 .

[19]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.