Tuning Strassen's Matrix Multiplication for Memory Efficiency
暂无分享,去创建一个
Mithuna Thottethodi | Siddhartha Chatterjee | Alvin R. Lebeck | Mithuna Thottethodi | A. Lebeck | S. Chatterjee
[1] V. Strassen. Gaussian elimination is not optimal , 1969 .
[2] Patrick C. Fischer,et al. Efficient Procedures for Using Matrix Algorithms , 1974, ICALP.
[3] Antoni Kreczmar. On Memory Requirements of Strassen's Algorithms , 1976, MFCS.
[4] David S. Wise,et al. Experiments with Quadtree Representation of Matrices , 1988, ISSAC.
[5] David H. Bailey,et al. Extra high speed matrix multiplication on the Cray-2 , 1988 .
[6] Alan Jay Smith,et al. Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.
[7] Jack J. Dongarra,et al. A set of level 3 basic linear algebra subprograms , 1990, TOMS.
[8] David S. Wise,et al. Costs of Quadtree Representation of Nondense Matrices , 1990, J. Parallel Distributed Comput..
[9] R. W. Johnson,et al. A tensor product formulation of Strassen's matrix multiplication algorithm with memory reduction , 1993, [1993] Proceedings Seventh International Parallel Processing Symposium.
[10] Nicholas J. Higham,et al. INVERSE PROBLEMS NEWSLETTER , 1991 .
[11] Monica S. Lam,et al. The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.
[12] P. Sadayappan,et al. A tensor product formulation of Strassen's matrix multiplication algorithm with memory reduction , 1993, [1993] Proceedings Seventh International Parallel Processing Symposium.
[13] Michael A. Heroux,et al. GEMMW: A Portable Level 3 BLAS Winograd Variant of Strassen's Matrix-Matrix Multiply Algorithm , 1994, Journal of Computational Physics.
[14] David A. Wood,et al. Cache profiling and the SPEC benchmarks: a case study , 1994, Computer.
[15] Kathryn S. McKinley,et al. Tile size selection using cache organization and data layout , 1995, PLDI '95.
[16] J. R. Johnson,et al. Implementation of Strassen's Algorithm for Matrix Multiplication , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.
[17] Scott B. Baden,et al. Dynamic Partitioning of Non-Uniform Structured Workloads with Spacefilling Curves , 1996, IEEE Trans. Parallel Distributed Syst..
[18] Jeremy D. Frens,et al. Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.
[19] Keshav Pingali,et al. Data-centric multi-level blocking , 1997, PLDI '97.
[20] Sharad Malik,et al. Cache miss equations: an analytical representation of cache misses , 1997, ICS '97.
[21] P. Pauca. Architecture-eecient Strassen's Matrix Multiplication: a Case Study of Divide-and-conquer Algorithms , 1997 .
[22] Chau-Wen Tseng,et al. Data transformations for eliminating conflict misses , 1998, PLDI.