Cache-friendly implementations of transitive closure

We show cache friendly implementations of the Floyd-Warshall algorithm for the all-pairs shortest-path problem. We first compare the best commercial compiler optimizations available with standard cache-friendly optimizations and a simple improvement involving a block layout, which reduces TLB misses. We show approximately 15% improvements using these optimizations. We also develop a general representation, the unidirectional space time representation, which can be used to generate cache friendly implementations for a large class of algorithms. We show analytically and experimentally that this representation can be used to minimize level-1 and level-2 cache misses and TLB misses and therefore exhibits the best overall performance. Using this representation we show a 2/spl times/ improvement in performance with respect to the compiler optimized implementation. Experiments were conducted on Pentium III, Alpha, and MIPS R12000 machines using problem sizes between 1024 and 2048 vertices. We used the Simplescalar simulator to demonstrate improved cache performance.

[1]  Sandeep Sen,et al.  Towards a theory of cache-efficient algorithms , 2000, SODA '00.

[2]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[3]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[4]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[5]  Viktor K. Prasanna,et al.  Dynamic data layouts for cache-conscious factorization of DFT , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[6]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[7]  Yves Robert,et al.  Loop partitioning versus tiling for cache-based multiprocessors , 1998 .

[8]  Sally A. McKee,et al.  Caches as filters: a new approach to cache analysis , 1998, Proceedings. Sixth International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.98TB100247).

[9]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[10]  Siddhartha Chatterjee,et al.  Cache-efficient matrix transposition , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[11]  James R. Larus,et al.  Cache-conscious structure layout , 1999, PLDI '99.

[12]  Jeffrey D Ullma Computational Aspects of VLSI , 1984 .

[13]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[14]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[15]  Ali R. Hurson,et al.  Effects of Multithreading on Cache Performance , 1999, IEEE Trans. Computers.

[16]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[17]  Denis Trystram,et al.  Parallel algorithms and architectures , 1995 .

[18]  Guang R. Gao,et al.  Heap analysis and optimizations for threaded programs , 1997, Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques.