Cache-Friendly implementations of transitive closure

The topic of cache performance has been well studied in recent years. Compiler optimizations exist and optimizations have been done for many problems. Much of this work has focused on dense linear algebra problems. At first glance, the Floyd--Warshall algorithm appears to fall into this category. In this paper, we begin by applying two standard cache-friendly optimizations to the Floyd--Warshall algorithm and show limited performance improvements. We then discuss the unidirectional space time representation (USTR). We show analytically that the USTR can be used to reduce the amount of processor-memory traffic by a factor of O(&sqrt;C), where C is the cache size, for a large class of algorithms. Since the USTR leads to a tiled implementation, we develop a tile size selection heuristic to intelligently narrow the search space for the tile size that minimizes total execution time. Using the USTR, we develop a cache-friendly implementation of the Floyd--Warshall algorithm. We show experimentally that this implementation minimizes the level-1 and level-2 cache misses and TLB misses and, therefore, exhibits the best overall performance. Using this implementation, we show a 2x improvement in performance over the best compiler optimized implementation on three different architectures. Finally, we show analytically that our implementation of the Floyd--Warshall algorithm is asymptotically optimal with respect to processor-memory traffic. We show experimental results for the Pentium III, Alpha, and MIPS R12000 machines using problem sizes between 1024 and 2048 vertices. We demonstrate improved cache performance using the Simplescalar simulator.

[1]  James R. Larus,et al.  Cache-conscious structure definition , 1999, PLDI '99.

[2]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[3]  Viktor K. Prasanna,et al.  Dynamic data layouts for cache-conscious factorization of DFT , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[4]  David A. Patterson,et al.  Computer architecture (2nd ed.): a quantitative approach , 1996 .

[5]  Yves Robert,et al.  Proceedings of the international workshop on Parallel algorithms & architectures , 1986 .

[6]  Jaewook Shin,et al.  Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[7]  Guang R. Gao,et al.  Heap analysis and optimizations for threaded programs , 1997, Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques.

[8]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[9]  Ellis Horowitz,et al.  Fundamentals of Computer Algorithms , 1978 .

[10]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[11]  Denis Trystram,et al.  Parallel algorithms and architectures , 1995 .

[12]  Erik R. Altman,et al.  Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques , 2006, PACT 2006.

[13]  Ali R. Hurson,et al.  Effects of Multithreading on Cache Performance , 1999, IEEE Trans. Computers.

[14]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[15]  Rakesh M. Verma,et al.  Tight Bounds for Prefetching and Buffer Management Algorithms for Parallel I/O Systems , 1996, FSTTCS.

[16]  Siddhartha Chatterjee,et al.  Cache-efficient matrix transposition , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[17]  James R. Larus,et al.  Cache-conscious structure layout , 1999, PLDI '99.

[18]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[19]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[20]  Yves Robert,et al.  Loop partitioning versus tiling for cache-based multiprocessors , 1998 .

[21]  Viktor K. Prasanna,et al.  Cache-friendly implementations of transitive closure , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[22]  Jeffrey D Ullma Computational Aspects of VLSI , 1984 .

[23]  Sandeep Sen,et al.  Towards a theory of cache-efficient algorithms , 2000, SODA '00.

[24]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[25]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[26]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach (4. ed.) , 2007 .

[27]  Peter J. Varman,et al.  Optimal prefetching and caching for parallel I/O sytems , 2001, SPAA '01.

[28]  Sally A. McKee,et al.  Caches as filters: a new approach to cache analysis , 1998, Proceedings. Sixth International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.98TB100247).