Optimizing graph algorithms for improved cache performance

We develop algorithmic optimizations to improve the cache performance of four fundamental graph algorithms. We present a cache-oblivious implementation of the Floyd-Warshall algorithm for the fundamental graph problem of all-pairs shortest paths by relaxing some dependencies in the iterative version. We show that this implementation achieves the lower bound on processor-memory traffic of /spl Omega/(N/sup 3///spl radic/C), where N and C are the problem size and cache size, respectively. Experimental results show that this cache-oblivious implementation shows more than six times the improvement in real execution time over that of the iterative implementation with the usual row major data layout, on three state-of-the-art architectures. Second, we address Dijkstra's algorithm for the single-source shortest paths problem and Prim's algorithm for minimum spanning tree problem. For these algorithms, we demonstrate up to two times the improvement in real execution time by using a simple cache-friendly graph representation, namely adjacency arrays. Finally, we address the matching algorithm for bipartite graphs. We show performance improvements of two to three times in real execution time by using the technique of making the algorithm initially work on subproblems to generate a suboptimal solution and, then, solving the whole problem using the suboptimal solution as a starting point. Experimental results are shown for the Pentium III, UltraSPARC III, Alpha 21264, and MIPS R12000 machines.

[1]  Mithuna Thottethodi,et al.  Nonlinear array layouts for hierarchical memory systems , 1999, ICS '99.

[2]  James R. Larus,et al.  Cache-conscious structure layout , 1999, PLDI '99.

[3]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[4]  Viktor K. Prasanna,et al.  Optimizing graph algorithms for improved cache performance , 2002, IEEE Transactions on Parallel and Distributed Systems.

[5]  Michael Brenner,et al.  Multiagent Planning with Partially Ordered Temporal Plans , 2003, IJCAI.

[6]  Viktor K. Prasanna,et al.  Tiling, Block Data Layout, and Memory Hierarchy Performance , 2003, IEEE Trans. Parallel Distributed Syst..

[7]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[8]  Sabih H. Gerez,et al.  Algorithms for VLSI design automation , 1998 .

[9]  Yves Robert,et al.  Loop partitioning versus tiling for cache-based multiprocessors , 1998 .

[10]  Viktor K. Prasanna,et al.  Cache-friendly implementations of transitive closure , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[11]  Wilson C. Hsieh,et al.  Impulse: Memory system support for scientific applications , 1999, Sci. Program..

[12]  Jeremy D. Frens,et al.  Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.

[13]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[14]  Richard E. Ladner,et al.  The influence of caches on the performance of heaps , 1996, JEAL.

[15]  Sartaj Sahni,et al.  Data Structures, Algorithms and Applications in Java , 1998 .

[16]  James R. Larus,et al.  Making Pointer-Based Data Structures Cache Conscious , 2000, Computer.

[17]  Peter M. Kogge,et al.  The Characterization of Data Intensive Memory Workloads on Distributed PIM Systems , 2000, Intelligent Memory Systems.

[18]  Nikil D. Dutt,et al.  Memory data organization for improved cache performance in embedded processor applications , 1997, TODE.

[19]  Sandeep Sen,et al.  Towards a theory of cache-efficient algorithms , 2000, SODA '00.

[20]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[21]  Mihalis Yannakakis,et al.  Graph-theoretic methods in database theory , 1990, PODS '90.

[22]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[23]  Dimitri P. Bertsekas,et al.  Data Networks , 1986 .

[24]  Mateo Valero,et al.  Eliminating cache conflict misses through XOR-based placement functions , 1997, ICS '97.

[25]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[26]  Sartaj Sahni,et al.  A Blocked All-Pairs Shortest-Path Algorithm , 2000, SWAT.

[27]  Peter Sanders,et al.  Fast priority queues for cached memory , 1999, JEAL.

[28]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[29]  Peter J. Varman,et al.  Optimal prefetching and caching for parallel I/O sytems , 2001, SPAA '01.

[30]  Viktor K. Prasanna,et al.  Analysis of memory hierarchy performance of block data layout , 2002, Proceedings International Conference on Parallel Processing.

[31]  David A. Patterson,et al.  Computer architecture (2nd ed.): a quantitative approach , 1996 .

[32]  Sally A. McKee,et al.  Caches as filters: a new approach to cache analysis , 1998, Proceedings. Sixth International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.98TB100247).

[33]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[34]  Viktor K. Prasanna,et al.  Dynamic data layouts for cache-conscious factorization of DFT , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[35]  M. Kanehisa,et al.  Extraction of correlated gene clusters by multiple graph comparison. , 2001, Genome informatics. International Conference on Genome Informatics.