Transitive closure on the cell broadband engine: A study on self-scheduling in a multicore processor

In this paper, we present a mappingmethodology and optimizations for solving transitive closure on the Cell multicore processor. Using our approach, it is possible to achieve near peak performance for transitive closure on the Cell processor. We first parallelize the Standard Floyd Warshall algorithm and show through analysis and experimental results that data communication is a bottleneck for performance and scalability. We parallelize a cache optimized version of Floyd Warshall algorithm to remove the memory bottleneck. As is the case with several scientific computing and industrial applications on a multicore processor, synchronization and scheduling of the cores plays a crucial role in determining the performance of this algorithm. We define a self-scheduling mechanism for the cores of a multicore processor and design a self-scheduler for Blocked Floyd Warshall algorithm on the Cell multicore processor to remove the scheduling bottleneck. We also present optimizations in scheduling order to remove synchronization points. Our implementations achieved up to 78GFLOPS.

[1]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[2]  Ronald F. Boisvert,et al.  Preface to the special issue on the basic linear algebra subprograms (BLAS) , 2002, TOMS.

[3]  Sartaj Sahni,et al.  A blocked all-pairs shortest-paths algorithm , 2003, ACM J. Exp. Algorithmics.

[4]  David A. Bader,et al.  FFTC: Fastest Fourier Transform for the IBM Cell Broadband Engine , 2007, HiPC.

[5]  P. Raghavan,et al.  Scalable Sparse Matrix Techniques for Modeling Crack Growth , 2002, PARA.

[6]  J. Demmel,et al.  Sun Microsystems , 1996 .

[7]  Jack J. Dongarra,et al.  Algorithm 679: A set of level 3 basic linear algebra subprograms: model implementation and test programs , 1990, TOMS.

[8]  Jack J. Dongarra,et al.  Implementation of mixed precision in solving systems of linear equations on the Cell processor , 2007, Concurr. Comput. Pract. Exp..

[9]  Sriram Krishnamoorthy,et al.  Combining analytical and empirical approaches in tuning matrix transposition , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[10]  Yves Robert,et al.  Dense linear algebra kernels on heterogeneous platforms: Redistribution issues , 2002, Parallel Comput..

[11]  Fabrizio Petrini,et al.  Challenges in Mapping Graph Exploration Algorithms on Advanced Multi-core Processors , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[12]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[13]  Letizia Tanca,et al.  What you Always Wanted to Know About Datalog (And Never Dared to Ask) , 1989, IEEE Trans. Knowl. Data Eng..

[14]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[15]  Eljas Soisalon-Soininen,et al.  Parsing Theory - Volume I: Languages and Parsing , 1988, EATCS Monographs on Theoretical Computer Science.

[16]  Samuel Williams,et al.  Scientific computing Kernels on the cell processor , 2007 .

[17]  Samuel Williams,et al.  Scientific Computing Kernels on the Cell Processor , 2007, International Journal of Parallel Programming.

[18]  Yoon-Ju Lee,et al.  Empirical Optimization for a Sparse Linear Solver: A Case Study , 2005, International Journal of Parallel Programming.

[19]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[20]  Viktor K. Prasanna,et al.  Cache-Friendly implementations of transitive closure , 2007, IEEE PACT.

[21]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .