Scalable hybrid designs for linear algebra on reconfigurable computing systems

Recently, reconfigurable computing systems have been built which employ field-programmable gate arrays (FPGAs) as hardware accelerators for general-purpose processors. These systems provide new opportunities for high-performance computing. In this paper, we investigate hybrid designs that effectively utilize both the FPGAs and processors in the reconfigurable computing systems. Based on a high-level computational model, we propose designs for floating-point matrix multiplication and block LU decomposition. In our designs, the workload of an application is partitioned between the FPGAs and processors in a balanced way; the FPGAs and processors work cooperatively without data hazards or memory access conflicts. Experimental results on Cray XDI show that with one Xilinx XC2VP50 FPGA (a relatively small device available in XDI) and an AMD 2.2 GHz processor, our designs achieve up to 1.4X/2X speedup over the design that employs AMD processors/FPGAs only. The performance of our designs scales with the number of nodes. Moreover, our designs achieve higher performance when improved floating-point units or larger devices are used

[1]  Diederik Verkest,et al.  Run-Time Minimization of Reconfiguration Overhead in Dynamically Reconfigurable Systems , 2003, FPL.

[2]  Wayne Luk,et al.  Efficient Hardware Generation of Random Variates with Arbitrary Distributions , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[3]  Viktor K. Prasanna,et al.  Sparse Matrix-Vector multiplication on FPGAs , 2005, FPGA '05.

[4]  Robert A. van de Geijn,et al.  Parallel implementation of BLAS: general techniques for Level 3 BLAS , 1995, Concurr. Pract. Exp..

[5]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[6]  Viktor K. Prasanna,et al.  A Library of Parameterizable Floating-Point Cores for FPGAs and Their Application to Scientific Computing , 2005, ERSA.

[7]  Viktor K. Prasanna,et al.  A Hybrid Approach for Mapping Conjugate Gradient onto an FPGA-Augmented Reconfigurable Supercomputer , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[8]  James Demmel,et al.  LAPACK Users' Guide, Third Edition , 1999, Software, Environments and Tools.

[9]  Jaeyoung Choi,et al.  Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines , 1994, Sci. Program..

[10]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[11]  Title : High-Performance Math Libraries Who says you can ’ t get performance and accuracy for free ? , 2005 .

[12]  Yong Dou,et al.  64-bit floating-point FPGA matrix multiplication , 2005, FPGA '05.

[13]  Viktor K. Prasanna,et al.  Cache-Friendly implementations of transitive closure , 2007, IEEE PACT.

[14]  Srinivas Katkoori,et al.  Power minimization algorithms for LUT-based FPGA technology mapping , 2004, TODE.

[15]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[16]  Ramachandran Vaidyanathan,et al.  Adaptive image filtering using run-time reconfiguration , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[17]  Viktor K. Prasanna,et al.  High Performance Linear Algebra Operations on Reconfigurable Systems , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[18]  Karl S. Hemmert,et al.  Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance , 2004, 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[19]  Warren J. Gross,et al.  Sparse Matrix-Vector Multiplication for Finite Element Method Matrices on FPGAs , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[20]  Viktor K. Prasanna,et al.  Hardware/Software Approach to Molecular Dynamics on Reconfigurable Computers , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[21]  Eric Stahlberg,et al.  Hardware/Software Integration for FPGA-based All-Pairs Shortest-Paths , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[22]  Jack Dongarra,et al.  LINPACK Users' Guide , 1987 .

[23]  Y. El-Kurdi,et al.  Hardware Acceleration for Finite-Element Electromagnetics: Efficient Sparse Matrix Floating-Point Computations With FPGAs , 2007, IEEE Transactions on Magnetics.

[24]  Robert K. Brayton,et al.  HW/SW partitioning and code generation of embedded control applications on a reconfigurable architecture platform , 2002, Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627).

[25]  Jaeyoung Choi,et al.  Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers , 1994, Concurr. Pract. Exp..

[26]  Viktor K. Prasanna,et al.  Scalable and modular algorithms for floating-point matrix multiplication on FPGAs , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[27]  Robert A. van de Geijn,et al.  Parallel implementation of BLAS: general techniques for Level 3 BLAS , 1995, Concurrency Practice and Experience.

[28]  Jim Stevens,et al.  Enabling a Uniform Programming Model Across the Software/Hardware Boundary , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[29]  André DeHon,et al.  Floating-point sparse matrix-vector multiply for FPGAs , 2005, FPGA '05.

[30]  Neil W. Bergmann,et al.  An FPGA network architecture for accelerating 3DES - CBC , 2005, International Conference on Field Programmable Logic and Applications, 2005..

[31]  Sartaj Sahni,et al.  A blocked all-pairs shortest-paths algorithm , 2003, ACM J. Exp. Algorithmics.

[32]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[33]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[34]  Uday Bondhugula,et al.  Parallel FPGA-based all-pairs shortest-paths in a directed graph , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.