Design tradeoffs for BLAS operations on reconfigurable hardware

Numerical linear algebra operations are key primitives in scientific computing. Performance optimizations of such operations have been extensively investigated and some basic operations have been implemented as software libraries. With the rapid advances in technology, hardware acceleration of linear algebra applications using FPGAs (field programmable gate arrays) has become feasible. In this paper, we propose FPGA-based designs for several BLAS operations, including vector product, matrix-vector multiply, and matrix multiply. By identifying the design parameters for each BLAS operation, we analyze the design tradeoffs. In the implementations of the designs, the values of the design parameters are determined according to the hardware constraints, such as the available area, the size of on-chip memory, the external memory bandwidth and the number of I/O pins. The proposed designs are implemented on a Xilinx Virtex-II Pro FPGA.

[1]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[2]  Jack Dongarra,et al.  LINPACK Users' Guide , 1987 .

[3]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[4]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[5]  Scott Hauck,et al.  Reconfigurable computing: a survey of systems and software , 2002, CSUR.

[6]  Viktor K. Prasanna,et al.  High-performance FPGA-based general reduction methods , 2005, 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05).

[7]  Viktor K. Prasanna,et al.  Area and time efficient implementations of matrix multiplication on FPGAs , 2002, 2002 IEEE International Conference on Field-Programmable Technology, 2002. (FPT). Proceedings..

[8]  Karl S. Hemmert,et al.  Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance , 2004, 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[9]  Viktor K. Prasanna,et al.  Computing Lennard-Jones Potentials and Forces with Reconfigurable Hardware , 2004, ERSA.

[10]  Viktor K. Prasanna,et al.  Scalable and modular algorithms for floating-point matrix multiplication on FPGAs , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[11]  Viktor K. Prasanna,et al.  Designing scalable FPGA-based reduction circuits using pipelined floating-point cores , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[12]  Viktor K. Prasanna,et al.  Analysis of high-performance floating-point arithmetic on FPGAs , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[13]  Wayne Luk,et al.  Optimizing FPGA-based vector product designs , 1999, Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00375).

[14]  Yong Dou,et al.  64-bit floating-point FPGA matrix multiplication , 2005, FPGA '05.

[15]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.