Scalable and modular algorithms for floating-point matrix multiplication on FPGAs

Summary form only given. The abundant hardware resources on current FPGAs provide new opportunities to improve the performance of hardware implementations of scientific computations. We propose two FPGA-based algorithms for floating-point matrix multiplication, a fundamental kernel in a number of scientific applications. We analyze the design tradeoffs in implementing this kernel on FPGAs. Our algorithms employ a linear array architecture with a small control logic. This architecture effectively utilizes the hardware resources on the entire FPGA and reduces the routing complexity. The processing elements (PEs) used in our algorithms are modular so that floating-point units can be easily embedded into them. In our designs, the floating-point units are optimized to maximize the number of PEs integrated on the FPGA as well as the clock speed. Experimental results show that our algorithms achieve high clock speeds and provide good scalability. Our algorithms achieve superior sustained floating-point performance compared with existing FPGA-based implementations and state-of-the-art processors.

[1]  E. L. Harder,et al.  The Institute of Electrical and Electronics Engineers, Inc. , 2019, 2019 IEEE International Conference on Software Architecture Companion (ICSA-C).

[2]  Viktor K. Prasanna,et al.  A high-performance and energy-efficient architecture for floating-point based LU decomposition on FPGAs , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[3]  Victor Y. Pan,et al.  Parallel matrix multiplication on a linear array with a reconfigurable pipelined bus system , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[4]  Viktor K. Prasanna,et al.  Area and time efficient implementations of matrix multiplication on FPGAs , 2002, 2002 IEEE International Conference on Field-Programmable Technology, 2002. (FPT). Proceedings..

[5]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[6]  Maya Gokhale,et al.  A Preliminary Study of Molecular Dynamics on Reconfigurable Computers , 2003, Engineering of Reconfigurable Systems and Algorithms.

[7]  Viktor K. Prasanna,et al.  Analysis of high-performance floating-point arithmetic on FPGAs , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[8]  Viktor K. Prasanna,et al.  Energy-efficient signal processing using FPGAs , 2003, FPGA '03.

[9]  Pavle Belanovic,et al.  A Library of Parameterized Floating-Point Modules and Their Use , 2002, FPL.

[10]  Kris Gaj,et al.  Implementation trade-offs of Triple DES in the SRC-6 e Reconfigurable Computing Environment , 2003 .

[11]  Abbes Amira,et al.  An FPGA based parameterizable system for matrix product implementation , 2002, IEEE Workshop on Signal Processing Systems.

[12]  Scott McMillan,et al.  A re-evaluation of the practicality of floating-point operations on FPGAs , 1998, Proceedings. IEEE Symposium on FPGAs for Custom Computing Machines (Cat. No.98TB100251).

[13]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.