Designing scalable FPGA-based reduction circuits using pipelined floating-point cores

The use of pipelined floating-point arithmetic cores to create high-performance FPGA-based computational kernels has introduced a new class of problems that do not exist when using single-cycle arithmetic cores. In particular, the data hazards associated with pipelined floating-point reduction circuits can limit the scalability or severely reduce the performance of an otherwise high-performance computational kernel. The inability to efficiently execute the reduction in hardware coupled with memory bandwidth issues may even negate the performance gains derived from hardware acceleration of the kernel. In this paper we introduce a method for developing scalable floating-point reduction circuits that run in optimal time while requiring only /spl Theta/ (lg (n)) space and a single pipelined floating-point unit. Using a Xilinx Virtex-II Pro as the target device, we implement reference instances of our reduction method and present the FPGA design statistics supporting our scalability claims.

[1]  Miriam Leeser,et al.  Variable Precision Floating Point Division and Square Root , 2005 .

[2]  Karl S. Hemmert,et al.  Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance , 2004, 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[3]  Oliver Diessel,et al.  Resource-aware run-time elaboration of behavioural FPGA specifications , 2002, 2002 IEEE International Conference on Field-Programmable Technology, 2002. (FPT). Proceedings..

[4]  Delon Levi,et al.  Of gates and wires , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[5]  Viktor K. Prasanna,et al.  Computing Lennard-Jones Potentials and Forces with Reconfigurable Hardware , 2004, ERSA.

[6]  William S. Fithian Iterative Matrix Equation Solver for a Reconfigurable FPGA-Based Hypercomputer® , 2002 .

[7]  Viktor K. Prasanna,et al.  Design tradeoffs for BLAS operations on reconfigurable hardware , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[8]  Margaret Martonosi,et al.  Accelerating Pipelined Integer and Floating-Point Accumulations in Configurable Hardware with Delayed Addition Techniques , 2000, IEEE Trans. Computers.

[9]  Neil W. Bergmann,et al.  The Egret platform for reconfigurable system on chip , 2003, Proceedings. 2003 IEEE International Conference on Field-Programmable Technology (FPT) (IEEE Cat. No.03EX798).

[10]  Viktor K. Prasanna,et al.  Sparse Matrix-Vector multiplication on FPGAs , 2005, FPGA '05.

[11]  Viktor K. Prasanna,et al.  Analysis of high-performance floating-point arithmetic on FPGAs , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..