High-Performance Reduction Circuits Using Deeply Pipelined Operators on FPGAs

Field-programmable gate arrays (FPGAs) have become an attractive option for accelerating scientific applications. Many scientific operations such as matrix-vector multiplication and dot product involve the reduction of a sequentially produced stream of values. Unfortunately, because of the pipelining in FPGA-based floating-point units, data hazards may occur during these sequential reduction operations. Improperly designed reduction circuits can adversely impact the performance, impose unrealistic buffer requirements, and consume a significant portion of the FPGA. In this paper, we identify two basic methods for designing serial reduction circuits: the tree-traversal method and the striding method. Using accumulation as an example, we analyze the design trade-offs among the number of adders, buffer size, and latency. We then propose high-performance and area-efficient designs using each method. The proposed designs reduce multiple sets of sequentially delivered floating-point values without stalling the pipeline or imposing unrealistic buffer requirements. Using a Xilinx Virtex-ll Pro FPGA as the target device, we implemented our designs and present performance and area results.

[1]  Viktor K. Prasanna,et al.  A Library of Parameterizable Floating-Point Cores for FPGAs and Their Application to Scientific Computing , 2005, ERSA.

[2]  David A. Bader,et al.  Evaluating Arithmetic Expressions Using Tree Contraction: A Fast and Scalable Parallel Implementation for Symmetric Multiprocessors (SMPs) (Extended Abstract) , 2002, HiPC.

[3]  Maya Gokhale,et al.  A Preliminary Study of Molecular Dynamics on Reconfigurable Computers , 2003, Engineering of Reconfigurable Systems and Algorithms.

[4]  Duncan A. Buell,et al.  The Advanced Encryption Standard on the HC 36m Reconfigurable Computer , 2003 .

[5]  Karl S. Hemmert,et al.  Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance , 2004, 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[6]  Robert J. Harrison,et al.  Hardware Acceleration of Parallel Lagged-Fibonacci Pseudo Random Number Generation , 2006, ERSA.

[7]  Kai Hwang,et al.  Vector reduction methods for arithmetic pipelines , 1983, 1983 IEEE 6th Symposium on Computer Arithmetic (ARITH).

[8]  Viktor K. Prasanna,et al.  Designing scalable FPGA-based reduction circuits using pipelined floating-point cores , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[9]  Viktor K. Prasanna,et al.  An FPGA-Based Application-Specific Processor for Efficient Reduction of Multiple Variable-Length Floating-Point Data Sets , 2006, IEEE 17th International Conference on Application-specific Systems, Architectures and Processors (ASAP'06).

[10]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[11]  Sadaf R. Alam,et al.  Scientific Computing Beyond CPUs: FPGA implementations of common scientific kernels , 2005 .

[12]  Keith D. Underwood,et al.  FPGAs vs. CPUs: trends in peak floating-point performance , 2004, FPGA '04.

[13]  Viktor K. Prasanna,et al.  Sparse Matrix-Vector multiplication on FPGAs , 2005, FPGA '05.

[14]  Viktor K. Prasanna,et al.  Computing Lennard-Jones Potentials and Forces with Reconfigurable Hardware , 2004, ERSA.

[15]  Peter M. Kogge,et al.  The Architecture of Pipelined Computers , 1981 .

[16]  Miriam Leeser,et al.  Advanced Components in the Variable Precision Floating-Point Library , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[17]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[18]  Tarek A. El-Ghazawi,et al.  Low latency elliptic curve cryptography accelerators for NIST curves over binary fields , 2005, Proceedings. 2005 IEEE International Conference on Field-Programmable Technology, 2005..

[19]  Viktor K. Prasanna,et al.  A Hybrid Approach for Mapping Conjugate Gradient onto an FPGA-Augmented Reconfigurable Supercomputer , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.