论文信息 - High-Performance Reduction Circuits Using Deeply Pipelined Operators on FPGAs

High-Performance Reduction Circuits Using Deeply Pipelined Operators on FPGAs

Field-programmable gate arrays (FPGAs) have become an attractive option for accelerating scientific applications. Many scientific operations such as matrix-vector multiplication and dot product involve the reduction of a sequentially produced stream of values. Unfortunately, because of the pipelining in FPGA-based floating-point units, data hazards may occur during these sequential reduction operations. Improperly designed reduction circuits can adversely impact the performance, impose unrealistic buffer requirements, and consume a significant portion of the FPGA. In this paper, we identify two basic methods for designing serial reduction circuits: the tree-traversal method and the striding method. Using accumulation as an example, we analyze the design trade-offs among the number of adders, buffer size, and latency. We then propose high-performance and area-efficient designs using each method. The proposed designs reduce multiple sets of sequentially delivered floating-point values without stalling the pipeline or imposing unrealistic buffer requirements. Using a Xilinx Virtex-ll Pro FPGA as the target device, we implemented our designs and present performance and area results.

Viktor K. Prasanna | Gerald R. Morris | Ling Zhuo

[1] Viktor K. Prasanna,et al. A Library of Parameterizable Floating-Point Cores for FPGAs and Their Application to Scientific Computing , 2005, ERSA.

[2] David A. Bader,et al. Evaluating Arithmetic Expressions Using Tree Contraction: A Fast and Scalable Parallel Implementation for Symmetric Multiprocessors (SMPs) (Extended Abstract) , 2002, HiPC.

[3] Maya Gokhale,et al. A Preliminary Study of Molecular Dynamics on Reconfigurable Computers , 2003, Engineering of Reconfigurable Systems and Algorithms.

[4] Duncan A. Buell,et al. The Advanced Encryption Standard on the HC 36m Reconfigurable Computer , 2003 .

[5] Karl S. Hemmert,et al. Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance , 2004, 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[6] Robert J. Harrison,et al. Hardware Acceleration of Parallel Lagged-Fibonacci Pseudo Random Number Generation , 2006, ERSA.

[7] Kai Hwang,et al. Vector reduction methods for arithmetic pipelines , 1983, 1983 IEEE 6th Symposium on Computer Arithmetic (ARITH).

[8] Viktor K. Prasanna,et al. Designing scalable FPGA-based reduction circuits using pipelined floating-point cores , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[9] Viktor K. Prasanna,et al. An FPGA-Based Application-Specific Processor for Efficient Reduction of Multiple Variable-Length Floating-Point Data Sets , 2006, IEEE 17th International Conference on Application-specific Systems, Architectures and Processors (ASAP'06).

[10] Thomas H. Cormen,et al. Introduction to algorithms [2nd ed.] , 2001 .

[11] Sadaf R. Alam,et al. Scientific Computing Beyond CPUs: FPGA implementations of common scientific kernels , 2005 .

[12] Keith D. Underwood,et al. FPGAs vs. CPUs: trends in peak floating-point performance , 2004, FPGA '04.

[13] Viktor K. Prasanna,et al. Sparse Matrix-Vector multiplication on FPGAs , 2005, FPGA '05.

[14] Viktor K. Prasanna,et al. Computing Lennard-Jones Potentials and Forces with Reconfigurable Hardware , 2004, ERSA.

[15] Peter M. Kogge,et al. The Architecture of Pipelined Computers , 1981 .

[16] Miriam Leeser,et al. Advanced Components in the Variable Precision Floating-Point Library , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[17] Clifford Stein,et al. Introduction to Algorithms, 2nd edition. , 2001 .

[18] Tarek A. El-Ghazawi,et al. Low latency elliptic curve cryptography accelerators for NIST curves over binary fields , 2005, Proceedings. 2005 IEEE International Conference on Field-Programmable Technology, 2005..

[19] Viktor K. Prasanna,et al. A Hybrid Approach for Mapping Conjugate Gradient onto an FPGA-Augmented Reconfigurable Supercomputer , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.