论文信息 - Area-Efficient Evaluation of a Class of Arithmetic Expressions Using Deeply Pipelined Floating-Point Cores

Area-Efficient Evaluation of a Class of Arithmetic Expressions Using Deeply Pipelined Floating-Point Cores

Due to technological advances,it has becomepossible to implement floating-point coreson FPGAs in an effort to provide hardware acceleration for the myriad applications that require high performance floating-point arithmetic. However, in order to achieve a high clock rate, these floating-point cores must be deeply pipelined. Due to this deep pipelining and the complexity of floating-point arithmetic, floating-point cores use a great deal of the FPGA’s area. It is thus important to use as few floating-point cores in an architecture as possible.However, the deep pipelining makes it difficult to reusethe samefloatingpoint core for a series of floating-point computations that are dependent upon one another. In this paper, we describe an area-efficient architecture and algorithm for the evaluation of arithmetic expressions.This design effectively hides the pipeline latencyof the floating-point coresand usesonly onefloating-point core for each type of operator in the expression.The design is applicable to a wide variety of fields suchasscientific computing, cognition, and graph theory. We analyzethe performance of this design when implemented on a Xilinx Virtex-II Pro FPGA.

[1] Viktor K. Prasanna,et al. Efficient Floating-point Based Block LU Decomposition on FPGAs , 2004, ERSA.

[2] Joseph JáJá,et al. An Introduction to Parallel Algorithms , 1992 .

[3] Gary L. Miller,et al. Parallel tree contraction and its application , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[4] Viktor K. Prasanna,et al. Designing scalable FPGA-based reduction circuits using pipelined floating-point cores , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[5] Karl S. Hemmert,et al. Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance , 2004, 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[6] Margaret Martonosi,et al. Accelerating Pipelined Integer and Floating-Point Accumulations in Configurable Hardware with Delayed Addition Techniques , 2000, IEEE Trans. Computers.

[7] Richard Cole,et al. The accelerated centroid decomposition technique for optimal parallel tree evaluation in logarithmic time , 2005, Algorithmica.

[8] Srinivas Aluru,et al. Parallel algorithms for tree accumulations , 2005, J. Parallel Distributed Comput..

[9] Viktor K. Prasanna,et al. Design tradeoffs for BLAS operations on reconfigurable hardware , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[10] David A. Bader,et al. Evaluating Arithmetic Expressions Using Tree Contraction: A Fast and Scalable Parallel Implementation for Symmetric Multiprocessors (SMPs) (Extended Abstract) , 2002, HiPC.

[11] Viktor K. Prasanna,et al. Sparse Matrix-Vector multiplication on FPGAs , 2005, FPGA '05.

[12] S. Sitharama Iyengar,et al. Introduction to parallel algorithms , 1998, Wiley series on parallel and distributed computing.

[13] Viktor K. Prasanna,et al. Computing Lennard-Jones Potentials and Forces with Reconfigurable Hardware , 2004, ERSA.

[14] C. Siva Ram Murthy,et al. Parallel Arithmetic Expression Evaluation on Reconfigurable Meshes , 1994, Comput. Lang..

[15] Wojciech Rytter,et al. Optimal Parallel Algorithm for Dynamic Expression Evaluation and Context-Free Recognition , 1989, Inf. Comput..

[16] Maya Gokhale,et al. A Preliminary Study of Molecular Dynamics on Reconfigurable Computers , 2003, Engineering of Reconfigurable Systems and Algorithms.

[17] Jaswinder Pal Singh,et al. A parallel Lauritzen-Spiegelhalter algorithm for probabilistic inference , 1994, Proceedings of Supercomputing '94.

[18] Reinhard Männer,et al. Using floating-point arithmetic on FPGAs to accelerate scientific N-Body simulations , 2002, Proceedings. 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[19] Viktor K. Prasanna,et al. High-performance FPGA-based general reduction methods , 2005, 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05).

[20] Viktor K. Prasanna,et al. Cache-friendly implementations of transitive closure , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[21] Viktor K. Prasanna,et al. Scalable and modular algorithms for floating-point matrix multiplication on FPGAs , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..