Area-Efficient Evaluation of a Class of Arithmetic Expressions Using Deeply Pipelined Floating-Point Cores

Due to technological advances,it has becomepossible to implement floating-point coreson FPGAs in an effort to provide hardware acceleration for the myriad applications that require high performance floating-point arithmetic. However, in order to achieve a high clock rate, these floating-point cores must be deeply pipelined. Due to this deep pipelining and the complexity of floating-point arithmetic, floating-point cores use a great deal of the FPGA’s area. It is thus important to use as few floating-point cores in an architecture as possible.However, the deep pipelining makes it difficult to reusethe samefloatingpoint core for a series of floating-point computations that are dependent upon one another. In this paper, we describe an area-efficient architecture and algorithm for the evaluation of arithmetic expressions.This design effectively hides the pipeline latencyof the floating-point coresand usesonly onefloating-point core for each type of operator in the expression.The design is applicable to a wide variety of fields suchasscientific computing, cognition, and graph theory. We analyzethe performance of this design when implemented on a Xilinx Virtex-II Pro FPGA.

[1]  Viktor K. Prasanna,et al.  Efficient Floating-point Based Block LU Decomposition on FPGAs , 2004, ERSA.

[2]  Joseph JáJá,et al.  An Introduction to Parallel Algorithms , 1992 .

[3]  Gary L. Miller,et al.  Parallel tree contraction and its application , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[4]  Viktor K. Prasanna,et al.  Designing scalable FPGA-based reduction circuits using pipelined floating-point cores , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[5]  Karl S. Hemmert,et al.  Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance , 2004, 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[6]  Margaret Martonosi,et al.  Accelerating Pipelined Integer and Floating-Point Accumulations in Configurable Hardware with Delayed Addition Techniques , 2000, IEEE Trans. Computers.

[7]  Richard Cole,et al.  The accelerated centroid decomposition technique for optimal parallel tree evaluation in logarithmic time , 2005, Algorithmica.

[8]  Srinivas Aluru,et al.  Parallel algorithms for tree accumulations , 2005, J. Parallel Distributed Comput..

[9]  Viktor K. Prasanna,et al.  Design tradeoffs for BLAS operations on reconfigurable hardware , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[10]  David A. Bader,et al.  Evaluating Arithmetic Expressions Using Tree Contraction: A Fast and Scalable Parallel Implementation for Symmetric Multiprocessors (SMPs) (Extended Abstract) , 2002, HiPC.

[11]  Viktor K. Prasanna,et al.  Sparse Matrix-Vector multiplication on FPGAs , 2005, FPGA '05.

[12]  S. Sitharama Iyengar,et al.  Introduction to parallel algorithms , 1998, Wiley series on parallel and distributed computing.

[13]  Viktor K. Prasanna,et al.  Computing Lennard-Jones Potentials and Forces with Reconfigurable Hardware , 2004, ERSA.

[14]  C. Siva Ram Murthy,et al.  Parallel Arithmetic Expression Evaluation on Reconfigurable Meshes , 1994, Comput. Lang..

[15]  Wojciech Rytter,et al.  Optimal Parallel Algorithm for Dynamic Expression Evaluation and Context-Free Recognition , 1989, Inf. Comput..

[16]  Maya Gokhale,et al.  A Preliminary Study of Molecular Dynamics on Reconfigurable Computers , 2003, Engineering of Reconfigurable Systems and Algorithms.

[17]  Jaswinder Pal Singh,et al.  A parallel Lauritzen-Spiegelhalter algorithm for probabilistic inference , 1994, Proceedings of Supercomputing '94.

[18]  Reinhard Männer,et al.  Using floating-point arithmetic on FPGAs to accelerate scientific N-Body simulations , 2002, Proceedings. 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[19]  Viktor K. Prasanna,et al.  High-performance FPGA-based general reduction methods , 2005, 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05).

[20]  Viktor K. Prasanna,et al.  Cache-friendly implementations of transitive closure , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[21]  Viktor K. Prasanna,et al.  Scalable and modular algorithms for floating-point matrix multiplication on FPGAs , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..