Accelerating scientific computing applications with reconfigurable hardware

With recent technological advances, it has become possible to use reconfigurable hardware to accelerate scientific computing applications. There has been a resulting development of reconfigurable computers that have microprocessors, reconfigurable hardware, and high-performance interconnect. We address several aspects of accelerating scientific computing applications with reconfigurable hardware and reconfigurable computers. Because there is no native support on reconfigurable hardware for the floating-point arithmetic needed by many scientific computing applications, we introduce a library of double-precision floating-point cores and analyze the effects on performance of the degree of pipelining and the implemented features of IEEE standard 754. Scientific computing applications may spend a large amount of time evaluating arithmetic expressions. Hence, we present area-efficient designs for arithmetic expression evaluation that hide the pipeline latencies of floating-point cores. These designs use at most two cores for each type of operator in the expression and have better area and throughput properties than designs generated by a state-of-the-art hardware compiler for FPGAs. Experiments show that for 64- and 1024-input expressions, area increases linearly with the number of types of operators. Implementing a design on a reconfigurable computer can be diffficult and is not guaranteed to give a speed-up. We thus formulate hierarchical architectural and performance models for reconfigurable computers that facilitate performance prediction early in the design process. The performance model has errors of 5% to 13% in our work in accelerating molecular dynamics. A hierarchical programming model for developing and modeling implementations of scientific computing applications on reconfigurable computers is also provided. To demonstrate acceleration of a complete scientific computing application, we study molecular dynamics on reconfigurable computers. We investigate single-node, shifted-force simulations; single-node, particle-mesh-Ewald simulations; and multinode, shifted-force simulations. We attain 2x to 3x speed-ups over state-of-the-art microprocessors through a hardware/software approach in which the most intensive task executes on reconfigurable hardware and the rest of the tasks execute on the microprocessor. In the particle-mesh-Ewald simulation, we exploit parallelism between the microprocessor and the reconfigurable hardware. For the multi-node, shifted-force simulations, we show that a cluster of accelerated nodes has about the same performance as a cluster of twice as many unaccelerated nodes.