Mapping sparse matrix scientific applications onto fpga-augmented reconfigurable supercomputers

The large capacity of field programmable gate arrays (FPGAs) has prompted researchers to map computational kernels onto FPGAs. In some instances, these kernels achieve significant speedups over their software-only counterparts running on general-purpose processors. The success of these efforts has spurred supercomputer companies to develop reconfigurable computers (RCs) that allow the FPGAs to become, in effect, application-specific coprocessors. In concert with the RCs are high-level language-to-hardware description language (HLL-to-HDL) compilers that facilitate development of FPGA-based kernels using HLL-based programming rather than HDL-based hardware design. In theory, these technologies allow end-users to create high-performance custom computing architectures. In practice, acceleration of floating-point scientific kernels is still problematic. Sequential vector reductions like accumulation are difficult because the pipelined floating-point units introduce loop carried dependences that prevent the hardware from being fully pipelined. This has a profound impact on fundamental scientific kernels such as matrix vector multiply. Without pipelining, the potential performance advantage of FPGA-based kernels is eliminated. This dissertation develops algorithms and architectures for time and area efficient software and hardware implementation of scientific kernels on RCs. In particular, it deals with the problem of mapping IEEE Standard 754 double-precision floating-point sparse matrix computations onto FPGA-augmented RCs using an HLL-to-HDL compiler. The major contributions of this research are firstly, a novel algorithm and architecture that facilitates HLL-based reduction of multiple, sequentially delivered floating-point vectors without pipeline stalls or buffer overflows, and secondly, the demonstration of how to speedup an important class of scientific applications, that is, sparse matrix solvers, by mapping them onto reconfigurable supercomputers. Optimized software version of two classic iterative solvers, the Jacobi method, and conjugate gradient are used as a baseline for comparison. Using heuristics and techniques presented in the dissertation, these two solvers are accelerated using FPGA-based kernels. To ensure a fair comparison, both versions of each solver are developed using the same software baseline, and both versions are run on the same platform using the same set of sparse linear equations. The FPGA-augmented versions have a measured speedup of over two on a current-generation RC and an estimated speedup of over six on a next-generation RC.