High-Performance and Parameterized Matrix Factorization on FPGAs

FPGAs have become an attractive choice for scientific computing. In this paper, we propose a high performance design for LU decomposition, a key kernel in many scientific and engineering applications. Our design achieves the optimal performance for LU decomposition using the available hardware resources. The design is parameterized. Thus, it can be easily adapted to various hardware constraints. Experimental results show that our design achieves high performance and offers good scalability. Our implementation on a Xilinx Virtex-II Pro XC2VP100 achieves superior sustained floating-point performance over existing FPGA-based implementations and optimized libraries on the state-of-the-art processors.