High-performance linear algebra on reconfigurable computing systems

Recently, high-end computing systems have been introduced that employ reconfigurable hardware as application-specific hardware accelerators for general-purpose processors. These systems provide new opportunities for high-performance implementations of scientific applications. However, they also pose new design challenges, including utilization of available hardware resources, exploitation of multiple levels of memory, and hardware/software co-design. In this work, we investigate high-performance designs for floating-point based linear algebra on reconfigurable computing systems. The operations studied are fundamental kernels for scientific computing, including dense and sparse matrix-vector multiplication, matrix multiplication and matrix factorization. We first study the existing systems and propose a high-level design model. This model captures the architectural details of a system through various parameters at both node level and system level. We next propose optimized designs on reconfigurable hardware using a parameterized design approach. Using the approach, we identify the design parameters, explore the design space and analyze the design trade-offs for each target operation. By tuning the parameters, the proposed designs can adapt to various hardware devices and achieve optimal performance under the available hardware resources. We also develop high-throughput and area-efficient designs for reduction, a fundamental primitive in performing linear algebra operations. Our designs are then incorporated into hybrid designs that utilize both the processors and the reconfigurable hardware in the system. A design methodology is proposed for hybrid designs to perform workload partitioning and hardware/software coordination. Experimental results show that our designs for the vector operations achieve more than 90% of the peak performance under the available memory bandwidth. For the matrix operations, our designs achieve the optimal latency and minimize the required memory bandwidth with the available hardware resources. In addition, when newer and faster floating-point cores become available, the performance of our designs increases correspondingly. The proposed hybrid designs have been implemented on a state-of-the-art reconfigurable computing system. With 6 processors and 6 reconfigurable devices, our designs achieve more than 20 GFLOPS for matrix multiplication and matrix factorization. Furthermore, our designs achieve up to 90% of the total computing power of the system and more than 85% of the performance predicted using the high-level design model.