Energy-efficient large-scale matrix multiplication on FPGAs

Energy efficiency has emerged as one of the key performance metrics in computing. In this work, we present an energy efficient design for large-scale matrix multiplication. As a baseline architecture, we use a highly optimized on-chip matrix multiplication architecture extended to support large matrices using external memory. Based on the matrix multiplication algorithm and the DRAM model, we present an efficient data layout for storing the input matrices. This data layout reduces the energy consumed by the external memory by minimizing the number of row activations in a DRAM. By exploiting the matrix multiplication algorithm, modular structure of the DRAM, and the high bandwidth between the on-chip and the external memory, we propose a memory activation schedule. This memory activation schedule is based on a realistic DRAM model and reduces the memory energy, which is the dominant energy of the design. Our proposed scheme improves the energy efficiency (defined as the number of operations per Joule) of the baseline architecture by 1.6×, 1.3×, and 1.2× for 32K×32K 16-bit fixed point, 32K×32K single precision floating point, and 16K×16K double precision floating point matrix multiplication, respectively.

[1]  Mahmut T. Kandemir,et al.  Estimating influence of data layout optimizations on SDRAM energy consumption , 2003, ISLPED '03.

[2]  Viktor K. Prasanna,et al.  Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on Reconfigurable Computing Systems , 2007, IEEE Transactions on Parallel and Distributed Systems.

[3]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[4]  Veljko M. Milutinovic,et al.  FPGA accelerator for floating-point matrix multiplication , 2012, IET Comput. Digit. Tech..

[5]  Jaeyoung Choi,et al.  Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers , 1994, Concurr. Pract. Exp..

[6]  Viktor K. Prasanna,et al.  Scalable and modular algorithms for floating-point matrix multiplication on FPGAs , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[7]  Kiran Kumar Matam,et al.  Evaluating energy efficiency of floating point matrix multiplication on FPGAs , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[8]  Abbes Amira,et al.  An FPGA based parameterizable system for matrix product implementation , 2002, IEEE Workshop on Signal Processing Systems.

[9]  Viktor K. Prasanna,et al.  Eecient Matrix Multiplication Using Cache Conscious Data Layouts , 2000 .

[10]  Viktor K. Prasanna,et al.  Energy- and time-efficient matrix multiplication on FPGAs , 2005, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[11]  Mithuna Thottethodi,et al.  Recursive Array Layouts and Fast Matrix Multiplication , 2002, IEEE Trans. Parallel Distributed Syst..

[12]  Yong Dou,et al.  FPGA accelerating three QR decomposition algorithms in the unified pipelined framework , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[13]  David Gregg,et al.  FPGA Based Sparse Matrix Vector Multiplication using Commodity DRAM Memory , 2007, 2007 International Conference on Field Programmable Logic and Applications.

[14]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[15]  Viktor K. Prasanna,et al.  Energy efficient architecture for matrix multiplication on FPGAs , 2013, 2013 23rd International Conference on Field programmable Logic and Applications.

[16]  Yong Dou,et al.  64-bit floating-point FPGA matrix multiplication , 2005, FPGA '05.

[17]  Viktor K. Prasanna,et al.  Energy efficiency of FPGAs and programmable processors for matrix multiplication , 2002, 2002 IEEE International Conference on Field-Programmable Technology, 2002. (FPT). Proceedings..