Area and time efficient implementations of matrix multiplication on FPGAs

We develop new algorithms and architectures for matrix multiplication on configurable hardware. These designs significantly reduce the latency as well as the area. Our designs improve the previous designs in terms of the area/speed metric where the speed denotes the maximum achievable running frequency. The area/speed metrics for the previous designs and our design are 14.45, 4.93, and 2.35, respectively, for 4 /spl times/ 4 matrix multiplication. The latency of one of the previous design is 0.57 /spl mu/s, while our design takes 0.15 /spl mu/s using 18% less area. The area of our designs is smaller by 11% - 46% compared with the best known systolic designs with the same latency for the matrices of sizes 3 /spl times/ 3 - 12 /spl times/ 12. The performance improvements tend to grow with the problem size.