Fast and efficient implementation of Convolutional Neural Networks on FPGA

State-of-the-art CNN models for Image recognition use deep networks with small filters instead of shallow networks with large filters, because the former requires fewer weights. In the light of above trend, we present a fast and efficient FPGA based convolution engine to accelerate CNN models over small filters. The convolution engine implements Winograd minimal filtering algorithm to reduce the number of multiplications by 38% to 55% for state-of-the-art CNNs. We exploit the parallelism of the Winograd convolution engine to scale the overall performance. We show that our overall design sustains the peak throughput of the convolution engines. We propose a novel data layout to reduce the required memory bandwidth of our design by half. One noteworthy feature of our Winograd convolution engine is that it hides the computation latency of the pooling layer. As a case study we implement VGG16 CNN model and compare it with previous approaches. Compared with the state-of-the-art reduced precision VGG16 implementation, our implementation achieves 1.2× improvement in throughput by using 3× less multipliers and 2× less on-chip memory without impacting the classification accuracy. The improvements in throughput per multiplier and throughput per unit on-chip memory are 3.7× and 2.47× respectively, compared with the state-of-the-art design.

[1]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  S. Winograd Arithmetic complexity of computations , 1980 .

[4]  Viktor Prasanna,et al.  Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System , 2017, FPGA.

[5]  Xiaogang Wang,et al.  Deep Learning Face Representation by Joint Identification-Verification , 2014, NIPS.

[6]  Yann LeCun,et al.  Fast Training of Convolutional Networks through FFTs , 2013, ICLR.

[8]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[10]  Yu Wang,et al.  Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[11]  Shawki Areibi,et al.  Caffeinated FPGAs: FPGA framework For Convolutional Neural Networks , 2016, 2016 International Conference on Field-Programmable Technology (FPT).

[12]  Tyler Highlander,et al.  Very Efficient Training of Convolutional Neural Networks using Fast Fourier Transform and Overlap-and-Add , 2016, BMVC.

[13]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[14]  Viktor K. Prasanna,et al.  Energy optimizations for FPGA-based 2-D FFT architecture , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[15]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[16]  Andrew Lavin,et al.  Fast Algorithms for Convolutional Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Berin Martini,et al.  Large-Scale FPGA-based Convolutional Networks , 2011 .

[18]  Yu Cao,et al.  Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks , 2016, FPGA.