A Framework for Generating High Throughput CNN Implementations on FPGAs

We propose a framework to generate highly efficient accelerators for inferencing on FPGAs. Our framework consists of multiple algorithmic optimizations for computation complexity and communication volume reduction, a mapping methodology for efficient resource utilization, and a tool for automatic \textttVerilog generation. The algorithmic optimizations improve throughput of frequency domain convolution so as to satisfy a given set of hardware constraints. While the Overlap-and-Add (OaA) technique has been known, it performs "wasted" computation at the edges. We propose a novel Concatenate-and-Pad (CaP) technique, which improves OaA significantly by reducing the "wasted" computation on the padded pixels. The proposed CaP used in conjunction with OaA enables us to choose a fixed FFT size at design time, and achieve low computation complexity for layers with various image sizes and kernel window sizes. We also develop a novel frequency domain loop tiling technique to further boost throughput by improving data reuse. Our mapping methodology optimizes the architecture for the target device by fast design space exploration. We quantitatively categorize FPGAs by capturing their DSP resources, on-chip memory size and external memory bandwidth into a device coefficient. We identify the optimal architectural parameters based on the tradeoff between computation and communication cost. Our framework includes a tool to automatically generate fully synthesizable \textttVerilog. We demonstrate the framework by generating high throughput accelerators for state-of-the-art CNN models on Intel HARP heterogeneous platform. Using our framework, we achieve throughput of $780.6$ $GOPS$, $669.1$ $GOPS$ and $552.1$ $GOPS$ for AlexNet, VGG16 and FCN-16s respectively. These correspond to $6.8\times$ (AlexNet) and $4.9\times$ (VGG16) improvement compared with the state-of-the-art implementations.

[1]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Viktor Prasanna,et al.  Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System , 2017, FPGA.

[3]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[4]  Andrew Lavin,et al.  Fast Algorithms for Convolutional Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Viktor K. Prasanna,et al.  Fast and efficient implementation of Convolutional Neural Networks on FPGA , 2017, 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[6]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Yu Wang,et al.  Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[8]  P. Duhamel,et al.  `Split radix' FFT algorithm , 1984 .

[9]  Viktor K. Prasanna,et al.  Energy efficient parameterized FFT architecture , 2013, 2013 23rd International Conference on Field programmable Logic and Applications.

[10]  Yu Cao,et al.  Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[11]  Emanuel Radoi,et al.  Overlap-Save and Overlap-Add Filters: Optimal Design and Comparison , 2010, IEEE Transactions on Signal Processing.

[12]  Vivienne Sze,et al.  Using Dataflow to Optimize Energy Efficiency of Deep Neural Network Accelerators , 2017, IEEE Micro.

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Yu Cao,et al.  Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks , 2017, FPGA.

[15]  Xuegong Zhou,et al.  A high performance FPGA-based accelerator for large-scale convolutional neural networks , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[16]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[17]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[18]  Viktor K. Prasanna,et al.  Fast generation of high throughput customized deep learning accelerators on FPGAs , 2017, 2017 International Conference on ReConFigurable Computing and FPGAs (ReConFig).

[19]  Jing Li,et al.  Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network , 2017, FPGA.

[20]  V. Prasanna,et al.  Optimizing Frequency Domain Implementation of CNNs on FPGAs , 2017 .