论文信息 - Fast generation of high throughput customized deep learning accelerators on FPGAs

Fast generation of high throughput customized deep learning accelerators on FPGAs

Accelerating CNNs has been an active area of research. Research on GPU has led to several well-developed open-source tools such as CAFFE and TensorFlow. However, for FPGA accelerators, such design automation tools are not yet available. We propose an automatic code generation tool that synthesizes high throughput accelerators for CNN inferencing targeting broad types of CNNs and FPGAs. The tool takes as input a high level description of the CNN model and the target FPGA device, and generates fully synthesizable Verilog as output. The tool adopts an algorithm-architecture co-design methodology based on frequency domain convolution. Our proposed algorithm called Concatenate and Pad (CaP), together with our efficient design space exploration, ensure design modularity and scalability (in terms of routing complexity and tool execution time). Users can optionally customize various design parameters, such as FFT sizes and hardware resources to be used. The tool optimizes throughput for a user specified hardware. To illustrate the tool, we generate optimized designs for AlexNet, VGG16 and variations of them (AlexNet∗ and VGG16∗). Experimental results show that for inferencing on these models, throughput of 274.5 GOPS, 660.9 GOPS, 283.2 GOPS and 623.0 GOPS is achieved on the Intel HARP (version 0) platform. The throughput of AlexNet and VGG16 designs outperform state-of-the-art FPGA implementations by 1.85x and 3.53x respectively. The tool is delivered as a Python3 package, and is easily portable onto various computing platforms. Experiments on variety of CNNs and target FPGA devices show that the tool runs in less than 20 seconds on a commodity desktop.

[1] Viktor Prasanna,et al. Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System , 2017, FPGA.

[2] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3] Yu Wang,et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[4] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[6] Jason Cong,et al. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[7] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[8] Viktor K. Prasanna,et al. Energy efficient parameterized FFT architecture , 2013, 2013 23rd International Conference on Field programmable Logic and Applications.

[9] Yu Cao,et al. Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks , 2016, FPGA.

[10] Asit K. Mishra,et al. From high-level deep neural models to FPGAs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11] Tyler Highlander,et al. Very Efficient Training of Convolutional Neural Networks using Fast Fourier Transform and Overlap-and-Add , 2016, BMVC.

[12] Andrew Lavin,et al. Fast Algorithms for Convolutional Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Charles Clos,et al. A study of non-blocking switching networks , 1953 .

[14] Emanuel Radoi,et al. Overlap-Save and Overlap-Add Filters: Optimal Design and Comparison , 2010, IEEE Transactions on Signal Processing.

[15] V. Prasanna,et al. Optimizing Frequency Domain Implementation of CNNs on FPGAs , 2017 .

[16] Yu Cao,et al. Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[17] Viktor K. Prasanna,et al. Energy and Memory Efficient Mapping of Bitonic Sorting on FPGA , 2015, FPGA.

[18] Viktor K. Prasanna,et al. Energy performance of FPGAs on PERFECT suite kernels , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).