A Flexible Design Automation Tool for Accelerating Quantized Spectral CNNs

CNNs have proven to be extremely powerful in various computer vision applications. To alleviate the computation burden and improve hardware efficiency, low-complexity convolution algorithms (e.g., spectral convolution) and data quantization schemes have been implemented on FPGAs. However, to translate the reduced algorithm complexity into improved hardware performance, we need significant manual tuning of mapping parameters specific to the CNN model and the target FPGA device. We propose a flexible tool to automate the process of generating high throughput accelerators for quantized, spectral CNNs. The tool takes as input high level specification of the CNN model, the data quantization scheme and the target hardware architecture. It outputs synthesizable Verilog after fast exploration of the complete design space. Our tool is flexible in three dimensions: 1) data representation, 2) FPGA architecture, and 3) CNN models. To support arbitrary quantization bit width, we propose a resource-efficient multiplier design, which uses the fixed, high bit-width DSPs to implement various low bit-width complex multiplications needed in spectral CNNs. To support FPGAs with limited on-chip memory, we propose a systolic array-based architecture for spectral convolution, which exploits high computation parallelism in DSPs without stressing BRAM resources. To support CNNs with various layer parameters, we tile and permute data blocks to saturate the communication and computation capacity. Finally, we propose a fast design space exploration algorithm to complete the end-to-end Verilog generation. The whole design space exploration and verilog generation takes less than 1 second on an Intel Core i5 laptop. We perform evaluation on Stratix-10 and Stratix-V FPGAs, using AlexNet and VGG16. The generated accelerators achieve 2X to 4X higher throughput than state-of-the-art, for 8-bit and 16-bit data quantization.

[1]  Jinjun Xiong,et al.  DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[2]  Rajesh Gupta,et al.  Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs , 2017, FPGA.

[3]  Shuchang Zhou,et al.  DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients , 2016, ArXiv.

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  Wayne Luk,et al.  FP-BNN: Binarized neural network on FPGA , 2018, Neurocomputing.

[6]  Viktor K. Prasanna,et al.  Fast generation of high throughput customized deep learning accelerators on FPGAs , 2017, 2017 International Conference on ReConFigurable Computing and FPGAs (ReConFig).

[7]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[10]  Viktor K. Prasanna,et al.  A Framework for Generating High Throughput CNN Implementations on FPGAs , 2018, FPGA.

[11]  Xi Chen,et al.  FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[12]  Viktor Prasanna,et al.  Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System , 2017, FPGA.

[13]  Peng Zhang,et al.  Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[14]  Yu Cao,et al.  An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[15]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jinjun Xiong,et al.  Face Recognition with Hybrid Efficient Convolution Algorithms on FPGAs , 2018, ACM Great Lakes Symposium on VLSI.

[17]  Viktor K. Prasanna,et al.  Throughput-Optimized Frequency Domain CNN with Fixed-Point Quantization on FPGA , 2018, 2018 International Conference on ReConFigurable Computing and FPGAs (ReConFig).

[18]  Tinoosh Mohsenin,et al.  Accelerating Convolutional Neural Network With FFT on Embedded Hardware , 2018, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[19]  Qiuwen Lou,et al.  Design Flow of Accelerating Hybrid Extremely Low Bit-Width Neural Network in Embedded FPGA , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).

[20]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[22]  Ekow J. Otoo,et al.  Fast Parallel Algorithms for Blocked Dense Matrix Multiplication on Shared Memory Architectures , 2012, ICA3PP.

[23]  Viktor K. Prasanna,et al.  Energy and Memory Efficient Mapping of Bitonic Sorting on FPGA , 2015, FPGA.