BoostGCN: A Framework for Optimizing GCN Inference on FPGA

Graph convolutional networks (GCNs) have revolutionized many big data applications, such as recommendation systems, traffic prediction, etc. However, accelerating GCN inference is challenging due to (1) massive external memory traffic and irregular memory access, (2) workload imbalance due to skewed degree distribution, and (3) intra-stage load imbalance caused by two heterogeneous computation phases of the algorithm. To address the above challenges, we propose a framework named BoostGCN to optimize GCN inference on FPGA. First, we develop a novel hardware-aware Partition-Centric Feature Aggregation (PCFA) scheme that leverages 3-D partitioning with the vertex-centric computing paradigm. This increases on-chip data reuse and reduces the total data communication volume with external memory. Second, we design a novel hardware architecture to enable pipelined execution of the two heterogeneous computation phases. We develop a low-overhead task scheduling strategy to reduce the pipeline stalls caused by the two computation phases. Third, we provide a complete GCN acceleration framework on FPGA with optimized RTL templates. It can generate hardware designs based on the customized configuration and is adaptable to various GCN models. Using our framework, we generate accelerators for various GCN models on a state-of-the-art FPGA platform and evaluate our designs using widely used datasets. Experimental results show that the accelerators produced by our framework achieve significant speedup compared with state-of-the-art implementations on CPU (≈ 100×), GPU (≈ 30×), prior FPGA accelerator (3-45)×.

[1]  Viktor K. Prasanna,et al.  HitGraph: High-throughput Graph Processing Framework on FPGA , 2019, IEEE Transactions on Parallel and Distributed Systems.

[2]  Yuan Meng,et al.  DYNAMAP: Dynamic Algorithm Mapping Framework for Low Latency CNN Inference , 2020, ArXiv.

[3]  Yu Wang,et al.  ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture , 2017, FPGA.

[4]  Viktor Prasanna,et al.  Hardware Acceleration of Large Scale GCN Inference , 2020, 2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[5]  Jure Leskovec,et al.  How Powerful are Graph Neural Networks? , 2018, ICLR.

[6]  Jure Leskovec,et al.  Graph Convolutional Neural Networks for Web-Scale Recommender Systems , 2018, KDD.

[7]  Chang Zhou,et al.  AliGraph: A Comprehensive Graph Neural Network Platform , 2019, Proc. VLDB Endow..

[8]  Yixin Chen,et al.  Link Prediction Based on Graph Neural Networks , 2018, NeurIPS.

[9]  Zhanxing Zhu,et al.  Spatio-temporal Graph Convolutional Neural Network: A Deep Learning Framework for Traffic Forecasting , 2017, IJCAI.

[10]  Zbigniew J. Czech,et al.  Introduction to Parallel Computing , 2017 .

[11]  Hai Jin,et al.  Improving Performance of Graph Processing on FPGA-DRAM Platform by Two-level Vertex Caching , 2019, FPGA.

[12]  Matthew Mattina,et al.  SCALE-Sim: Systolic CNN Accelerator , 2018, ArXiv.

[13]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[14]  Dongyoung Kim,et al.  A novel zero weight/activation-aware hardware architecture of convolutional neural network , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[15]  Viktor Prasanna,et al.  GraphACT: Accelerating GCN Training on CPU-FPGA Heterogeneous Platforms , 2019, FPGA.

[16]  Cyrus Shahabi,et al.  Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting , 2017, ICLR.

[17]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[18]  Onur Mutlu,et al.  Ramulator: A Fast and Extensible DRAM Simulator , 2016, IEEE Computer Architecture Letters.

[19]  Yu Wang,et al.  FPGP: Graph Processing Framework on FPGA A Case Study of Breadth-First Search , 2016, FPGA.

[20]  Jan Eric Lenssen,et al.  Fast Graph Representation Learning with PyTorch Geometric , 2019, ArXiv.

[21]  Lei Deng,et al.  GNNAdvisor: An Efficient Runtime System for GNN Acceleration on GPUs , 2020, ArXiv.

[22]  Alex Smola,et al.  Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs , 2019, ArXiv.

[23]  Rajgopal Kannan,et al.  GraphSAINT: Graph Sampling Based Inductive Learning Method , 2019, ICLR.

[24]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[25]  Viktor K. Prasanna,et al.  A Flexible Design Automation Tool for Accelerating Quantized Spectral CNNs , 2019, 2019 29th International Conference on Field Programmable Logic and Applications (FPL).

[26]  S. Reinhardt,et al.  AWB-GCN: A Graph Convolutional Network Accelerator with Runtime Workload Rebalancing , 2019, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27]  Deming Chen,et al.  HybridDNN: A Framework for High-Performance Hybrid DNN Accelerator Design and Implementation , 2020, 2020 57th ACM/IEEE Design Automation Conference (DAC).

[28]  Jason Cong,et al.  Caffeine: Toward Uniformed Representation and Acceleration for Deep Convolutional Neural Networks , 2019, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[29]  Peng Zhang,et al.  Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[30]  Tze Meng Low,et al.  Efficient SpMV Operation for Large and Highly Sparse Matrices using Scalable Multi-way Merge Parallelization , 2019, MICRO.

[31]  Viktor K. Prasanna,et al.  Accurate, Efficient and Scalable Graph Embedding , 2018, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[32]  Ajitesh Srivastava,et al.  Reuse Kernels or Activations?: A Flexible Dataflow for Low-latency Spectral CNN Acceleration , 2020, FPGA.

[33]  Alex Fout,et al.  Protein Interface Prediction using Graph Convolutional Networks , 2017, NIPS.

[34]  Yafei Dai,et al.  PCGCN: Partition-Centric Processing for Accelerating Graph Convolutional Network , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[35]  Dongrui Fan,et al.  HyGCN: A GCN Accelerator with Hybrid Architecture , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[36]  Yu Cao,et al.  Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks , 2017, FPGA.

[37]  Hao Ma,et al.  GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal Graphs , 2018, UAI.