Efficient Neighbor-Sampling-based GNN Training on CPU-FPGA Heterogeneous Platform

Graph neural networks (GNNs) have become increasingly important in many real-world applications. However, training GNN on large-scale real-world graphs is still challenging. Many sampling-based GNN training algorithms have been proposed to facilitate the mini-batch style training process. The wellknown Neighbor-Sampling-base (NS) GNN training algorithms, such as GraphSAGE, have shown great advantages in terms of accuracy, generalization, and scalability on large-scale graphs. Nevertheless, efficient hardware acceleration for such algorithms has not been systematically studied.In this paper, we conduct an experimental study to understand the computational characteristics of NS GNN training. The evaluation results show that neighbor sampling and feature aggregation take the majority of the execution time due to the irregular memory accesses and extensive memory traffic. Then, we propose a system design for NS GNN training by exploiting the CPU-FPGA heterogeneous platform. We develop an optimized parallel neighbor sampling implementation and an efficient FPGA accelerator to enable high-throughput GNN training. We propose the neighbor sharing and task pipelining techniques to improve the training throughput. We implement a prototype system on an FPGA-equipped server. The evaluation results demonstrate that our CPU-FPGA design achieves 12−21× speedup than CPU-only platform and 0.4 − 3.2× speedup than CPU-GPU platform. Moreover, our FPGA accelerator are 2.3× more energy efficient than the GPU board.