论文信息 - HitGraph: High-throughput Graph Processing Framework on FPGA

HitGraph: High-throughput Graph Processing Framework on FPGA

This paper presents, HitGraph, an FPGA framework to accelerate graph processing based on the edge-centric paradigm. HitGraph takes in an edge-centric graph algorithm and hardware resource constraints, determines design parameters and then generates a Register Transfer Level (RTL) FPGA design. This makes accelerator design for various graph analytics transparent and user-friendly by masking internal details of the accelerator design process. HitGraph enables increased data reuse and parallelism through novel algorithmic optimizations, including (1) an optimized data layout that reduces non-sequential external memory accesses, (2) an efficient update merging and filtering scheme to reduce the data communication between the FPGA and external memory, and (3) a partition skipping scheme to reduce redundant edge traversals for non-stationary graph algorithms. Based on our design methodology, we accelerate Sparse Matrix Vector Multiplication (SpMV), PageRank (PR), Single Source Shortest Path (SSSP), and Weakly Connected Component (WCC). Experimental results show that HitGraph sustains a high throughput of 2076 Million Traversed Edges Per Second (MTEPS) for SpMV, 2225 MTEPS for PR, 2916 MTEPS for SSSP, and 3493 MTEPS for WCC, respectively. Compared with highly-optimized multi-core implementations, HitGraph achieves up to 37.9× speedup. Compared with state-of-the-art FPGA frameworks, HitGraph achieves up to 50.7× throughput improvement.

[1] Aart J. C. Bik,et al. Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[2] Margaret Martonosi,et al. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3] Stephen Neuendorffer,et al. Streaming Systems in FPGAs , 2008, SAMOS.

[4] Marco D. Santambrogio,et al. Architectural optimizations for high performance and energy efficient Smith-Waterman implementation on FPGAs using OpenCL , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[5] Wenguang Chen,et al. GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning , 2015, USENIX Annual Technical Conference.

[6] Jure Leskovec,et al. {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[7] Karin Strauss,et al. A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication , 2014, FCCM 2014.

[8] Wayne Luk,et al. A framework for FPGA acceleration of large graph problems: Graphlet counting case study , 2011, 2011 International Conference on Field-Programmable Technology.

[9] Zefu Dai. Appliction-driven Memory System Design on FPGAs , 2014 .

[10] Panos Kalnis,et al. Mizan: a system for dynamic load balancing in large-scale graph processing , 2013, EuroSys '13.

[11] Jing Li,et al. Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search , 2017, FPGA.

[12] Karin Strauss,et al. A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[13] Bruce Jacob,et al. DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[14] Kenneth E. Batcher,et al. Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[15] Ozcan Ozturk,et al. Energy Efficient Architecture for Graph Analytics Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[16] Yu Wang,et al. ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture , 2017, FPGA.

[17] Willy Zwaenepoel,et al. X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.

[18] Yiran Chen,et al. GraphR: Accelerating Graph Processing Using ReRAM , 2017, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[19] Yong Dou,et al. An FPGA Implementation for Solving the Large Single-Source-Shortest-Path Problem , 2016, IEEE Transactions on Circuits and Systems II: Express Briefs.

[20] Huazhong Yang,et al. HyVE: Hybrid vertex-edge memory hierarchy for energy-efficient graph processing , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[21] Yu Wang,et al. Parallel FPGA-based all pairs shortest paths for sparse networks: A human brain connectome case study , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[22] James C. Hoe,et al. A Study of Pointer-Chasing Performance on Shared-Memory Processor-FPGA Systems , 2016, FPGA.

[23] Jinwook Kim,et al. GTS: A Fast and Scalable Graph Processing Method based on Streaming Topology to GPUs , 2016, SIGMOD Conference.

[24] John D. Owens,et al. Gunrock: a high-performance graph processing library on the GPU , 2015, PPoPP.

[25] Marco D. Santambrogio,et al. Heterogeneous exascale supercomputing: The role of CAD in the exaFPGA project , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[26] Jianlong Zhong,et al. Medusa: Simplified Graph Processing on GPUs , 2014, IEEE Transactions on Parallel and Distributed Systems.

[27] Brandon Lucia,et al. When is Graph Reordering an Optimization? Studying the Effect of Lightweight Graph Reordering Across Applications and Input Graphs , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[28] Yu Wang,et al. A Reconfigurable Computing Approach for Efficient and Scalable Parallel Graph Exploration , 2012, 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors.

[29] Pradeep Dubey,et al. GraphMat: High performance graph analytics made productive , 2015, Proc. VLDB Endow..

[30] Yu Wang,et al. NXgraph: An efficient graph processing system on a single machine , 2015, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[31] Guy E. Blelloch,et al. GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[32] Zbigniew J. Czech,et al. Introduction to Parallel Computing , 2017 .

[33] Reynold Xin,et al. GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[34] Tianshi Chen,et al. TuNao: A High-Performance and Energy-Efficient Reconfigurable Accelerator for Graph Processing , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[35] James C. Hoe,et al. GraphGen: An FPGA Framework for Vertex-Centric Graph Computation , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[36] Viktor K. Prasanna,et al. High-Throughput and Energy-Efficient Graph Processing on FPGA , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[37] Viktor K. Prasanna,et al. ReCALL: Reordered Cache Aware Locality Based Graph Processing , 2017, 2017 IEEE 24th International Conference on High Performance Computing (HiPC).

[38] Bo Wu,et al. Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[39] Kunle Olukotun,et al. GraphOps: A Dataflow Library for Graph Analytics Acceleration , 2016, FPGA.

[40] Viktor K. Prasanna,et al. An FPGA framework for edge-centric graph processing , 2018, CF.

[41] Tim Weninger,et al. Thinking Like a Vertex , 2015, ACM Comput. Surv..

[42] Tom Feist,et al. Vivado Design Suite , 2012 .

[43] Keval Vora,et al. CuSha: vertex-centric graph processing on GPUs , 2014, HPDC '14.

[44] Jing Li,et al. Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform , 2018, FPGA.

[45] Jing Li,et al. Accelerating Graph Analytics by Co-Optimizing Storage and Access on an FPGA-HMC Platform , 2018, FPGA.

[46] Viktor K. Prasanna,et al. Optimizing memory performance for FPGA implementation of pagerank , 2015, 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig).