论文信息 - High-Throughput and Energy-Efficient Graph Processing on FPGA

High-Throughput and Energy-Efficient Graph Processing on FPGA

In this paper, we propose a novel design for large-scale graph processing on FPGA. Our design uses large external memory for storing massive graph data and FPGA for acceleration, and leverages edge-centric computing principles. We propose a data layout which optimizes the external memory performance and leads to an efficient memory activation schedule to reduce on-chip memory power consumption. Further, we develop a parallel architecture on FPGA which can saturate the external memory bandwidth and concurrently process multiple input data to increase throughput. We use our design to accelerate several classic graph algorithms, including single-source shortest path, weakly connected component, and minimum spanning tree. Experimental results show that for all the considered graph algorithms, our design achieves high throughput of over 600 million traversed edges per second (MTEPS) and high energy-efficiency of over 30 MTEPS/W. Compared with a baseline design, our optimizations result in over 3.6× throughput and 5.8× energy-efficiency improvements, respectively. Our design achieves 32% throughput improvement when compared with state-of-the-art FPGA designs, and up to 7.8× speedup when compared with state-of-the-art multi-core implementation.

[1] T. Lindvall. ON A ROUTING PROBLEM , 2004, Probability in the Engineering and Informational Sciences.

[2] Nachiket Kapre,et al. Spatial hardware implementation for sparse graph algorithms in GraphStep , 2011, TAAS.

[3] Onur Mutlu,et al. Memory Systems , 2014, Computing Handbook, 3rd ed..

[4] Yu Wang,et al. Parallel FPGA-based all pairs shortest paths for sparse networks: A human brain connectome case study , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[5] Phillip H. Jones,et al. CyGraph: A Reconfigurable Architecture for Parallel Breadth-First Search , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[6] Jonathan W. Berry,et al. Challenges in Parallel Graph Processing , 2007, Parallel Process. Lett..

[7] Viktor K. Prasanna,et al. Energy and Memory Efficient Mapping of Bitonic Sorting on FPGA , 2015, FPGA.

[8] Willy Zwaenepoel,et al. X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.

[9] J. Gregory Steffan,et al. Efficient multi-ported memories for FPGAs , 2010, FPGA '10.

[10] Aart J. C. Bik,et al. Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[11] Markus Püschel,et al. Computer generation of streaming sorting networks , 2012, DAC Design Automation Conference 2012.

[12] Zefu Dai. Appliction-driven Memory System Design on FPGAs , 2014 .

[13] Stephen Neuendorffer,et al. Streaming Systems in FPGAs , 2008, SAMOS.

[14] Viktor K. Prasanna,et al. Energy performance of FPGAs on PERFECT suite kernels , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[15] Franz Franchetti,et al. Understanding the design space of DRAM-optimized hardware FFT accelerators , 2014, 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors.

[16] Yong Dou,et al. An FPGA Implementation for Solving the Large Single-Source-Shortest-Path Problem , 2016, IEEE Transactions on Circuits and Systems II: Express Briefs.

[17] Viktor K. Prasanna,et al. Accelerating Large-Scale Single-Source Shortest Path on FPGA , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[18] Viktor K. Prasanna,et al. Optimizing memory performance for FPGA implementation of pagerank , 2015, 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig).

[19] Bruce Jacob,et al. DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[20] Kenneth E. Batcher,et al. Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[21] Yu Wang,et al. A Reconfigurable Computing Approach for Efficient and Scalable Parallel Graph Exploration , 2012, 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors.

[22] Bruce Jacob,et al. Memory Systems: Cache, DRAM, Disk , 2007 .

[23] Lieven Eeckhout,et al. Performance Evaluation and Benchmarking , 2005 .

[24] Ryan Kastner,et al. Energy efficient canonical huffman encoding , 2014, 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors.

[25] Nachiket Kapre,et al. A Case for Embedded FPGA-based SoCs in Energy-Efficient Acceleration of Graph Problems , 2015, Supercomput. Front. Innov..

[26] Yu Wang,et al. FPGP: Graph Processing Framework on FPGA A Case Study of Breadth-First Search , 2016, FPGA.

[27] James C. Hoe,et al. GraphGen: An FPGA Framework for Vertex-Centric Graph Computation , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.