Optimizing memory performance for FPGA implementation of pagerank

Recently, FPGA implementation of graph algorithms arising in many areas such as social networks has been studied. However, the irregular memory access pattern of graph algorithms makes obtaining high performance challenging. In this paper, we present an FPGA implementation of the classic PageRank algorithm. Our goal is to optimize the overall system performance, especially the cost of accessing the off-chip DRAM. We optimize the data layout so that most of memory accesses to the DRAM are sequential. Post-place-and-route results show that our design on a state-of-the-art FPGA can achieve a high clock rate of over 200 MHz. Based on a realistic DRAM access model, we build a simulator to estimate the execution time including memory access overheads. The simulation results show that our design achieves at least 96% of the theoretically best performance of the target platform. Compared with a baseline design, our optimized design dramatically reduces the number of random memory accesses and improves the execution time by at least 70%.

[1]  James C. Hoe,et al.  GraphGen: An FPGA Framework for Vertex-Centric Graph Computation , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[2]  Séamas McGettrick,et al.  An FPGA architecture for the Pagerank eigenvector problem , 2008, 2008 International Conference on Field Programmable Logic and Applications.

[3]  Viktor K. Prasanna,et al.  A message-passing multi-softcore architecture on FPGA for Breadth-first Search , 2010, 2010 International Conference on Field-Programmable Technology.

[4]  Bruce Jacob,et al.  DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[5]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[6]  Onur Mutlu,et al.  Memory Systems , 2014, Computing Handbook, 3rd ed..

[7]  Phillip H. Jones,et al.  CyGraph: A Reconfigurable Architecture for Parallel Breadth-First Search , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[8]  Wayne Luk,et al.  A framework for FPGA acceleration of large graph problems: Graphlet counting case study , 2011, 2011 International Conference on Field-Programmable Technology.

[9]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[10]  Viktor K. Prasanna,et al.  Accelerating Large-Scale Single-Source Shortest Path on FPGA , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[11]  Zefu Dai Appliction-driven Memory System Design on FPGAs , 2014 .

[12]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[13]  Bruce Jacob,et al.  Memory Systems: Cache, DRAM, Disk , 2007 .

[14]  Willy Zwaenepoel,et al.  X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.

[15]  J. Gregory Steffan,et al.  Efficient multi-ported memories for FPGAs , 2010, FPGA '10.

[16]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[17]  Jonathan W. Berry,et al.  Challenges in Parallel Graph Processing , 2007, Parallel Process. Lett..