Accelerating PageRank using Partition-Centric Processing

PageRank is a fundamental link analysis algorithm and a key representative of the performance of other graph algorithms and Sparse Matrix Vector (SpMV) multiplication. Calculating PageRank on sparse graphs generates large amount of random memory accesses resulting in low cache line utilization and poor use of memory bandwidth. In this paper, we present a novel Partition-Centric Processing Methodology (PCPM) that drastically reduces the amount of communication with DRAM and achieves high memory bandwidth. Similar to the state of the art Binning with Vertex-centric Gather-Apply-Scatter (BVGAS) method, PCPM performs partition wise scatter and gather of updates with both phases enjoying full cache line utilization. However, BVGAS suffers from random memory accesses and redundant read/write of update values from nodes to their neighbors. In contrast, PCPM propagates single update from source node to all destinations in a partition, thus decreasing redundancy effectively. We make use of this characteristic to develop a novel bipartite Partition-Node Graph (PNG) data layout for PCPM, that enables streaming memory accesses, with very little generation overhead. We perform detailed analysis of PCPM and provide theoretical bounds on the amount of communication and random DRAM accesses. We experimentally evaluate our approach using 6 large graph datasets and demonstrate an average 2.7x speedup in execution time and 1.7x reduction in communication, compared to the state of the art. We also show that unlike the BVGAS implementation, PCPM is able to take advantage of intelligent node labeling that enhances locality in graphs, by further reducing the amount of communication with DRAM. Although we use PageRank as the target application in this paper, our approach can be applied to generic SpMV computation.

[1]  H. Howie Huang,et al.  G-Store: High-Performance Graph Store for Trillion-Edge Processing , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Marc Lelarge,et al.  Balanced graph edge partition , 2014, KDD.

[3]  Patrick Weber,et al.  OpenStreetMap: User-Generated Street Maps , 2008, IEEE Pervasive Computing.

[4]  Viktor K. Prasanna,et al.  Optimizing memory performance for FPGA implementation of pagerank , 2015, 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig).

[5]  Michael Isard,et al.  Scalability! But at what COST? , 2015, HotOS.

[6]  Robert Gentleman,et al.  Graphs in molecular biology , 2007, BMC Bioinformatics.

[7]  Fabio Checconi,et al.  Optimizing Sparse Matrix-Vector Multiplication for Large-Scale Data Analytics , 2016, ICS.

[8]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[9]  Khuzaima Daudjee,et al.  Giraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems , 2015, Proc. VLDB Endow..

[10]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[11]  Viktor K. Prasanna,et al.  Cache-Friendly implementations of transitive closure , 2007, IEEE PACT.

[12]  A. H. Sherman,et al.  Comparative Analysis of the Cuthill–McKee and the Reverse Cuthill–McKee Ordering Algorithms for Sparse Matrices , 1976 .

[13]  David A. Patterson,et al.  Reducing Pagerank Communication via Propagation Blocking , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[14]  James Demmel,et al.  When cache blocking of sparse matrix vector multiply works and why , 2007, Applicable Algebra in Engineering, Communication and Computing.

[15]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[16]  Albert-László Barabási,et al.  Internet: Diameter of the World-Wide Web , 1999, Nature.

[17]  Viktor K. Prasanna,et al.  Design and implementation of parallel PageRank on multicore platforms , 2017, 2017 IEEE High Performance Extreme Computing Conference (HPEC).

[18]  Hui Ding,et al.  TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[19]  Xuemin Lin,et al.  Speedup Graph Processing by Graph Ordering , 2016, SIGMOD Conference.

[20]  A. N. Yzelman,et al.  A Cache-Oblivious Sparse Matrix–Vector Multiplication Scheme Based on the Hilbert Curve , 2012 .

[21]  Mario Szegedy,et al.  A Simple Yet Effective Balanced Edge Partition Model for Parallel Computing , 2017, SIGMETRICS.

[22]  DavidsonAndrew,et al.  Gunrock: a high-performance graph processing library on the GPU , 2015 .

[23]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..

[24]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[25]  Willy Zwaenepoel,et al.  Everything you always wanted to know about multicore graph processing but were afraid to ask , 2017, USENIX Annual Technical Conference.

[26]  Mario Szegedy,et al.  A Simple Yet Effective Balanced Edge Partition Model for Parallel Computing , 2017, SIGMETRICS 2017.

[27]  Marco Rosa,et al.  Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks , 2010, WWW.

[28]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[29]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[30]  Ling Huang,et al.  Evolution of social-attribute networks: measurements, modeling, and implications using google+ , 2012, Internet Measurement Conference.

[31]  Wenguang Chen,et al.  GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning , 2015, USENIX Annual Technical Conference.

[32]  Pradeep Dubey,et al.  GraphMat: High performance graph analytics made productive , 2015, Proc. VLDB Endow..

[33]  Brian W. Barrett,et al.  Introducing the Graph 500 , 2010 .

[34]  S H Strogatz,et al.  Random graph models of social networks , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Viktor K. Prasanna,et al.  ReCALL: Reordered Cache Aware Locality Based Graph Processing , 2017, 2017 IEEE 24th International Conference on High Performance Computing (HiPC).

[36]  George Karypis,et al.  Multilevel algorithms for partitioning power-law graphs , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[37]  Willy Zwaenepoel,et al.  X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.

[38]  Richard W. Vuduc,et al.  Branch-Avoiding Graph Algorithms , 2014, SPAA.

[39]  Sebastiano Vigna,et al.  The Graph Structure in the Web - Analyzed on Different Aggregation Levels , 2015, J. Web Sci..

[40]  Sebastiano Vigna,et al.  Permuting Web and Social Graphs , 2009, Internet Math..

[41]  John D. Owens,et al.  Gunrock: a high-performance graph processing library on the GPU , 2015, PPoPP.

[42]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[43]  José E. Moreira,et al.  Efficient implementation of scatter-gather operations for large scale graph analytics , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[44]  Ming Wu,et al.  Managing Large Graphs on Multi-Cores with Graph Awareness , 2012, USENIX Annual Technical Conference.

[45]  Jeremy G. Siek,et al.  The Boost Graph Library - User Guide and Reference Manual , 2001, C++ in-depth series.

[46]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.