Accelerating Stochastic Gradient Descent Based Matrix Factorization on FPGA

Matrix Factorization (MF) based on Stochastic Gradient Descent (SGD) is a powerful machine learning technique to derive hidden features of objects from observations. In this article, we design a highly parallel architecture based on Field-Programmable Gate Array (FPGA) to accelerate the training process of the SGD-based MF algorithm. We identify the challenges for the acceleration and propose novel algorithmic optimizations to overcome them. By transforming the SGD-based MF algorithm into a bipartite graph processing problem, we propose a 3-level hierarchical partitioning scheme that enables conflict-minimizing scheduling and processing of edges to achieve significant speedup. First, we develop a fast heuristic graph partitioning approach to partition the bipartite graph into induced subgraphs; this enables to efficiently use the on-chip memory resources of FPGA for data reuse and completely hide the data communication between FPGA and external memory. Second, we partition all the edges of each subgraph into non-overlapping matchings to extract the maximum parallelism. Third, we propose a batching algorithm to schedule the execution of the edges inside each matching to reduce the memory access conflicts to the on-chip RAMs of FPGA. Compared with non-optimized FPGA-based baseline designs, the proposed optimizations result in up to 60× data dependency reduction, 4.2× bank conflict reduction, and 15.4× speedup. We evaluate the performance of our design using a state-of-the-art FPGA device. Experimental results show that our FPGA accelerator sustains a high computing throughput of up to 217 billion floating-point operations per second (GFLOPS) for training very large real-life sparse matrices. Compared with highly-optimized GPU-based accelerators, our FPGA accelerator achieves up to 12.7× speedup. Based on our optimization methodology, we also implement a software-based design on a multi-core platform, which demonstrates 1.3× speedup compared with the state-of-the-art multi-core implementation.

[1]  Rajesh Gupta,et al.  Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs , 2017, FPGA.

[2]  Nancy M. Amato,et al.  Faster Parallel Traversal of Scale Free Graphs at Extreme Scale with Vertex Delegates , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  James C. Hoe,et al.  GraphGen: An FPGA Framework for Vertex-Centric Graph Computation , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[4]  Viktor K. Prasanna,et al.  High-Throughput and Energy-Efficient Graph Processing on FPGA , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[5]  Inderjit S. Dhillon,et al.  Parallel matrix factorization for recommender systems , 2014, Knowl. Inf. Syst..

[6]  Chih-Jen Lin,et al.  A Fast Parallel Stochastic Gradient Method for Matrix Factorization in Shared Memory Systems , 2015, ACM Trans. Intell. Syst. Technol..

[7]  Vaclav Petricek,et al.  Recommender System for Online Dating Service , 2007, ArXiv.

[8]  Keshav Pingali,et al.  Stochastic gradient descent on GPUs , 2015, GPGPU@PPoPP.

[9]  Kathryn Fraughnaugh,et al.  Introduction to graph theory , 1973, Mathematical Gazette.

[10]  Viktor K. Prasanna,et al.  Optimizing memory performance for FPGA implementation of pagerank , 2015, 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig).

[11]  Liana L. Fong,et al.  Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs , 2016, HPDC.

[12]  Margaret Martonosi,et al.  Graphicionado: A high-performance and energy-efficient accelerator for graph analytics , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  Darshika G. Perera,et al.  An efficient embedded multi-ported memory architecture for next-generation FPGAs , 2017, 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[14]  Chen Yang,et al.  FPDeep: Acceleration and Load Balancing of CNN Training on FPGA Clusters , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[15]  Weimin Zheng,et al.  Exploring the Hidden Dimension in Graph Processing , 2016, OSDI.

[16]  Tom Feist,et al.  Vivado Design Suite , 2012 .

[17]  Viktor K. Prasanna,et al.  Sketch Acceleration on FPGA and its Applications in Network Anomaly Detection , 2018, IEEE Transactions on Parallel and Distributed Systems.

[18]  Haibo Chen,et al.  Bipartite-Oriented Distributed Graph Partitioning for Big Learning , 2014, Journal of Computer Science and Technology.

[19]  Miriam Leeser,et al.  Scaling Neural Network Performance through Customized Hardware Architectures on Reconfigurable Logic , 2017, 2017 IEEE International Conference on Computer Design (ICCD).

[20]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[21]  Léon Bottou,et al.  Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.

[22]  Viktor K. Prasanna,et al.  Accelerating low rank matrix completion on FPGA , 2017, 2017 International Conference on ReConFigurable Computing and FPGAs (ReConFig).

[23]  Dong Yu,et al.  On parallelizability of stochastic gradient descent for speech DNNS , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Yun Liang,et al.  CuMF_SGD: Parallelized Stochastic Gradient Descent for Matrix Factorization on GPUs , 2017, HPDC.

[25]  Kunle Olukotun,et al.  Understanding and optimizing asynchronous low-precision stochastic gradient descent , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[26]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[27]  Thomas B. Preußer,et al.  Inference of quantized neural networks on heterogeneous all-programmable devices , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[28]  J. Gregory Steffan,et al.  Multi-ported memories for FPGAs via XOR , 2012, FPGA '12.

[29]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[30]  Keqin Li,et al.  MSGD: A Novel Matrix Factorization Approach for Large-Scale Collaborative Filtering Recommender Systems on GPUs , 2018, IEEE Transactions on Parallel and Distributed Systems.

[31]  Philip Heng Wai Leong,et al.  Kibo: An Open-Source Fixed-Point Tool-kit for Training and Inference in FPGA-Based Deep Learning Networks , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[32]  Arda Yurdakul,et al.  Efficient Implementations of Multi-pumped Multi-port Register Files in FPGAs , 2013, 2013 Euromicro Conference on Digital System Design.

[33]  Inderjit S. Dhillon,et al.  A Scalable Asynchronous Distributed Algorithm for Topic Modeling , 2014, WWW.

[34]  Yu Wang,et al.  A Reconfigurable Computing Approach for Efficient and Scalable Parallel Graph Exploration , 2012, 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors.

[35]  Pradeep Dubey,et al.  GraphMat: High performance graph analytics made productive , 2015, Proc. VLDB Endow..

[36]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[37]  J. Gregory Steffan,et al.  Efficient multi-ported memories for FPGAs , 2010, FPGA '10.

[38]  Viktor K. Prasanna,et al.  Accelerating Large-Scale Single-Source Shortest Path on FPGA , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[39]  Viktor K. Prasanna,et al.  FASTCF: FPGA-based Accelerator for STochastic-Gradient-Descent-based Collaborative Filtering , 2018, FPGA.

[40]  Pradeep Dubey,et al.  Navigating the maze of graph analytics frameworks using massive graph datasets , 2014, SIGMOD Conference.