论文信息 - Accelerating Stochastic Gradient Descent Based Matrix Factorization on FPGA

Accelerating Stochastic Gradient Descent Based Matrix Factorization on FPGA

Matrix Factorization (MF) based on Stochastic Gradient Descent (SGD) is a powerful machine learning technique to derive hidden features of objects from observations. In this article, we design a highly parallel architecture based on Field-Programmable Gate Array (FPGA) to accelerate the training process of the SGD-based MF algorithm. We identify the challenges for the acceleration and propose novel algorithmic optimizations to overcome them. By transforming the SGD-based MF algorithm into a bipartite graph processing problem, we propose a 3-level hierarchical partitioning scheme that enables conflict-minimizing scheduling and processing of edges to achieve significant speedup. First, we develop a fast heuristic graph partitioning approach to partition the bipartite graph into induced subgraphs; this enables to efficiently use the on-chip memory resources of FPGA for data reuse and completely hide the data communication between FPGA and external memory. Second, we partition all the edges of each subgraph into non-overlapping matchings to extract the maximum parallelism. Third, we propose a batching algorithm to schedule the execution of the edges inside each matching to reduce the memory access conflicts to the on-chip RAMs of FPGA. Compared with non-optimized FPGA-based baseline designs, the proposed optimizations result in up to 60× data dependency reduction, 4.2× bank conflict reduction, and 15.4× speedup. We evaluate the performance of our design using a state-of-the-art FPGA device. Experimental results show that our FPGA accelerator sustains a high computing throughput of up to 217 billion floating-point operations per second (GFLOPS) for training very large real-life sparse matrices. Compared with highly-optimized GPU-based accelerators, our FPGA accelerator achieves up to 12.7× speedup. Based on our optimization methodology, we also implement a software-based design on a multi-core platform, which demonstrates 1.3× speedup compared with the state-of-the-art multi-core implementation.

[1] Rajesh Gupta,et al. Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs , 2017, FPGA.

[2] Nancy M. Amato,et al. Faster Parallel Traversal of Scale Free Graphs at Extreme Scale with Vertex Delegates , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3] James C. Hoe,et al. GraphGen: An FPGA Framework for Vertex-Centric Graph Computation , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[4] Viktor K. Prasanna,et al. High-Throughput and Energy-Efficient Graph Processing on FPGA , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[5] Inderjit S. Dhillon,et al. Parallel matrix factorization for recommender systems , 2014, Knowl. Inf. Syst..

[6] Chih-Jen Lin,et al. A Fast Parallel Stochastic Gradient Method for Matrix Factorization in Shared Memory Systems , 2015, ACM Trans. Intell. Syst. Technol..

[7] Vaclav Petricek,et al. Recommender System for Online Dating Service , 2007, ArXiv.

[8] Keshav Pingali,et al. Stochastic gradient descent on GPUs , 2015, GPGPU@PPoPP.

[9] Kathryn Fraughnaugh,et al. Introduction to graph theory , 1973, Mathematical Gazette.

[10] Viktor K. Prasanna,et al. Optimizing memory performance for FPGA implementation of pagerank , 2015, 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig).

[11] Liana L. Fong,et al. Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs , 2016, HPDC.

[12] Margaret Martonosi,et al. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13] Darshika G. Perera,et al. An efficient embedded multi-ported memory architecture for next-generation FPGAs , 2017, 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[14] Chen Yang,et al. FPDeep: Acceleration and Load Balancing of CNN Training on FPGA Clusters , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[15] Weimin Zheng,et al. Exploring the Hidden Dimension in Graph Processing , 2016, OSDI.

[16] Tom Feist,et al. Vivado Design Suite , 2012 .

[17] Viktor K. Prasanna,et al. Sketch Acceleration on FPGA and its Applications in Network Anomaly Detection , 2018, IEEE Transactions on Parallel and Distributed Systems.

[18] Haibo Chen,et al. Bipartite-Oriented Distributed Graph Partitioning for Big Learning , 2014, Journal of Computer Science and Technology.

[19] Miriam Leeser,et al. Scaling Neural Network Performance through Customized Hardware Architectures on Reconfigurable Logic , 2017, 2017 IEEE International Conference on Computer Design (ICCD).

[20] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[21] Léon Bottou,et al. Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.

[22] Viktor K. Prasanna,et al. Accelerating low rank matrix completion on FPGA , 2017, 2017 International Conference on ReConFigurable Computing and FPGAs (ReConFig).

[23] Dong Yu,et al. On parallelizability of stochastic gradient descent for speech DNNS , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24] Yun Liang,et al. CuMF_SGD: Parallelized Stochastic Gradient Descent for Matrix Factorization on GPUs , 2017, HPDC.

[25] Kunle Olukotun,et al. Understanding and optimizing asynchronous low-precision stochastic gradient descent , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[26] Jason Cong,et al. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[27] Thomas B. Preußer,et al. Inference of quantized neural networks on heterogeneous all-programmable devices , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[28] J. Gregory Steffan,et al. Multi-ported memories for FPGAs via XOR , 2012, FPGA '12.

[29] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[30] Keqin Li,et al. MSGD: A Novel Matrix Factorization Approach for Large-Scale Collaborative Filtering Recommender Systems on GPUs , 2018, IEEE Transactions on Parallel and Distributed Systems.

[31] Philip Heng Wai Leong,et al. Kibo: An Open-Source Fixed-Point Tool-kit for Training and Inference in FPGA-Based Deep Learning Networks , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[32] Arda Yurdakul,et al. Efficient Implementations of Multi-pumped Multi-port Register Files in FPGAs , 2013, 2013 Euromicro Conference on Digital System Design.

[33] Inderjit S. Dhillon,et al. A Scalable Asynchronous Distributed Algorithm for Topic Modeling , 2014, WWW.

[34] Yu Wang,et al. A Reconfigurable Computing Approach for Efficient and Scalable Parallel Graph Exploration , 2012, 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors.

[35] Pradeep Dubey,et al. GraphMat: High performance graph analytics made productive , 2015, Proc. VLDB Endow..

[36] Yehuda Koren,et al. Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[37] J. Gregory Steffan,et al. Efficient multi-ported memories for FPGAs , 2010, FPGA '10.

[38] Viktor K. Prasanna,et al. Accelerating Large-Scale Single-Source Shortest Path on FPGA , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[39] Viktor K. Prasanna,et al. FASTCF: FPGA-based Accelerator for STochastic-Gradient-Descent-based Collaborative Filtering , 2018, FPGA.

[40] Pradeep Dubey,et al. Navigating the maze of graph analytics frameworks using massive graph datasets , 2014, SIGMOD Conference.