Accelerating low rank matrix completion on FPGA

Low Rank Matrix Completion (LRMC) is widely used in the analysis of incomplete datasets. In this paper, we propose a novel FPGA-based accelerator to speedup a matrix-factorization-based LRMC algorithm that uses stochastic gradient descent. The accelerator is a multi-pipelined architecture with parallel pipelines processing distinct data from a shared on-chip buffer. We propose two distinct on-chip buffer architectures based on a design-space exploration of the performance tradeoffs offered by two competing design methodologies: memory-efficiency versus concurrent conflict-free accesses. Our first design (i.e., memory-efficient design) organizes the buffer into banks and maximally utilizes available on-chip memory for matrix chunk processing without requiring complex address translation tables for on-chip addressing; however, it could incur bank conflicts when concurrent accesses to the same bank occur. The second design (i.e., bank-conflict-free design) exploits parallel multiport memory access and completely eliminates bank conflicts by duplicating the stored data; however, it has much higher on-chip RAM consumption. Intuitively, design one enables (slower) acceleration of (larger) chunks of the input matrix whereas design two enables (faster) processing of (smaller) matrix chunks but requires more iterations for processing the complete matrix. We propose a simple but efficient partitioning approach for supporting large input matrices that do not fit in the on-chip memory of FPGA. We also develop algorithmic optimizations based on matching to reduce data dependencies for parallel pipeline execution. We implement our designs on a state-of-the-art UltraScale+ FPGA device. We use real-life datasets for the evaluation and compare these two designs by varying the number of pipelines. The data dependency optimization results in at least 21.6 x data dependency reduction and improves the execution time by up to 66.3 x compared with non-optimized baseline designs. The memory-efficient design is also shown to be more scalable than the bank-conflict-free design. Compared with the state-of-the-art multi-core implementation and GPU implementation, the bank-conflict-free design achieves 5.4 x and 5.2 x speedup, respectively; the memory-efficient design achieves 16.7 x and 16.2 x speedup, respectively.

[1]  James Bennett,et al.  The Netflix Prize , 2007 .

[2]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2009, Found. Comput. Math..

[3]  Viktor K. Prasanna,et al.  Large-scale packet classification on FPGA , 2015, 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[4]  Qiang Xu,et al.  Approximate Frequent Itemset Mining for streaming data on FPGA , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[5]  Amit Singer,et al.  A remark on global positioning from local distances , 2008, Proceedings of the National Academy of Sciences.

[6]  Weimin Zheng,et al.  Exploring the Hidden Dimension in Graph Processing , 2016, OSDI.

[7]  Viktor K. Prasanna,et al.  Accelerating Large-Scale Single-Source Shortest Path on FPGA , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[8]  Chao Wang,et al.  CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-Circulant Weight Matrices , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Patrick Seemann,et al.  Matrix Factorization Techniques for Recommender Systems , 2014 .

[10]  Emmanuel J. Candès,et al.  Matrix Completion With Noise , 2009, Proceedings of the IEEE.

[11]  J. Gregory Steffan,et al.  Efficient multi-ported memories for FPGAs , 2010, FPGA '10.

[12]  Viktor K. Prasanna,et al.  Optimizing memory performance for FPGA implementation of pagerank , 2015, 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig).

[13]  Liana L. Fong,et al.  Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs , 2016, HPDC.

[14]  Viktor K. Prasanna,et al.  Energy performance of FPGAs on PERFECT suite kernels , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[15]  Manoj Alwani,et al.  Fused-layer CNN accelerators , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Margaret Martonosi,et al.  Graphicionado: A high-performance and energy-efficient accelerator for graph analytics , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  Viktor K. Prasanna,et al.  Fast generation of high throughput customized deep learning accelerators on FPGAs , 2017, 2017 International Conference on ReConFigurable Computing and FPGAs (ReConFig).

[18]  Pradeep Dubey,et al.  Navigating the maze of graph analytics frameworks using massive graph datasets , 2014, SIGMOD Conference.

[19]  Jason Cong,et al.  Energy Efficiency of Full Pipelining: A Case Study for Matrix Multiplication , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[20]  Viktor K. Prasanna,et al.  400 Gbps energy-efficient multi-field packet classification on FPGA , 2014, 2014 International Conference on ReConFigurable Computing and FPGAs (ReConFig14).

[21]  Andrea Montanari,et al.  Matrix completion from a few entries , 2009, ISIT.

[22]  Robin J. Wilson Introduction to Graph Theory , 1974 .

[23]  Vaclav Petricek,et al.  Recommender System for Online Dating Service , 2007, ArXiv.

[24]  Takeo Kanade,et al.  Shape and motion from image streams under orthography: a factorization method , 1992, International Journal of Computer Vision.

[25]  Viktor K. Prasanna,et al.  A Hybrid Approach for Mapping Conjugate Gradient onto an FPGA-Augmented Reconfigurable Supercomputer , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[26]  Keshav Pingali,et al.  Stochastic gradient descent on GPUs , 2015, GPGPU@PPoPP.

[27]  Viktor K. Prasanna,et al.  High-Throughput and Energy-Efficient Graph Processing on FPGA , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[28]  Franck Cappello,et al.  Evaluating irregular memory access on OpenCL FPGA platforms: A case study with XSBench , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[29]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[30]  Prateek Jain,et al.  Low-rank matrix completion using alternating minimization , 2012, STOC '13.