High Throughput Large Scale Sorting on a CPU-FPGA Heterogeneous Platform

Recently accelerating sorting using FPGA has been of growing interest in both industry and academia. However, the supported size of data set is usually small for FPGA-only sorting designs due to limited on-chip memory. In this paper, we propose a design to speed-up large scale sorting using a CPU-FPGA heterogeneous platform. We first optimize a fully-pipelined merge sort based accelerator and employ several such designs working in parallel on FPGA. The partial results from the FPGA are then merged on the CPU. On the Intel QuickAssist QPI FPGA Platform, for a range of data set size, we improve the throughput by 2.9× and 1.9× compared with CPU-only and FPGA-only baselines, respectively. Compared with the state-of-the-art FPGA implementation for sorting, our design achieves 2.3× throughput improvement.

[1]  Bharat Sukhwani,et al.  Database analytics acceleration using FPGAs , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[2]  Stephan Olariu,et al.  An Optimal Hardware-Algorithm for Sorting Using a Fixed-Size Parallel Sorting Device , 2000, IEEE Trans. Computers.

[3]  Bhyrav Mutnury,et al.  QuickPath Interconnect (QPI) design and analysis in high speed servers , 2010, 19th Topical Meeting on Electrical Performance of Electronic Packaging and Systems.

[4]  Kunle Olukotun,et al.  Hardware acceleration of database operations , 2014, FPGA.

[5]  Viktor K. Prasanna,et al.  Energy efficient parameterized FFT architecture , 2013, 2013 23rd International Conference on Field programmable Logic and Applications.

[6]  Asif Khan,et al.  High-throughput Pipelined Mergesort , 2008, 2008 6th ACM/IEEE International Conference on Formal Methods and Models for Co-Design.

[7]  Viktor K. Prasanna,et al.  Energy and Memory Efficient Mapping of Bitonic Sorting on FPGA , 2015, FPGA.

[8]  Robert J. Safranek,et al.  Intel® QuickPath Interconnect Architectural Features Supporting Scalable System Architectures , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[9]  James C. Hoe,et al.  A Study of Pointer-Chasing Performance on Shared-Memory Processor-FPGA Systems , 2016, FPGA.

[10]  Viktor K. Prasanna,et al.  Automatic generation of high throughput energy efficient streaming architectures for arbitrary fixed permutations , 2015, 2015 25th International Conference on Field Programmable Logic and Applications (FPL).

[11]  Viktor K. Prasanna,et al.  Energy-efficient architecture for stride permutation on streaming data , 2013, 2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig).

[12]  Markus Püschel,et al.  Computer generation of streaming sorting networks , 2012, DAC Design Automation Conference 2012.

[13]  Jim Tørresen,et al.  FPGASort: a high performance sorting architecture exploiting run-time reconfiguration on fpgas for large problem sorting , 2011, FPGA '11.

[14]  Goetz Graefe,et al.  Implementing sorting in database systems , 2006, CSUR.

[15]  Jürgen Teich,et al.  Energy-aware SQL query acceleration through FPGA-based dynamic partial reconfiguration , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[16]  R. Marcelino,et al.  A comparison of three representative hardware sorting units , 2009, 2009 35th Annual Conference of IEEE Industrial Electronics.

[17]  Gustavo Alonso,et al.  Sorting networks on FPGAs , 2012, The VLDB Journal.