A hybrid design for high performance large-scale sorting on FPGA

Sorting is a key kernel in numerous big data application including database operations, graphs and text analytics. Due to low control overhead, parallel bitonic sorting networks are usually employed for hardware implementations to accelerate sorting. Although a typical implementation of merge sort network can lead to low latency and small memory usage, it suffers from low throughput due to the lack of parallelism in the final stage. We analyze a pipelined merge sort network, showing its theoretical limits in terms of latency, memory and, throughput. To increase the throughput, we propose a merge sort based hybrid design where the final few stages in the merge sort network are replaced with “folded” bitonic merge networks. In these “folded” networks, all the interconnection patterns are realized by streaming permutation networks (SPN). We present a theoretical analysis to quantify latency, memory and throughput of our proposed design. Performance evaluations are performed by experiments on Xilinx Virtex-7 FPGA with post place-androute results. We demonstrate that our implementation achieves a throughput close to 10 GBps, outperforming state-of-the-art implementation of sorting on the same hardware by 1.2x, while preserving lower latency and higher memory efficiency.

[1]  Thompson The VLSI Complexity of Sorting , 1983, IEEE Transactions on Computers.

[2]  Gustavo Alonso,et al.  Sorting networks on FPGAs , 2012, The VLDB Journal.

[3]  Amin Vahdat,et al.  TritonSort: A Balanced and Energy-Efficient Large-Scale Sorting System , 2013, TOCS.

[4]  E. Szemerédi,et al.  O(n LOG n) SORTING NETWORK. , 1983 .

[5]  Amin Farmahini Farahani,et al.  Modular Design of High-Throughput, Low-Latency Sorting Units , 2013, IEEE Transactions on Computers.

[6]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[7]  Kunle Olukotun,et al.  Hardware acceleration of database operations , 2014, FPGA.

[8]  Frank Thomson Leighton,et al.  Tight Bounds on the Complexity of Parallel Sorting , 1984, IEEE Transactions on Computers.

[9]  Valery Sklyarov,et al.  Implementation in FPGA of Address-Based Data Sorting , 2011, 2011 21st International Conference on Field Programmable Logic and Applications.

[10]  János Komlós,et al.  An 0(n log n) sorting network , 1983, STOC.

[11]  Markus Püschel,et al.  Computer generation of streaming sorting networks , 2012, DAC Design Automation Conference 2012.

[12]  Jim Tørresen,et al.  FPGASort: a high performance sorting architecture exploiting run-time reconfiguration on fpgas for large problem sorting , 2011, FPGA '11.

[13]  Hans-Jörg Pfleiderer,et al.  Area and Throughput Aware Comparator Networks Optimization for Parallel Data Processing on FPGA , 2007, 2007 IEEE International Symposium on Circuits and Systems.

[14]  Viktor K. Prasanna,et al.  Energy and Memory Efficient Mapping of Bitonic Sorting on FPGA , 2015, FPGA.