Computer Generation of High Throughput and Memory Efficient Sorting Designs on FPGA

Accelerating sorting using dedicated hardware to fully utilize the memory bandwidth for Big Data applications has gained much interest in the research community. Recently, parallel sorting networks have been widely employed in hardware implementations due to their high data parallelism and low control overhead. In this paper, we propose a systematic methodology for mapping large-scale bitonic sorting networks onto FPGA. To realize data permutations in the sorting network, we develop a novel RAM-based design by vertically “folding” the classic Clos network. By utilizing the proposed design for data permutation, we develop a hardware generator to automatically build bitonic sorting architectures on FPGAs. For given input size, data width and data parallelism, the hardware generator specializes both the datapath and the control unit for sorting and generates a design in high level hardware description language. We demonstrate trade-offs among throughput, latency and area using two illustrative sorting designs including a high throughput design and a resource efficient design. With a data parallelism of <inline-formula> <tex-math notation="LaTeX">$p$</tex-math><alternatives><inline-graphic xlink:href="chen-ieq1-2705128.gif"/> </alternatives></inline-formula> <inline-formula><tex-math notation="LaTeX">$(2\leq p \leq N/2)$</tex-math> <alternatives><inline-graphic xlink:href="chen-ieq2-2705128.gif"/></alternatives></inline-formula>, the high throughput design sorts an <inline-formula><tex-math notation="LaTeX">$N$</tex-math><alternatives> <inline-graphic xlink:href="chen-ieq3-2705128.gif"/></alternatives></inline-formula>-key sequence with latency <inline-formula><tex-math notation="LaTeX">$6N/p+o(N)$</tex-math><alternatives> <inline-graphic xlink:href="chen-ieq4-2705128.gif"/></alternatives></inline-formula>, throughput <inline-formula> <tex-math notation="LaTeX">$p$</tex-math><alternatives><inline-graphic xlink:href="chen-ieq5-2705128.gif"/> </alternatives></inline-formula> results per cycle and uses <inline-formula><tex-math notation="LaTeX">$6N+o(N)$ </tex-math><alternatives><inline-graphic xlink:href="chen-ieq6-2705128.gif"/></alternatives></inline-formula> memory. This achieves optimal memory efficiency (defined as the ratio of throughput to the amount of on-chip memory used by the design) and outperforms the state-of-the-art. Experimental results show that the designs obtained by our proposed hardware generator achieve 49 to 112 percent improvement in energy efficiency and 56 to 430 percent higher memory efficiency compared with the state-of-the-art.

[1]  James C. Hoe,et al.  Automatic generation of streaming datapaths for arbitrary fixed permutations , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[2]  Jim Tørresen,et al.  FPGASort: a high performance sorting architecture exploiting run-time reconfiguration on fpgas for large problem sorting , 2011, FPGA '11.

[3]  Hans-Jörg Pfleiderer,et al.  Area and Throughput Aware Comparator Networks Optimization for Parallel Data Processing on FPGA , 2007, 2007 IEEE International Symposium on Circuits and Systems.

[4]  Viktor K. Prasanna,et al.  Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[5]  Sartaj Sahni,et al.  Bitonic Sort on a Mesh-Connected Parallel Computer , 1979, IEEE Transactions on Computers.

[6]  Stephan Olariu,et al.  An Optimal Hardware-Algorithm for Sorting Using a Fixed-Size Parallel Sorting Device , 2000, IEEE Trans. Computers.

[7]  R. Brualdi Combinatorial Matrix Classes , 2006 .

[8]  Markus Püschel,et al.  Streaming Sorting Networks , 2016, TODE.

[9]  Charles Clos,et al.  A study of non-blocking switching networks , 1953 .

[10]  Thompson The VLSI Complexity of Sorting , 1983, IEEE Transactions on Computers.

[11]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[12]  Kenneth E. Batcher,et al.  Minimizing Communication in the Bitonic Sort , 2000, IEEE Trans. Parallel Distributed Syst..

[13]  Wayne Luk,et al.  An efficient sparse conjugate gradient solver using a Beneš permutation network , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[14]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[15]  Viktor K. Prasanna,et al.  Energy and Memory Efficient Mapping of Bitonic Sorting on FPGA , 2015, FPGA.

[16]  Viktor K. Prasanna,et al.  Energy-efficient architecture for stride permutation on streaming data , 2013, 2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig).

[17]  Horácio C. Neto,et al.  Sorting Units for FPGA-Based Embedded Systems , 2008, DIPES.

[18]  Valery Sklyarov,et al.  Implementation in FPGA of Address-Based Data Sorting , 2011, 2011 21st International Conference on Field Programmable Logic and Applications.

[19]  Kunle Olukotun,et al.  Hardware acceleration of database operations , 2014, FPGA.

[20]  Gustavo Alonso,et al.  Sorting networks on FPGAs , 2012, The VLDB Journal.

[21]  Frank Thomson Leighton,et al.  Tight Bounds on the Complexity of Parallel Sorting , 1985, IEEE Trans. Computers.

[22]  A. Yavuz Oruç,et al.  Adaptive Binary Sorting Schemes and Associated Interconnection Networks , 1994, IEEE Trans. Parallel Distributed Syst..

[23]  James C. Hoe,et al.  Permuting streaming data using RAMs , 2009, JACM.

[24]  Pradeep Dubey,et al.  Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs , 2009, Proc. VLDB Endow..

[25]  Amin Vahdat,et al.  TritonSort: A Balanced and Energy-Efficient Large-Scale Sorting System , 2013, TOCS.