Energy-efficient architecture for stride permutation on streaming data

Stride permutation is widely used in various digital signal processing algorithms when implemented on FPGAs. Permuting a long data sequence through hardware wiring leads to high area consumption and routing complexity. A preferable approach is to build a hardware structure to permute streaming data inputs. In this paper, we present an energy-efficient architecture to perform stride permutation on streaming data. The supported problem size and stride are powers of two. A three-stage structure, composed of two stages of interconnection networks and one stage of data buffers, is used as a baseline architecture. To improve the energy efficiency, we develop a data remapping technique which reduces the required memory by 50% at the expense of small amount of extra logic. We also present a multiplexer-based cyclic shift interconnection network. Our proposed architecture is evaluated using two performance metrics: composite Energy ×Area × Time (EAT) and energy efficiency (defined as points/Joule). The experimental results show that the proposed data remapping technique reduces up to 40% dynamic power consumption compared with the baseline architecture. The proposed architecture results in a high energy efficiency of up to 75.3 giga points/Joule, and has an EAT ratio of 0.31 to 0.35 over the baseline architecture for various streaming width w (2 ≤ w ≤ 32).

[1]  James C. Hoe,et al.  Permuting streaming data using RAMs , 2009, JACM.

[2]  Mats Torkelson,et al.  A new approach to pipeline FFT processor , 1996, Proceedings of International Conference on Parallel Processing.

[3]  Javier D. Bruguera,et al.  High-performance VLSI architecture for the Viterbi algorithm , 1997, IEEE Trans. Commun..

[4]  Viktor K. Prasanna,et al.  High throughput energy efficient parallel FFT architecture on FPGAs , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[5]  Jarmo Takala,et al.  Stride permutation networks for array processors , 2004 .

[6]  E. V. Jones,et al.  A pipelined FFT processor for word-sequential data , 1989, IEEE Trans. Acoust. Speech Signal Process..

[7]  David Nassimi A self routing Benes network , 1980, ISCA '80.

[8]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[9]  Michael Conner,et al.  Recursive fast algorithm and the role of the tensor product , 1992, IEEE Trans. Signal Process..

[10]  Viktor K. Prasanna,et al.  Energy efficient parameterized FFT architecture , 2013, 2013 23rd International Conference on Field programmable Logic and Applications.

[11]  Jarmo Takala,et al.  Stride permutation networks for array processors , 2004, Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004..

[12]  Charles Clos,et al.  A study of non-blocking switching networks , 1953 .

[13]  Viktor K. Prasanna,et al.  Optimal Multipass Self-Routing Algorithms for Clos-Type Multistage Networks , 1992, ICPP.