Optimal dynamic data layouts for 2D FFT on 3D memory integrated FPGA

FPGAs have been widely used for accelerating various applications. For many data intensive applications, the memory bandwidth limits the performance. 3D memories with through-silicon-via connections provide potential solutions to the latency and bandwidth limitations. In this paper, we revisit the classic 2D FFT problem to evaluate the performance of 3D memory integrated FPGA. To fully utilize the fine-grained parallelism in 3D memory, data layouts which take into account the structure and organization of the memory are required. We propose dynamic data layouts for optimizing the performance of the 3D architecture. In 2D FFT, data are accessed in row major order in the first phase, whereas the data are accessed in column major order in the second phase. This column major order results in high memory latency and low bandwidth due to high row activation overhead of memory. Using the proposed dynamic data layouts, we improve memory access performance in the second phase without degrading the performance of the first phase. With parallelism employed in the third dimension of the memory, data parallelism can be increased to further improve the performance. We adopt a model-based approach for 3D memory and we perform experiments on the FPGA to validate our analysis and evaluate the performance. Compared with the baseline architecture, our approach achieves up to $$40\times $$40× peak memory bandwidth utilization for columnwise FFT, thus resulting in approximately $$97\,\,\%$$97% improvement in throughput for the complete 2D FFT application.

[1]  Hong Ren Wu,et al.  The structure of vector radix fast Fourier transforms , 1989, IEEE Trans. Acoust. Speech Signal Process..

[2]  Viktor K. Prasanna,et al.  Energy-efficient architecture for stride permutation on streaming data , 2013, 2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig).

[3]  Viktor K. Prasanna,et al.  Performance Modeling of Matrix Multiplication on 3D Memory Integrated FPGA , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[4]  Viktor K. Prasanna,et al.  DRAM Row Activation Energy Optimization for Stride Memory Access on FPGA-Based Systems , 2015, ARC.

[5]  Chunming Zhang,et al.  Accelerating 2D FFT with Non-Power-of-Two Problem Size on FPGA , 2010, 2010 International Conference on Reconfigurable Computing and FPGAs.

[6]  Viktor K. Prasanna,et al.  Automatic generation of high throughput energy efficient streaming architectures for arbitrary fixed permutations , 2015, 2015 25th International Conference on Field Programmable Logic and Applications (FPL).

[7]  Franz Franchetti,et al.  Memory Bandwidth Efficient Two-Dimensional Fast Fourier Transform Algorithm and Implementation for Large Problem Sizes , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[8]  Narayanan Vijaykrishnan,et al.  FPGA Architecture for 2D Discrete Fourier Transform Based on 2D Decomposition for Large-sized Data , 2009, 2009 IEEE Workshop on Signal Processing Systems.

[9]  Franz Franchetti,et al.  Understanding the design space of DRAM-optimized hardware FFT accelerators , 2014, 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors.

[10]  Viktor K. Prasanna,et al.  Energy and Memory Efficient Mapping of Bitonic Sorting on FPGA , 2015, FPGA.

[11]  Viktor K. Prasanna,et al.  Dynamic data layouts for cache-conscious implementation of a class of signal transforms , 2004, IEEE Transactions on Signal Processing.

[12]  Shreyas G. Singapura,et al.  Towards Performance Modeling of 3D Memory Integrated FPGA Architectures , 2015, ARC.

[13]  Ali Akoglu,et al.  A power efficient reconfigurable system-in-stack: 3D integration of accelerators, FPGAs, and DRAM , 2014, 2014 27th IEEE International System-on-Chip Conference (SOCC).

[14]  Franz Franchetti,et al.  A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing , 2013, 2013 IEEE International 3D Systems Integration Conference (3DIC).

[15]  Viktor K. Prasanna,et al.  High throughput energy efficient parallel FFT architecture on FPGAs , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[16]  Peter Pirsch,et al.  Using SDRAMs for two-dimensional accesses of long 2n × 2m-point FFTs and transposing , 2011, 2011 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.

[17]  Viktor K. Prasanna,et al.  Energy efficient parameterized FFT architecture , 2013, 2013 23rd International Conference on Field programmable Logic and Applications.