On-chip memory efficient data layout for 2D FFT on 3D memory integrated FPGA

3D memories are becoming viable solutions for the memory wall problem and meeting the bandwidth requirements of memory intensive applications. The high bandwidth provided by 3D memories does not translate to a proportional increase in performance for all applications. For an application such as 2D FFT with strided access patterns, the data layout of the memory has a significant impact on the total execution time of the implementation. In this paper, we present a data layout for 2D FFT on 3D memory integrated FPGA that is both on-chip memory efficient as well as throughput-optimal. Our data layout ensures that consecutive accesses to 3D memory are sufficiently interleaved among layers and vaults to absorb latency due to activation overheads for both sequential (Row FFT) and strided (Column FFT) accesses. The current state-of-the-art implementation on 3D memory requires O(√cN) on-chip memory to reduce the strided accesses and achieve maximum bandwidth for an N × N FFT problem size and c columns in a 3D memory bank row. Our proposed data layout optimizes the throughput of both the Row FFT and Column FFT phases of 2D FFT with O(N) on-chip memory for the same problem size and memory parameters without decreasing the memory bandwidth thereby achieving a √c× reduction in on-chip memory. On architectures with limited on-chip memory, our data layout achieves 2× to 4× improvement in execution time compared with the state-of-art 2D FFT implementation on 3D memory.

[1]  Shreyas G. Singapura,et al.  Towards Performance Modeling of 3D Memory Integrated FPGA Architectures , 2015, ARC.

[2]  Viktor K. Prasanna,et al.  Performance Modeling of Matrix Multiplication on 3D Memory Integrated FPGA , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[3]  Onur Mutlu,et al.  Simultaneous Multi-Layer Access , 2016, ACM Trans. Archit. Code Optim..

[4]  Mahmut T. Kandemir,et al.  Design and Management of 3D Chip Multiprocessors Using Network-in-Memory , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[5]  C. K. Yuen,et al.  Theory and Application of Digital Signal Processing , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[6]  Narayanan Vijaykrishnan,et al.  FPGA Architecture for 2D Discrete Fourier Transform Based on 2D Decomposition for Large-sized Data , 2009, 2009 IEEE Workshop on Signal Processing Systems.

[7]  Yong Chen,et al.  HMC-Sim: A Simulation Framework for Hybrid Memory Cube Devices , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[8]  Franz Franchetti,et al.  A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing , 2013, 2013 IEEE International 3D Systems Integration Conference (3DIC).

[9]  Peter Pirsch,et al.  Using SDRAMs for two-dimensional accesses of long 2n × 2m-point FFTs and transposing , 2011, 2011 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.

[10]  Jung Ho Ahn,et al.  CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[11]  Onur Mutlu,et al.  Ramulator: A Fast and Extensible DRAM Simulator , 2016, IEEE Computer Architecture Letters.

[12]  Franz Franchetti,et al.  Understanding the design space of DRAM-optimized hardware FFT accelerators , 2014, 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors.

[13]  Viktor K. Prasanna,et al.  Optimal dynamic data layouts for 2D FFT on 3D memory integrated FPGA , 2016, The Journal of Supercomputing.

[14]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .