Dynamic data layouts for cache-conscious factorization of DFT

Effective utilization of cache memories is a key factor in achieving high performance in computing the Discrete Fourier Transform (DFT). Most optimization techniques for computing the DFT rely on either modifying the computation and data access order or exploiting low level platform specific details, while keeping the data layout in memory static. In this paper we propose a high level optimization technique, dynamic data layout (DDL). In DDL, data reorganization is performed between computations to effectively utilize the cache. This cache-conscious factorization of the DFT including the data reorganization steps is automatically computed by using efficient techniques in our approach. An analytical model of the cache miss pattern is utilized to predict the performance and explore the search space of factorizations. Our technique results in up to a factor of 4 improvement over standard FFT implementations and up to 33% improvement over other optimization techniques such as copying on SUN UltraSPARC-II, DEC Alpha and Intel Pentium III.

[1]  David H. Bailey Unfavorable Strides in Cache Memory Systems (RNR Technical Report RNR-92-015) , 1995, Sci. Program..

[2]  David H. Bailey Unfavorable strides in cache memory systems , 1992 .

[3]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[4]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[5]  Kevin R. Wadleigh,et al.  High Performance FFT Algorithms for Cache-Coherent Multiprocessors , 1999, Int. J. High Perform. Comput. Appl..

[6]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[7]  V. K. Prasanna,et al.  Utilizing the power of high-performance computing , 1998 .

[8]  Viktor K. Prasanna,et al.  Fast parallel implementation of DFT using configurable devices , 1997, FPL.

[9]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[10]  David H. Bailey,et al.  FFTs in external or hierarchical memory , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[11]  Margaret Martonosi,et al.  Characterizing the Memory Behavior of Compiler-Parallelized Applications , 1996, IEEE Trans. Parallel Distributed Syst..

[12]  Viktor K. Prasanna,et al.  Parallel implementation of synthetic aperture radar on high performance computing platforms , 1997, Proceedings of 3rd International Conference on Algorithms and Architectures for Parallel Processing.

[13]  Mithuna Thottethodi,et al.  Nonlinear array layouts for hierarchical memory systems , 1999, ICS '99.