Dynamic data layouts for cache-conscious implementation of a class of signal transforms

Effective utilization of cache memories is a key factor in achieving high performance for computing large signal transforms. Nonunit stride access in the computation of large signal transforms results in poor cache performance, leading to severe degradation in the overall performance. In this paper, we develop a cache-conscious technique, called a dynamic data layout, to improve the performance of large signal transforms. In our approach, data reorganization is performed between computation stages to reduce cache misses. We develop an efficient search algorithm to determine an optimal tree with the minimum execution time among possible factorization trees based on the size of the signal transform and the data access stride. Our approach is applied to compute the fast Fourier transform (FFT) and the Walsh-Hadamard transform (WHT). Experiments were performed on Alpha 21264, MIPS R10000, UltraSPARC III, and Pentium 4. Experimental results show that our FFT and WHT achieve performance improvement of up to 3.52 times over other state-of-the-art FFT and WHT packages. The proposed optimization is portable across various platforms.

[1]  Dragan Mirkovic,et al.  An adaptive software library for fast Fourier transforms , 2000, ICS '00.

[2]  Mahmut T. Kandemir,et al.  Static and Dynamic Locality Optimizations Using Integer Linear Programming , 2001, IEEE Trans. Parallel Distributed Syst..

[3]  R. C. Whaley,et al.  Automatically Tuned Linear Algebra Software (ATLAS) , 2011, Encyclopedia of Parallel Computing.

[4]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[5]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[6]  Margaret Martonosi,et al.  Characterizing the Memory Behavior of Compiler-Parallelized Applications , 1996, IEEE Trans. Parallel Distributed Syst..

[7]  A. W. M. van den Enden,et al.  Discrete Time Signal Processing , 1989 .

[8]  Mahmut T. Kandemir,et al.  Compiler-directed selection of dynamic memory layouts , 2001, Ninth International Symposium on Hardware/Software Codesign. CODES 2001 (IEEE Cat. No.01TH8571).

[9]  David J. DeWitt,et al.  Weaving Relations for Cache Performance , 2001, VLDB.

[10]  Sandeep K. S. Gupta,et al.  Implementing Fast Fourier Transforms on Distributed-Memory Multiprocessors Using Data Redistributions , 1994, Parallel Process. Lett..

[11]  Hiroshi Nakamura,et al.  Augmenting Loop Tiling with Data Alignment for Improved Cache Performance , 1999, IEEE Trans. Computers.

[12]  Mithuna Thottethodi,et al.  Nonlinear array layouts for hierarchical memory systems , 1999, ICS '99.

[13]  George Karypis,et al.  Introduction to Parallel Computing , 1994 .

[14]  Ramesh C. Agarwal,et al.  A high performance parallel algorithm for 1-D FFT , 1994, Proceedings of Supercomputing '94.

[15]  Mahmut T. Kandemir,et al.  Improving locality using loop and data transformations in an integrated framework , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[16]  Olivier Temam,et al.  To copy or not to copy: A compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93. Proceedings.

[17]  R. Tolimieri,et al.  Algorithms for Discrete Fourier Transform and Convolution , 1989 .

[18]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[19]  Markus Püschel,et al.  In search of the optimal Walsh-Hadamard transform , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[20]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[21]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[22]  W. Jalby,et al.  To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93.

[23]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[24]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[25]  David H. Bailey,et al.  FFTs in external or hierarchical memory , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[26]  Sebastian Egner,et al.  Zur algorithmischen Zerlegungstheorie linearer Transformationen mit Symmetrie , 1997 .

[27]  Larry Carter,et al.  Faster FFTs via architecture-cognizance , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[28]  Kevin R. Wadleigh,et al.  High Performance FFT Algorithms for Cache-Coherent Multiprocessors , 1999, Int. J. High Perform. Comput. Appl..

[29]  David H. Bailey Unfavorable strides in cache memory systems , 1992 .

[30]  C. Loan Computational Frameworks for the Fast Fourier Transform , 1992 .