A Portable 3D FFT Package for Distributed-Memory Parallel Architectures

A parallel algorithm for 3D FFTs is implemented as a series of local 1D FFTs combined with data transposes. This allows the use of vendor supplied (often fully optimized) sequential 1D FFTs. The FFTs are carried out in-place by using an in-place data transpose across the processors.