Efficient algorithms for block-cyclic redistribution of arrays

We present new algorithmic techniques for a classical research problem, runtime redistribution of an array from one block-cyclic layout to another. Our methodology for reducing communication overheads is based on a generalized circulant matrix formalism. Using this formalism, we derive direct, indirect, and hybrid communication schedules for the cyclic redistribution problem when the block size changes by an integer factor K. We have also developed formulae to estimate the timing performance of each of these schedules for a given parallel machine and redistribution problem. In our indirect communication schedule, blocks are moved from a source processor to a destination processor through intermediate "relay" processors. This reduces the number of communication steps by an order of magnitude, in comparison with previous approaches. This algorithm performs cyclic(x) to cyclic(Kx) redistribution on P processors in [log/sub 2/K]+2 steps. Implementations of these algorithms on the Cray T3D and on the IBM SP-2 show superior performance over previous approaches. Since our algorithms are developed using MPI, they can be easily ported to different application environments. Our techniques can be used in the design of scalable redistribution libraries, in efficient implementations of the REDISTRIBUTE directive of HPF and in developing parallel algorithms for various HPC applications.

[1]  Guy L. Steele,et al.  The High Performance Fortran Handbook , 1993 .

[2]  Geoffrey C. Fox,et al.  Runtime array redistribution in HPF programs , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[3]  J. Ramanujam,et al.  Multi-phase array redistribution: modeling and evaluation , 1995, Proceedings of 9th International Parallel Processing Symposium.

[4]  B. Elspas,et al.  Graphs with circulant adjacency matrices , 1970 .

[5]  David W. Walker,et al.  Redistribution of block-cyclic data distributions using MPI , 1996, Concurr. Pract. Exp..