Efficient Algorithms for Block-Cyclic Redistribution of Arrays

Abstract. The block-cyclic data distribution is commonly used to organize array elements over the processors of a coarse-grained distributed memory parallel computer. In many scientific applications, the data layout must be reorganized at run-time in order to enhance locality and reduce remote memory access overheads. In this paper we present a general framework for developing array redistribution algorithms. Using this framework, we have developed efficient algorithms that redistribute an array from one block-cyclic layout to another. Block-cyclic redistribution consists of indexsetcomputation , wherein the destination locations for individual data blocks are calculated, and datacommunication , wherein these blocks are exchanged between processors. The framework treats both these operations in a uniform and integrated way. We have developed efficient and distributed algorithms for index set computation that do not require any interprocessor communication. To perform data communication in a conflict-free manner, we have developed directindirectandhybrid algorithms. In the direct algorithm, a data block is transferred directly to its destination processor. In an indirect algorithm, data blocks are moved from source to destination processors through intermediate relay processors. The hybrid algorithm is a combination of the direct and indirect algorithms. Our framework is based on a generalized circulant matrix formalism of the redistribution problem and a general purpose distributed memory model of the parallel machine. Our algorithms sustain excellent performance over a wide range of problem and machine parameters. We have implemented our algorithms using MPI, to allow for easy portability across different HPC platforms. Experimental results on the IBM SP-2 and the Cray T3D show superior performance over previous approaches. When the block size of the cyclic data layout changes by a factor of K , the redistribution can be performed in O( log K) communication steps. This is true even when K is a prime number. In contrast, previous approaches take O(K) communication steps for redistribution. Our framework can be used for developing scalable redistribution libraries, for efficiently implementing parallelizing compiler directives, and for developing parallel algorithms for various applications. Redistribution algorithms are especially useful in signal processing applications, where the data access patterns change significantly between computational phases. They are also necessary in linear algebra programs, to perform matrix transpose operations.

[1]  Kwan Woo Ryu,et al.  The block distributed memory model for shared memory multiprocessors , 1994, Proceedings of 8th International Parallel Processing Symposium.

[2]  Viktor K. Prasanna,et al.  Efficient algorithms for multi-dimensional block-cyclic redistribution of arrays , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).

[3]  Rajeev Thakur,et al.  Efficient Algorithms for Array Redistribution , 1996, IEEE Trans. Parallel Distributed Syst..

[4]  Viktor K. Prasanna,et al.  Efficient algorithms for block-cyclic redistribution of arrays , 1996, Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing.

[5]  Viktor K. Prasanna,et al.  A Mapping Methodology for Designing Software Task Pipelines for Embedded Signal Processing , 1998, IPPS/SPDP Workshops.

[6]  Geoffrey C. Fox,et al.  Runtime array redistribution in HPF programs , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[7]  Viktor K. Prasanna,et al.  High Throughput-Rate Parallel Algorithms for Space Time Adaptive Processing , 1997 .

[8]  Viktor K. Prasanna,et al.  High-performance computing for vision , 1996, Proc. IEEE.

[9]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[10]  P. Sadayappan,et al.  An Approach to Communication-eecient Data Redistribution , 1994 .

[11]  James Ward,et al.  Space-time adaptive processing for airborne radar , 1998 .

[12]  Viktor K. Prasanna,et al.  Scalable Data Parallel Implementations of Object Recognition Using Geometric Hashing , 1994, J. Parallel Distributed Comput..

[13]  Prithviraj Banerjee,et al.  Automatic generation of efficient array redistribution routines for distributed memory multicomputers , 1995, Proceedings Frontiers '95. The Fifth Symposium on the Frontiers of Massively Parallel Computation.

[14]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, PARA.

[15]  Viktor K. Prasanna,et al.  Efficient Algorithms for Block-Cyclic Array Redistribution between Processor Sets , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[16]  P. Sadayappan,et al.  An approach to communication-efficient data redistribution , 1994, ICS '94.

[17]  Lionel M. Ni,et al.  Processor mapping techniques toward efficient data redistribution , 1994, Proceedings of 8th International Parallel Processing Symposium.

[18]  David W. Walker,et al.  Redistribution of block-cyclic data distributions using MPI , 1996, Concurr. Pract. Exp..

[19]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[20]  B. Elspas,et al.  Graphs with circulant adjacency matrices , 1970 .

[21]  Susanne E. Hambrusch,et al.  C3: A Parallel Model for Coarse-Grained Machines , 1996, J. Parallel Distributed Comput..

[22]  Jehoshua Bruck,et al.  CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers , 1995, IEEE Trans. Parallel Distributed Syst..

[23]  Rajeev Thakur,et al.  Eecient Algorithms for Array Redistribution , 1996 .

[24]  J. Ramanujam,et al.  Multi-phase array redistribution: modeling and evaluation , 1995, Proceedings of 9th International Parallel Processing Symposium.

[25]  Guy L. Steele,et al.  The High Performance Fortran Handbook , 1993 .