Efficient Algorithms for Block-Cyclic Array Redistribution between Processor Sets

Run-time array redistribution is necessary to enhance the performance of parallel programs on distributed memory supercomputers. In this paper, we present an efficient algorithm for array redistribution from cyclic(x) on P processors to cyclic(Kx) on Q processors. The algorithm reduces the overall time for communication by considering the data transfer, communication schedule, and index computation costs. The proposed algorithm is based on a generalized circulant matrix formalism . Our algorithm generates a schedule that minimizes the number of communication steps and eliminates node contention in each communication step. The network bandwidth is fully utilized by ensuring that equal-sized messages are transferred in each communication step. Furthermore, the procedure to compute the schedule and the index sets is extremely fast. It takes O(max(P, Q)) time. Therefore, our proposed algorithm is suitable for run-time array redistribution. To evaluate the performance of our scheme, we have implemented the algorithm using C and MPI. The experiments were conducted on the IBM SP2. The experimental results show that the proposed algorithm outperforms well-known algorithms when the total communication time including the data transfer and schedule and index computation times are considered.

[1]  Ching-Hsien Hsu,et al.  A Basic-Cycle Calculation Technique for Efficient Dynamic Data Redistribution , 1998, IEEE Trans. Parallel Distributed Syst..

[2]  Yves Robert,et al.  Scheduling Block-Cyclic Array Redistribution , 1998, IEEE Trans. Parallel Distributed Syst..

[3]  Ken Kennedy,et al.  Compilation techniques for block-cyclic distributions , 1994 .

[4]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[5]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[6]  Jaeyoung Choi,et al.  Parallel Matrix Transpose Algorithms on Distributed Memory Concurrent Computers , 1995, Parallel Comput..

[7]  James Ward,et al.  Space-time adaptive processing for airborne radar , 1998 .

[8]  Jehoshua Bruck,et al.  Efficient algorithms for all-to-all communications in multi-port message-passing systems , 1994, SPAA '94.

[9]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[10]  P. Sadayappan,et al.  An approach to communication-efficient data redistribution , 1994, ICS '94.

[11]  J. Ramanujam,et al.  Multi-phase array redistribution: modeling and evaluation , 1995, Proceedings of 9th International Parallel Processing Symposium.

[12]  Guy L. Steele,et al.  The High Performance Fortran Handbook , 1993 .

[13]  Bernard Tourancheau,et al.  Fast Runtime Block Cyclic Data Redistribution on Multiprocessors , 1997, J. Parallel Distributed Comput..

[14]  Viktor K. Prasanna,et al.  High-performance computing for vision , 1996, Proc. IEEE.

[15]  Geoffrey C. Fox,et al.  Runtime array redistribution in HPF programs , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[16]  Viktor K. Prasanna,et al.  Efficient Algorithms for Block-Cyclic Redistribution of Arrays , 1999, Algorithmica.

[17]  Jack Dongarra,et al.  Parallel matrix transpose algorithms on distributed memory concurrent computers , 1993, Proceedings of Scalable Parallel Libraries Conference.

[18]  Rajeev Thakur,et al.  Efficient Algorithms for Array Redistribution , 1996, IEEE Trans. Parallel Distributed Syst..

[19]  V. K. Prasanna,et al.  Communication issues in heterogeneous embedded systems , 1996, Proceedings of the 4th International Workshop on Parallel and Distributed Real-Time Systems.

[20]  Viktor K. Prasanna,et al.  Parallel implementation of synthetic aperture radar on high performance computing platforms , 1997, Proceedings of 3rd International Conference on Algorithms and Architectures for Parallel Processing.

[21]  Prithviraj Banerjee,et al.  Automatic generation of efficient array redistribution routines for distributed memory multicomputers , 1995, Proceedings Frontiers '95. The Fifth Symposium on the Frontiers of Massively Parallel Computation.

[22]  Lionel M. Ni,et al.  Processor mapping techniques toward efficient data redistribution , 1994, Proceedings of 8th International Parallel Processing Symposium.

[23]  David W. Walker,et al.  Redistribution of block‐cyclic data distributions using MPI , 1996 .