Portable and scalable algorithms for irregular all-to-all communication

In this paper we develop portable and scalable algorithms for performing irregular all-to-all communication in High Performance Computing (HPC) systems. To minimize the communication latency, the algorithm reduces the total number of messages transmitted, reduces the variance of the lengths of these messages, and overlaps the communication with computation. The performance of the algorithm is characterized using a simple model of HPC systems. Our implementations are performed using the Message Passing Interface (MPI) standard and they can be ported to various HPC platforms. The performance of our algorithms is evaluated on CM5, T3D and SP2. The results show the effectiveness of the techniques as well as the interplay between the architectural features, the machine size, and the variance of message lengths. The experiences of our study can be applied in other HPC systems to optimize the performance of collective communication operations.

[1]  Andrew A. Chien,et al.  A comparison of architectural support for messaging in the TMC CM-5 and the Cray T3D , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[2]  Jang Sun Lee,et al.  Communication-Efficient and Memory-Bounded External Redistribution , 1995 .

[3]  Viktor K. Prasanna,et al.  Scalable Data Parallel Implementations of Object Recognition Using Geometric Hashing , 1994, J. Parallel Distributed Comput..

[4]  Sanjay Ranka,et al.  Personalized Communication Avoiding Node Contention on Distributed Memory Systems , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[5]  Yi Liu,et al.  Scalable S-to-P broadcasting on message-passing MPPs , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[6]  Sanjay Ranka,et al.  Many-to-many personalized communication with bounded traffic , 1995, Proceedings Frontiers '95. The Fifth Symposium on the Frontiers of Massively Parallel Computation.

[7]  Marina del Rey,et al.  Improving PVM Performance Using ATOMIC User-Level Protocol , 1995 .

[8]  Thorsten von Eicken,et al.  U-Net: a user-level network interface for parallel and distributed computing , 1995, SOSP.

[9]  Geoffrey C. Fox,et al.  Supporting irregular distributions using data-parallel languages , 1995, IEEE Parallel Distributed Technol. Syst. Appl..

[10]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[11]  David A. Bader,et al.  Practical parallel algorithms for personalized communication and integer sorting , 1996, JEAL.

[12]  Viktor K. Prasanna,et al.  Parallelization of perceptual grouping on distributed memory machines , 1995, Proceedings of Conference on Computer Architectures for Machine Perception.

[13]  Shahid H. Bokhari,et al.  Optimal Multiphase Complete Exchange on Circuit-Switched Hypercube Architectures , 1994, SIGMETRICS.

[14]  Gregory G. Finn,et al.  Atomic: A High-Speed Local Communication Architecture , 1994, J. High Speed Networks.

[15]  David A. Patterson,et al.  A case for networks of workstations (now) , 1994, Symposium Record Hot Interconnects II.

[16]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[17]  Nada Golmie,et al.  Study of interoperability between EFCI and ER switch mechanisms for ABR traffic in an ATM network , 1995, Proceedings of Fourth International Conference on Computer Communications and Networks - IC3N'95.

[18]  Shahid H. Bokhari,et al.  Multiphase complete exchange on Paragon, SP2, and CS-2 , 1996, IEEE Parallel Distributed Technol. Syst. Appl..

[19]  Sean W. Smith,et al.  Parallelizing a global atmospheric chemical tracer model , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[20]  Susanne E. Hambrusch,et al.  Communication Operations on Coarse-Grained Mesh Architectures , 1995, Parallel Comput..

[21]  George Karypis,et al.  Introduction to Parallel Computing , 1994 .