Efficient communication algorithms for parallel computing platforms

High Performance Computing (HPC) platforms are used for various applications. In these platforms, processor speed has increased rapidly. However, data communication speed among processor, memory, and disk has not kept pace. Thus, efficient communication algorithms are critical for effective utilization of HPC platforms. Our work focuses on the design of efficient communication algorithms on HPC platforms. To design efficient communication algorithms, we first design a simple and accurate model of HPC platforms. We identify three main costs on HPC platforms: processor-processor, memory-disk, and processor-memory communication costs. Among these, we investigate the first two. We design communication algorithms using the above model. First, we develop a set of communication algorithms for the software task pipeline which consists of several stages of processors. The general communication for the software task pipeline is M-to-N K-block-cyclic communication, where M is the number of source processors, N is the number of destination processors, and K is the number of consecutive blocks that need to be sent to the same processor. Our algorithm for the communication reduces the number of communication steps to as small as lg(N/M + 1) whereas a previous serial communication takes MN communication steps. Our experimental results show that the number of processors required to process Synthetic Aperture Radar (SAR) data is reduced by as much as 50%. The second class of algorithms is memory-disk communication algorithms. In this research, several algorithms for memory-disk communications are designed: all-to-all broadcast communication and matrix transpose. The results show that the execution time of a matrix transpose on IBM SP2 is reduced by as much as 31.2% when the data size is 64 MBytes and the number of processors is one. The execution time of the all-to-all broadcast communication is reduced by as much as 86% on SGI/Cray T3E when the number of processors is 64 and the data size is 256 KBytes per processor. Finally, several benchmarks that measure HPC performance are implemented. We choose a recently proposed benchmark to measure the real-time performance. We implement it using our communication algorithms and the previous serial algorithm. Also, we implement our low-level benchmark on HPC platforms.

[1]  Viktor K. Prasanna,et al.  Issues in using heterogeneous HPC systems for embedded real time signal processing applications , 1995, Proceedings Second International Workshop on Real-Time Computing Systems and Applications.

[2]  Viktor K. Prasanna,et al.  Parallel implementation of synthetic aperture radar on high performance computing platforms , 1997, Proceedings of 3rd International Conference on Algorithms and Architectures for Parallel Processing.

[3]  Richard A. Games Benchmarking Methodology for Real-Time Embedded Scalable High Performance Computing. , 1996 .

[4]  Kwan Woo Ryu,et al.  The block distributed memory model for shared memory multiprocessors , 1994, Proceedings of 8th International Parallel Processing Symposium.

[5]  Sanjay Ranka,et al.  Many-to-many personalized communication with bounded traffic , 1995, Proceedings Frontiers '95. The Fifth Symposium on the Frontiers of Massively Parallel Computation.

[6]  James Ward,et al.  Space-time adaptive processing for airborne radar , 1998 .

[7]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[8]  David A. Patterson,et al.  Computer Organization & Design: The Hardware/Software Interface , 1993 .

[9]  S. Kung,et al.  VLSI Array processors , 1985, IEEE ASSP Magazine.

[10]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[11]  Jack J. Dongarra,et al.  Performance of various computers using standard linear equations software in a FORTRAN environment , 1988, CARN.

[12]  Viktor K. Prasanna,et al.  Heterogeneous computing: challenges and opportunities , 1993, Computer.

[13]  Larry Carter,et al.  Hierarchical tiling for improved superscalar performance , 1995, Proceedings of 9th International Parallel Processing Symposium.

[14]  V. K. Prasanna,et al.  Communication issues in heterogeneous embedded systems , 1996, Proceedings of the 4th International Workshop on Parallel and Distributed Real-Time Systems.

[15]  Pramod K. Varshney,et al.  Design, implementation and evaluation of parallel pipelined STAP on parallel computers , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[16]  P. Sadayappan,et al.  Efficient transposition algorithms for large matrices , 1993, Supercomputing '93.

[17]  Guy L. Steele,et al.  The High Performance Fortran Handbook , 1993 .

[18]  Anthony Skjellum,et al.  Using MPI - portable parallel programming with the message-parsing interface , 1994 .

[19]  Reinhold Weicker,et al.  Dhrystone: a synthetic systems programming benchmark , 1984, CACM.

[20]  Viktor K. Prasanna,et al.  Portable Implementation of Real-Time Signal Processing Benchmarks on HPC Platforms , 1998, PARA.

[21]  Viktor K. Prasanna,et al.  Efficient Algorithms for Block-Cyclic Redistribution of Arrays , 1999, Algorithmica.

[22]  Robert W. Floyd,et al.  Permuting Information in Idealized Two-Level Storage , 1972, Complexity of Computer Computations.

[23]  Monica S. Lam,et al.  The design, implementation, and evaluation of Jade , 1998, TOPL.

[24]  Jan P. Allebach,et al.  Multidimensional Signal Processing , 1997 .

[25]  Viktor K. Prasanna,et al.  High-performance computing for vision , 1996, Proc. IEEE.

[26]  Ian T. Foster,et al.  Designing and building parallel programs - concepts and tools for parallel software engineering , 1995 .

[27]  Cho-Li Wang High performance computing for vision on distributed-memory machines , 1996 .

[28]  John C. Curlander,et al.  Synthetic Aperture Radar: Systems and Signal Processing , 1991 .

[29]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[30]  Hampapuram K. Ramapriyan,et al.  A Generalization of Eklundh's Algorithm for Transposing Large Matrices , 1975, IEEE Transactions on Computers.

[31]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[32]  Brian A. Wichmann,et al.  A Synthetic Benchmark , 1976, Comput. J..

[33]  Pramod K. Varshney,et al.  Multi-threaded design and implementation of parallel pipelined STAP on parallel computers with SMP nodes , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[34]  Ian G. Cumming,et al.  Parallel synthetic aperture radar processing on workstation networks , 1996, Proceedings of International Conference on Parallel Processing.

[35]  J. O. Eklundh,et al.  A Fast Computer Method for Matrix Transposing , 1972, IEEE Transactions on Computers.

[36]  Viktor K. Prasanna,et al.  Portable and scalable algorithms for irregular all-to-all communication , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[37]  Thomas H. Cormen,et al.  Asymptotically Tight Bounds for Performing BMMC Permutations on Parallel Disk Systems , 1998, SIAM J. Comput..

[38]  S. VitterJ.,et al.  Algorithms for parallel memory, I , 1994 .