HAN: a Hierarchical AutotuNed Collective Communication Framework

High-performance computing (HPC) systems keep growing in scale and heterogeneity to satisfy the increasing computational need, and this brings new challenges to the design of MPI libraries, especially with regard to collective operations. To address these challenges, we present “HAN,” a new hierarchical autotuned collective communication framework in Open MPI, which selects suitable homogeneous collective communication modules as submodules for each hardware level, uses collective operations from the submodules as tasks, and organizes these tasks to perform efficient hierarchical collective operations. With a task-based design, HAN can easily swap out submodules, while keeping tasks intact, to adapt to new hardware. This makes HAN suitable for the current platform and provides a strong and flexible support for future HPC systems. To provide a fast and accurate autotuning mechanism, we present a novel cost model based on benchmarking the tasks instead of a whole collective operation. This method drastically reduces tuning time, as the cost of tasks can be reused across different message sizes, and is more accurate than existing cost models. Our cost analysis suggests the autotuning component can find the optimal configuration in most cases. The evaluation of the HAN framework suggests our design significantly improves the default Open MPI and achieves decent speedups against state-of-the-art MPI implementations on tested applications.

[1]  Kevin Harms,et al.  Characterization of MPI Usage on a Production Supercomputer , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Sathish S. Vadhiyar,et al.  Automatically Tuned Collective Communications , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[3]  Dhabaleswar K. Panda,et al.  Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  William Gropp,et al.  Modeling MPI Communication Performance on SMP Nodes: Is it Time to Retire the Ping Pong Test , 2016, EuroMPI.

[5]  Hao Zhu,et al.  Hierarchical Collectives in MPICH2 , 2009, PVM/MPI.

[6]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[7]  Vijay S. Pai,et al.  Exploiting Process Imbalance to Improve MPI Collective Operations in Hierarchical Systems , 2015, ICS.

[8]  Dhabaleswar K. Panda,et al.  Fast collective operations using shared and remote memory access protocols on clusters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[9]  Dhabaleswar K. Panda,et al.  SALaR: Scalable and Adaptive Designs for Large Message Reduction Collectives , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[10]  Roger W. Hockney,et al.  The Communication Challenge for MPP: Intel Paragon and Meiko CS-2 , 1994, Parallel Computing.

[11]  Jack J. Dongarra,et al.  Performance analysis of MPI collective operations , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[12]  Xin Yuan,et al.  Automatic generation and tuning of MPI collective communication routines , 2005, ICS '05.

[13]  Xin Yuan,et al.  STAR-MPI: self tuned adaptive routines for MPI collective operations , 2006, ICS '06.

[14]  Massimo Maresca,et al.  Polymorphic-Torus Network , 1989, IEEE Trans. Computers.

[15]  Jesper Larsson Träff,et al.  Two-tree algorithms for full bandwidth broadcast, reduction and scan , 2009, Parallel Comput..

[16]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[17]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[18]  Brice Goglin,et al.  KNEM: A generic and scalable kernel-assisted intra-node MPI communication framework , 2013, J. Parallel Distributed Comput..

[19]  Torsten Hoefler,et al.  Using performance models to understand scalable Krylov solver performance at scale for structured grid problems , 2019, ICS.

[20]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience: Research Articles , 2007 .

[21]  G. Fagg,et al.  Flexible collective communication tuning architecture applied to Open MPI , 2006 .

[22]  Bronis R. de Supinski,et al.  Exploiting hierarchy in parallel computer networks to optimize collective operation performance , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[23]  Henri E. Bal,et al.  MagPIe: MPI's collective communication operations for clustered wide area systems , 1999, PPoPP '99.

[24]  Jack J. Dongarra,et al.  Decision Trees and MPI Collective Algorithm Selection Problem , 2007, Euro-Par.

[25]  Jesper Larsson Träff,et al.  MPI Collectives and Datatypes for Hierarchical All-to-all Communication , 2014, EuroMPI/ASIA.

[26]  Armin R. Mikler,et al.  Net-PIPE: Network Protocol Independent Performance Evaluator , 1997 .

[27]  Manjunath Gorentla Venkata,et al.  Cheetah: A Framework for Scalable Hierarchical Collective Operations , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[28]  George Bosilca,et al.  HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[29]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[30]  George Bosilca,et al.  ADAPT: an event-based adaptive collective communication framework , 2018, HPDC.

[31]  Torsten Hoefler,et al.  Implementation and performance analysis of non-blocking collective operations for MPI , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[32]  Dhabaleswar K. Panda,et al.  Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[33]  Kees Verstoep,et al.  Fast Measurement of LogP Parameters for Message Passing Platforms , 2000, IPDPS Workshops.

[34]  Dhabaleswar K. Panda,et al.  Designing topology-aware collective communication algorithms for large scale InfiniBand clusters: Case studies with Scatter and Gather , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[35]  Dharma P. Agrawal,et al.  Generalized Hypercube and Hyperbus Structures for a Computer Network , 1984, IEEE Transactions on Computers.

[36]  Jack J. Dongarra,et al.  MPI Collective Algorithm Selection and Quadtree Encoding , 2006, PVM/MPI.