Toward Exascale Resilience

Over the past few years resilience has became a major issue for high-performance computing (HPC) systems, in particular in the perspective of large petascale systems and future exascale systems. These systems will typically gather from half a million to several millions of central processing unit (CPU) cores running up to a billion threads. From the current knowledge and observations of existing large systems, it is anticipated that exascale systems will experience various kind of faults many times per day. It is also anticipated that the current approach for resilience, which relies on automatic or application level checkpoint/ restart, will not work because the time for checkpointing and restarting will exceed the mean time to failure of a full system. This set of projections leaves the community of fault tolerance for HPC systems with a difficult challenge: finding new approaches, which are possibly radically disruptive, to run applications until their normal termination, despite the essentially unstable nature of exascale systems. Yet, the community has only five to six years to solve the problem. This white paper synthesizes the motivations, observations and research issues considered as determinant of several complimentary experts of HPC in applications, programming models, distributed systems and system management.

[1]  J. Neumann Probabilistic Logic and the Synthesis of Reliable Organisms from Unreliable Components , 1956 .

[2]  Christian Engelmann,et al.  Super-Scalable Algorithms for Computing on 100, 000 Processors , 2005, International Conference on Computational Science.

[3]  Chao Wang,et al.  A tunable holistic resiliency approach for high-performance computing systems , 2009, PPoPP '09.

[4]  Edsger W. Dijkstra,et al.  Self-stabilizing systems in spite of distributed control , 1974, CACM.

[5]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[6]  Christian Engelmann,et al.  Proactive process-level live migration in HPC environments , 2008, HiPC 2008.

[7]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[8]  G. R. Liu,et al.  1013 Mesh Free Methods : Moving beyond the Finite Element Method , 2003 .

[9]  John A. Gunnels,et al.  Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[10]  Daniel Marques,et al.  C3: A System for Automating Application-Level Checkpointing of MPI Programs , 2003, LCPC.

[11]  Josep Torrellas,et al.  SWICH: A Prototype for Efficient Cache-Level Checkpointing and Rollback , 2006, IEEE Micro.

[12]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[13]  Zhiling Lan,et al.  Fault-Driven Re-Scheduling For Improving System-level Fault Resilience , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[14]  Kevin Regimbal,et al.  Report of the Workshop on Petascale Systems Integration for LargeScale Facilities , 2007 .

[15]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[16]  Zizhong Chen Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[17]  Mica R. Endsley,et al.  Toward a Theory of Situation Awareness in Dynamic Systems , 1995, Hum. Factors.

[18]  George Bosilca,et al.  Redesigning the message logging model for high performance , 2010, Concurr. Comput. Pract. Exp..

[19]  Kai Li,et al.  Memory Exclusion: Optimizing the Performance of Checkpointing Systems , 1999, Softw. Pract. Exp..

[20]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[21]  George Candea,et al.  JAGR: an autonomous self-recovering application server , 2003, 2003 Autonomic Computing Workshop.

[22]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[23]  Laxmikant V. Kalé,et al.  FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[24]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[25]  R. Vilalta,et al.  Providing Persistent and Consistent Resources through Event Log Analysis and Predictions for Large-scale Computing Systems , 2002 .

[26]  James Demmel,et al.  Percu: a holistic method for evaluating high performance computing systems , 2008 .

[27]  Manuel Blum,et al.  Designing programs that check their work , 1989, STOC '89.

[28]  Richard P. Martin,et al.  Using Fault Injection and Modeling to Evaluate the Performability of Cluster-Based Services , 2003, USENIX Symposium on Internet Technologies and Systems.

[29]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[30]  Charng-Da Lu,et al.  Assessing Fault Sensitivity in MPI Applications , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[31]  Christian Engelmann,et al.  Development of Naturally Fault Tolerant Algorithms for Computing on 100,000 Processors , 2002 .

[32]  Laxmikant V. Kale,et al.  Proactive Fault Tolerance in Large Systems , 2004 .

[33]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[34]  Laxmikant V. Kalé,et al.  Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.

[35]  Charng-da Lu,et al.  Scalable Diskless Checkpointing for Large Parallel Systems , 2005 .