Toward Exascale Resilience
暂无分享,去创建一个
Franck Cappello | Laxmikant V. Kalé | William Gropp | Marc Snir | Al Geist | Bill Kramer | W. Gropp | M. Snir | L. Kalé | A. Geist | F. Cappello | B. Kramer
[1] J. Neumann. Probabilistic Logic and the Synthesis of Reliable Organisms from Unreliable Components , 1956 .
[2] Christian Engelmann,et al. Super-Scalable Algorithms for Computing on 100, 000 Processors , 2005, International Conference on Computational Science.
[3] Chao Wang,et al. A tunable holistic resiliency approach for high-performance computing systems , 2009, PPoPP '09.
[4] Edsger W. Dijkstra,et al. Self-stabilizing systems in spite of distributed control , 1974, CACM.
[5] Carl E. Landwehr,et al. Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.
[6] Christian Engelmann,et al. Proactive process-level live migration in HPC environments , 2008, HiPC 2008.
[7] Thomas Hérault,et al. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[8] G. R. Liu,et al. 1013 Mesh Free Methods : Moving beyond the Finite Element Method , 2003 .
[9] John A. Gunnels,et al. Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).
[10] Daniel Marques,et al. C3: A System for Automating Application-Level Checkpointing of MPI Programs , 2003, LCPC.
[11] Josep Torrellas,et al. SWICH: A Prototype for Efficient Cache-Level Checkpointing and Rollback , 2006, IEEE Micro.
[12] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..
[13] Zhiling Lan,et al. Fault-Driven Re-Scheduling For Improving System-level Fault Resilience , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).
[14] Kevin Regimbal,et al. Report of the Workshop on Petascale Systems Integration for LargeScale Facilities , 2007 .
[15] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[16] Zizhong Chen. Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[17] Mica R. Endsley,et al. Toward a Theory of Situation Awareness in Dynamic Systems , 1995, Hum. Factors.
[18] George Bosilca,et al. Redesigning the message logging model for high performance , 2010, Concurr. Comput. Pract. Exp..
[19] Kai Li,et al. Memory Exclusion: Optimizing the Performance of Checkpointing Systems , 1999, Softw. Pract. Exp..
[20] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .
[21] George Candea,et al. JAGR: an autonomous self-recovering application server , 2003, 2003 Autonomic Computing Workshop.
[22] L. Alvisi,et al. A Survey of Rollback-Recovery Protocols , 2002 .
[23] Laxmikant V. Kalé,et al. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).
[24] Anand Sivasubramaniam,et al. BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).
[25] R. Vilalta,et al. Providing Persistent and Consistent Resources through Event Log Analysis and Predictions for Large-scale Computing Systems , 2002 .
[26] James Demmel,et al. Percu: a holistic method for evaluating high performance computing systems , 2008 .
[27] Manuel Blum,et al. Designing programs that check their work , 1989, STOC '89.
[28] Richard P. Martin,et al. Using Fault Injection and Modeling to Evaluate the Performability of Cluster-Based Services , 2003, USENIX Symposium on Internet Technologies and Systems.
[29] Jon Stearley,et al. What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).
[30] Charng-Da Lu,et al. Assessing Fault Sensitivity in MPI Applications , 2004, Proceedings of the ACM/IEEE SC2004 Conference.
[31] Christian Engelmann,et al. Development of Naturally Fault Tolerant Algorithms for Computing on 100,000 Processors , 2002 .
[32] Laxmikant V. Kale,et al. Proactive Fault Tolerance in Large Systems , 2004 .
[33] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.
[34] Laxmikant V. Kalé,et al. Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.
[35] Charng-da Lu,et al. Scalable Diskless Checkpointing for Large Parallel Systems , 2005 .