Computational Science — ICCS 2003

As the run time of an application approaches the the mean time to interrupt (MTTI) for the system on which it is running, it becomes necessary to generate intermediate snapshots of the application’s run state, known as checkpoint files or restart dumps. In the event of a system failure that halts program execution, these snapshots allow an application to resume computing from the most recently saved intermediate state instead of starting over at the beginning of the calculation. In this paper three models for predicting the optimum compute intervals between restart dumps are discussed. These models are evaluated by comparing their results to a simulation that emulate an application running on a actual system with interrupts. The results will be used to derive a simple method for calculating the optimum restart interval.