Checkpointing and restart techniques

Why bother checkpointing your application?

There are many good reasons for designing your applications with checkpointing and restarting capabilities on HPC systems. The two presumably most important ones are:

HPC systems are, as their name indicates, high performance systems rather than high availability systems. For that reason the risk of job interruptions caused e.g. by a node failure or a network interrupt must be taken into account at all times.
The imposed runtime limits on the systems demand efficient automatic approaches to submitting long-running jobs without much effort.

Checkpointing procedures

There are currently two supported approaches for submitting a long-running job to the runtime limited HPC systems. Both procedures are, of course, highly application dependent. Please contact the ZID cluster administration if you need implementation assistance or if your application doesn't seem to fit into either of the two schemes.

Attention! Both approaches comprise the danger of flooding the SGE batch scheduler with erroneous jobs or, even worse, of littering your working directories or inflicting other major damage to your account. Please be extra careful when applying such a scheme and test thoroughly before going into production.

Sub-dividing the program into appropriate pieces

The first and presumably more easy way to overcome the runtime limitations is to sub-divide your program into a set of sequentially dependent portions. Strictly speaking, this is not a checkpointing with restart approach (provided that you do not additionally arrange for the restart in some appropriate way). As depicted below, if one job-portion, i.e. one link in the chain fails, all subsequent links will fail also.
Prerequisite for this approach is, that each portion of your job depends in some systematic way upon its predecessor.

As an example, each job-portion might generate an output file data.final in the current working directory, which is taken to be the input file data.start of the subsequent job-portion. No matter how complicated the interdependency may be, the important step is to submit the portions in a way to the SGE batch scheduler, that the sequential dependencies are respected.

The following exemplary script depicts a possible implementation of the above scenario:

#!/bin/bash

# Backup initial input file data.start to data.initial, just in case
cp data.start data.initial

# Submit job and dump command output to file previous_job.info
qsub job.sh jobscript_args | tee previous_job.info

# Submit next 100 job-portions sequentially and create necessary dependencies
for i in `seq 0 99`; do

# extract job_id from previous job
job_id=`cat previous_job.info | awk '{print $3}'`
echo $job_id

# submit next job with hold until previous one is finished
qsub -hold_jid $job_id job.sh jobscript_args | tee previous_job.info

done

The jobscript job.sh for the depicted purpose might look similar to the following script:

#!/bin/bash

# SGE options
# use current working directory
#$ -cwd
# redirect stdout of job to sge.stdout
#$-o sge.stdout
#...

# Backup the previous input file
cp data.start bck_data.start

# Make the previous output file the current input file
mv data.final data.start

# Call your main routine or script
./my_prog my_prog_args

The important option for job-submission is

qsub -hold_jid <job_list> ...

with job_list containing a comma separated list of SGE job-ids. Instead of providing the job-ids, the prerequisite jobs can also be given by their SGE names, but the names must be uniquely identifiable!

Note: All jobs submitted via the above method will remain in a hold state until all prerequisite jobs have been finished. As resource reservation, activated with the -R yes submit option, doesn't apply to jobs in a hold state, the reservation will not happen before the job becomes eligible for execution.

Automatic checkpoint and restart procedure

If your application provides some kind of user-level checkpointing facility (many large user applications do have integrated checkpointing), there is a more elegant method available for restarting your program appropriately after a node failure or in case a runtime limit was exceeded.

Necessary submit options

To implement the automatic restart procedure, first you have to submit your job with the additional options -notify to activate a notification signal sent by the SGE daemon to your job script before the runtime limit is exceeded and -r y to enable restarting for your job. E.g.:

qsub -r y -notify ... job_script

The restart option -r y

The -r y restart option is the basic premise to make your program eligible for a restart in case of a failing node or excess of a runtime limit. Without setting this option explicitly, programs are taken to be started with the -r n setting, i.e. by default programs are not restarted.

The notification option -notify

The -notify option causes Grid Engine to send a SIGUSR2/12 signal to your program some time before the application is finally killed for some predictable reason (e.g. excess of the queue's runtime limit). The notification delay, i.e. the time between the notification signal and the final termination of the program is currently 10 minutes. If this should not be enough for appropriate checkpointing of your program, please contact the ZID cluster administration.

Implementation of the procedure

There are three main aspects to be taken into account, when implementing the checkpoint and restart procedure:

Trapping the notification signal SIGUSR2 and, if desired, initiating the checkpoint procedure (checkpoints could alternatively/additionally be done at regular intervals).
Signalling the SGE batch scheduler with exit code 99 that the program should be restarted.
Starting/restarting the application, if necessary differentiating between a restart and the initial startup with the aid of SGE's RESTARTED environment variable.

The following script exemplifies a potential implementation of this procedure:

#!/bin/bash

# SGE submit options
#$ -r yes
#$ -notify
#$ -cwd
#$ ...

# Trap SGE's notification signal SIGUSR2/12
trap "{
echo \"Notification signal caught, initiate checkpoint procedure...\"
./my_checkpoint_procedure
echo \"Force application restart with exit status 99.\"
exit 99
}" 12

# start application program depending on restart condition
if [ "$RESTARTED" == "0" ]; then

# initiate restart count file
echo "0" > restart_count

# create/truncate SGE output file
echo "" > $SGE_STDOUT_PATH

# start application program
echo "Initial program startup..."
./app_start

elif [ "$RESTARTED" == "1" ]; then

# get restart count
restart_nb=$((`read | cat restart_count` + 1))
# exit normally if restart count large enough (no subsequent restart!)
if [ $restart_nb -eq 99 ]; then
echo "Restarted this job more than enough!...Bye."
exit 0;
fi
# save new restart count
echo $restart_nb > restart_count

# restart application program
echo "Program restart..."
./app_restart

else

echo "Something going awfully wrong: RESTARTED=${RESTARTED}!"
exit 1

If you utilize a multilevel startup procedure with more than one script-file, you need to include the signal handler trap : 12 within every parent script in the script-hierarchy. E.g., a submit script containing all SGE submit options, calling subsequently an application startup script as depicted above, would have to be similar to the following sample:

#!/bin/bash

# SGE submit options
#$ -r yes
#$ -notify
#$ -cwd
#$ ...

# Trap SIGUSR2 and wait for child's exit status
trap : 12

./program_script