Checkpointing

Abstract

Checkpointing HPC applications has been a challenging, but highly desired functionality for saving the state of long-running applications. This functionality hedges against failure modes from unexpected events that can cause premature failure of an application.
 

Itasca - where one can do the checkpointing right now.

        login to itasca

              ssh username@itasca.msi.umn.edu

Checkpoint serial jobs

Compile your job

       module load intel

       icc  -o my_test my_app.c  -lcr

or

       ifort -o my_test  my_app.f -lcr  
 

Run your job

      qsub -I -l node=1:ppn=8,mem=10gb,walltime=2:00:00

      cd $wrk # where your job

      cr_run ./my_test &

Find the job PID

   PS=`ps -u szhang | grep my_test`

   PID=${PS:0:6}
 

  PID=`echo "${PID:0:6}" | sed 's/ //g' `
  cr_checkpoint --signal=2 --term $PID

To verify   checkpointing  success

    tail  context.$PID

To checkpoint again and terminate the job 

   cr_checkpoint  --term $PID

To restart the job from the status of last chechpointing

  cr_restart context.$PID

 

Checkpoint OpenMP  jobs

Compile your job

       module load intel

       icc  -o my_test -openmp  my_app.c  -lcr

or

       ifort -o my_test   -openmp my_app.f -lcr  
 

Run your job

      qsub -I -l node=1:ppn=8,mem=10gb,walltime=2:00:00

      cd $wrk # where your job

      export OMP_NUM_THREADS=4

      cr_run ./my_test &

Find the job PID and checkpoint the job

   PS=`ps -u szhang | grep my_test`

   PID=`echo "${PID:0:6}" | sed 's/ //g' `

   cr_checkpoint --signal=2 --term $PID

To verify   checkpointing  success

    tail  context.$PID

To checkpoint again and terminate the job 

   cr_checkpoint  --term $PID

To restart the job from the status of last chechpointing

  export OMP_NUM_THREADS=4

 cr_restart context.$PID

 

Checkpoint MPI  jobs

Compile your job

       module load intel ompi/1.6.3-blcr/intel

       mpicc  -o my_test   my_app.c 

or

       mpif77 -o my_test    my_app.f
 

Where to store the checkpointing file

Please create a  .openmpi in your home directory, generate a file named as mca-params.conf under    .openmpi. the mca-params.conf file should contain the path to the directory where you want to store the checkpointing files. Here is an example:

     cat /home/support/szhang/.openmpi/mca-params.conf
     snapc_base_global_snapshot_dir=/lustre/cr_files
     crs_base_snapshot_dir=/lustre/cr_files/local

Run your job

      qsub -I -l node=4:ppn=8,mem=10gb,walltime=2:00:00

      cd $wrk # where your job

      mpirun  -am ft-enable-cr -np 32 ./my_test &

Find the job PID and checkpoint it:

      pid=`ps -u szhang | grep mpirun`

      jid=`echo "${pid:0:6}" | sed 's/ //g' `

     export jid
     ompi-checkpoint $jid

To verify  checkpointing  success

    ls -al /lustre/cr_files | grep $jid

To checkpoint again and terminate the job 

   ompi-checkpoint --term $jid

To restart the job from the status of last chechpointing

cd  /lustre/cr_files/

 ompi-restart  /lustre/cr_files/ompi_global_snapshot_$jid.ckpt/

 

 

  I/O Performance