Job Problem Solving

There are many things that can cause a job submitted to a queue to fail.  Thankfully there are several methods available to determine the cause of a job failure.  This page lists some methods that commonly help users resolve the problems their jobs encounter.

Check that the Job Fits the Queue

One of the most common job problems is that a job may require resources which cannot be provided in the queue it was submitted to.  Resources could include requested wall time, number of nodes, processors per node, or memory.  Each job queue has different limits with respect to these resources, and these limits are outlined in our Job Queue Summary webpage.

Jobs listed as "Blocked" have probably requested more memory than the nodes in the queue can provide.  The compute nodes require some of their memory to run the operating system and basic services, and consequently jobs should request about 2 GB less memory per node than is physically present on the hardware.  For example, if a set of nodes has 24 GB of memory each, then a job should use less than 22 GB of memory on each.

If a job will not fit the submitted queue the job may fail or hang in the queue.

Check the Computing Environment

A job may fail due to inability to access required software or libraries, or an environment configuration error.  A simple first check is to verify the system being used is the intended one.  Use the command hostname to show the name of the host being accessed.

Many calculations require software modules to be loaded so that the appropriate libraries and executables are accessible.  Errors messages about missing library files (e.g. cannot find libfftw.so.0) often stem from incorrectly loaded modules.  Check that the appropriate module load commands are being used.  Jobs begin with only the default  modules, so each job script needs to contain module load commands for the required software.  The command module list will show the currently loaded modules.  The command env will show the values of all environmental variables.

Use Job Status Commands

If a job using the PBS system appears to be stuck in the queue, or is otherwise behaving strangely, there are some system commands available to gather more information. 

The command qstat will show a list of all queued jobs on a system.  The job number associated with each job will be shown on the far left.  To see all of the queued jobs submitted by a particular user name use the command:

qstat -u username

More information regarding ways to use the qstat command can be found here.

The command showq also lists all queued jobs.  This command is useful because the listed jobs are categorized as either Active (running), Eligible (waiting), or Blocked.  Blocked jobs are usually blocked because they are requesting resources, often large amounts of memory, that could never be provided.  Sometimes jobs are blocked due to a system maintenance issue (see below).  To show a categorized list of all jobs associated with a particular user name use the command:

showq -w user=username

The command checkjob can be used to view the details of a job submitted to a queue.  To use this command the job ID number must first be determined using the qstat or showq commands.  The syntax to use the checkjob command is:

checkjob jobnumber

If a job is blocked the information obtained with checkjob can help determine the reason.  Care must be used when interpreting the checkjob output: the command compares the job to the requirments of multiple queues, and error messages associated with queues that the job is not in do not negatively affect the job.  Also, a message stating that sufficient hardware is not available means that the job is still waiting in the queue for its turn.

Use Job Status Emails

If you submit a job to one of MSI's system queues the system can send you automatic job information emails.  This is particularly useful when a job encounters an error because you will receive an email with error information.

To use job status emails include the following lines in your PBS job script:

#PBS -m abe 
#PBS -M sample_email@umn.edu

The first line will cause the system to send email updates when the job begins, ends, or aborts.  The second line directs the emails to be sent to sample_email.umn.edu (this should be replaced with the email address where you wish to receive the updates).  If your job aborts you will receive an email with error information, including a job exit code.  Below is a summary of some possible exit codes and their meanings.

Exit Code Meaning
0 Job completed correctly.
1 Job exited after experiencing an error; check outputs for information.
9, 64 Out of CPU time.
125, 127 Severe error.
130, 131 The job ran out of CPU time or swap.  If swap is the culprit check for memory leaks.
134 The job was killed with an abort signal, possible program error.
137 The job was killed because it exceeded the time limit.
139 Segmentation violation, see the Debugging webpage.
140 The job exceeded the "wall clock" time limit.
271 Killed at the request of root (moab), this can be due to exceeding the "wall clock" time limit, being cancelled by the user, or exceeding some other hardware limit.

Check Error Logs and Outputs

When a job which used the PBS queue system exits or aborts it creates output and error log files. If an error occurred these files often have useful information regarding the error.  These files will be located by default in the job working directory (the directory the PBS script was submitted from).

Output files by default have names in the format: scriptname.o12345 
Error log files by default have names in the format: scriptname.e12345 

Here scriptname stands for the name of the job PBS script, and the trailing number will be the job ID number which is unique for each time a job is submitted.

Check Service Units (Computation Time) Remaining

Jobs will be rejected from the High Performance queues if your group does not have sufficient service units (SUs) remaining.  To check your remaining service units use the command:

acctinfo

More information about service units is available here and here .

Check Disk Space Remaining

A job will fail if attempts to write files to a full disk or directory.  If your job writes files within your home directory, or within other directories owned by your research group, you may check the space remaning in these directories using the command:

groupquota

Jobs should not write output to the /tmp directory.  The /tmp directory has relatively little available space and fills quickly.  For temporary space jobs should use the /scratch directories.  Itasca jobs may use the /lustre filesystem for temporary storage.

Check for a System Maintenance Issue

During mainetnance periods, or if a problem occurs, a system may become unavailable for computation.
Information about system issues is available at our System Status page.