You are here
Job Problem Solving
There are many things that can cause a job submitted to a queue to fail. Thankfully there are several methods available to determine the cause of a job failure. This page lists some methods that commonly help users resolve the problems their jobs encounter.
Check that the Job Fits the Queue
One of the most common job problems is that a job may require resources which cannot be provided in the queue it was submitted to. Resources could include requested wall time, number of nodes, processors per node, or memory. Each job queue has different limits with respect to these resources, and these limits are outlined in our Job Queue Summary webpage.
Jobs listed as "Blocked" have probably requested more memory than the nodes in the queue can provide. The compute nodes require some of their memory to run the operating system and basic services, and consequently jobs should request about 2 GB less memory per node than is physically present on the hardware. For example, if a set of nodes has 24 GB of memory each, then a job should use less than 22 GB of memory on each.
If a job will not fit the submitted queue the job may fail or hang in the queue.
Check the Computing Environment
A job may fail due to inability to access required software or libraries, or an environment configuration error. A simple first check is to verify the system being used is the intended one. Use the command hostname to show the name of the host being accessed.
Many calculations require software modules to be loaded so that the appropriate libraries and executables are accessible. Errors messages about missing library files (e.g. cannot find libfftw.so.0) often stem from incorrectly loaded modules. Check that the appropriate module load commands are being used. Jobs begin with only the default modules, so each job script needs to contain module load commands for the required software. The command module list will show the currently loaded modules. The command env will show the values of all environmental variables.
Use Job Status Commands
If a job using the PBS system appears to be stuck in the queue, or is otherwise behaving strangely, there are some system commands available to gather more information.
The command qstat will show a list of all queued jobs on a system. The job number associated with each job will be shown on the far left. To see all of the queued jobs submitted by a particular user name use the command:
qstat -u username
More information regarding ways to use the qstat command can be found here.
The command showq also lists all queued jobs. This command is useful because the listed jobs are categorized as either Active (running), Eligible (waiting), or Blocked. Blocked jobs are usually blocked because they are requesting resources, often large amounts of memory, that could never be provided. Sometimes jobs are blocked due to a system maintenance issue (see below). To show a categorized list of all jobs associated with a particular user name use the command:
showq -w user=username
The command checkjob can be used to view the details of a job submitted to a queue. To use this command the job ID number must first be determined using the qstat or showq commands. The syntax to use the checkjob command is:
If a job is blocked the information obtained with checkjob can help determine the reason. Care must be used when interpreting the checkjob output: the command compares the job to the requirments of multiple queues, and error messages associated with queues that the job is not in do not negatively affect the job. Also, a message stating that sufficient hardware is not available means that the job is still waiting in the queue for its turn.
Use Job Status Emails
If you submit a job to one of MSI's system queues the system can send you automatic job information emails. This is particularly useful when a job encounters an error because you will receive an email with error information.
To use job status emails include the following lines in your PBS job script:
#PBS -m abe #PBS -M firstname.lastname@example.org
The first line will cause the system to send email updates when the job begins, ends, or aborts. The second line directs the emails to be sent to sample_email.umn.edu (this should be replaced with the email address where you wish to receive the updates).
If your job aborts you will receive an email with error information, including a job exit code. The exit code is given by the program executed (not the job scheduler), but many exit codes have common meanings. The documentation for each program will need to be consulted to determine the exact meaning of an exit code. Below are some common meanings for different exit code values.
|Exit Code||Common Meaning|
|0||Job completed correctly.|
|1||Job exited after experiencing an error; check outputs for information.|
|9, 64||Out of CPU time.|
|125, 127||Severe error.|
|130, 131||The job ran out of CPU time or swap. If swap is the culprit check for memory leaks.|
|134||The job was killed with an abort signal, possible program error.|
|137||The job was killed because it exceeded the time limit.|
|139||Segmentation violation, see the Debugging webpage.|
|140||The job exceeded the time limit.|
|271||Killed at the request of root (moab), this can be due to exceeding the "wall clock" time limit, being cancelled by the user, or exceeding some other hardware limit.|
Check Error Logs and Outputs
When a job which used the PBS queue system exits or aborts it creates output and error log files. If an error occurred these files often have useful information regarding the error. These files will be located by default in the job working directory (the directory the PBS script was submitted from).
Output files by default have names in the format: scriptname.o12345
Error log files by default have names in the format: scriptname.e12345
Here scriptname stands for the name of the job PBS script, and the trailing number will be the job ID number which is unique for each time a job is submitted.
Check Service Units (Computation Time) Remaining
Jobs will be rejected from the High Performance queues if your group does not have sufficient service units (SUs) remaining. To check your remaining service units use the command:
Check Disk Space Remaining
A job will fail if attempts to write files to a full disk or directory. If your job writes files within your home directory, or within other directories owned by your research group, you may check the space remaning in these directories using the command:
Jobs should not write output to the /tmp directory. The /tmp directory has relatively little available space and fills quickly. For temporary space jobs should use the /scratch directories. Itasca jobs may use the /lustre filesystem for temporary storage.
Check for a System Maintenance Issue
During mainetnance periods, or if a problem occurs, a system may become unavailable for computation.
Information about system issues is available at our System Status page.