PBS Information for Labs and the Lab Queue


Introduction

The Portable Batch System (PBS) is a queuing system installed for lab batch processing. It matches job requirements with available resources, ensuring that machines are fully used and resources are distributed among all users. In contrast to the HPC system queues, the Lab system queues do not require Service Units (SUs) in order for jobs to run.

Lab users can submit to the lab queue, which is the default target for isub and is the default from lab.msi.umn.edu and lab workstations if the queue is not specified. The lab-long queue is good for jobs requiring up to 150 hrs of walltime.

All PBS jobs must be submitted via the qsub command and a submission script. Jobs have a maximum wall clock time and a maximum number of CPUs per job, as determined by the configuration of the queue to which the job is submitted. If you find that your needs fall outside the parameters listed below, please email help@msi.umn.edu for assistance

Queue Configuration

Users must submit Lab queue jobs from the lab.msi.umn.edu interactive node using the instruction below.

Lab Queue Policies (maximum limits per job)

Default Queue
(-q lab)

Long Queue
(-q lab-long)

600 Hour Queue
(-q lab-600)

Overclock
(-q oc)

isub Defaults
(Command "isub" with no parameters)

1 node

1 node

1 node

1 node

1 node

32 cores

8 cores

8 cores

12 cores

1 core

128GB memory

15GB memory

128GB memory

23GB memory

4GB memory

72hr run-time

150hr run-time

600hr run-time

72hr run-time

2hr run-time

Up to 6 running
jobs per user

Up to 6 running
jobs per user

Up to 1 running
job per user

Up to 3 running
jobs per user

Up to 6 running jobs per user

Lab Compute Node Details

It is not necessary to specify a particular compute node; they are listed for informational purposes to describe the resources available. You can use this information to request the amount of resources you need for your job.

Node Details

Nodes

Model CPU cores Memory per node
labh01 PowerEdge R900 24 Intel Xeon 2.67 GHz 128GB
labh02 SunFire X4440 16 AMD Opteron 8384 2.7 GHz 128GB
labh03 - labh08 SunFire X4600 32 AMD Opteron 8356 2.3GHz 128GB
mirror1-mirror16 Altix XE 310 8 Intel Xeon X5355 2.66GHz 16GB
lab001-lab064 Altix XE 310 8 Intel Xeon 2.66GHz 16GB
laboc01-laboc04 LiquidCool Liquid Submerged 12 Intel Xeon X5690 @ 4.1GHz 24GB

How to Create a Job Submission Script

This example is for the Lab queue. First log in to lab.msi.umn.edu, then write a PBS batch script like the examples below but with your username.

The following PBS script is for a 1-hour job to run on a single processor of a single node using 1gb of memory. You can load any preferred modules and run any software that can operate in batch mode, including your own code as in this example, which you would save as script.pbs.

#!/bin/bash -l
#PBS -l nodes=1:ppn=1,mem=1gb,walltime=01:00:00
#PBS -m abe
cd /home/msi/username/Testpbs
module load intel
./test < input.dat > output.dat

How to Submit a Job

You may use PBS to submit jobs from lab workstations (for the lab queue), or from the interactive node: lab.msi.umn.edu.

Use the command qsub to submit a job to the queuing system. qsub takes a job submission script that contains special commands telling PBS what resources are needed. It also contains the commands necessary to run the submitted job.

So for example, if you wrote your submission script for the regular lab queue, you would log in to lab.msi.umn.edu and submit it as follows:

qsub script.pbs

To submit to a different queue from lab.msi.umn.edu you could say:

qsub -q lab-long script.pbs

How to Check Job Status

You can check job status using the showq command.

Here is an example showq session showing a queued and a running job. Note that the "PROCS" field shows the number of processors requested by the user. Also, the "STATE" field gives the status of the job, such as Running, Idle, or Blocked. For more information please refer to the showq man page.

 

username@interactive [~/test] % showq -u username

active jobs------------------------
JOBID USERNAME STATE PROCS REMAINING STARTTIME

98736 username Running 16 1:23:31:53 Thu Mar 31 14:32:38
98737 username Running 16 1:23:33:02 Thu Mar 31 14:33:47

2 active jobs 32 of 808 processors in use by local jobs (3.96%)
 26 of 112 nodes active (23.21%)

eligible jobs----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME

98743 username Idle 16 2:00:00:00 Thu Mar 31 14:32:22
98742 username Idle 16 2:00:00:00 Thu Mar 31 14:32:16


2 eligible jobs 

blocked jobs-----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME

0 blocked jobs Total jobs: 4

To find more information about a running job, for example, 98736, one can use "checkjob -v 98736", which shows the following:

username@interactive [~/test] % checkjob -v 98736
job 98736 (RM job '98736.elmom.msi.umn.edu')

AName: 3T.pbs
State: Running 
Creds: user:username group:mygroup class:mirror
WallTime: 00:29:17 of 2:00:00:00
SubmitTime: Thu Mar 31 14:30:50
 (Time Queued Total: 00:01:48 Eligible: 00:01:47)

StartTime: Thu Mar 31 14:32:38
NodeMatchPolicy: EXACTNODE
Total Requested Tasks: 16
Total Requested Nodes: 2

Req[0] TaskCount: 16 Partition: labpar 
Dedicated Resources Per Task: PROCS: 1 MEM: 875M
Utilized Resources Per Task: PROCS: 0.48 MEM: 199M SWAP: 8465M
Avg Util Resources Per Task: PROCS: 0.48
Max Util Resources Per Task: PROCS: 0.48 MEM: 199M SWAP: 8465M
Average Utilized Memory: 117.49 MB
Average Utilized Procs: 6.06
TasksPerNode: 8 NodeCount: 2

Allocated Nodes:
[mirror2:8][mirror3:8]


Task Distribution: mirror2,mirror2,mirror2,mirror2,mirror2,mirror2,mirror2,mirror2,mirror3,mirror3,mirror3,...

UMask: 0000 
OutputFile: elmo:/home/msi/username/myfile.out
ErrorFile: elmo:/home/msi/username/myfile.err
StartCount: 1
User Specified Partition List: elmopar,dellmopar,elmobpar,mirrorpar,labpar,SHARED,elmom
Partition List: elmopar,dellmopar,elmobpar,mirrorpar,labpar,elmom
SrcRM: elmom DstRM: elmom DstRMJID: 98736.elmom.msi.umn.edu
Submit Args: 3T.pbs
Flags: BACKFILL,RESTARTABLE,FSVIOLATION
Attr: BACKFILL,FSVIOLATION,checkpoint
StartPriority: -156396
PE: 16.00
Reservation '98736' (-00:29:42 - 1:23:30:18 Duration: 2:00:00:00)


You can also use the commands checkjob and showstart for detailed information on your job. The syntax of these commands are:

checkjob <jobid>

e.g.: checkjob 689723

showstart <jobid>

e.g.: showstart 689723

How to Remove or Kill Jobs

Sometimes you may wish to stop a job before it ends on its own. Jobs are killed or removed from the queuing system by using the qdel command. There is a man page for qdel that lists the options you can use with it. If you wish to kill a running job, for example 98736.elmom.msi.umn.edu, type qdel 98736 at the command line.

Interactive Jobs

Interactive jobs can be run on the lab queue via isub.