PBS Information for Labs and the Lab Queue


Introduction

The Portable Batch System (PBS) is a queuing system installed for lab batch processing. It matches job requirements with available resources, ensuring that machines are fully used and resources are distributed among all users. In contrast to the HPC system queues, the Lab system queues do not require Service Units (SUs) in order for jobs to run.

Lab users can submit to the lab queue, which is the default target for isub and is the default from lab.msi.umn.edu and lab workstations if the queue is not specified. The lab-long queue is good for jobs requiring up to 150 hrs of walltime.

All PBS jobs must be submitted via the qsub command and a submission script. Jobs have a maximum wall clock time and a maximum number of CPUs per job, as determined by the configuration of the queue to which the job is submitted. If you find that your needs fall outside the parameters listed below, please email help@msi.umn.edu for assistance

Queue Configuration

Users must submit Lab queue jobs from the lab.msi.umn.edu interactive node using the instruction below.

Lab Queue Policies (maximum limits per job)
QUEUE

Default Queue
(-q lab)

Long Queue
(-q lab-long)

600 Hour Queue
(-q lab-600)

Overclock
(-q oc)

isub Defaults
(Command "isub" with no parameters)

NODE LIMIT

1 node

1 node

1 node

1 node

1 node

CORE LIMIT

32 cores

8 cores

8 cores

12 cores

1 core

MEMORY LIMIT

128GB memory

15GB memory

128GB memory

23GB memory

2GB memory

RUNTIME LIMIT

72hr run-time

150hr run-time

600hr run-time

72hr run-time

2hr run-time

JOB LIMIT

Up to 6 running
jobs per user

Up to 6 running
jobs per user

Up to 1 running
job per user

Up to 3 running
jobs per user

Up to 6 running jobs per user

Lab Compute Node Details

It is not necessary to specify a particular compute node; they are listed for informational purposes to describe the resources available. You can use this information to request the amount of resources you need for your job.

Node Details

Nodes

Model CPU cores Memory per node
labh01 PowerEdge R900 24 Intel Xeon 2.67 GHz 128GB
labh02 SunFire X4440 16 AMD Opteron 8384 2.7 GHz 128GB
labh03 - labh08 SunFire X4600 32 AMD Opteron 8356 2.3GHz 128GB
labq01-labq64 Altix XE 310 8 Intel Xeon 2.66GHz 16GB
laboc01-laboc04 LiquidCool Liquid Submerged 12 Intel Xeon X5690 @ 4.1GHz 24GB

How to Create a Job Submission Script

This example is for the Lab queue. First log in to lab.msi.umn.edu, then write a PBS batch script like the examples below but with your username.

The following PBS script is for a 1-hour job to run on a single processor of a single node using 1gb of memory. You can load any preferred modules and run any software that can operate in batch mode, including your own code as in this example, which you would save as script.pbs.

#!/bin/bash -l
#PBS -l nodes=1:ppn=1,mem=1gb,walltime=01:00:00
#PBS -m abe
cd /home/mygroup/username/Testpbs
module load intel
./test < input.dat > output.dat

How to Submit a Job

You may use PBS to submit jobs from lab workstations (for the lab queue), or from the interactive node: lab.msi.umn.edu.

Use the command qsub to submit a job to the queuing system. qsub takes a job submission script that contains special commands telling PBS what resources are needed. It also contains the commands necessary to run the submitted job.

So for example, if you wrote your submission script for the regular lab queue, you would log in to lab.msi.umn.edu and submit it as follows:

qsub script.pbs

To submit to a different queue from lab.msi.umn.edu you could say:

qsub -q lab-long script.pbs

How to Check Job Status

You can check job status using the showq command.

Here is an example showq session showing a queued and a running job. Note that the "PROCS" field shows the number of processors requested by the user. Also, the "STATE" field gives the status of the job, such as Running, Idle, or Blocked. For more information please refer to the showq man page.

 

username@interactive [~/test] % showq -u username

active jobs------------------------
JOBID USERNAME STATE PROCS REMAINING STARTTIME

98736 username Running 16 1:23:31:53 Thu Mar 31 14:32:38
98737 username Running 16 1:23:33:02 Thu Mar 31 14:33:47

2 active jobs 32 of 808 processors in use by local jobs (3.96%)
 26 of 112 nodes active (23.21%)

eligible jobs----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME

98743 username Idle 16 2:00:00:00 Thu Mar 31 14:32:22
98742 username Idle 16 2:00:00:00 Thu Mar 31 14:32:16


2 eligible jobs 

blocked jobs-----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME

0 blocked jobs Total jobs: 4

To find more information about a running job, for example, 181271, one can use "checkjob -v 181271", which shows the following:

username@interactive [~/test] % checkjob -v 181271
job 181271 (RM job '181271.nokomis0015.msi.umn.edu')

AName: STDIN
State: Running
Creds:  user:myuser  group:mygroup  class:lab
WallTime:   00:03:42 of 2:00:00
SubmitTime: Fri Jul 25 13:13:14
  (Time Queued  Total: 00:00:01  Eligible: 00:00:00)

StartTime: Fri Jul 25 13:13:15
TemplateSets:  DEFAULT
NodeMatchPolicy: EXACTNODE
Total Requested Tasks: 1
Total Requested Nodes: 1

Req[0]  TaskCount: 1  Partition: nokomis0015
Dedicated Resources Per Task: PROCS: 1  MEM: 2000M  SWAP: 4000M
Utilized Resources Per Task:  PROCS: 0.08  MEM: 416M  SWAP: 573M
Avg Util Resources Per Task:  PROCS: 0.08
Max Util Resources Per Task:  PROCS: 0.18  MEM: 416M  SWAP: 573M
Average Utilized Memory: 289.82 MB
Average Utilized Procs: 0.27
TasksPerNode: 1  NodeCount:  1

Allocated Nodes:
[labq21.msi.umn.edu:1]

SystemID:   Moab
SystemJID:  181271
Notification Events: JobFail
Task Distribution: labq21.msi.umn.edu

UMask:          0000
OutputFile:     /dev/pts/0
ErrorFile:      /dev/pts/0
StartCount:     1
System Available Partition List: nokomis0015,labqpbs,labpar,gputpar,oc,galaxy
Partition List: nokomis0015,labqpbs,labpar,gputpar,oc,galaxy
SrcRM:          nokomis0015  DstRM: nokomis0015  DstRMJID: 181271.nokomis0015.msi.umn.edu
Submit Args:    -I -l nallocpolicy=cpuload -m a -q lab -l walltime=02:00:00 -l nodes=1:ppn=1 -l mem=2000MB,vmem=4000MB
Flags:          BACKFILL,INTERACTIVE,FSVIOLATION
Attr:           BACKFILL,INTERACTIVE,FSVIOLATION,checkpoint
StartPriority:  -96195
PE:             1.00




You can also use the commands checkjob and showstart for detailed information on your job. The syntax of these commands are:

checkjob <jobid>

e.g.: checkjob 689723

showstart <jobid>

e.g.: showstart 689723

How to Remove or Kill Jobs

Sometimes you may wish to stop a job before it ends on its own. Jobs are killed or removed from the queuing system by using the qdel command. There is a man page for qdel that lists the options you can use with it. If you wish to kill a running job, for example 98736.nokomis0015.msi.umn.edu, type qdel 98736 at the command line.

Interactive Jobs

Interactive jobs can be run on the lab queue via isub.