Job Queues

MSI uses job scheduling queues to efficiently and fairly share MSI resources.  The queues available on our systems often manage different sets of hardware, and have different limits for quantities such as walltime, available processors, and available memory.  When submitting a calculation it is important to choose a queue where the job is suited to the hardware and resource limitations.

Below is a summary of the available queues organized by system, and the associated queue limitations.
The quantities listed are totals or upper limits.

Itasca

Itasca is an HP Linux cluster with most nodes using Intel Xeon 5560 Nehalem EP processors, and the "Sandy Bridge" (sb) nodes using Intel Xeon E5-2670 Sandy Bridge processors.

Queue nameNumber of NodesProcessor cores per node
( ppn=)
Wallclock Limit
( walltime=)
Total Node Memory Limit
( mem=nodes*)
Per-core Memory Limit
( pmem=)
Local Scratch
(GB/node)
Simultaneous Running Jobs
(soft limit / hard limit)
Simultaneous Idle Jobs
(gaining priority in queue)
batch
(default)
1086 nodes (8688 cores)824 hours22gb2750mb90 GB2 / 58
devel
( -q devel)
32 nodes (256 cores)82 hours22gb2750mb90 GB
long
( -q long)
28 nodes (224 cores)848 hours22gb2750mb90 GB
sb
( -q sb)
35 nodes (560 cores)1648 hours60gb3750mb112 GB
sb128
( -q sb128)
8 nodes (128 cores)1696 hours120gb7500mb534 GB
sb256
( -q sb256)
8 nodes (128 cores)1696 hours240gb15000mb534 GB 

Service Unit (SU) rate: 1.5 CPU hours / SU

On Itasca node sharing is not allowed; Itasca jobs must use whole nodes.  Itasca jobs should always request 8 processors per node (ppn=8) in the batch, devel, and long queues, and 16 processors per node (ppn=16) in the sb, sb128, and sb256 queues.   Special compiling optimization options may give better performance on the Sandy Bridge nodes as described on the ItascaSB webpage.

In addition to to local /scratch directories Itasca has a high performance Lustre filesystem for temporary files.  It is located at /lustre and accessible from all Itasca nodes.  Unused temporary files are subject to deletion after a period of time as described on the Scratch and Temporary Space webpage.

Calhoun

Calhoun is an SGI Linux cluster using Intel Xeon Clovertown-class processors.

Queue nameNumber of NodesProcessor cores per node
( ppn=)
Wallclock Limit
( walltime=)
Total Node Memory Limit
( mem=nodes*)
Per-core Memory Limit
( pmem=)
Local Scratch
(GB/node)
Simultaneous Running JobsSimultaneous Idle Jobs
(gaining priority in queue)
batch
(default)
180 nodes (1440 cores)848 hours14gb1750mb174 GBNo LimitNo Limit
devel
( -q devel)
8 nodes (64 cores)82 hours14gb1750mb174 GB
medium
( -q medium)
64 nodes (512 cores)896 hours14gb1750mb174 GB
long
( -q long)
16 nodes (128 cores)8192 hours14gb1750mb174 GB
max
( -q max)
2 nodes (16 cores)8600 hours14gb1750mb174 GB

Service Unit (SU) rate: 3.5 CPU hours / SU
Calhoun allows node sharing, so fractions of a node may be requested (ppn values other than 8 may be used).

Cascade

Cascade is a heterogeneous cluster that has both conventional CPUs as well as GPU accelerators.  

Queue nameNumber of NodesProcessor cores per node (cpu)
( ppn=)
Accelerators per node (GPU or Phi)Wallclock Limit
( walltime=)
Total Node Memory Limit
( mem=nodes*)
Per-core Memory Limit
( pmem=)
Local Scratch
(GB/node)
Simultaneous Running JobsSimultaneous Idle Jobs
(gaining priority in queue)
cascade
(default)
8 nodes (96 cpu cores, 32 GPGPU cards)124 Tesla GPU cards120 hours90gb7500mb450 GB4No Limit
phi
( -q phi)
2 nodes (32 cpu cores, 2 Intel Phi cards)161 Phi coprocessor24 hours124gb7750mb880 GB
kepler
( -q kepler)
4 nodes (32 cpu cores, 8  GPGPU cards)162 Kepler GPU cards24 hours124gb7750mb880 GB

Service Unit (SU) rate: 1.5 CPU hours / SU
There is currently no additional SU charge for GPU accelerator use.

Lab Servers (isub)

Queue nameNumber of NodesProcessor cores per node
( ppn=)
Wallclock Limit
( walltime=)
Total Node Memory Limit
( mem=)
Per-core Memory Limit
( pmem=)
Local Scratch
(GB/node)
Simultaneous Running JobsSimultaneous Idle Jobs
(gaining priority in queue)
lab
(default)
1 node (32 cores)3272 hours120gb3750mb100 GB68
lab-long
( -q lab-long)
1 node (8 cores)8150 hours13gb1625mb100 GB68
lab-600
( -q lab-600)
1 node (8 cores)8600 hours128gb1625mb100 GB18
oc
( -q oc)
1 node (12 cores)1272 hours22gb1800mb100 GB38

The Lab Servers are for smaller jobs.  Calculations on the lab servers do not consume Service Units (SUs).  In some cases more nodes are physically present than listed here, but jobs may each request only a single node, so the table represents the queue submission limits for individual jobs.

The "overclock" (oc) queue has liquid cooled processors operating in an overclocked state which may give performance benefits for certain types of 1 or 2 core calculations as described on the Lab Overclock webpage.