Job Queues

MSI uses job scheduling queues to efficiently and fairly share MSI resources.  The queues available on our systems often manage different sets of hardware, and have different limits for quantities such as walltime, available processors, and available memory.  When submitting a calculation it is important to choose a queue where the job is suited to the hardware and resource limitations.

Below is a summary of the available queues organized by system, and the associated queue limitations.
The quantities listed are totals or upper limits.

Mesabi

Mesabi is an HP Linux cluster with most nodes using Haswell E5-2680v3 processors.

Queue name Node Sharing Max Nodes Per Job Min Nodes Per Job Processor cores per node
( ppn=)
Wallclock Limit
( walltime=)
Total Node Memory Limit Per-core Memory Limit
( pmem=)
Local Scratch
(GB/node)
Per User Limits Per Group Limits
small Yes 9 None 24 96 hours 62gb 2580mb 390 GB None 1800 total cores
large No 48 10 24 24 hours 62gb 2580mb 390 GB 4 Jobs 16 Jobs
widest No 360 49 24 24 hours 62gb 2580mb 390 GB 4 Jobs 16 Jobs
max Yes 1 (single core per job) None 1 696 hours 62gb 62gb 390 GB 4 Jobs 16 Jobs
ram256g Yes 2 None 24 96 hours 252gb 10580mb 390 GB 2 nodes 1800 total cores
ram1t Yes 2 None 32 96 hours 998gb 31180mb 390 GB 2 nodes 1800 total cores
k40 
GPU nodes*
No 40 None 24 24 hours 126gb 5290mb 390 GB None 1800 total cores
mesabi
(default)
The mesabi queue is a meta-queue, which will automatically route jobs to the small, large, widest, or max queues, according to where each job will best fit based on the resource request.

Service Unit (SU) rate: 1.5 CPU hours / SU

*Note:  The k40 queue is for calculations performing GPU computations.  Each of the k40 nodes contains two NVIDIA K40m GPUs.

Note:  The ram1t nodes contain Intel Ivy Bridge processors, which do not support all of the optimized instructions of the Haswell processors.  Programs compiled using the Haswell instructions will only run on the Haswell processors.

Note:  The 1800 core limit is inclusive for all group jobs in the small, ram256g,  ram1t, and k40, queues together.  For example, a group simultaneously using 1798 cores in the small queue and 2 cores in the ram1t queue could run no further simultaneous jobs in the small, ram256g, ram1t, or k40, queues.

Itasca

Itasca is an HP Linux cluster with most nodes using Intel Xeon 5560 Nehalem EP processors, and the "Sandy Bridge" (sb) nodes using Intel Xeon E5-2670 Sandy Bridge processors.

Queue name Number of Nodes Processor cores per node
( ppn=)
Wallclock Limit
( walltime=)
Total Node Memory Limit Per-core Memory Limit
( pmem=)
Local Scratch
(GB/node)
Per User Running Jobs
(soft limit / hard limit)
Per User Idle Jobs
(gaining priority in queue)
batch
(default)
1086 nodes (8688 cores) 8 24 hours 22gb 2750mb 90 GB 2 / 5 8
devel 32 nodes (256 cores) 8 2 hours 22gb 2750mb 90 GB
long 28 nodes (224 cores) 8 48 hours 22gb 2750mb 90 GB
sb 35 nodes (560 cores) 16 48 hours 62gb 3875mb 112 GB
sb128 8 nodes (128 cores) 16 96 hours 126gb 7875mb 534 GB
sb256 8 nodes (128 cores) 16 96 hours 254gb 15875mb 534 GB 

Service Unit (SU) rate: 1.5 CPU hours / SU

On Itasca node sharing is not allowed; Itasca jobs must use whole nodes.  Itasca jobs should always request 8 processors per node (ppn=8) in the batch, devel, and long queues, and 16 processors per node (ppn=16) in the sb, sb128, and sb256 queues.   Special compiling optimization options may give better performance on the Sandy Bridge nodes as described on the ItascaSB webpage.

Lab Servers (isub)

Queue name Number of Nodes Processor cores per node
( ppn=)
Wallclock Limit
( walltime=)
Total Node Memory Limit
( mem=)
Per-core Memory Limit
( pmem=)
Local Scratch
(GB/node)
Per User Running Jobs Per User Idle Jobs
(gaining priority in queue)
lab
(default)
1 node (8 cores) 8 24 hours 15gb 1850mb 100 GB 6 8
lab-long
( -q lab-long)
1 node (8 cores) 8 150 hours 15gb 1850mb 100 GB 6 8
lab-600
( -q lab-600)
1 node (8 cores) 8 600 hours 15gb 1850mb 100 GB 1 8
oc
( -q oc)
1 node (12 cores) 12 72 hours 22gb 1800mb 100 GB 3 8

The Lab Servers are for smaller jobs.  Calculations on the lab servers do not consume Service Units (SUs).  In some cases more nodes are physically present than listed here, but jobs may each request only a single node, so the table represents the queue submission limits for individual jobs.

The "overclock" (oc) queue has liquid cooled processors operating in an overclocked state which may give performance benefits for certain types of 1 or 2 core calculations as described on the Lab Overclock webpage.