Choosing a Job Queue

Note: This page contains guidelines for both choosing a job queue under MSI's current job scheduler, PBS/TORQUE, and choosing a partition under MSI's new job scheduler, Slurm. The migration of MSI systems to the Slurm scheduler will take place over Quarter 4 of 2020, and the PBS/Torque will be discontinued on January 6th, 2021. Click on the following links to jump to a section, depending on the scheduler you are using to submit your job.

Choosing a Job Queue (PBS/TORQUE) 

Summary

Most MSI systems use job queues to efficiently and fairly manage when computations are executed. A job queue is an automated waiting list for use of a particular set of computational hardware. When computational jobs are submitted to a job queue they wait in the queue in line until the appropriate resources become available. Different job queues have different resources and limitations. When submitting a job, it is very important to choose a job queue which has resources and limitations suitable to the particular calculation.

This document outlines factors to consider when choosing a job queue. These factors are important when choosing where to place a job. This document is best used on all MSI systems and in conjunction with the Queues page that outlines the resource limitations for each queue.

Please note that Mesabi's "widest" queue requires special permission to use. Please submit your code for review at: help@msi.umn.edu.

Guidelines

There are several important factors to consider when choosing a job queue for a specific program or custom script. In most cases, jobs are submitted via PBS scripts as described in Job Submission and Scheduling

Overall System

Each MSI system contains job queues managing sets of hardware with different resource and policy limitations. MSI currently has two primary systems: the supercomputer Mesabi and Mesabi's expansion Mangi. Mesabi has a wide variety of queues suitable for many different job types. Mangi is a heterogeneous system suitable for even more job types. Mangi should be your first choice when doing any computation at MSI. The Mesabi Interactive Queue is primarily used for interactive software that is graphical in nature, and testing. Which system to choose depends highly on which system has queues appropriate for your software/script. Examine the Queue page to determine the most appropriate system.

Job Walltime (walltime=)

The job walltime is the time from the start to the finish of a job (as you would measure it using a clock on a wall), not including time spent waiting to run. This is in contrast to cputime, which measures the cumulative time all cores spent working on a job. Different job queues have different walltime limits, and it is important to choose a queue with a sufficiently high walltime that enables your job to complete. Jobs that exceed the requested walltime are killed by the system to make room for other jobs. Walltime limits are maximums only, and you can always request a shorter walltime, which will reduce the amount of time you wait in the queue for your job to start. If you are unsure how much walltime your job will need start with the queues with shorter walltime limits and only move to others if needed. 

Job Nodes and Cores (nodes=X:ppn=Y)

Many calculations have the ability to use multiple cores (ppn), or (less often) multiple nodes, to improve calculation speed. Certain job queues have maximum or minimum values for the number  nodes and cores a job may use. If Node Sharing is enabled for a queue you can request fewer cores (ppn) than exist on an entire node. If Node Sharing is not enabled then you must request resources equivalent to a multiple of an entire node. Mesabi’s widest and large queues do not allow Node Sharing.

Job Memory (mem=)

The memory which a job requires is an important factor when choosing a queue. The largest amount of memory (RAM) that can be requested for a job is limited by the memory on the hardware associated with that queue. Mesabi has two queues (ram256g and ram1t) with high memory hardware, the largest memory hardware is available through the ram1t queue. 

User and Group Limitations

To efficiently share resources, many queues have limits on the number of jobs or cores a particular user or group may simultaneously use. If a workflow requires many jobs to complete, it can be helpful to choose queues which will allow many jobs to run simultaneously. 

Special Hardware

Some queues contain nodes with special hardware, GPU accelerators and solid-state scratch drives being the most common. If a calculation needs to use special hardware, then it is important to choose a queue with the correct hardware available. Furthermore, those queues may require additional resources to be specified (e.g., GPU nodes require ":gpus=X").

Queue Congestion

At certain times particular queues may become overloaded with submitted jobs. In such a case, it can be helpful to send jobs to queues with lower utilization (node status). Sending jobs to lower utilization queues can decrease wait time and improve throughput. Care must be taken to make sure calculations will fit within queue limitations.

 

Choosing a Partition (Slurm)

Summary

Most MSI systems use job partitions to efficiently and fairly manage when computations are executed. A job partition is an automated waiting list for use of a particular set of computational hardware. When computational jobs are submitted to a job partition they wait in the partition in line until the appropriate resources become available. Different job partitions have different resources and limitations. When submitting a job, it is very important to choose a job partition which has resources and limitations suitable to the particular calculation.
 
This document outlines factors to consider when choosing a job partition. These factors are important when choosing where to place a job. This document is best used on all MSI systems and in conjunction with the partitions page that outlines the resource limitations for each partition.

Guidelines

There are several important factors to consider when choosing a job partition for a specific program or custom script. In most cases, jobs are submitted via Slurm scripts as described in Job Submission and Scheduling (Slurm)

Overall System

Each MSI system contains job partitions managing sets of hardware with different resource and policy limitations. MSI currently has two primary systems: the supercomputer Mesabi and Mesabi's expansion Mangi. Mesabi has a wide variety of partitions suitable for many different job types. Mangi is a heterogeneous system suitable for even more job types. The Mesabi Interactive partition is primarily used for interactive software that is graphical in nature, and testing. Which system to choose depends highly on which system has partitions appropriate for your software/script. Examine the partition page to determine the most appropriate system.

Job Walltime (--time=)

The job walltime is the time from the start to the finish of a job (as you would measure it using a clock on a wall), not including time spent waiting to run. This is in contrast to cputime, which measures the cumulative time all cores spent working on a job. Different job partitions have different walltime limits, and it is important to choose a partition with a sufficiently high walltime that enables your job to complete. Jobs that exceed the requested walltime are killed by the system to make room for other jobs. Walltime limits are maximums only, and you can always request a shorter walltime, which will reduce the amount of time you wait in the partition for your job to start. If you are unsure how much walltime your job will need, start with the partitions with shorter walltime limits and only move to others if needed. 

Job Nodes and Cores (--nodes= and --ntasks= )

Many calculations have the ability to use multiple cores, or (less often) multiple nodes, to improve calculation speed. Certain job partitions have maximum or minimum values for the number  nodes and cores a job may use. If Node Sharing is enabled for a partition you can request fewer cores than exist on an entire node. If Node Sharing is not enabled then you must request resources equivalent to a multiple of an entire node. Mesabi’s large partition does not allow Node Sharing.

Job Memory (--mem=)

The memory which a job requires is an important factor when choosing a partition. The largest amount of memory (RAM) that can be requested for a job is limited by the memory on the hardware associated with that partition. Mesabi has two partitions (ram256g and ram1t) with high memory hardware, the largest memory hardware is available through the amd2tb partition. 

User and Group Limitations

To efficiently share resources, many partitions have limits on the number of jobs or cores a particular user or group may simultaneously use. If a workflow requires many jobs to complete, it can be helpful to choose partitions which will allow many jobs to run simultaneously. 

Special Hardware

Some partitions contain nodes with special hardware, GPU accelerators and solid-state scratch drives being the most common. If a calculation needs to use special hardware, then it is important to choose a partition with the correct hardware available. Furthermore, those partitions may require additional resources to be specified (e.g., V100 GPU nodes require "--gres=gpu:v100:1").

Partition Congestion

At certain times particular partitions may become overloaded with submitted jobs. In such a case, it can be helpful to send jobs to partitions with lower utilization (node status). Sending jobs to lower utilization partitions can decrease wait time and improve throughput. Care must be taken to make sure calculations will fit within partition limitations.

Preemptable Partitions 

The preempt and preempt-gpu partitions are special partitions that allow the use of idle interactive resources. Jobs submitted to the preempt queue may be killed at any time to make room for an interactive job. Care must be taken to use these queues only for jobs that can easily restart after being killed. An example job is shown below
 
#SBATCH --time=24:00:00
#SBATCH --mem=20gb
#SBATCH -n 12
#SBATCH --requeue
#SBATCH -p preempt-gpu
#SBATCH --gres=gpu:k40:1
 
module load singularity
singularity exec --nv                               \ /home/support/public/singularity/gromacs_2018.2.sif \
gmx mdrun -s benchMEM.tpr -cpi state.cpi -append