Supercomputing Institute Technical IBM

Introduction to LoadLeveler

Introduction

This tutorial provides an introduction to IBM's batch scheduling system - LoadLeveler. The Institute uses the Maui Scheduler instead of the default LoadLeveler scheduler. The Maui Scheduler is a backfill scheduler, which means it schedules a time for jobs requiring a large number of processors first, and then uses smaller jobs to fill in the gaps created by reserving a time for the large job. Jobs using an accurate wall clock time limit can be scheduled promptly. The Maui Scheduler also uses "fair-share" when scheduling jobs. Groups requesting to use the machine will get a higher priority when jobs are scheduled up to their fair-share and a lower priority when their jobs exceed their fair-share. Fair-share is based on the number of Service Units that each group has been awarded.


I. Overview of LoadLeveler

LoadLeveler is a batch scheduling system available from IBM for the SP, Power4 and workstation clusters for submitting serial and parallel jobs. LoadLeveler matches job requirements with the machine resources. These job requirements are specified in a job command file.

The collection of computers available for job scheduling with LoadLeveler is called a pool. This term should not be confused with node pools discussed in the POE tutorial. LoadLeveler pools may consist of all the nodes on a machine or only a subset. On our machine, the LoadLeveler pool consists of all the batch nodes. Every machine in a pool has a LoadLeveler daemon running on it.

II. Queue Definition

Loadleveler is used on the Institute's IBM SP, and on the Institute's IBM Power4.

II-A. IBM SP Queue Definition

Currently, there are two SPs.

The SP consists of 74 WinterHawk+ nodes, 11 NightHawk nodes, 4 NightHawk nodes for interactive use and 4 additional Winterhawk+ nodes for file servers. Among the 74 WinterHawk+ nodes, each node has 4 processor (375 MHz, Pwr3). 13 of the nodes have 8 GB memory. 62 of the nodes have 4 GB memory. Among the 15 NightHawk nodes, each node has 4 processors (222 MHz, pwr3) sharing 16 GB memory.

The following table summarizes the queues on the SP.

QueueDescription
q1 The default queue on the WinterHawk+ nodes. For this queue, you do not have to specify a queue (i.e. #@ class = q1) in your command file. All but 14 nodes have 4 GB, the remaining 14 nodes have 8 GB of memory. The maximum ConsumableMemory is 3.5 GB and 7.5 GB (some of the memory is reserved for the operating system.)

Some of the nodes are configured so that the maximum wall clock time is 24 hours, the remaining nodes have a maximum wall clock time of 150 hours. Jobs requiring no more than 24 hours will run on all the nodes. The jobs that require more than 24 hours will run only on nodes set up specifically for longer running jobs. Type llavail to get information on the current availability, including number of nodes set aside for short jobs.

To use the larger memory WinterHawk+ nodes request consumable memory greater than 3.5 GB per node. For example, for a single process job #@ resources = ConsumableMemory(4000) will schedule the job on the larger memory WinterHawk+ node. If you ask for more than one task per node and the total memory per node is greater than 3.5 GB per node, the job will go to the larger memory nodes.

nighthawk For the NightHawk nodes. To use this queue, add #@ class = nighthawk to your command file. Please do not use these nodes unless you have large memory requirements. The maximum ConsumableMemory is 15.5 GB. The maximum wall clock time is 300 hours. However, only 7 nodes are set up to handle 300 hour jobs. The other nodes are reserved for jobs requesting at most 150 hours. Before you submit jobs, load the nhloadl module (module add nhloadl). To then use the other queues, unload this module (module delete nhloadl)

Note: The wall_clock_limit specified in the script is the maximum wall clock time of each process. For example with 4 processors and a 10 hour limit, a perfectly balanced job that uses 100% of the processors will use a total of 40 CPU-hours.

To submit jobs that use commercial software that uses a license manager (e.g. matlab, hypermesh), You need to specify the Ethernet feature in your command file. The nodes with the Ethernet feature are

There are no Ethernet nodes for jobs requesting more than 24 hours.

LoadLeveler uses a "backfill" scheduler, which means smaller jobs are used to fill in gap between larger jobs. By specifing the accurate wall clock time limits, you help LoadLeveler schedule your jobs promptly.

II-B. IBM Power4 Queue Definition

There is one queue for the IBM Power4. For more information, see the IBM Power4 Quickstart Guide.

Note: The maximum number of jobs that a person can have in the queue on the IBM Power4 is 32.

III. Basic Tasks

The following tasks are discussed:

Creating a Command File

As stated above, all LoadLeveler jobs must be submitted with a command file. This file resembles a shell script and specifies information such as executable name, class, resource requirements, input/output files, number of processors required, and job type. An example is: test1.cmd. The following example uses an input file test1.in. Use an input file if you need to use the command line to enter input to the program. If your program requires no command line input delete the line
#@ input =test1.in

from the below example.


#!/bin/csh
# Note: That the above line is needed if C shell commands follow
#       the @queue command, as is the case below.
#
#@ initialdir = /homes/sp1/user_name/Loadleveler_examples
#@ input      = test1.in
#@ output     = test1.out
#@ error      = test1.err
#@ wall_clock_limit = 10:00:00 
#@ resources = ConsumableMemory(500) 
#@ queue
test1
echo job done

The following rules apply to job command files:

For all the queues, you are required to specify the maximum amount of wall clock time in your command file. Jobs not specifying a wall clock time limit will be rejected.

You need to specify the maximum amount of memory per process that your job requires. To request 500 MB of memory per process, use

#@ resources = ConsumableMemory(500)

to your command file. ConsumableMemory is per process, so if you are running two tasks per node and requesting 500 MB of ConsumableMemory, you are really asking for 1000 MB of memory per node.

Submitting a Job

To submit jobs, simply type: llsubmit 'name of command file'

i.e. llsubmit test1.cmd

Loadleveler will respond with:

llsubmit: The job "sp81css0.msi.umn.edu.672" has been submitted.

The first part of this - sp81css0 (node 81 using css0 i.e. SP switch interface) - is the name of the machine that submitted the job, and the second part - 672 - counts the number of jobs submitted from sp81.

Monitoring a Job's Status

Our prefered command for monitoring jobs is showq ( A Maui Scheduler command). It displays active jobs, idle jobs and non-queued jobs.


runesha@sp65 [~] % showq
ACTIVE JOBS--------------------
           JOBNAME USERNAME      STATE  PROC   REMAINING            STARTTIME

 sp81css0.114704.0  umemoto    Running     8     1:03:33  Tue Jul 16 12:57:21
 sp81css0.114722.0    tyler    Running     1     1:57:59  Tue Jul 16 14:51:47
 sp81css0.114616.0    duany    Running    16     4:24:50  Mon Jul 15 19:18:38
 sp81css0.114196.0 mcgrathm    Running     4  1:01:35:27  Sun Jul 14 16:29:15
 sp83css0.223993.0  khchang    Running     1  1:02:11:17  Thu Jul 11 11:05:05
 ...

   100 Active Jobs     293 of  312 Processors Active (93.91%)
                        74 of   78 Nodes Active      (94.87%)

IDLE JOBS----------------------
           JOBNAME USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

 sp83css0.226796.0  druguet       Idle    28    10:00:00  Tue Jul 16 13:58:08
 sp81css0.114696.0 fagernes       Idle     9    23:00:00  Tue Jul 16 11:54:59
 sp83css0.226743.0 seethara       Idle    16    12:00:00  Tue Jul 16 00:09:05
 sp81css0.114679.0     olaf       Idle     1  1:00:00:00  Tue Jul 16 09:48:53
 sp83css0.226810.0    tyler       Idle     1  1:00:00:00  Tue Jul 16 14:52:56
 ...

47 Idle Jobs

NON-QUEUED JOBS----------------
           JOBNAME USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

 sp83css0.207270.0   elorin  BatchHold    32    20:00:00  Tue Jun  4 14:55:14
 sp83css0.201907.0  pradeep  BatchHold    64    21:00:00  Mon May 20 09:58:17
  sp81css0.90952.0  pkiprof  BatchHold     4  6:06:00:00  Thu May 23 10:42:32
 sp81css0.114705.0     jjyi       Idle     1  1:12:00:00  Tue Jul 16 13:09:23
 ...

Total Jobs: 180   Active Jobs: 100   Idle Jobs: 47   Non-Queued Jobs: 33
         

To view only your own jobs: For example to view jobs owned by runesha:

  runesha@sp65 [~] % showq | grep runesha  

Holding and Releasing a Job

Jobs may be put on hold using the llhold command by typing llhold job_id (i.e., sp81css0.672). Jobs can be released using the "-r" option: llhold -r job_id. As with the llprio command, llhold applies only to jobs in the queue.

Displaying a Machine's Status

You can display the status of the SP. You can often use this information to customize the size of your job so that it runs immediately. The two commands llavail and showbf are extremely useful. The command llavail summarizes the number of processors available on the nodes. For example,
sp68 [~] % llavail
                      CPUs
     Class   #Nodes  Avail
        q1       36      0
                 10      1
                  7      2
                  9      3

 nighthawk        8      0
                  3      2
                  1      3
                  3      4

shows that in the default queue (q1), 10 nodes have 1 processor available, 7 nodes have 2 processors, and 9 nodes have three processors available. No nodes currently have all 4 processors available. On the NightHawk nodes, 2 nodes have 2 processors available, 4 nodes have 3 processors available, and 3 nodes have all 4 processors available.

The showbf command can be used to give some indications as to how long these processors are available. For example,

sp68 [~] % showbf
backfill window (user: 'user01' group: 'user01' partition: ALL) Thu Feb  8 09:50:56

 92 procs available for    4:01:01:27
 73 procs available with no timelimit
Shows that all the available processors can be used for up to 4 days, one hour, 1 minute and 27 seconds.

Therefore, if you submit a job requesting 10 nodes with one task per node, and specifying a walltime of 4 days, this job will run immediately. This is just one example of a configuration that would run immediately.

For information on the NightHawk nodes, use the showbf -m '>8000' command. This command asks for information on the nodes which have more than 8000 MB of memory available, which are the NightHawk nodes. For example,

sp68 [~] % showbf -m '>8000'
backfill window (user: 'schaudt' group: 'tech' partition: ALL) Thu Feb  8 09:56:00

 20 procs available for    4:00:56:23
  1 proc available with no timelimit

Used with the llavail command, a job requesting 7 nodes using two tasks per node and less thatn 4 days of walltime, will run immediately.

Canceling a Job

The llcancel command cancels running or queued jobs. Simply type llcancel job_id to cancel a job where job_id is say sp81css0.672. There are options to cancel jobs by user id, host name, or process id. Of course, only the LoadLeveler System Administrator can cancel other people's jobs. Type man llcancel for more information.

IV. Advanced Tasks

The following advanced tasks are discussed:

Using the Command File like a Unix Shell

Suppose one needed to do some "pre-execution" set up work, for instance, copying input files to to a scratch directory. This can be done with the command file, just as you would with a an ordinary Unix shell file. Consider the following example:



#!/bin/csh
#
#@ initialdir = /homes/sp1/user_name/Loadleveler_examples
#@ input      = test2.in
#@ output     = test2.out
#@ error      = test2.err
#@ wall_clock_limit = 10:00:00
#@ resources = ConsumableMemory(500) 
#@ queue
# Beginning of script commands

cp /homes/sp1/user_name/Loadleveler_examples/test2.in  /scratch/user_name/test2.in
cp /homes/sp1/user_name/Loadleveler_examples/test2  /scratch/user_name/test2
/scratch/user_name/test2
cp /scratch/user_name/test2.out /homes/sp1/user_name/Loadleveler_examples/test2.out
cp /scratch/user_name/test2.err /homes/sp1/user_name/Loadleveler_examples/test2.err
rm  /scratch/user_name/test2
rm  /scratch/user_name/test2.in
rm  /scratch/user_name/test2.out
rm  /scratch/user_name/test2.err
echo job done

In this example, the user's input file is copied to scratch space on the executing host before execution. This saves the performance cost of accessing the input file over the network (local disk is more efficient). The job is then run, output copied back and finally, cleanup of the scratchspace is performed. This command file copies the input file to a scratch directory on the executing host, executes the job, copies the output files back, and then removes all the files on scratch. By doing this pre-execution work, the performance cost of accessing the input file over the network is saved. Using the local disk space is more effiecient.

Submitting Parallel (OpenMP) Jobs

Here is an example of an OpenMP LoadLeveler submission script:

#!/bin/csh
#@ initialdir = /homes/sp1/user_name/Loadleveler_examples
#@ job_type     = parallel
#@ output       = test.out
#@ error        = test.err
#@ wall_clock_limit = 10:00:00
#@ resources = ConsumableMemory(500)
#@ node = 1
#@ tasks_per_node = 2
#@ node_usage = shared
#@ queue

setenv OMP_NUM_THREADS 2
a.out < input
This script will alllow you to run the job using 2 CPUs on a single node.

The essential commands for OpenMP jobs are job_type and tasks_per_node . Since OpenMP is a shared memory parallel paradigm, the job can only be run on a single node, which requires node = 1.

Submitting Parallel (MPI) Jobs

Here is an example of an MPI Loadleveler submission script:


#!/bin/csh
#@ output       = test.out
#@ error        = test.err
#@ job_type     = parallel
#@ wall_clock_limit = 10:00:00
#@ resources = ConsumableMemory(500) 
#@ network.MPI = css0,shared,US
#@ blocking = unlimited
#@ total_tasks = 4
#@ node_usage = shared 
#@ queue 

/home/sp1/user_name/a.out

The above commend file requests a total of 2000MB of memory (500 MB times 4 tasks ).

The following Loadleveler kewords are used for MPI LoadLeveler submission scripts:

Submitting Multiple Jobs

A single command job file can be used to submit multiple jobs by using multiple #@queue statements. This is particularly useful if the same is used with multiple jobs that have different input and output files.

An example command file is:


#!/bin/csh
#@ initialdir = /homes/sp1/user_name/Loadleveler_examples
#@ executable = test
#@ input      = test1.in
#@ output     = test1.out
#@ error      = test1.err
#@ wall_clock_limit = 10:00:00
#@ resources = ConsumableMemory(500) 
#@ queue
#@ executable = test
#@ input      = test2.in
#@ output     = test2.out
#@ error      = test2.err
#@ wall_clock_limit = 15:00:00
#@ resources = ConsumableMemory(500) 
#@ queue
#@ executable = test
#@ input      = test3.in
#@ output     = test3.out
#@ error      = test3.err
#@ wall_clock_limit = 10:00:00
#@ resources = ConsumableMemory(500) 
#@ queue

Submitting Multiple Jobs with Dependency

A useful extension to the multiple job script described above is when the some of the jobs depend on the outcome of other jobs. For example, assuming in our 3 job step example above, we want (i) job step 2 to run after job step 1 and (ii) job step 3 to run after job step 2, the command file can be modified as follows:

An example command file is:


#!/bin/csh
#@ step_name = step1
#@ initialdir = /homes/sp1/user_name/Loadleveler_examples
#@ executable = test
#@ input      = test1.in
#@ output     = test1.out
#@ error      = test1.err
#@ wall_clock_limit = 10:00:00
#@ resources = ConsumableMemory(500) 
#@ queue
#@ step_name = step2
#@ dependency = step1 == 0
#@ executable = test
#@ input      = test2.in
#@ output     = test2.out
#@ error      = test2.err
#@ wall_clock_limit = 20:00:00
#@ resources = ConsumableMemory(500) 
#@ queue
#@ step_name = step3
#@ dependency = step2 == 0
#@ executable = test
#@ input      = test3.in
#@ output     = test3.out
#@ error      = test3.err
#@ wall_clock_limit = 10:00:00
#@ resources = ConsumableMemory(500) 
#@ queue
Please note the different input file names in the dependency example above. One can also use one input file for a multi-step job, however, the single input file must contain the inputs for all job steps (duplicate data if all job steps use the same data).

How node_usage Options Affect Charging of CPU Time

There are 4 CPUs per node of the SP. For effective usage of the machine, it is desirable to have up to four tasks per node. Thus the default setting for node_usage is shared. It is however recognised that for some applications, for example large-memory serial jobs and shared-memory multi-threaded (via directives or pthreads) jobs or even MPI jobs, a user may want to have a node exclusively to himself. This is where the node_usage option not_shared becomes useful. However, using this option is not without a price. To avoid system abuse and to ensure that the option is used only when necessary, a user that chooses to use the option will be CHARGED 4 times the wall-clock usage.

How to Request a Dedicated Node for a non-MPI (e.g. Serial or Shared-Memory Multi-threaded) Job

The LoadLeveler keyword node_usage can be used by a user to ensure that a node is not shared with another user by setting node_usage to not_shared. However, since the node_usage keyword is reserved only for job type parallel, that is with the keyword job_type set to parallel (parallel in this sense refers to MPI jobs), running a serial or shared-memory multi-threaded job with node_usage set to not_shared also demands setting job_type to parallel even though the job is not MPI parallel. The example script below shows how to run a serial job called serial_exec where the node assigned by LoadLeveler will be used in dedicated mode:


#!/bin/csh
#@ output       = serial_out.$(jobid)
#@ error        = serial_err.$(jobid)
#@ job_type     = parallel
#@ wall_clock_limit = 10:00:00
#@ resources = ConsumableMemory(500) 
#@ node = 1
#@ tasks_per_node = 1
#@ node_usage = not_shared
#@ queue

/home/sp1/user_name/serial_exec

VERY IMPORTANT: Use a node in dedicated mode only when you need it because you will be charged 4 times the wall-clock, as explained under How node_usage Options Affect Charging of CPU Time above.

V. Additional Information

A list of LoadLeveler commands and LoadLeveler job command file keywords can be found in the IBM LoadLeveler manuals at http://sp81.msi.umn.edu:999/loadl/index.html

This information is available in alternative formats upon request by individuals with disabilities. Please send email to alt-format@msi.umn.edu or call 612-624-0528.

HOME | QUESTIONS | FEEDBACK
Events | Links | People | Publications | Support | Welcome
 


URL: http://www.msi.umn.edu /tutorial/scicomp/sp/LoadLeveler/content.html
This page last modified on Friday, 23-May-2008 13:00:12 CDT  
Please direct questions or problems to help@msi.umn.edu  
Website related questions or problems should be directed to webmaster@msi.umn.edu
The Supercomputing Institute does not collect personal information on visitors to our website. For the University of Minnesota policy, see www.privacy.umn.edu.