Itasca - Quickstart Guide

 

This guide will provide you with the basic information needed to get up and running jobs on Itasca ( HP Cluster Platform 3000 BL280c G6).

Login Procedure

Please connect through login.msi.umn.edu or nx.msi.umn.edu. See MSI's interactive connections FAQ, and use itasca.msi.umn.edu as the hostname.

Available Software:

A module system is used on Itasca to control the run-time environment for individual applications. Please type module avail to see the software available on Itasca.

Compilers and MPI libraries

The following tables summarize the compilers, MPI implementations. In the subsequent commands and scripts, we have used Intel compilers for Fortran, C, and C++.

Compiler
Commands
Module
GNU 4.4.1gcc, g++, gfortrangcc
Intelicc, icpc, ifortintel

MPI implementations and their corresponding modules

Compiler
Platform MPI
OpenMPI
Intel MPI
Intelintel pmpi/intelintel ompi/intelintel impi/intel
GNUgcc pmpi/gnugcc ompi/gnugcc impi/gnu

Please note mpif90, mpif77, mpicc and mpicxx are the generic scripts set for compiling F90, F77, C and C++ codes respectively whichever MPI implementation is used. The script sets the necessary PATHs to the include files and MPI libraries so that the MPI code can be compiled. Please don't set any hard-coded path inside the MPI code unless it has been well tested and will generate better performance with the specified MPI implementation. Different versions of each of the MPI implementations are available for different versions of the compilers. Again type module avail to see the details. User manual and documentation can be found in /opt/platform_mpi/doc for Platform MPI and /soft/intel/ict/3.2/impi/3.2.2.006/doc for Intel MPI.

Compiling Code:

Serial codes

Single core serial jobs are not allowed to run on Itasca. They should be run on other systems. Please feel free to contact user support for assistance in running these kinds of jobs (email: help@msi.umn.edu or call 612-626-0802) .

OpenMP codes

C

module load intel
icc -o test -O3 -openmp openmp1.c

Fortran

module load intel
ifort -o test -O3 -openmp openmp1.f

Users can select different compiling options for optimizing the performance. Please see the man page (e.g., man ifort or man icc) for the available options. For example, for jobs to run on Sandy Bridge nodes, users should add -xAVX flag to the above compiling commands.

 

MPI codes

For compiling MPI code with one of the MPI implementations, one needs to load the corresponding MPI modules.

To use Platform MPI and  Intel Compilers:

C

module load intel pmpi/intel
mpicc -o test mpi_code.c
mpiCC -o test mpi_code.cpp

Fortran

module load intel pmpi/intel
mpif90 -o test mpi_code.f

To use Open MPI and  Intel Compilers:

C

module load intel ompi/intel
mpicxx -o test mpi_code.cpp
mpicc -o test mpi1.c

Fortran

module load intel ompi/intel
mpif90 -o test mpi1.f

To use Intel MPI and  Intel Compilers:

C

module load intel impi/intel
mpiicpc -o test mpi_code.cpp
mpiicc -o test mpi1.c

Fortran

module load intel impi/intel
mpiifort -o test mpi1.f

To use  Intel MPI  and   GNU Compilers :

C

module load intel impi/intel
mpicxx -o test mpi_code.cpp
mpicc -o test mpi1.c

Fortran

module load intel impi/intel
mpif90 -o test mpi1.f

Run Jobs Interactively:

OpenMP jobs

export OMP_NUM_THREADS=4
./test < input.dat > output.dat

MPI jobs

mpirun -np 4 ./test > run.out

Submit Jobs to the Queue

We use PBS to ensure the machine is being used to its full potential and is fair to every user. You need to create a script file and use the command qsub to submit jobs, e.g. qsub myscript.pbs . Use the "-q" option to select which queue you will submit to (examples below).

For detailed information about the queues, please see the Job Queues page.

Serial jobs

On Itasca, no compute nodes are shared by two or more jobs. Single core serial jobs are not allowed to run on Itasca. They should run on other systems. Please feel free to contact user support for assistance in running these kinds of jobs (email: help@msi.umn.edu or call 612-626-0802). However, for a simulation that run multiple copies of the serial job for different inputs, they can be packed to run on one or more nodes.

 

OpenMP jobs

The following is a PBS script for a 1-hour OpenMP job that will run on 1 node using all 8 cores. This job will need 10GB memory. Save the script as 'myscript.pbs'

#!/bin/bash -l
#PBS -l walltime=01:00:00,mem=10gb,nodes=1:ppn=8
#PBS -m abe
cd /lustre/szhang
module load intel
export OMP_NUM_THREADS=8
./test < input.dat > output.dat

MPI jobs

The following is a PBS script for a 1024-process MPI job that will run for 1 hour on 128 nodes using intel pmpi. The script is using pmem = 1500mb to request 1500MB of memory per core. Please note the difference between mem and pmem. mem requests memory for a job. For the following example, the equivalent mem is 128x8x1500 MB .

#!/bin/bash -l
#PBS -l walltime=01:00:00,pmem=1500mb,nodes=128:ppn=8
#PBS -m abe
a1=$(cat $PBS_NODEFILE | sort | uniq)
time pdsh -w `echo $a1 | sed 's/ /,/g'` date >& check_node

cd /lustre/Your_username
module load intel
module load pmpi/intel
mpirun -np 1024 -hostfile $PBS_NODEFILE ./test > run.out

The following is a PBS script for a 32-process MPI job that will run for 1 hour on 4 nodes using intel ompi. The script is using pmem = 500mb to request 500MB of memory for each core. Save the script as 'myscript.pbs'

#!/bin/bash -l
#PBS -l walltime=01:00:00,pmem=500mb,nodes=4:ppn=8
#PBS -m abe
a1=$(cat $PBS_NODEFILE | sort | uniq)
time pdsh -w `echo $a1 | sed 's/ /,/g'` date >& check_node
cd /lustre/Your_username
module load intel
module load ompi/intel
mpirun -np 32 ./test > run.out

The following is a PBS script for a 256 process MPI job that will run for 2 hour on 32 nodes using intel impi. The script is using pmem = 500mb to request 500MB of memory for each core. Save the script as 'myscript.pbs'

#!/bin/bash -l
#PBS -l walltime=02:00:00,pmem=500mb,nodes=32:ppn=8
#PBS -m abe
a1=$(cat $PBS_NODEFILE | sort | uniq)
time pdsh -w `echo $a1 | sed 's/ /,/g'` date >& check_node

cd /lustre/Your_username
module load intel
module load impi/intel
mpirun -r ssh -f $PBS_NODEFILE -np 256 ./test > run.out
#Please note that the default impi module is for interl MPI version 4. For code compiled with earlier version of intel MPI, you should use the following
module load impi/intel3.2.1
mpdboot -r ssh -n 32 -f $PBS_NODEFILE
mpiexec -perhost 8 -n 256 -env I_MPI_DEVICE rdssm ./test > run.out

Useful Commands

Submitting a standard batch job:

qsub myscript.pbs

Submitting a job to the "long" queue (for extended walltimes):

qsub -q long myscript.pbs

Monitoring queue status:

showq
qstat -f

Monitoring a job's status:

checkjob

Canceling a job:

qdel

Recommendations

Where to run your jobs? If your application needs to read or write large volumes of data, you should run your jobs in a directory under the high performance and high capacity scratch partition /lustre, which is mounted on all of Itasca's login and compute nodes via the LUSTRE file system. To use /lustre, you should create a subdirectory under /lustre with your user name. For example you could do the following.

cd /lustre
mkdir your_username
cd your_username
cp ~/your_job .
mpirun -np 4 ./your_job

Where to load the modules that are needed for running jobs? Put the needed modules in your .bashrc file rather than in the jobs script. That is required for jobs that link with shared objects and these shared objects are not local on the compute nodes. The job will fail if they cannot find the shared objects in the working environment.

I/O performance issue: jobs that usually write relatively small amounts of data at a time should use the buffered I/O transfer approach. For such kind of FORTRAN jobs, please set the following

export FORT_BUFFERED=1

in your job script.

Debugging - submitting an interactive batch job for debugging:

qsub -I myscript.pbs

Where myscript.pbs only contains the #PBS commands to list requested resources. E.g.:

#PBS -l pmem=2150mb,nodes=20:ppn=8

Use of PMPI for special need

Platform MPI has a useful feature that allows users to conveniently reorder the hosts contained in $PBS_NODEFILE for the objective to meet special need. The recommended procedure goes as follows:

Set the following env parameters in your .bashrc file:

module load intel pmpi/intel
export MPI_MAX_REMSH=16
export MPI_MAX_MPID_WAITING=64
export MPI_BUNDLE_MPIDS=N
export OMP_NUM_THREADS=1

Generate your own host list, e.g., named myhostlist

Run the job with your own hostfile:

mpirun -np 128 -hostfile myhostlist ./a.out

 

Low Priority  Jobs: qos=weightlessqos

Itasca queues support low priority jobs, which allow you to exceed the normal job number limit, provided that the extra jobs can run as backfill. Low priority jobs work the same and are charged the same SUs as normal jobs, but will only run if they fit onto a set of nodes that would not have been running any normal priority job.

You can have as many as 12 low priority jobs running at the same time in addition to your normal priority jobs. These low priority jobs bear no weight against your personal job number limit. However, these low priority jobs are given extremely low priority for starting, and will only run if they fit into the job schedule as backfill and fill a gap when some nodes would not have been running because they were being drained for a larger job.

Submit a PBS script as a low priority job by specifying quality of service is weightless using the command line option qos=weightlessqos in the qsub job limits. For example:

qsub job.pbs  -l qos=weightlessqos

There is no guarantee that a weightless job will ever run because it is impossible to predict if there will be a gap with a given number of nodes in the job schedule.  However, gaps appear frequently because the scheduler often needs to drain the queue in preparation for larger core count jobs.  If the number of nodes and wall time requested by weightless job fit into a gap when it is submitted, then it should run immediately. The following section explains how to find backfill opportunities.
 

Find Backfill Opportunities: showbf

Many factors go into when jobs are scheduled to run. Frequently, some nodes are idle becase they have been reserved for a high core count job that is still waiting for its full set of nodes to be free. If a job is submitted that fits on currently idle nodes, and if the wall time of that job ensures that it will complete before the high core count job would run, then that smaller job can run as backfill. You can display a summary of the number of nodes available for backfill in a given queue class with the showbf command. The display is tabulated by the maximum wallclock duration.
For example, on Itasca, to see the nodes available as backfill opportunities in the batch queue use the following command

showbf -c batch


That command will produce a report like
Partition     Tasks  Nodes      Duration   StartOffset       StartDate
---------     -----  -----  ------------  ------------  --------------
ALL            1544    193       3:52:59      00:00:00  08:45:31_07/07
ALL            1072    134       4:17:26      00:00:00  08:45:31_07/07
ALL             704     88       7:40:29      00:00:00  08:45:31_07/07
ALL             488     61       7:42:20      00:00:00  08:45:31_07/07
ALL             480     60      14:31:26      00:00:00  08:45:31_07/07
node1081       1544    193       3:52:59      00:00:00  08:45:31_07/07
node1081       1072    134       4:17:26      00:00:00  08:45:31_07/07
node1081        704     88       7:40:29      00:00:00  08:45:31_07/07
node1081        488     61       7:42:20      00:00:00  08:45:31_07/07
node1081        480     60      14:31:26      00:00:00  08:45:31_07/07

To see the nodes available as backfill opportunities for the Sandy Bridge queue use the following command

showbf -c sb

That command will produce a report like
Partition     Tasks  Nodes      Duration   StartOffset       StartDate
---------     -----  -----  ------------  ------------  --------------
ALL             304     19      INFINITY      00:00:00  08:46:05_07/07
sbpar           304     19      INFINITY      00:00:00  08:46:05_07/07

Note: in both of these examples, showbf shows all the PBS queue partitions associated with the queue class name that you typically submit your job with (e.g., batch or sb).  All queue classes are subsets of the "ALL" partition, so it typically shows redundant information.  Once you know what name of the partition that is specific to the class, you can use that partition name to removie the redundant listing.  SInce node1081 is the unique partition name for the batch queue on Itasca, you can get a more concise list with

showbf -p node1081

for the the Sandy Bridge queue you can get a more concise list with

show -p sbpar