Koronis Quickstart

NOTE: Koronis was decommissioned on March 17, 2014. Information about retrieving and transferring Koronis data can be found here. Please contact help@msi.umn.edu with any questions.

Overview

Koronis is a constellation of SGI systems, including foremost an Altix UV1000 server with 1140 compute cores (190 6-core Intel Xeon X7542 "Westmere" processors at 2.66 GHz), 2.96 TiB of globally-addressable shared memory in a single system image. OpenMP or other threaded codes should run well on this resource.

This guide will provide you with the basic information necessary to get your jobs up and running on Koronis.

 

Login Procedure

Please connect through login.msi.umn.edu or nx.msi.umn.edu, i.e.,

        ssh login.msi.umn.edu

        ssh koronis.msi.umn.edu

Available Software

The command line

module avail

will list software packages that have been compiled and installed for Koronis.

module load name_of_software_package

will set appropriate environmental variables and add the software run scripts and binaries to your path. For example,

module load intel

must be run to make the icc and ifort commands available.

To get more information about a module:

module help name_of_software_package

To see how the module will affect your execution environment:

module show name_of_software_package

Compiling Codes

Please compile your codes on login nodes to finishe the compiling quickly. You may experince a slow compling if you have to compile the code on the compute nodes .

OpenMP Codes

C
module load intel
icc -o test -O3 -openmp openmp_code.c
Fortran
module load intel
ifort -o test -O3 -openmp openmp_code.f

Please add "-shared-intel -mcmodel=large -i-dynamic" flags to the compiling options if the job needs memory more than 2 GB. Users can select different compiling options for optimizing the performance. Please see the man page (e.g., man ifort or man icc) for the available options.

MPI Codes

C
module load intel mpt
icc -o test -lmpi mpi_code.c
icpc -o test -lmpi++abi1002 -lmpi mpi_code.cpp
Fortran
module load intel mpt
ifort -o test -lmpi mpi_code.f

Please see the MPI man page (man mpi) for more information on SGI's MPI implementation.

 

Run Jobs Interactively

OpenMP jobs

module load intel
export OMP_NUM_THREADS=4 
 ./test 

MPI Jobs

module load intel mpt
mpirun -np 4 ./test 

Submitting Jobs to the Queue

There are currently two queues on the system. The default queue submits to the UV100 systems, used for development. There is also a queue for the UV1000 system, specified via the -q uv1000 option on job submission. The maximum run-time is currently set to 24 hours, and there are no limits on the number of queued or running jobs.

The minimum size of a job on a UV100 is 6 processes and 32 GiB of memory. The minimum size of a job on the UV1000 is 6 processes and 16 GiB of memory. Jobs should request resources in sets of 6 processor cores (called ncpus by PBS). Jobs do not need to request memory, as it is implicitly allocated based on the number of processor cores requested. See the CPU sets section below for more information.

Queue Memory (per system) Cores (per system) Walltime
uv1000 2.96 TiB 1140 24:00:00
uvdev 352 GiB 66 24:00:00

Submit a script to PBS

qsub yourscript.pbs

The following is an example of a submission script for a 1-hour, 12-core, OpenMP job submitted to a UV100 node.

#PBS -l select=2:ncpus=6
#PBS -l walltime=01:00:00
#PBS -l place=excl:group=board

cd $PBS_O_WORKDIR

module load intel
export OMP_NUM_THREADS=12

dplace -c 0-11 -x2 ./a.out

Here is an submission script for a 24-hour, 192-core, OpenMP job submitted to the UV1000.

#PBS -l select=32:ncpus=6
#PBS -l place=excl:group=iru
#PBS -l walltime=24:00:00
#PBS -q uv1000

cd $PBS_O_WORKDIR

module load intel
export OMP_NUM_THREADS=192
dplace -c 0-191 -x2 ./a.out 

Here is an submission script for a 24-hour, 192-core, MPI job submitted to the UV1000.

#PBS -l select=32:ncpus=6:mpiprocs=6
#PBS -l place=excl:group=iru
#PBS -l walltime=24:00:00
#PBS -q uv1000

cd $PBS_O_WORKDIR

module load intel mpt
mpiexec_mpt -np 192 dplace -c 0-191 -x2 ./a.out 

MSI staff have developed a script which will automatically determine what group to select in a batch job given the number of cores required, along with the number of nodes required. The script is located in /soft/koronis/msi/bin/ProcToGroup.sh, and an example of how to call it is given in /soft/koronis/msi/bin/GroupExample.sh, for your reference. For example, executing:

/soft/koronis/msi/bin/ProcToGroup.sh 24

returns:

boardpair 4

Specific components of the the UV1000 can be selected to run on according to the table below.

Resource Cores Memory (GiB) NUMA nodes Description
rack 384 1024 64 One rack contains two irus
iru 192 512 32 One iru contains two iruhalves
iruhalf 96 256 16 One irqhalf contains two iruquadrants
iruquadrant 48 128 8 One iruquadrant contains two boardpairs
boardpair 24 64 4 One boardpair contains two boards
board 12 32 2 One board contains two sockets
socket 6 16 1 A single socket on the system

Checking job status

Job status can be seen using the qstat command

To get a summary of all of your jobs:

qstat -a

To get detailed status of one of your jobs:

      qstat -f jobID

To know the estimated time when your job will start to run  

qstat -Tw jobID

NOTE: You will not see all jobs in the queue, only your own jobs. Thus it may appear that Koronis is "empty," but it very rarely is. We are working on a way to provide more useful information about queue and node status within Koronis. Once complete, the commands will be documented here.

To get the status of the uv1000 queue:

qstat -Qf uv1000

CPU sets

As you can see in the above examples, resource specifications on Koronis are much different than on other MSI systems. This is primarily the result of using CPU sets. CPU sets are a technology available on SGI's SMP systems that allow multiple jobs to run on the same node without impacting each other's resources. That is, a CPU set is a resource container in which a job runs. CPU sets are required in order to achieve maximum performance on the SGI UV systems in Koronis.

When a job runs on a UV system, a CPU set is dynamically created for that job. The CPU set will be made up of NUMA nodes within the UV system. In this context "node" does not refer to a compute node, but it refers to a processor socket and its associated memory within the UV system. The minimum size of a CPU set is one processor socket. In the UV1000 system, each processor socket contains a processor with 6 cores, and has an associated 16 GiB of memory. Thus, the minimum resources consumed by any job on the UV1000 are 6 cores and 16 GiB of memory. In the UV100 systems, each processor socket contains a processor with 6 cores, and has an associated 32 GiB of memory. Thus, the minimum resources consumed by any job on a UV100 system are 6 cores and 32 GiB of memory.

As cores and memories are so tightly coupled, one must take care to request the appropriate number of cores that will result in the memory required for their job. For example, if your job on the UV1000 has merely 1 process (thus runs on merely 1 core), but it requires 300 GiB of memory, then you will need to request at least 19 NUMA nodes for your job, which is equivalent to 114 cores and 304 GiB of memory. This is where the select statement comes in, as shown in the above examples. The select statement for this job would then be select=19:ncpus=6.

Here is a script example about the use of thread pinning within a CPU set for an OpenMP job using 48 cores:

 #PBS -l select=8:ncpus=6
 #PBS -l place=excl:group=iruquadrant
 
 #PBS -l walltime=24:00:00 
 #PBS -q uv1000

 module load intel

 # Show the resources allocated to my CPU set
 cpuset -d .

 # Turn on some debugging
 set -xv

 # Set the stacksize
 ulimit -s unlimited

 export OMP_NUM_THREADS=48
 export KMP_AFFINITY=disabled
 export KMP_LIBRARY=turnaround
 export KMP_BLOCKTIME=infinite

 cd working_directory
 /usr/bin/time dplace -c 0-47 -x2 ./a.out

Please note the key word group specifies the needed resource described in the above table.

MSI staff testing on Koronis has determined that performance can sometimes be improved by 50% to 100% without using cpusets, by setting the KMP_AFFINITY environment variable in your batch job before executing an application. An example bash command would be:

export KMP_AFFINITY="granularity=fine,compact,1,0"

Over time, MSI will refine its use of CPU sets to ensure that Koronis is being properly utilized. We ask you for your patience as we explore the nuances of this technology.

Storage

There are three categories of storage within Koronis: Home, Project, and Scratch.

Home directories

Home directories are located at /home/koronis and are suitable for storing source code, small files, text-based job results, and other like data. Home directories are available on all nodes via the network filesystem, NFS. Koronis has 16 TB of storage allocated for home directories.

Koronis-only project spaces

Koronis-only project spaces are located at /cxfs/project[1-9] and are suitable for storing large datasets. As with MSI's central project spaces, a research group should email help@msi.umn.edu requesting a project space within Koronis. Koronis-only project spaces are on a clustered filesystem called CXFS. CXFS allows all of the Koronis-only project spaces within Koronis to be shared to all Koronis systems at very high bandwidth. However, care must be taken to properly utilize this bandwidth.

Reads from CXFS are very fast, and thus we recommend that it be used to store large input for jobs. Writes to CXFS require one use large-block write operations -- 8 MB and larger. Most applications will by default use 8 KB to 32 KB write operations, which will be quite slow on CXFS. As such, we recommend that jobs output data to scratch space, and then copy their output to Koronis-only project space, if desired. A properly tuned application can achieve 1-3 GB/s of bandwidth to CXFS. Koronis has 500 TB of storage allocated for Koronis-only project spaces.

Central MSI project spaces can be made available on the Koronis interactive nodes by request.

Scratch space

Scratch spaces on Koronis are located on each compute system at /scratch and are suitable for heavy writing during the run time of a job. Scratch spaces use a fast filesystem called XFS. Unlike its clustered counterpart, XFS is capable of very high bandwidth at a much wider range of block sizes. It is for this reason that we recommend most jobs use scratch. Applications writing a lot of data to scratch can achieve 3 to 7.5 GB/s of bandwidth. The scratch spaces on the UV1000 and UV100 systems are 96 TB and 20 TB in size, respectively.

As with scratch on other MSI HPC resources, any data in scratch that is over 14 days old will be purged. To aid users in deleting their unneeded data in scratch, we have made each compute system's scratch space accessible via NFS on Koronis' interactive nodes. The scratch spaces for the UV1000 and UV100 systems are accessible at /scratch/uv1000, /scratch/uvdev1, and /scratch/uvdev2t, respectively, on the interactive nodes.

Ideally, one should migrate their job's useful output from scratch to CXFS project space with a command at the end of their submission script. The simple UNIX cp command can be used to copy very small output to one's home directory, but more intelligent tools are required to do a high-bandwidth transfer to CXFS. At this time, we recommend the bbcp command for copying large output from scratch to CXFS project space.

To use bbcp, you must first load the bbcp module:

module load bbcp

and then use bbcp similarly to how you would use cp:

bbcp /scratch/user/job/output.dat /cxfs/project5/group/output.dat

For very large amounts of data (10s of GB to multiple TB in size), some additional bbcp parameters can be used to improve throughput:

bbcp -B 8M -s 8 /scratch/user/job/output.dat /cxfs/project5/group/output.dat

The above options increase the blocks and number of data streams used by bbcp.

Backups

Files in Koronis home and project spaces are backed up nightly at 8pm Central time. If you delete a file that did not exist the last time a backup was run, that file can not be restored. Please email help@msi.umn.edu with restore requests; include specific location(s) of the file(s) that need to be restored as well as the time frame from which you'd like them restored. Koronis storage does not use the snapshots you may know from central MSI home and project spaces.