Intel Phi - Quickstart

How to use the Phi nodes on Cascade interactively?
How to compile programs for execution on a Phi accelerator?
How to run jobs on a Phi accelerator (MIC) in a native mode?

How to run jobs on a Phi accelerator (MIC) in offload mode?
Good practice guidelines
How to submit batch jobs?

 

How to use the Phi nodes on Cascade interactively?

Two Phi nodes are available on Cascade. Each is equipped with a CPU E5-2670 (Sandy Bridge,  2.60GHz, 16 cores and 124 GB of memory) and a single Xeon Phi card with Intel Many Integrated Core (MIC) Architecture.  Each Phi card features 60 Intel cores at 1.053 GHz or 240 processing threads, 8 GB of memory, and 320 GB/s memory bandwidth.

To use  one of the Phi nodes interactively users need to first login to Cascade, i.e.

             ssh  cascade.msi.umn.edu  

Next, request a node with a Phi accelerator using the command:
qsub -I -l walltime=2:00:00,nodes=1:ppn=12:phi,pmem=200mb

Specific nodes may be requested by name using the commands:
qsub -I -l walltime=2:00:00,nodes=cas013:ppn=12:phi,pmem=200mb
or
qsub -I -l walltime=2:00:00,nodes=cas014:ppn=12:phi,pmem=200mb

The time and memory specifications in these commands can be changed to be appropriate for your job needs.  More details on job submission commands, and their meaning, is availalble on our webpage here.

How to compile programs for Phi execution?

The Phi accelerators have a very limited software stack.  Only programs compiled using the Intel compilers with the -mic option can currently execute on the Phi cards.  Programs must be pre-compiled on a CPU before executing them on a Phi (it is not recommended to attempt to compile on a Phi directly).  Additionally, runtime library paths and environmental variables must be manually setup when using a Phi card (no modules are available on the Phi cards).

When compiling programs for execution on a Phi card it is recommended to use the latest Intel compilers.  Currently the newest installed version of the Intel compiler package is Composer_xe_2013_sp1.1.106.

To compile a program on a CPU in preparation for execution on a Phi accelerator first load the newest Intel compilers with:

module load intel/2013/sp1_update1

The code can then be compiled using one of the following commands (please note the -mmic option which is required for Phi card compatibility):
    FORTRAN   ifort -mmic -openmp -O3 your.f90
    C                  icc -mmic -openmp -O3 your.c
    C++              icpc -mmic -openmp -O3 your.cpp

How to run jobs on a Phi accelerator in native mode?

The following gives the procedure to run a calculation natively on a Phi card:

0) Prior to accessing a Phi card in native mode you must first setup ssh keys.  This can be done using the following commands:

  1. Generate an ssh key using the command:
    ssh-keygen -t rsa
  2. Accept the default settings for the file name and location.  Enter a passphrase as
    indicated. This password is only for use with the ssh key (different from your account password). 
    It is recommended that the passwod be more than 8 characters password (it may have letters, number, spaces, and symbols).
  3. Two files will be created in the .ssh directory that is in your home directory.  There will
    be a file with the .pub suffix. You'll need to copy that file into another file with the
    following command:
    cat ~/.ssh/some_name.pub >> ~/.ssh/authorized_keys
    This will make it possible for you to authenticate via the ssh keys. Your ssh key password will be
    the passphrase you entered earlier.

1) After beginning an interactive job on one of the Phi nodes (either cas013 or cas014), ssh into the Phi card using the command: 

         ssh mic0

 After you ssh to the Phi card the Phi will reboot, which may take a few seconds.  The entire procedure from job submission to Phi card access is shown in the screenshot below:

 

2) Set the number of cores needed to run the application, e.g.,

       export OMP_NUM_THREADS=30

3) When executing your program on a Phi card the Intel compiler library paths will need to be manually specified (whichever compiler version was used to compile the program).  If the newest Intel compilers are being used the library path can be specified using the command:

export LD_LIBRARY_PATH=/soft/intel/x86_64/2013/composer_xe_2013/composer_xe_2013_sp1.1.106/compiler/lib/mic:$LD_LIBRARY_PATH

When using the MKL libraries you will additionally need to add the MKL library path.  For the newest Intel compiler package this can be done with the command:

export LD_LIBRARY_PATH=/soft/intel/x86_64/2013/composer_xe_2013/composer_xe_2013_sp1.1.106/mkl/lib/mic:$LD_LIBRARY_PATH

4) Access the directory where the executable exists:

Since the Phi cards have access to user home directories, many calculations can be launched directly if they do not need other external libraries.  For example:

   cd /panfs/roc/groups/$user/mic0_dir

    ./a.out

  Suggestion: Controling threading affinity by using KMP_AFFINITY may increase performance.
 

The environmental variable KMP_AFFIINITY controls the placement of program threads on the accelerator cores.  The variable can be set with:
export KMP_AFFINITY=verbose,${type}

Where type can be compact or scatter.

More information about KMP_AFFINITY can be found at:
http://software.intel.com/en-us/articles/openmp-thread-affinity-control

How to run jobs on a Phi accelerator in offload mode?

The offload procedure of running multi-threaded jobs consists of three steps:

1. Insert the offload directives into the code: Portions within an application can be offloaded by placing an offload directive before a block of code. The follwoing examples show the use of offload directives for offloading the calculations onto the Phi cards.

          C/C++ OpenMP example:
                  #pragma offload target (mic)
                  #pragma omp parallel for reduction(+:pi)
                  for (i=0; i<count; i++) {
                      float t = (float)((i+0.5)/count);
                      pi += 4.0/(1.0+t*t);   
                   }
                  pi /= count;

          Fortran OpenMP example:
                   !dir$ offload target(mic)
                   !$omp parallel do
                   do i=1,10
   
                    A(i) = B(i) * C(i)
 
                  enddo

           More options can be found  on this webpage: Effective Use of the Intel Compiler's Offload Features

2. Compiling code: Intel compilers version 13 or newer can identify the existence of Phi card.  

    Users need to  load the corresponding module before issuing the compile command, i.e.,
                module load intel/2013/sp1_update1
                icc -openmp -O3 -vec  code.c
                icpc -openmp -O3 -vec code.cpp
                ifort -openmp -O3 -vec code.f90

3. Setup the run-time environment on the host that controls the use of Phi card

    For offloading a job onto the Phi card, users need to set up the working environment first on the host (e.g., cas014) and then launch the job:

    source /soft/intel/x86_64/2013/composer_xe_2013/composer_xe_2013_sp1.1.106/bin/compilervars.sh

    ./MIC_offload.exe

     Since the offload mode has the option to run the job on the CPU if the Phi card is not available, it is necessary to have a means to check whether the job runs as desired unless the mandatory flag is activated in the code.  The environment variable OFFLOAD_REPORT used to generate information about the Phi card behavior.  To set this variable the following commands may be used:

     export OFFLOAD_REPORT=1
  or  
     export OFFLOAD_REPORT=2

When this environmental variable is set to 1 or 2 you should be able to see Phi report information.

Good practice guidelines

Intel has provided a lot of information about good Phi practices at this webpage: Best Practice Guide - Intel Xeon Phi.  Here we just list some of the commonly used tips for performance enhancement and/or specific features relevant to MSI's Phi nodes.

1. Vectorize your code as much as possible

Due to the large SIMD width of 64 Bytes, vectorization becomes even more important for the MIC architecture than for Intel Xeon. Users are encouraged to watch a video provided by Intel about how to vectorize a code.

2. Minimize the data transfer and communication between host and Phi as the overhead is costly over the PCIe.

As with GPGPU accelerators, the data transfers on to and from the Phi cards are done through the slow PCIe bus.  Hence, keeping the data on the Phi cards as much as possible is the key for achiveing good performance.

3. Take the advantage of libraries tuned for the Phi Card

The newer versions of Intel MKL (v 11) support the MIC architecture in both offload and native modes. User can check the availability of MKL on the Phi nodes using module avail command, i.e.,

     module avail mkl

and then chose the right MKL library. 

4. Control the runtime environment

The optimal settings for the run-time environment depends on the application being used.  Although no one can set an environment which will be optimal for all simulations, the general two sets of environmental variables allow users to test and adjust  toward the objective to gain optional performance.

How to submit batch jobs?

Batch mode is not yet enabled on the Phi cards.