Introduction to MPI

This tutorial will provide an introduction to the Message Passing Interface (MPI) library. The goal of this tutorial is to enable people to write, compile, and execute simple MPI programs on parallel computers.

1. Introduction

  • 1.1What is MPI?

    • A message-passing library specification
      • message-passing model
      • not a compiler specification
    • For parellel computers, clusters, and heterogeneous networks
    • Designed to permit the development of parallel software libraries
    • Designed to provide access to advanced parallel hardware
      • end users
      • library writers
      • tool developers
  • 1.2 Motivation for a "New Design"

    • Message Passing now mature as programming paradigm
      • Well undestood
      • efficient match to hardware
      • many applications
    • Vendor systems not portable
    • Portable systems are mostly research projects
      • incomplete
      • lack vendor support
      • not at most efficient level
    • Few systems offer the full range of desired features: such as modularity for libraries, access to peak performance, portability,heterogeneity, perforamnce measurement tools, etc...
  • 1.3 The MPI Process

    • Began at Williamsburg Workshop in APril, 1992
    • Organized at the Supercomputing '92 (November)
    • Followed HPF format and process
    • Met every six weeks for two days
    • Extensive, open email discussions
    • Drafts, readings, votes
    • Pre-final draft distributed at Supercomputing'93
    • Two-month public comment period
    • Final version of draft in May 1994
    • Widely available on the web, ftp sites, netlib
    • Public implementations available
    • Vendor implementations available
  • 1.4 Who designed MPI?

    • Broad participation
    • Vendors: IBM, Intel, TMC, Meiko,Cray,Convex, Ncube
    • Libraries writers: PVM, p4, Zipcode, TCGMSG, Chameleon, Express, Linda
    • Application specialists and consultants: Companies, Laboratories, Universities
  • 1.5 Is MPI Large or Small?

    • MPI is large (125 functions)
      • MPI extensive functionality requires many functions
      • Number of functions not necessarily a measure of complexity
    • MPI is small ( 6 functions)
      • Many parallel programs can be written with just 6 basic functions

2. MPI Programs

To get a feel for writing MPI programs, we'll focus on the simplest program imaginable, the standard Hello world program. The basic outline of an MPI program follows these general steps:

  • initialize for communications
  • do the communications necessary for coordinating computation
  • exit in a "clean" fashion from the message-passing system when done communicating

MPI has over 125 functions. However, a beginning programmer usually can make do with only six of these functions. These six functions are illustrated in our sample program.

2.1. Format of MPI calls

First, we'll look at the actual calling formats used by MPI.

  • C bindings

    For C, the general format is

    rc = MPI_Xxxxx(parameter, ... )

    Note that case is important here. For example, MPI must be capitalized, as must be the first character after the underscore. Everything after that must be lower case. Rc is a return code, and is type integer.

    C programs should include the file "mpi.h". This contains definitions for MPI constants and functions.

  • Fortran bindings

    For Fortran, the general form is

    Call MPI_XXXXX(parameter,..., ierror)

    Note that case is not important here. So, an equivalent form would be

    call mpi_xxxxx(parameter,..., ierror)

    Instead of the function returning with an error code, as in C, the Fortran versions of MPI routines usually have one additional parameter in the calling list, ierror, which is the return code.

    Fortran programs should include 'mpif.h'. This contains definitions for MPI constants and functions.

  • Both C and Fortran

    The exceptions to the above formats are the timing routines (MPI_WTIME and MPI_WTICK) which are functions for both C and Fortran, and return double-precision reals.

2.2. An MPI sample program

As you look at the code below, note the various calls to MPI routines. Click on the name of each MPI routine to read a detailed description of that routine's purpose and syntax.

C version

#include <stdio.h>
#include "mpi.h"
main(int argc, char **argv)
{
  int rank, size, tag, rc, i;
  MPI_Status status;
  char message[20];

  rc = MPI_Init(&argc, &argv);
  rc = MPI_Comm_size(MPI_COMM_WORLD, &size);
  rc = MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  tag = 100;

  if(rank == 0) {
    strcpy(message, "Hello, world");
    for (i=1; i<size; i++)
      rc = MPI_Send(message, 13, MPI_CHAR, i, tag, MPI_COMM_WORLD);
  }
  else
    rc = MPI_Recv(message, 13, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &status);

  printf( "node %d : %.13s
", rank,message);
  rc = MPI_Finalize();
}

Fortran version


          program hello
          include 'mpif.h'
          integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)
          character(12) message
        
          call MPI_INIT(ierror)
          call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
          call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
          tag = 100
          if(rank .eq. 0) then
        
            message = 'Hello, world'
            do i=1, size-1
        
               call MPI_SEND(message, 12, MPI_CHARACTER, i, tag, 
        
        &                    MPI_COMM_WORLD, ierror)
            enddo
          else
        
               call MPI_RECV(message, 12, MPI_CHARACTER, 0, tag,
        
        &                    MPI_COMM_WORLD, status, ierror)
          endif
          print*, 'node', rank, ':', message
        
          call MPI_FINALIZE(ierror)
          end

To summarize the program: This is a SPMD code, so copies of this program are running on multiple nodes. Each process initializes itself with MPI (MPI_INIT), determines the number of processes (MPI_COMM_SIZE), and learns its rank (MPI_COMM_RANK). Then one process (with rank 0) sends messages in a loop (MPI_SEND), setting the destination argument to the loop index to ensure that each of the other processes is sent one message. The remaining processes receive one message (MPI_RECV). All processes then print the message, and exit from MPI (MPI_FINALIZE).

 

3. MPI messages

MPI messages consist of two basic parts: the actual data that you want to send/receive, and an envelope of information that helps to route the data. There are usually three calling parameters in MPI message-passing calls that describe the data, and another three parameters that specify the routing:

Message = data(3 parameters) + envelope(3 parameters)

   startbuf,count,datatype,  dest,tag,comm
             |    /              |   /
        ---DATA--/             ENVELOPE
                     

Let's look at the data and envelope in more detail. We'll describe each parameter, and discuss whether these must be coordinated between the sender and receiver.

3.1. Data

The buffer (location in your program where data are to be sent from or stored to) is specified by three calling parameters:

  • Startbuf

    The address where the data start. For example, this could be the start of an array in your program.

  • Count

    The number of elements (items) of data in the message. Note that this is elements, not bytes. This makes for portable code, since you don't have to worry about different representations of data types on different computers. The software implementation of MPI determines the number of bytes automatically.

    The count specified by the receive call should be greater than or equal to the count specified by the send. If more data is sent than storage is available in the receive buffer, an error will occur.

  • Datatype

    The type of data to be transmitted. For example, this could be floating point. The datatype should be the same for the send and receive call. As an aside, an exception to this rule is the datatype MPI_PACKED, which is one method of handling mixed-type messages (the preferred method is with a derived datatypes).

    The types of data already defined for you are called "basic datatypes," and are listed below. WARNING: note that the names are slightly different between the C implementation and the Fortran implementation.

MPI Basic Datatypes for C

        ---------------------------------------
        MPI Datatype         C datatype          
        ---------------------------------------
        MPI_CHAR             signed char         
        MPI_SHORT            signed short int    
        MPI_INT              signed int          
        MPI_LONG             signed long int     
        MPI_UNSIGNED_CHAR    unsigned char       
        MPI_UNSIGNED_SHORT   unsigned short int  
        MPI_UNSIGNED         unsigned int        
        MPI_UNSIGNED_LONG    unsigned long int   
        MPI_FLOAT            float               
        MPI_DOUBLE           double              
        MPI_LONG_DOUBLE      long double         
        MPI_BYTE                                 
        MPI_PACKED                               
        ---------------------------------------

MPI Basic Datatypes for Fortran

        ---------------------------------------
        MPI Datatype           Fortran Datatype  
        ---------------------------------------
        MPI_INTEGER            INTEGER           
        MPI_REAL               REAL              
        MPI_DOUBLE_PRECISION   DOUBLE PRECISION  
        MPI_COMPLEX            COMPLEX           
        MPI_LOGICAL            LOGICAL           
        MPI_CHARACTER          CHARACTER(1)      
        MPI_BYTE                                 
        MPI_PACKED                               
        ---------------------------------------

 

3.2 Envelope

As mentioned earlier, a message consists of the actual data and the message envelope. The envelope provides information on how to match sends to receives. The three parameters used to specify the message envelope are:

  • Destination or source

    These arguments are set to a rank in a communicator (see below). Ranks range from 0 to (size-1) where size is the number of processes in the communicator.

    Destination is specified by the send and is used to route the message to the appropriate process. Source is specified by the receive. Only messages coming from that source can be accepted by the receive call. The receive can set source to MPI_ANY_SOURCE to indicate that any source is acceptable.

  • Tag

    An arbitrary number to help distinguish among messages. The tags specified by the sender and receiver must match. The receiver can specify MPI_ANY_TAG to indicate that any tag is acceptable.

  • Communicator

    The communicator specified by the send must equal that specified by the receive. We'll describe communicators in more depth later in the module. For now, we'll just say that a communicator defines a communication "universe", and that processes may belong to more than one communicator. In this module, we will only be working with the predefined communicator MPI_COMM_WORLD which includes all processes in the application.

    To help understand the message envelope, let's consider the analogy of a bill collection agency that collects for several utilities. When sending a bill the agency must specify:

     

    1.The person receiving the bill (more specifically, their ID number). This is the destination.
    2.What month the bill covers. Since the person will get twelve bills each year, they need to know which one this is. This is the tag.
    3.Which utility is being collected for. The person needs to know whether this is their electric or phone bill. This is the communicator.

 

4. Communicators

 

4.1 Why have communicators?

As promised earlier, we are now going to explain a little more about communicators. We will not go into great detail -- only enough that you will have some understanding of how they are used.

As described above, a message's eligibility to be picked up by a specific receive call depends on its source, tag, and communicator. Tag allows the program to distinguish between types of messages. Source simplifies programming. Instead of having a unique tag for each message, each process sending the same information can use the same tag. But why is a communicator needed?

An example

Suppose you are sending messages between your processes, but you are also calling a set of libraries you obtained elsewhere, which also runs on multiple nodes and communicates within itself using MPI. In this case, you want to make sure that messages you send go to your processes, and do not get confused with the messages being sent internally within the library routines.

In this example, we have three processes communicating with each other. Each process also calls a library routine, and the three library routines communicate with each other. We want to have two different message "spaces" here, one for our messages, and one for the library's messages. We do not want any intermingling of the messages.

The diagram below shows what we would like to happen. The white boxes represent our own routines ("caller"). The shaded boxes represent the library routines ("callee"). Sends and receives are indicated with their destination or source. In this case, everything works as intended.

However, there is no guarantee that things will occur in this order, given the fact that the relative scheduling of processes on different nodes can occur non-deterministically. Suppose, for example, that we change the third process by adding some computation at the beginning. The sequence of events might then occur as follows:

In this case, communications do not occur as intended. The first "receive" in process 0 now receives the "send" from the library routine in process 1, not the intended (and now delayed) "send" from process 2. As a result, all three processes hang.

This problem is solved by the library developer defining a new communicator, and specifying this in all send and receive calls made by the library. This would create a library ("callee") message space separate from the user's ("caller") message space. Tags can't be used to separate the message spaces because they are chosen by the programmer. How can the programmer avoid using the same tags as the library, when they don't know what tags the library uses? Communicators are returned by the system and incorporate a system-wide unique identifier.

 

4.2 Communicators and process groups

In addition to development of parallel libraries, communicators are useful in organizing communication within an application.

Up to this point, we've looked at communicators that include all processes in the application. But the programmer can also define a subset of processes, called a process group, and attach one or more communicators to the process group. Communication specifying that communicator is now restricted to those processes.

This also ties directly into use of collective communications.

To revisit the bill collection analogy: one person may have an account with the electric and phone companies (2 communicators) but not with the water company. The electric communicator may contain different people than the phone communicator. A person's ID number (rank) may vary with the utility (communicator). So, it is critical to note that the rank given as message source or destination is the rank in the specified communicator.

5. Summary

Although MPI provides for an extensive, and sometimes complex, set of calls, you can begin with just the six basic calls:

  • MPI_INIT
  • MPI_COMM_RANK
  • MPI_COMM_SIZE
  • MPI_SEND
  • MPI_RECV
  • MPI_FINALIZE

However, for programming convenience and optimization of your code, you should consider using other calls, such as those described in the more advanced references.

MPI Messages

MPI messages consist of two parts:

data (startbuf, count, datatype)
envelope (destination/source, tag, communicator)

The data defines the information to be sent or received. The envelope is used in routing messages to the receiver, and in matching send calls to receive calls.

Communicators

Communicators guarantee unique message spaces. In conjunction with process groups, they can be used to limit communication to a subset of processes.

 

6. Acknowledgements and References

This talk was based on the "Basics of MPI Programming" developed at the Cornell Theory center and on the tutorial by Ewing Lusk.

Good MPI references are:

Using MPI by Wiliam Gropp, Ewing Lusk, Anthony Skjellum
Message Passing Interface Forum (1995) MPI: A Message Passing Interface Standard. June 12, 1995.