Collective Communications with MPI

1. Introduction

Collective communication involves sending and receiving data among all the processes in a "communicator universe" (communicators were discussed in the Introduction to MPI tutorial ). MPI's collective communication routines have been built by using point-to-point communication routines and tuned by the vendors for the purpose of effectively dealing with the communication complexity. Certainly, you can build your own collective communication routines, but that could involve a lot of extra work. There is no sense in re-inventing the wheel.

Other message-passing libraries provide some collective communication calls, none of them provides a set of collective communication routines as complete and robust as those provided by MPI. In this page, we introduce the basic characteristics of collective communication, operation routines, and the new information from vendors on their collective communication routines.


2. MPI Collective Communication Routines

  • Barrier synchronization
  • Broadcast from one member to all other members
  • Gather data from all members to one member
  • Scatter data from one member to all other members
  • All-to-all exchange of data
  • Global reduction (e.g., sum, min of "common" data element to one node)
  • Scan across all members of a group

2.1 Characteristics

MPI collective communication routines differ in many ways from MPI point-to-point communication routines, which were introduced in MPI Point-to-Point Communication I. Here are the characteristics of MPI collective communication routines:

  • Involve coordinated communication within a group of processes identified by an MPI communicator.
  • Substitutes for a more complex sequence of point-to-point calls.
  • All routines block until they are locally complete.
  • Communications may, or may not, be synchronized, depending on how the vendor chose to implement them. Thus, while the caller is free to modify the memory location(s) of the buffer, the callee may have not even started the operation. The one exception to this is the barrier routine, which by definition enforces synchronization.
  • In some cases, a root process originates or receives all data.
  • Amount of data sent must exactly match amount of data specified by receiver.
  • Many variations to basic categories.
  • No message tags are needed.

 


2.2 Barrier Synchronization Routines

On parallel applications in the distributed memory environment, explicit or implicit synchronization is sometimes required. As with other message-passing libraries, MPI provides a function call, MPI_BARRIER, to synchronize all processes within a communicator. A barrier is simply a synchronization primitive. A node calling it will be blocked until all the nodes within the group have called it. The syntax of MPI_BARRIER for both C and FORTRAN programs is given below:

C
MPI_Barrier (MPI_Comm comm)

where:

MPI_Comm is an MPI predefined structure for communicators, and
comm is a communicator.
FORTRAN
MPI_BARRIER (comm, ierr)

where:

comm is an integer denoting a communicator, and
ierr is an integer return error code.

 


2.3 Data Movement Routines

MPI provides five types of collective data movement routines that come in two variants: "simple" in which all communicated items are the same size and "vector" where each item can be a different size. The MPI collective data movement routines are (the 'v' at the end indicates vector variant):

  • broadcast - send data from one member to all members of a group
  • gather, gatherv - gather data from all group members to one member
  • scatter, scatterv - scatter data from one member to all members of a group
  • allgather, allgatherv - a variation on gather where all members of the group receive the result
  • alltoall, alltoallv - another variation in which data is scattered or gathered from all members to all members

Now, let's take a look at the functionality and syntax of these routines.

 


2.3.1 Broadcast

In many cases, one processor needs to send (broadcast) some data (either scalar or vector) to all the processes in a group. MPI provides the broadcast primitive MPI_BCAST to accomplish this task. The syntax of the MPI_BCAST is given by:

C
int MPI_Bcast (void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm)
FORTRAN
MPI_BCAST (buffer, count, datatype, root, comm, ierr)

where:

buffer is the starting address of a buffer,
count is an integer indicating the number of data elements in the buffer,
datatype is an MPI defined constant indicating the data type of the elements in the buffer,
root is an integer indicating the rank of broadcast root process, and
comm is the communicator.

The MPI_BCAST must be called by each node in the group, specifying the same comm and root. The message is sent from the root process to all processes in the group, including the root process.

2.3.2 Gather and Scatter

Often, an array is scattered throughout all processors in the group, and one wants to collect each piece of the array into a specified process. Other times, one wants to do the reverse: distribute the data into n equal segments to the processes in a group. MPI has two functions (GATHER and SCATTER) to perform this kind of communication. In addition, they come in two "flavors": one in which the numbers of data items collected from/sent to nodes is different and a more efficient version for the special case where the number per node is the same. Their syntax is given below:

C
int MPI_Gather (void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int rcount, MPI_Datatype rtype, int root, MPI_Comm comm )
int MPI_Scatter (void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int rcount, MPI_Datatype rtype, int root, MPI_Comm comm)
FORTRAN
MPI_GATHER (sbuf, scount, stype, rbuf, rcount, rtype, root, comm, ierr)
MPI_SCATTER (sbuf, scount, stype, rbuf, rcount, rtype, root, comm, ierr)

where, for the Gather routines:

sbuf is the starting address of send buffer,
scount is the number of elements in the send buffer,
stype is the data type of send buffer elements,
rbuf is the starting address of the receive buffer,
rcount is the number of elements for any single receive,
rtype is the data type of the receive buffer elements,
root is the rank of receiving process, and
comm is the communicator.
    Note: rbuf, rcount, and rtype are significant for root process only.

and for the Scatter routines:

sbuf is the address of the send buffer,
scount is the number of elements sent to each process,
stype is the data type of the send buffer elements,
rbuf is the address of the receive buffer,
rcount is the number of elements in the receive buffer,
rtype is the data type of the receive buffer elements,
root is the rank of the sending process, and
comm is the communicator.
    Note: sbuf is significant for root process only.

In the gather operation, each process (root process included) sends scount elements of type stype of sbuf to the root process. The root process receives the messages and stores them in rank order in the rbuf. For scatter, the reverse holds. The root process sends a buffer of N chunks of data (N = number of processors in the group) so that processor one gets the first element, processor two gets the second element, etc.

In order to illustrate these functions, we give a graph below:

This picture is augmented by the following example for gather.

  • Matrix-vector multiplication of matrix distributed by rows
  • Output vector needed in entirety by one processor

 

Sample Code

      DIMENSION A(25,100), b(100), cpart(25), ctotal(100)
      INTEGER root
      DATA root/0/

      DO I=1,25
            cpart(I) = 0.
            DO K=1,100
                  cpart(I) = cpart(I) + A(I,K)*b(K)
            END DO
      END DO
      call MPI_GATHER(cpart,25,MPI_REAL,ctotal,25,MPI_REAL,
            root, MPI_COMM_WORLD, ierr)

b: vector shared by all processors

c: vector updated by each processor independently

    A: matrix distributed by rows

Here, we give two FORTRAN program fragments to further show the use of MPI_GATHER and MPI_SCATTER.

MPI_GATHER

 

MPI_SCATTER

 


2.3.3 Allgather

MPI_ALLGATHER can be thought of as MPI_GATHER where all processes, not just the root, receive the result. The jth block of the receive buffer is the block of data sent from the jth process. A similar relationship holds for MPI_ALLGATHERV and MPI_GATHERV. The syntax of MPI_ALLGATHER and MPI_ALLGATHERV are similar to MPI_GATHER and MPI_GATHERV, respectively. However, the argument root is dropped from MPI_ALLGATHER and MPI_ALLGATHERV.

C
int MPI_Allgather (void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int rcount, MPI_Datatype rtype, MPI_Comm comm )
int MPI_Allgatherv (void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int* rcount, int* displs, MPI_Datatype rtype, MPI_Comm comm)
FORTRAN
MPI_ALLGATHER (sbuf, scount, stype, rbuf, rcount, rtype, comm, ierr)
MPI_ALLGATHERV (sbuf, scount, stype, rbuf, rcount, displs, rtype, comm, ierr)

The variables for Allgather are:

sbuf is the starting address of send buffer,
scount is the number of elements in send buffer,
stype is the data type of send buffer elements,
rbuf is the address of receive buffer,
rcount is the number of elements received from any process,
rtype is the data type of receive buffer elements,
comm is the group communicator.

Note: The arguments are the same as MPI_GATHER or MPI_GATHERV except for no root argument.

 

 

Example:

  • Matrix-vector multiplication of matrix distributed by rows
  • Output vector needed in entirety by all processors

Sample Code


      DIMENSION A(25,100), b(100), cpart(25), ctotal(100)

      DO I=1,25
            cpart(I) = 0.
            DO K=1,100
                  cpart(I) = cpart(I) + A(I,K)*b(K)
            END DO
      END DO
      call MPI_ALLGATHER(cpart,25,MPI_REAL,ctotal,25,
            MPI_REAL, MPI_COMM_WORLD, ierr)

 


2.3.4 AlltoAll

In applications like matrix transpose and FFT, an MPI_ALLTOALL call is very helpful. This is an extension to ALLGATHER where each process sends distinct data to each receiver. The jth block from processor i is received by processor j and stored in ith block. A graphic representation of the MPI_ALLTOALL is shown below:

The syntax of MPI_ALLTOALL is:

C
int MPI_Alltoall (void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int rcount, MPI_Datatype rtype, MPI_Comm comm )
FORTRAN
MPI_ALLTOALL (sbuf, scount, stype, rbuf, rcount, rtype, comm, ierr)

The variables for Alltoall are:

sbuf is the starting address of send buffer,
scount is the number of elements sent to each process,
stype is the data type of send buffer elements,
rbuf is the address of receive buffer,
rcount is the number of elements received from any process,
rtype is the data type of receive buffer elements, and
comm is the group communicator.

Note: Same specification as ALLGATHER, except sbuf must contain scount*NPROC elements.


2.4 Advanced Features

 

MPI collective communication provides powerful functionalities to handle very complicated data movement such as when the data on the root process is not contiguous, different processes need to receive/send different size messages or the order of data assigned to or/collected from processes is different than the rank order. The routines that carry out this kind of advanced data movement are:


         Gatherv
         Scatterv 
         Allgatherv 
         Alltoallv 
What does the "v" stand for?
varying -- size, relative location of messages

more flexibility in writing code
less need to copy data into temporary buffers
more compact final code
vendor's implementation may be optimal
(if not, may be trading performance for convenience)

2.4.1 Gatherv and Scatterv

MPI_GATHERV and MPI_SCATTERV are the vector versions of MPI_GATHER and MPI_SCATTER. MPI_GATHERV extends the functionality of MPI_GATHER by changing the receive count from an integer to an integer array and providing a new argument displs(array). MPI_GATHERV allows a varying count of data from each processor and allows some flexibility for where the gathered data is placed in the root as well. As a counterpart of MPI_GATHERV, MPI_SCATTERV is an extension of MPI_SCATTER in the same relationship as MPI_GATHERV is to MPI_GATHER.

C
int MPI_Gatherv (void* sbuf, int scount, MPI_Datatype stype, void* rbuf int *rcount, int* displs, MPI_Datatype rtype, int root, MPI_Comm comm)
int MPI_Scatterv (void* sbuf, int* scount, int* displa, MPI_Datatype stype, void* rbuf, int rcount, MPI_Datatype rtype, int root, MPI_Comm comm)
FORTRAN
MPI_GATHERV (sbuf, scount, stype, rbuf, rcount, displs, rtype, root, comm, ierr)
MPI_SCATTERV (sbuf, scount, displs, stype, rbuf, rcount, rtype, root, comm, ierr)

The variables for Gatherv are:

sbuf is the starting address of send buffer,
scount is the number of elements in send buffer,
stype is the data type of send buffer elements,
rbuf is the starting address of receive buffer,
rcount is the array containing number of elements to be received from each process,
displs is the array specifying the displacement relative to rbuf at which to place the incoming data from corresponding process,
rtype is the data type of receive buffer,
root is the rank of receiving process, and
comm is the group communicator.
    Note: rbuf, rcount, and rtype are significant for the root process only. Thus, one can allocate space on the root only for the receive buffer and save memory on the other processes.

The variables for Scatterv are:

sbuf is the address of send buffer,
scount is the integer array specifying the number of elements to send to each process,
displs is the array specifying the displacement relative to sbuf from which to take the data going out to the corresponding process,
stype is the data type of send buffer elements,
rbuf is the address of receive buffer,
rcount is the number of elements in receive buffer,
rtype is the data type of receive buffer elements, and
root is the rank of sending process, and
comm is the group communicator.
    Note: sbuf significant for root process only

For the purpose of illustrating the usage of MPI_GATHERV and MPI_SCATTERV, we give two Fortran program fragments below:

MPI_GATHERV
MPI_SCATTERV

Scatter vs. Scatterv

Purpose of scatter operation:
  • to send different chunks of data to different processes
  • not the same as broadcast (same chunk goes to all)
Extra capabilities in scatterv:
  • gaps allowed between messages in source data
    (but individual messages must still be contiguous)
    • irregular message sizes allowed
    • data can be distributed to processes in any order
    • It is possible to overcome the "contiguous" requirement by using MPI derived
      datatypes
      .  One could define a user datatype that contains gaps or
      skips, which would then become one's SENDTYPE or RECVTYPE.
      The disadvantage of this trick is that the <em>full length</em> of that
      derived datatype, including skips, becomes the basic counting unit when
      measuring counts and displacements for messages.  As a consequence of this,
      it is not possible to interleave data objects, because one cannot count in
      fractions of a type.
       

2.4.2 Allgatherv

MPI_ALLGATHERV routine collects individual messages from each task covered by the same communicator and distributes the resulting message to all tasks. Messages can have different sizes and displacements.

C
int MPI_Allgatherv (void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int* rcount, int* displs, MPI_Datatype rtype, MPI_Comm comm)
FORTRAN
MPI_ALLGATHERV (sbuf, scount, stype, rbuf, rcount, displs, rtype, comm, ierr)

The parameters for Allgatherv are:

sbuf is the starting address of send buffer,
scount is the number of elements in send buffer,
stype is the data type of send buffer elements,
rbuf is the address of receive buffer,
rcount is the number of elements received from any process,
rtype is the data type of receive buffer elements, and
comm is the group communicator.

2.4.3 Alltoallv

MPI_ALLTOALLV adds flexibility to MPI_ALLTOALL in that the location of data for the send is specified by sdispls and the location of the placement of the data on the receive side is specified by rdispls. The jth block sent from process i is received by process j and is placed in the ith block of recvbuf. These blocks need not all have the same size.

The type signature associated with sendcount[j], sendtype at process i must be equal to the type signature associated with recvcount[i], recvtype at process j. This implies that the amount of data sent must be equal to the amount of data received, pairwise between every pair of processes. Distinct type maps between sender and receiver are still allowed.

C
int MPI_Alltoallv (void* sendbuf, int sendcounts, int sdispls, MPI_Datatype sendtype, void* recvbuf, int* recvcount, int* rdispls, MPI_Datatype rcevtype, MPI_Comm comm)
FORTRAN
MPI_ALLTOALLV (sendbuf, sendcount, sendtype, recvbuf, recvcount, rdispls, recvtype, comm, ierr)

With the following variables:

sendbuf is the starting address of send buffer (choice),
sendcounts is the integer array equal to the group size specifying the number of elements to send to each processor,
sdispls is the integer array (of length group size). Entry,j specifies the displacement (relative to sendbuf from which to take the outgoing data destined for process j ,
sendtype is the data type of send buffer elements (handle),
recvbuf is the address of receive buffer (choice),
rdispls is the integer array (of length group size). Entry i specifies the displacement (relative to recvbuf at which to place the incoming data from process i ,
recvcounts is the integer array equal to the group size specifying the number of elements that can be received from each processor,
recvtype is the data type of receive buffer elements (handle),
comm is the communicator, and
IERROR is the Error information handle

2.4.4 Alltoallw

ALLTOALLW is an extension to MPI_ALLTOALLV defined in MPI-2. MPI_ALLTOALLW allows separate specification of count, displacement, and datatype. In addition, to allow maximum flexibility, the displacement of blocks within the send and receive buffers is specified in bytes. The MPI_ALLTOALLW function generalizes several MPI functions by carefully selecting the input arguments. For example, by making all but one process have sendcounts[i] = 0, this achieves an MPI_SCATTERW function.

C
FORTRAN
MPI_ALLTOALLW (sendbuf, sendcounts, sdispls, sendtypes, recvbuf, recvcounts, rdispls, recvtypes, comm)

The variables are defined as:

sendbuf is the starting address of send buffer (choice),
sendcounts is the integer array equal to the group size specifying the number of elements to send to each processor,
sdispls is the integer array (of length group size). Entry j specifies the displacement in bytes (relative to sendbuf) from which to take the outgoing data destined for process j,
sendtypes is the array of datatypes (of length group size). Entry j specifies the type of data to send to process j (handle),
recvbuf is the address of receive buffer (choice),
recvcounts] is the integer array equal to the group size specifying the number of elements that can be received from each processori,
rdispls is the integer array (of length group size). Entry i specifies the displacement in bytes (relative to recvbuf) at which to place the incoming data from process i ,
recvtypes is the array of datatypes (of length group size). Entry i specifies the type of data received from process i (handle)
comm is the communicator (handle).

2.5 Global Reduction Routines

MPI_REDUCE

One of the most useful collective operations is a global reduction or combine operation. The partial result in each process in the group is combined in one specified process or all the processes using some desired function. If there are n processes in the process group, and D(i,j) is the jth data item in process i, then the Dj item of data in the root process resulting from a reduce routine evaluation is given by:

where * is the reduction function and is always assumed associative. All MPI predefined functions are also assumed to be commutative. One may define functions that are assumed to be associative but not commutative. Each process can provide either one element or a sequence of elements. In both cases, the combine operation is executed element-wise on each element of the sequence. There are three versions of reduce. They are MPI_REDUCE, MPI_ALLREDUCE, and MPI_REDUCE_SCATTER. The form of these reduction primitives is listed below:

C
int MPI_Reduce (void* sbuf, void* rbuf, int count, MPI_Datatype stype, MPI_Op op, int root, MPI_Comm comm)
int MPI_Allreduce (void* sbuf, void* rbuf, int count, MPI_Datatype stype, MPI_Op op, MPI_Comm comm)
int MPI_Reduce_scatter (void* sbuf, void* rbuf, int* rcounts, MPI_Datatype stype, MPI_Op op, MPI_Comm comm)
FORTRAN
MPI_REDUCE (sbuf, rbuf, count, stype, op, root, comm, ierr)
MPI_ALLREDUCE (sbuf, rbuf, count, stype, op, comm, ierr)
MPI_REDUCE_SCATTER (sbuf, rbuf, rcounts, stype, op, comm, ierr)

The differences among these three reduces:

  • MPI_REDUCE returns results to a single process;
  • MPI_ALLREDUCE returns results to all processes in the group;
  • MPI_REDUCE_SCATTER scatters a vector, which results in a reduce operation, across all processes.
sbuf is the address of send buffer,
rbuf is the address of receive buffer,
count is the number of elements in send buffer,
stype is the data type of elements of send buffer,
op is the reduce operation (which may be MPI predefined, or your own),
root is the rank of the root process, and
comm is the group communicator.
Notes:
* rbuf significant only at the root process,
* the argument rcounts in MPI_REDUCE_SCATTER is an array.

Scan

D(k,j) = d(0,j) * d(1,j) * ... *d(k,j)

where * is the reduction function that may be either an MPI predefined function or a user defined function.

The syntax of the MPI scan routine is:

C
int MPI_Scan (void* sbuf, void* rbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)
FORTRAN
MPI_SCAN (sbuf, rbuf, count, datatype, op, comm, ierr)
sbuf is the starting address of send buffer,
rbuf is the starting address of receive buffer,
count is the number of elements in input buffer,
datatype is the data type of elements of input buffer,
op is the operation, and
comm is the group communicator.

 

Predefined Reduce Operations

Examples of MPI predefined operations are summarized below. MPI also provides a mechanism for user-defined operations used in MPI_ALLREDUCE and MPI_REDUCE.

MPI Predefined Reduce Operations
Name Meaning C type FORTRAN type
MPI_MAX maximum integer, float integer, real, complex
MPI_MIN minimum integer, float integer, real, complex
MPI_SUM sum integer, float integer, real, complex
MPI_PROD product integer, float integer, real, complex
MPI_LAND logical and integer logical
MPI_BAND bit-wise and integer, MPI_BYTE integer, MPI_BYTE
MPI_LOR logical or integer logical
MPI_BOR bit-wise or integer, MPI_BYTE integer, MPI_BYTE
MPI_LXOR logical xor integer logical
MPI_BXOR bit-wise xor integer, MPI_BYTE integer, MPI_BYTE
MPI_MAXLOC max value and location combination of int, float, double, and long double combination of integer, real, complex, double precision
MPI_MINLOC min value and location combination of int, float, double, and long double combination of integer, real, complex, double precision

Example

  • Each process in a forest dynamics simulation calculates the maximum tree height for its region.
  • Processor 0, which is writing output, must know the global maximum height.

  INTEGER maxht, globmx
      .
      .
      .  (calculations which determine maximum height)
      .
      .
  call MPI_REDUCE (maxht, globmx, 1, MPI_INTEGER, MPI_MAX, 0, MPI_COMM_WORLD, ierr)
  IF (taskid.eq.0) then
      .
      .  (Write output)
      .
  END IF

User-defined Operations

  • User can define his/her own reduce operation.
  • Makes use of the MPI_OP_CREATE function.

 

Performance Issues

  • A great deal of hidden communication takes place with collective communication.
  • Performance depends greatly on the particular implementation of MPI.
  • Because there may be forced synchronization, not always best to use collective communication.

Broadcast to 8 Processors

Solid line: data transfer
Dotted line: carry-over from previous transfer

Amount of data transferred: (N-1)*p

    N = number of processors
    p = size of message

Number of steps: log2(N)

Scatter to 8 Processors

Solid line: data transfer
Dotted line: carry-over from previous transfer

Amount of data transferred: log2(N) * N * p/2

    N = number of processors
    p = size of message

Number of steps: log2(N)

 


3. New Information from the Vendors

On the IBM SP, the new release of PSSP version 3.1.1 allows the processes of an MPI job running on the same node to take advantage of its shared memory architecture for communication. To use this feature, one needs to set the environment parameter MP_SHARED_MEMORY before launching the job, i.e.,:


                setenv MP_SHARED_MEMORY yes

 

IBM has started implementing their own non-blocking routines of collective communication to improve the performance. Their names are distinguished from the standard routines of collective communication by an additional letter "I" after the prefix "MPI_." For example, corresponding to MPI_Bcast(...) and MPI_Gather(...), the non-blocking routines are MPI_Ibcast(...) and MPI_Igather(...).

On the Origin 2000, SGI Message Passing Toolkit (MPT 1.3.0.3) has been installed with significant enhancement for the collective communication routines: MPI_Barrier, MPI_Bcast, and MPI_Alltoall.

MPT (1.3.0.3) supports both MPI and SHMEM message passing interface modules that FORTRAN programmers can use without any changes to portable MPI codes. To activate the compile-time interface checking, use the f90 compiler and the following command-line options:


        -auto_use mpi_interface, shmem_interface

 


4. References

The material in this tutorial borrows heavily from copyrighted material developed at the Cornell Theory Center and is used with their permission.