More Point to Point Communication with MPI

1. Overview

In many message-passing libraries, such as PVM or MPL, the method by which the system handles messages has been chosen by the library developer. The chosen method gives acceptable reliability and performance for all possible communication scenarios. But it may hide possible programming problems or may not give the best performance in specialized circumstances. For MPI, this is equivalent to standard mode communication, which you have been introduced to in Basics of MPI Programming and MPI Point to Point Communication I.

In MPI, more control over how the system handles the message has been given to the programmer, who selects a communication mode when they select a send routine. In addition to standard mode, MPI provides synchronous, ready, and buffered modes. This module will look at the system behavior for each mode, and discuss their advantages and disadvantages.

 


 

 

2. Communication Modes

 

2.1 Blocking Behavior

 

 

2.1.1 Blocking Synchronous Send

 

The communication mode is selected with the send routine. There are four blocking send routines and four non-blocking send routines, corresponding to the four communication modes. The receive routine does not specify communication mode -- it is simply blocking or non-blocking.

We'll start by examining the behavior of blocking communication for the four modes, beginning with synchronous mode. For compactness, we'll delay examination of non-blocking behavior until Section 3.

In the diagram below, time increases from left to right. The heavy horizontal line marked S represents execution time of the sending task node, and the heavy dashed line marked R represents execution time of the receiving task (on a second node). Breaks in these lines represent interruptions due to the message-passing event.

When the blocking synchronous send MPI_Ssend (S) is executed, the sending task sends the receiving task a "ready to send" message. When the receiver executes the receive call, it sends a "ready to receive" message. The data are then transferred.

There are two sources of overhead in message-passing. System overhead is incurred from copying the message data from the sender's message buffer onto the network, and from copying the message data from the network into the receiver's message buffer.

Synchronization overhead is the time spent waiting for an event to occur on another task. The sender must wait for the receive to be executed and for the handshake to arrive before the message can be transferred. The receiver also incurs some synchronization overhead in waiting for the handshake to complete. Synchronization overhead can be significant, not surprisingly, in synchronous mode. As we shall see, the other modes try different strategies for reducing this overhead.

Only one relative timing for the MPI_Ssend (S) and MPI_Recv (S) calls is shown, but they can come in either order. If the receive call precedes the send, most of the synchronization overhead will be incurred by the receiver.

One might hope that, if workload is properly load balanced, synchronization overhead would be minimal on both the sending and receiving task. This is not realistic on the SP. If nothing else causes lack of synchronization, UNIX daemons which run at unpredictable times on the various nodes will cause unsynchronized delays. One might respond to this by saying that it would be simpler to just call MPI_Barrier frequently to keep the tasks in sync, but that call itself incurs synchronization overhead and doesn't assure that the tasks will be in sync a few seconds later. Thus, barrier calls are almost always a waste of time.

2.1.2 Blocking Ready Send

The ready mode send MPI_Rsend (S) simply sends the message out over the network. It requires that the "ready to receive" notification has arrived, indicating that the receiving task has posted the receive. If the "ready to receive" message hasn't arrived, the ready mode send will incur an error. By default, the code will exit. The programmer can associate a different error handler with a communicator to override this default behavior. The diagram shows the latest posting of the MPI_Recv (S) that would not cause an error.

Ready mode aims to minimize system overhead and synchronization overhead incurred by the sending task. In the blocking case, the only wait on the sending node is until all data have been transferred out of the sending task's message buffer. The receive can still incur substantial synchronization overhead, depending on how much earlier it is executed than the corresponding send.

This mode should not be used unless the user is certain that the corresponding receive has been posted.

2.1.3 Blocking Buffered Send

The blocking buffered send MPI_Bsend (S) copies the data from the message buffer to a user-supplied buffer, and then returns. The sending task can then proceed with calculations that modify the original message buffer, knowing that these modifications will not be reflected in the data actually sent. The data will be copied from the user-supplied buffer over the network once the "ready to receive" notification has arrived.

Buffered mode incurs extra system overhead, because of the additional copy from the message buffer to the user-supplied buffer. Synchronization overhead is eliminated on the sending task -- the timing of the receive is now irrelevant to the sender. Synchronization overhead can still be incurred by the receiving task. Whenever the receive is executed before the send, it must wait for the message to arrive before it can return.

Another benefit for the user is the opportunity to provide the amount of buffer space for outgoing messages that the program needs. On the downside, the user is responsible for managing and attaching this buffer space. A buffered mode send that requires more buffer space than is available will generate an error, and (by default) the program will exit.

For IBM's MPI implementation, buffered mode is actually a little more complicated than depicted in the diagram, as a receive side system buffer may also be involved. This system buffer is discussed along with standard mode.


 

2.1.4 Buffer Management

For a buffered mode send, the user must provide the buffer: it can be a statically allocated array, or memory for the buffer can be dynamically allocated with malloc. The amount of memory allocated for the user-supplied buffer should exceed the sum of the message data, as message headers must also be stored. The parameter MPI_BSEND_OVERHEAD gives the bytes needed for each message for pointers and envelope information.

 

For Version 2 of IBM's MPI, adding MPI_BSEND_OVERHEAD bytes to the
buffer for each message does not guarantee sufficient buffer space.

If a user buffer is created to buffer one message, and the size of
the user buffer equals the number of bytes in the message plus
MPI_BSEND_OVERHEAD, the message runs out of buffer space if the
message count is an odd number of elements, but succeeds if it is an
even number of elements.

Specifying the overhead as (MPI_BSEND_OVERHEAD + 7) succeeds
regardless of the message count.
 

This space must be identified as the user-supplied buffer by a call to MPI_Buffer_attach (S) . When it is no longer needed, it should be detached with MPI_Buffer_detach (S) . There can only be one user-supplied message buffer active at a time. It will store multiple messages. The system keeps track of when messages ultimately leave the buffer, and will reuse buffer space. For a program to be safe, it should not depend on this happening.


 

2.1.5 Blocking Standard Send

For standard mode, the library implementor specifies the system behavior that will work best for most users on the target system. For IBM's MPI, there are two scenarios, depending on whether the message size is greater or smaller than a threshold value (called the eager limit). The eager limit depends on the number of tasks in the application:

 

Number of Tasks Eager Limit (bytes)
= threshold
1 - 16 4096
17 - 32 2048
33 - 64 1024
65 - 128 512

The behavior when the message size is less than or equal to the threshold is shown below:

In this case, the blocking standard send MPI_Send (S) copies the message over the network into a system buffer on the receiving node. The standard send then returns, and the sending task can continue computation. The system buffer is attached when the program is started -- the user does not need to manage it in any way. There is one system buffer per task that will hold multiple messages. The message will be copied from the system buffer to the receiving task when the receive call is executed.

As with buffered mode, use of a buffer decreases the likelihood of synchronization overhead on the sending task at the price of increased system overhead resulting from the extra copy to the buffer. As always, synchronization overhead can be incurred by the receiving task if a receive is posted first.

Unlike buffered mode, the sending task will not incur an error if the buffer space is exceeded. Instead the sending task will block until the receiving task calls a receive that pulls data out of the system buffer. Thus, synchronization overhead can still be incurred by the sending task.


 

2.1.6 Blocking Standard Send, cont'd

When the message size is greater than the threshold, the behavior of the blocking standard send MPI_Send (S) is essentially the same as for synchronous mode.

Why does standard mode behavior differ with message size? Small messages benefit from the decreased chance of synchronization overhead resulting from use of the system buffer. However, as message size increases, the cost of copying to the buffer increases, and it ultimately becomes impossible to provide enough system buffer space. Thus, standard mode tries to provide the best compromise.

You have now seen the system behavior for all four communication modes.



 

2.2 Deadlock

The concept of deadlock was introduced in MPI Point to Point Communication I. It is easier to understand the events that lead to deadlock once you understand the behavior of the various communication modes.

The following diagram represents two SPMD tasks: both are calling blocking standard sends at the same point of the program. Each task's matching receive occurs later in the other task's program.

 

If the message size is greater than the threshold, deadlock will occur because neither task can synchronize with its matching receive. If the message size is less than or equal to the threshold, deadlock can still occur if insufficient system buffer space is available. Both tasks will be waiting for a receive to draw message data out of the system buffer, but these receives cannot be executed because both tasks are blocked at the send.


 

2.3 Conclusions: Modes

Synchronous mode is the "safest", and therefore also the most portable. "Safe" means that if a code runs under one set of conditions (i.e. message sizes) it will run under all conditions. Synchronous mode is safe because it does not depend upon the order in which the send and receive are executed (unlike ready mode) or the amount of buffer space (unlike buffered mode and standard mode). Synchronous mode can incur substantial synchronization overhead.

Ready mode has the lowest total overhead. It does not require a handshake between sender and receiver (like synchronous mode) or an extra copy to a buffer (like buffered or standard mode). However, the receive must precede the send. This mode will not be appropriate for all messages.

Buffered mode decouples the sender from the receiver. This eliminates synchronization overhead on the sending task and ensures that the order of execution of the send and receive does not matter (unlike ready mode). An additional advantage is that the programmer can control the size of messages to be buffered, and the total amount of buffer space. There is additional system overhead incurred by the copy to the buffer.

Standard mode behavior is implementation-specific. The library developer choses a system behavior that provides good performance and reasonable safety. For IBM's MPI, small messages are buffered (to avoid synchronization overhead) and large messages are sent synchronously (to minimize system overhead and required buffer space).

 


 

3. Blocking v Non-blocking

 

3.1 Behavior of Non-blocking Calls

Blocking and non-blocking routines were introduced in MPI Point to Point Communication I.

A blocking send or receive call suspends execution of the program until the message buffer being sent/received is safe to use. Non-blocking calls return immediately after initiating the communication. The programmer does not know at this point whether data to be sent have been copied out of the send buffer, or whether data to be received have arrived. So, before using the message buffer, the programmer must check its status.

We have seen the blocking behavior for each of the communication modes. We will now discuss the non-blocking behavior for standard mode. The behaviors of the other modes can be implied from this.

The non-blocking standard send MPI_Isend (S) and non-blocking receive MPI_Irecv (S) . As before, the standard mode send will proceed differently depending on the message size.

The sending task posts the non-blocking standard send when the message buffer contents are ready to be transmitted. It returns immediately without waiting for the copy to the remote system buffer to complete. MPI_Wait (S) is called just before the sending task needs to overwrite the message buffer.

The receiving task calls a non-blocking receive as soon as a message buffer is available to hold the message. The non-blocking receive returns without waiting for the message to arrive. The receiving task calls MPI_Wait (S) when it needs to use the incoming message data (i.e. needs to be certain that it has arrived).

The system overhead will not differ substantially from the blocking send and receive calls unless data transfer and computation can occur simultaneously. Since the SP node CPU must perform both the data transfer and the computation, computation will be interrupted on both the sending and receiving nodes to pass the message. When the interruption occurs should not be of any particular consequence to the program that is running. Even for architectures that overlap computation and communication, the fact that this case applies only to small messages means that no great difference in performance would be expected.

The advantage of using the non-blocking send occurs when the system buffer is full. In this case, a blocking send would have to wait until the receiving task pulled some message data out of the buffer. If a non-blocking call is used, computation can be done during this interval.

The advantage of a non-blocking receive over a blocking one can be considerable if the receive is posted before the send. The task can continue computing until the Wait is posted, rather than sitting idle. This reduces the amount of synchronization overhead.

Non-blocking calls can ensure that deadlock will not result. The Wait must be posted after the calls needed to complete the communication.

3.2 Behavior of Non-blocking Calls, cont'd

The case of a non-blocking standard send MPI_Isend (S) for a message larger than the threshold is more interesting:

For a blocking send, the synchronization overhead would be the period between the blocking call and the copy over the network. For a non-blocking call, the synchronization overhead is reduced by the amount of time between the non-blocking call and the MPI_Wait (S) , in which useful computation is proceeding.

Again, the non-blocking receive MPI_Irecv (S) will reduce synchronization overhead on the receiving task for the case in which the receive is posted first. There is also a benefit to using a non-blocking receive when the send is posted first. Typically, blocking receives are posted immediately before the message data must be used (to allow the maximum amount of time for the communication to complete). So, the blocking receive would be posted in place of the MPI_Wait. This would delay the synchronization with the send call until this later point in the program, and thus increase synchronization overhead on the sending task.


 

3.3 Conclusions: Non-blocking Calls

On some machines, non-blocking calls can reduce the system overhead. This requires that the message transport can be handled in the background without having any impact on the computations in progress. This is not currently true for the SP.

The greater advantage of non-blocking communication is in reducing synchronization overhead and eliminating deadlock. Synchronization overhead for the sender can be reduced for synchronous mode and standard mode (ready and buffered mode don't accumulate send-side synchronization overhead). Non-blocking communication can reduce synchronization overhead for the receiver regardless of the communication mode.

It is best to post non-blocking sends and receives as early as possible, and call Wait as late as possible. "Early as possible" means that the data in the buffer to be sent must be valid, and likewise the the background without having any impact on the computations in progress. This is not currently true for the SP.

The greater advantage of non-blocking communication is in reducing synchronization overhead and eliminating deadlock. Synchronization overhead for the sender can be reduced for synchronous mode and standard mode (ready and buffered mode don't accumulate send-side synchronization overhead). Non-blocking communication can reduce synchronization overhead for the receiver regardless of the communication mode.

It is best to post non-blocking sends and receives as early as possible, and call Wait as late as possible. "Early as possible" means that the data in the buffer to be sent must be valid, and likewise the buffer to be received into must not be in use. Keep in mind that tasks may still need to synchronize on the Wait, if the message has not yet been transferred. You want to maximize the time the task is computing in the hope that, by the time the task reaches the Wait, the communication will have completed.


 

4. Programming Recommendations

In general, it is reasonable to start programming with non-blocking calls and standard mode. Non-blocking calls can eliminate the possibility of deadlock and reduce synchronization overhead. Standard mode gives generally good performance.

Blocking calls may be required if the programmer wishes the tasks to synchronize. Also, if the program requires a non-blocking call to be immediately followed by a Wait, it is more efficient to use a blocking call. If using blocking calls, it may be advantageous to start in synchronous mode, and then switch to standard mode. Testing in synchronous mode will ensure that the program does not depend on the presence of sufficient system buffering space.

The next step is to analyze the code and evaluate its performance. If non-blocking receives are posted early, well in advance of the corresponding sends, it might be advantageous to use ready mode. In this case, the receiving task should probably notify the sender after the receives have been posted. After receiving the notification, the sender can proceed with the sends.

If there is too much synchronization overhead on the sending task, especially for large messages, buffered mode may be more efficient. For IBM's MPI, an alternative would be to use the run-time flag to change the threshold message size at which system behavior switches from buffered to synchronous.


 

5. References

This tutorial was strongly based on copyrighted material developed at the Cornell Theory Center and is used here with their permission.