Point to Point Communications with MPI

    • Highlights

      In point to point communication, one process sends a message and a second process receives it. This is in contrast to collective communication routines, in which a pattern of communication is established amongst a group of processes. There are several programming options for point to point communication that relate to how the system handles the message, and this tutorial will give you a brief, high-level overview of some of them which are available in MPI.

    • Communication Modes and Program Control

      There are two general aspects to point to point comunication. One refers to the communication mode and the other corresponds to how program control is returned to the user when the send or receive is posted.

      Communication modes specify how the system handles the message. For instance, if a send has completed, does that tell us anything about the receive? Can we know if the receive has finished or even begun? Modes specify the underlying protocol for sending or receiving messages, and will be discussed in more detail in another tutorial.

    • Blocking versus Non-Blocking

      The other aspect to point to point communication focuses on the issue of program control, and is the primary topic which we will discuss here. Program control with point to point communication is enforced by using blocking or nonblocking send/receives. A blocking send "blocks" until the send buffer - the contents of the message in the send call - can be reclaimed. Similarily a blocking receive "blocks" until the receive buffer contains the contents of the message that was sent.

      The send call is potentially non-local. It may copy the contents of its buffer directly to the receive buffer or to a temporary system buffer. If it does the former, the send is not completed until the receive is initiated. If it does the latter, the send returns ahead of the matching receive. The choice of which send is implemented is implementation dependent.

      Blocking calls contrast with nonblocking class which allow the overlap of message transmittal and computation. Nonblocking calls return control immediately to the user, while a blocking send/receive is one in which control is not immediately returned. The key ideas include:

      • Blocking calls can be matched to non-blocking calls

        A blocking or non-blocking send can be paired to a blocking or non-blocking receive.

      • Blocking suspends execution until message buffer is safe to use

        In both sending and receiving modes, the buffer used to contain the message can be an oft-used resource, and problems arise when it is used before an on-going transaction has completed; blocking communications insure that this never happens -- when control returns from the blocking call, the buffer can safely be modified without any danger of corrupting some other part of the process.

      • Non-blocking separates communication from computation

        A non-blocking call effectively guarantees that an interrupt will be generated when the transaction is ready to proceed, thus allowing the original thread to get back to computationally-oriented processing.

         

    • Deadlock

      Deadlock is an often-encountered situation in parallel processing, resulting when two or more processes are in contention for the same set of resources.

      • Causes

        In communications, a typical scenario involves two processes wishing to exchange messages, but both trying to give theirs to the other while neither is yet ready to accept one.

      • Avoidance

        A number of strategies will be described to help insure against this sort of thing occurring in the application.

    • Communications State

      Applications can use specific calls to control how they deal with incomplete transactions, or transactions whose state is not known, without necessarily having to complete the operation (this is analogous to wanting to know whether or not your aunt has written you, but not particularly caring what she had to say). The application can be programmed to be aware of the state of its communications, and thereby act intelligently in different situations:

      • Wait, Test, and Probe

        When involved in non-blocking transactions, a number of calls make it possible for the application to query the status of particular messages, or to check on any that meet a certain set of characteristics, without necessarily taking any completed transactions out of the queue.

      • Status

        Checking the information returned from a transaction allows the application to, for example, take corrective action if an error has occurred.

         

    • Using Special Parameters
      • Wildcards

        A generic term meaning "anything meeting a very general set of characteristics." MPI_ANY_SOURCE allows the receiver to get messages from any sender, and MPI_ANY_TAG allows the receiver to get any kind of message from a sender.

      • Null Processes

        Applications often deal with regular data structures, and perform the same kind of communication everywhere within them, except for at the edges, where special code has to be written in order not to communicate where there are no valid "neighbors" to receive, or from whom to receive; special null parameters move the logic for this out of user-code and into the system, simplifying the application.

    • Overlapping Computation and Communication

      The biggest advantage to non-blocking sends and receives is that computation may proceed concurrently with the communication. That is computation may be done after the communication was posted, and before it was completed, provided the buffer is not used in the computation.

    • Checking the Communication Buffer

      A blocking send or receive call suspends execution of the program until the message buffer being sent/received is safe to use. In the case of a blocking send, this means that the data to be sent have been copied out of the send buffer, but they have not necessarily been received in the receiving task. The contents of the send buffer can be modified without affecting the message that was sent. Completion of a blocking receive implies that the data in the receive buffer are valid.

      Non-blocking calls return immediately after initiating the communication. The programmer does not know at this point whether the data to be sent have been copied out of the send buffer, or whether the data to be received have arrived. So, before using the message buffer, the programmer must check its status.

      • Blocking versus Non-blocking Checking

        The programmer can choose to block until the message buffer is safe to use (MPI_WAIT and variants (S) or to just return the current status of the communication (MPI_TEST and variants (S)).

      • Check for all, any, some, or one Message

        The different variants of the Wait and Test calls allow you to check the status of a specific message, or to check all, any, or some of a list of messages.

      • Summary

        It is fairly intuitive why you need to check the status of a non-blocking receive: you do not want to read the message buffer until you are sure that the message has arrived. It is less obvious why you would need to check the status of a non-blocking send. This is most necessary when you have a loop that repeatedly fills the message buffer and sends the message. You can't write anything new into that buffer until you know for sure that the preceding message has been successfully copied out of the buffer. Even if a send buffer is not re-used, it is advantageous to complete the communication, as this releases system resources (the request object, described on the next page).

         

    2.1 Syntax of Non-blocking Calls

    The nonblocking calls have the same syntax (S) as the blocking ones, with two exceptions:

    1. Each call has an "I" immediately following the "_".
    2. The last argument is a handle to an opaque request object that contains information about the message, i.e., its status.
    • C:

      MPI_Isend (buf,count,dtype,dest,tag,comm,request)

      MPI_Irecv (buf,count,dtype,dest,tag,comm,request)

       

    • Fortran:

      MPI_Isend (buf,count,dtype,dest,tag,comm,request,ierror)

      MPI_Irecv (buf,count,dtype,source,tag,comm,request,ierror)

       

    and an example of a Wait, (MPI_WAIT), call will look like this:

     

    Note that the Wait and Test calls take one or more request handles as input and return one or more status arrays (or structure in C). The request handle lets the user identify communication operations and links the posting operation (eg. MPI_ISEND) with the completion operation (eg. MPI_WAIT). The status array (structure in C) has information about the success or failure of the communication and other miscellaneous information. Wait, Test, and status are discussed in detail later in this talk.

    2.2 Basic Deadlock

    • Causes of Deadlock

      Deadlock is a phenomenon most common with blocking communication. It occurs when all tasks are waiting for events that haven't been initiated yet. The following diagram represents two SPMD tasks: both are calling blocking standard sends at the same point of the program. Each task's matching receive occurs later in the other task's program.

    • Avoiding Deadlock

      There are four ways to avoid deadlock:

      • Different ordering of calls between tasks

        Arrange for one task to post its receive first and for the other to post its send first. That clearly establishes that the message in one direction will precede the other.

      • Non-blocking calls

        Have each task post a non-blocking receive before it does any other communication. This allows each message to be received, no matter what the task is working on when the message arrives or in what order the sends are posted.

      • MPI_Sendrecv

        MPI_Sendrecv_replace Use MPI_Sendrecv (S) The send-receive combines in one call the sending of a message to a destination and the receiving from a source, and is particularly useful when a node both sends and receives a message. It is an elegant solution to the deadlock problem because the "send half" of the send-receive on node i can complete before the "receive half" on node j is initiated. In the _replace (S) version, the same buffer is used for both the send and receive, so the message that was sent is replaced by the received message. The system implements the send-receive-replace by allocating some buffer space (not subject to the threshold limit) to deal with the exchange of messages.

      • Buffered mode

        Use buffered sends so that computation can proceed after copying the message to the user-supplied buffer. This will allow the send to complete and the subsequent receive to be executed. Buffered sends are discussed in the follow-on to this module, More Point to Point Communication with MPI.

    2.3 Summary: Non-blocking Calls

     

      • Gains
        • Avoid Deadlock
        • Decrease Synchronization Overhead

          Non-blocking calls have the advantage that computation can continue almost immediately, even if the message can't be sent. This can eliminate deadlock and reduce synchronization overhead.

        • Can Reduce Systems Overhead

          On some machines, the system overhead can be reduced if the message transport can be handled in the background without having any impact on the computations in progress on both the sender and receiver. This is not currently true for the SP.

          A knowledgeable source has the following comments regarding whether or not non-blocking calls do in fact result in a reduction of system overhead:

          ...I have the strong suspicion that non-blocking calls can actually incur more overhead than blocking ones that result in immediate message transfer. This extra overhead comes from several sources: (a) the cost of allocating a request object; (b) the overhead of doing the interrupt when the data are transferred later; and (c) the cost of querying to determine whether the transfer has completed. Cost (a) can be eliminated by using persistent requests (definitely an advanced topic). All these can be small compared to the synchronization overhead that is avoided if useful computations are available to be done.

          Once you've gotten some experience with the basics of MPI, consult the standard or available texts for information onpersistent requests; the rest of the comment simply points out that a blocking call carries with it much less systems-baggage than does a non-blocking call, if you can assume that both are going to be satisfied immediately, but if the transaction is not immediately satisfied, then the non-blocking call wins to the extent that "useful computation" can be accomplished prior to its conclusion.

      • Post non-blocking sends/receives early and to do waits late

        Some additional programming is required with non-blocking calls, to test for completion of the communication. It is best to post sends and receives as early as possible, and to wait for completion as late as possible. "Early as possible" above means that the data in the buffer to be sent must be valid, and likewise the buffer to be received into must exist.

      • Be careful with reads and writes

        Do not write to send buffer between MPI_Isend and MPI_WAIT and must avoid reading and writing in receive buffer between MPI_Irecv and MPI_WAIT

        It should be possible to safely read the send buffer after the send is posted, but nothing should be written to that buffer until after status has been checked to give assurance that the original message has been sent. Otherwise, the original message contents could be overwritten. NO user reading or writing of the receive buffer should take place between posting a non-blocking receive and determining that the message has been received. The read might give either old data or new (incoming) message data. A write could overwrite the recently arrived message.

         

  1. Determining Information about Messages

    3.1 Wait, Test, and Probe

    Although it is often useful to decouple non-blocking communications from the computational thread, it is also useful to recouple them. By doing so, one can obtain information about the status of the communication transaction, and then take actions depending on this status. The three calls described here are among the most commonly useful for dealing with these kinds of situations.

    • MPI_WAIT

       

        • Used for non-blocking Sends and Receives

          Blocking communications involve an automatic wait, so you'll never see a non-trivial call to it when such operations are used. In both the send and receive for non-blocking operations, the calling process suspends its operation until the operation referenced by the wait has completed, at which time execution resumes in the calling process.

        • MPI_WAIT and non-blocking Receives

          On the receive side, the process has already posted a non-blocking receive, which will be completed regardless of what the calling process does. The programmer is therefore able to determine whether or not any useful computation can be accomplished before the information in the not-yet-received message is required. If there is useful work available, the application has clearly gained efficiency by doing that work while the message is still in transit. At some point, of course, the message will be needed, and a wait will be issued.

        • MPI_WAIT and non-blocking Sends

          On the sending side, the process is also freed from the transaction, except for the fact that it is constrained from doing any more communication with that particular buffer until its current use is completed. If the sender attempts to put some other message into that buffer before the preceding transmission completes, the results are indeterminate and based solely on what part of the process was in train when the overwriting occurred. Doing a wait on the message guarantees that re-use will not be destructive.

           

        • MPI_WAIT syntax
    • MPI_TEST
      • Comparison to MPI_WAIT

        While MPI_WAIT suspends execution until an operation completes, MPI_TEST returns immediately with information about its current status.

      • MPI_TEST and non-blocking Receives

        MPI_TEST will return true only in the case that the sender specified in the object has sent a message which is currently in the queue for delivery; traffic from all other sources is ignored.

      • MPI_TEST and non-blocking Sends

        On the sender side, test is the non-blocking analog to wait, giving the application knowledge of the current state without requiring it to block until completion, thus allowing the application to do other work, if any exists.

      • MPI_TEST Syntax

        MPI_TEST

    • MPI_Probe
      • MPI_PROBE is blocking. MPI_IPROBE is non-blocking.
      • Both calls return true if there is a message that can be received that matches the message envelope specified by source, tag, comm.
      • The previous calls all targeted specific messages and senders; the probe call can be tailored to return "deliverable" information regarding messages from any sender, as well as from specific ones.
      • MPI_PROBE Syntax

        MPI_Probe

      • MPI_IPROBE Syntax

        MPI_IProbe

      3.2 Status

      • status returns source, tag, error (standard)
      • Status is the object at which to look to determine information on the message source and tag, and any error incurred on the communication call.

        In Fortran these are returned as an array of integers. status(MPI_SOURCE), status(MPI_TAG), and status(MPI_ERROR) give the source,tag, and error condition of the incoming message, respectively.

        In C they are returned as a structure of type MPI_STATUS. The fields status.MPI_SOURCE, status.MPI_TAG, and status.MPI_ERROR.

        The status array (Fortran) or structure (C) must be declared in the user's program.

      • IBM's MPI implementation returns nbytes (non-standard)

        IBM's MPI implementation varies from the MPI standard regarding the status information by making available the total number of bytes involved in the operation. The standard mechanism for obtaining this information is to call MPI_Get_count (see below). Our trusty unnamed knowledgeable source at CTC comments:

        Although the standard doesn't specify where the number of bytes in the message appear in the status object, it does specify that the status must contain enough information so the MPI_Get_count can do its thing. Thus, it effectively mandates that the number of bytes will be there, just not where. The IBM MPI does the obvious -- it puts the message size in bytes in the first optional field.

      • MPI_Get_count returns number of elements

        This can be useful if you've allocated a receive buffer that may be larger than the incoming message, or if you want to learn the length of a message identified with MPI_Probe (S).

      • Checking Status

        The information made available by status comes in very handy in the following situations:

        • blocking receive or wait, when MPI_ANY_TAG (accept a message with any tag value) or MPI_ANY_SOURCE (accept a message from from any source) has been used
        • MPI_PROBE and MPI_IPROBE
        • MPI_TEST
        • Example: Suppose you're doing matrix- vector multiplication using a master/slave model. The master broadcasts each column vector and row of the matrix to each slave. The slave does the dot product, and then sends the result to the master. It is convenient to write one MPI_RECEIVE is the master can receive from any slave using MPI_ANY_SOURCE, but it will need to know who sent it the dot product so that it knows who to farm the next row of the matrix out to.
  2.  

    4.1 Wildcards

    Wildcard is a concept lifted from certain card games relating to an entity's ability to take on any legal value; in this message passing context, that ability comes in most useful when dealing with how a receiver determines which messages will be accepted for delivery:

    • MPI_ANY_SOURCE
      • The receiver is willing to accept messages from anyone.
      • When MPI_ANY_SOURCE is specified as the sender, the receiver is signaling a complete lack of concern for the originator of the transmission. In particular forms of concurrent processing, e.g., master/worker, the master is often in the position of knowing that one of the workers is going to be sending it a message, but the exact one is indeterminate; by specifying MPI_ANY_SOURCE the master indicates that a message from any of its contingent will be accepted.
    • MPI_ANY_TAG
      • The receiver is willing to accept any kind of message
      • A server may be called upon to handle a number of different kinds of requests, even from the same source. A separatereceive could be posted for each different type of message, identified by the tag field, but it's often more efficient to issue a single call capable of pulling in messages of all kinds, and figure out inside the application what particular processing is required by each.

    4.2 Null Parameters

    Many times it is useful to specify "dummy" source or destination values for communication.

    MPI_PROC_NULL can be used instead of rank whenever source or destination is used in a communication call.

    A communication with MPI_PROC_NULL has no effect. A send returns as soon as possible. A receive returns with no change to the input buffer. The source is set to MPI_PROC_NULL, the tag is set to MPI_ANY_TAG, and count equals 0.

    MPI_PROC_NULL is most useful for taking boundary-tests out of user code. For instance, if the action desired in such situations is simply to not send the message, then the programmer can have this accomplished automatically by specifying one of these null values for such circumstances, and, when the send or receive call determines that a null parameter has been given to it, the operation immediately returns as if it had completed successfully. This has the same effect as saying "In general, pass this to your neighbor", while tacitly knowing that you really mean "...as long as you have a neighbor", but such is a well-understood case, and you don't feel the need to explicitly mention it.

    • Start with non-blocking calls

      This will allow computation and communication to occur concurrently. This is important because sending data between processors is much slower than manipulating data on a processor, so to prevent a program being "starved for data", using non-blocking calls allow communication to be initiated while other operations are performed.

      The possibility of deadlock is eliminated and synchronization overhead is reduced.

    • Sometimes blocking calls are needed
      • Tasks may need to be synchronized.

        If your application requires that cooperating processes must be kept in a more-or-less strict lockstep, then non-blockingcalls become less useful and more cumbersome. Besides, synchronization is what the blocking calls are intended to provide.

      • MPI_WAIT immediately follows communication call

        The action of these two paired operations is exactly that of a blocking call, so why not simply use it?

    • Avoid deadlock by intelligent arrangement of sends/receives, or by posting non-blocking receives early.

      If you choose to use blocking transactions, try to guarantee that deadlock will be avoided by carefully tailoring your communication strategy so that sends and receives are properly paired and in the necessary order; alternatively, post non-blocking receives as early as possible, so that the sends will stay in the system for as little time as is necessary.

    • Use the appropriate "operation-status" call ("wait", "test", or "probe") to control the operation of non-blocking communications calls.

      Correctly knowing the state of communications transactions allows the application to intelligently steer itself to better efficiency in its use of available cycles. Ultimately pending traffic must be accepted, but that action can be long in coming and much can possibly be accomplished while it is incomplete. The wait, test and probe calls allow the application to match the appropriate activity with the particular situation.

    • Check the value of the "status" fields for problem reports.

      Don't just assume that things are running smoothly -- make it a general rule that every transaction is checked for success, and that failures are promptly reported and as much related information as possible is developed and made available for debugging.

    • Intelligent use of wildcards can greatly simplify logic and code.

      Using general receives, receives capable of handling more than one kind of message traffic (in terms of either sender, or message-type, or both), can greatly simplify the structure of your application, and can potentially save on system resources (if you are in the habit of using a unique message buffer for each transaction).

    • Null processes can move tests out of user code and simplify the application.

       

  3. This tutorial was strongly based on copyrighted material from the Cornell Theory Center and was used with their permission.

    Additional tutorials on point-to-point communications can be found by clicking here

    Franke, H. (1995) MPI-F, An MPI Implementation for IBM SP-1/SP-2. Version 1.41 5/2/95. Available in postscript, underhttp://www.tc.cornell.edu/UserDoc/Software/PTools/mpi/

    Franke, H., Wu, C.E., Riviere, M., Pattnaik, P. and Snir, M. (1995) MPI Programming Environment for IBM SP1/SP2. Proceedings of ICDCS 95, Vancouver, 1995. Available in postscript, under http://www.tc.cornell.edu/UserDoc/Software/PTools/mpi/

    Gropp, W., Lusk, E. and Skjellum, A. (1994) Using MPI. Portable Parallel Programming with the Message-Passing Interface. The MIT Press. Cambridge, Massachusetts.

    MPI: A Message-Passing Interface Standard June 1995

    Message Passing Interface Forum (1995) MPI: A Message Passing Interface Standard. June 12, 1995. Available in postscript fromhttp://www.epm.ornl.gov/~walker/mpi/

    RS/6000 SP: Practical MPI Progr amming by Yukiya Aoyama and Jun Nakano.