Supercomputing Institute Technical User Support

Introduction to Parallel Programming


1. Introduction


1.1 Serial vs Parallel

A large task can either be performed serially, one step following another, or can be decomposed into smaller tasks to be performed simultaneously, i.e., in parallel.

  • Sequential Programming: Traditionally, programs have been written for serial computers:

  • Parallel programming: Parallelism is done by

    1.2 The Need for Faster Machines

    There are several classes of problems that require faster processing:

    1.3 Parallel Programming

    1.4 Processor Communications and Memory architectures


    2. Processor Communications and Memory architectures


    2.1 Shared Memory

    • Multiple processors operate independently but share the same memory resources

    • Only one processor can access the shared memory location at a time

    • Synchronization achieved by controlling tasks reading from and writing to the shared memory

    • Advantages

      • Easy for user to use efficiently

      • Data sharing among tasks is fast (speed of memory access)

    • Disadvantages

      • Memory is bandwidth limited. Increase of processors without increase of bandwidth can cause severe bottlenecks

      • User is responsible for specifying synchronization, e.g., locks

    • Examples:
      • Cray Y-MP
      • Convex C-2
      • Cray C-90
      • Origin2000

    2.2 Distributed Memory

    2.3 Memory Hierarchies


    3. Parallel Programming Paradigms


    3.1 Message Passing

    The message passing model is defined as: Programming with message passing is done by linking with and making calls to libraries which manage the data exchange between processors. Message passing libraries are available for most modern programming languages.

    3.2 Data Parallel

    The data parallel model is defined as:


    Programming with data parallel model is accomplished by writing a program with data parallel constructs and compiling it with a data parallel compiler.

    The compiler converts the program into standard code and calls to a message passing library to distribute the data to all the processes.

    3.3 Implementations


    4. Steps for Creating a Parallel Program


    1. If you are starting with an existing serial program, debug the serial code completely

    2. Identify the parts of the program that can be executed concurrently:

      • Requires a thorough understanding of the algorithm

      • Exploit any inherent parallelism which may exist.

      • May require restructuring of the program and/or algorithm. May require an entirely new algorithm.

    3. Decompose the program:

      • Functional Parallelism
      • Data Parallelism
      • Combination of both

    4. Code development

      • Code may be influenced/determined by machine architecture

      • Choose a programming paradigm

      • Determine communication

      • Add code to accomplish task control and communications

    5. Compile, Test, Debug

    6. Optimization

      • Measure Performance
      • Locate Problem Areas
      • Improve them

    4. Steps for Creating a Parallel Program

    4.1 Definitions

    4.2 Decomposing the Program

    There are three methods for decomposing a problem into smaller tasks to be performed in parallel: Functional Decomposition, Domain Decomposition, or a combination of both

    4.3 Communication

    Understanding the interprocessor communications of your program is essential. The types of communications for message passing and data parallel are exactly the same. In fact most data parallel compilers simply use one of the standard message passing libraries to achieve data movement.

    Communications on distributed memory computers:


    5. Design and Performance Considerations


    5.1 Single processor tuning

    Code optimization for single-processors is the first step in reducing the overall execution time whether for sequential or parallel code. A tutorial on Single Processor Tuning can be found at

    http://www.msi.umn.edu/tutorial/scicomp/sp/single-proc/sing_proc.html

    That covers array optimization, loop optimization, tuning for arithmetic operations, etc.

    5.2 Load Balancing

    • Load balancing refers to the distribution of tasks in such a way as to insure the most time efficient parallel execution

    • If tasks are not distributed in a balanced way, you may end up waiting for one task to complete a task while other tasks are idle

    • Performance can be increased if work can be more evenly distributed

      For example, if there are many tasks of varying sizes, it may be more efficient to maintain a task pool and distribute to processors as each finishes

    • More important in some environments than others

    • Consider a heterogeneous environment where there are machines of widely varying power and user load versus a homogeneous environment with identical processors running one job per processor

    5.3 Data Dependency


    • A data dependency exists when there is multiple use of the same storage location

    • Importance of dependencies: frequently inhibit parallel execution

    • Example 1:

              DO 500 J = MYSTART,MYEND
              A(J) = A(J-1) * 2.0
          500 CONTINUE
      
      
      If Task 2 has A(J) and Task 1 has A(J-1), the value of A(J) is dependent on:

      • Distributed memory

        Task 2 obtaining the value of A(J-1) from Task 1

      • Shared memory

        Whether Task 2 reads A(J-1) before or after Task 1 updates it

    • Example 2:
      
          task 1        task 2
          ------        ------
      
          X = 2         X = 4
            .             .
            .             .
      
          Y = X**2      Y = X**3
      

        The value of Y is dependent on:

      • Distributed memory

        If and/or when the value of X is communicated between the tasks.

      • Shared memory

        Which task last stores the value of X.

    • Types of data dependencies

      • Flow Dependent: Task 2 uses a variable computed by Task 1. Task 1 must store/send the variable before Task 2 fetches

      • Output Dependent: Task 1 and Task 2 both compute the same variable and Task 2's value must be stored/sent after Task 1's

      • Control Dependent: Task 2's execution depends upon a conditional statement in Task 1. Task 1 must complete before a decision can be made about executing Task 2.

    • How to handle data dependencies?

      • Distributed memory

        Communicate required data at synchronization points.

      • Shared memory

        Synchronize read/write operations between tasks.

    5.4 Deadlock

    • Deadlock describes a condition where two or more processes are waiting for an event or communication from one of the other processes.

    • A basic example of deadlock can happenned when two SPMD tasks are calling blocking standard sends at the same point of the program. Each task's matching receive occurs later in the other task's program.

    5.5 Debugging

    • Debugging parallel programs is significantly more of a challenge than debugging serial programs

    • Parallel debuggers are beginning to become available, but much work remains to be done

    • Use a modular approach to program development

    • Pay as close attention to communication details as to computation details

    5.6 Performance Monitoring and Analysis

    • As with debugging, monitoring and analyzing parallel program execution is significantly more of a challenge than for serial programs

    • Parallel tools for execution monitoring and program analysis are beginning to become available

    • Some are quite useful

    • Work remains to be done, particularly in the area of scalability.

    5.7 Others

      Other performance considerations include:

      • Communication patterns and bandwidth
      • Granularity
      • Machine configuration
      • I/O

      6 Example


      A small example of MPI code is given. Instructions on how it can be compiled and run on a parallel machine (example case here is the IBM SP) are also provided. Please refer to the MPI workshop for more details on MPI .

      6.1 Example of MPI parallel program: Hello.f

      C version

      
      
                program hello
                include 'mpif.h'
                integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)
                character(12) message
      
                call MPI_INIT(ierror)
                call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
                call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
                tag = 100
                if(rank .eq. 0) then
      
                  message = 'Hello, world'
                  do i=1, size-1
      
                     call MPI_SEND(message, 12, MPI_CHARACTER, i, tag,
      
              &      MPI_COMM_WORLD, ierror)
                  enddo
                else
      
                     call MPI_RECV(message, 12, MPI_CHARACTER, 0, tag,
      
              &      MPI_COMM_WORLD, status, ierror)
                endif
                print*, 'node', rank, ':', message
      
                call MPI_FINALIZE(ierror)
                end
      

      6.2 Compiling and Running the parallel program on the SGI Origin

      • TO COMPILE

        • During compilation and linking, add "-lmpi" for the MPI library, i.e.

          • f77 program.f -lmpi (for Fortran77)
          • f90 program.f -lmpi (for Fortran90)
          • cc program.c -lmpi (for C)
          • CC program.C -lmpi (for C++)

        • All the options that are available for the standard MipsPro compilers (f77,f90,cc,CC) can be used during compilation.

      • TO RUN MPI JOBS

        • To run an MPI job interactively

          mpirun -np 4 a.out
        • where in this case, the MPI executable a.out is to run on 4 processors.

        • To run an MPI job in batch mode, NQE is used as the queueing system. Details of how to use NQE is available at http://www.msi.umn.edu/origin/info/jobs.

        6.3 Compiling and Running the parallel program on the IBM SP

        • TO COMPILE

          • Three scripts mpxlf, mpcc, mpCC for compiling MPI programs are already installed. These scripts dynamically link the communication libraries at run time and call the native IBM Fortran, C, or C++ compilers. They are part of the IBM Parallel Environment (PE) Software.

            These scripts are: mpxlf - Fortran mpxlf options program.f mpcc - C mpcc options program.c mpCC - C++ mpCC options program.C

          • All the options that are available for the standard IBM compilers (xlf,cc,CC) can be used with these scripts.

        • TO RUN MPI JOBS

          • The IBM AIX parallel environment (PE) and its run time support system, the Parallel Operating Environment (POE) provide the user with information needed to run a parallel job interactively or using Loadlevler (batch jobs) on the SP-SMP computer system. i.e: to run a job interactively

            poe exec -rmpool 1 -procs 4 -euidevice css0 -euilib us

        • More information on compiling and running MPI jobs on the IBM SP

        6.3 Hands on Workshop

        Users will be provided with small parallel codes to run/modify and test, or they can get coached on writing a code of their choosing. The User support staff will assist. This workshop will be held at the Scientific Development and Visualization Laboratory at the Supercomputing Institute following this tutorial.

        7. References, Acknowledgements


        References and Acknowledgements

        • "IBM AIX Parallel Environment Application Development, Release 1.0", IBM Corporation.

        • Material from this tutorial was based on the tutorial at the Maui High Performance Computing Center (MHPCC)

        [an error occurred while processing this directive]