Abstracts - The 10th International Workshop on Languages and
Compilers for Parallel Computing
Lowering HPF Procedure Interface to a
Canonical Representation
Jan Borowiec
BMD FIRST Research Institute for Computer
Architecture and Software Technology
Rudower Chaussee 5
12489 Berlin
GERMANY
borowiec@first.gmd.de
Arthur Veen
Parallel Computing
Amsterdam
THE NETHERLANDS
arthur@parcom.nl
Handling the procedure interface in an HPF compiler is complex due to the many possible
combinations of Fortran 90/JPF properties of an actual array argument and its associated
dummy argument. This paper describes an algorithm that reduces this complexity of
mapping all the combinations of properties to a small set of canonical Internal Representations.
These internal representations as well as the necessary run-time decsriptors are
also described. The algorithm has been implemented in the commercial HPF compiler
produced by the PREPARE project.
Java as a Language for Scientific Parallel Programming
Bryan Carpenter, Yuh-Jye Chang, Geoffrey Fox, Xiaoming Li
Northeast Parallel Architectures Centre
Syracuse University
Syracuse, New York
dbc@npac.syr.edu, yjchang@npac.syr.edu,
gcf@npac.syr.edu, lxm@npac.syr.edu
Java may be a natural language for portable parallel programming. We discuss the
basis of this claim in general terms, then illustrate the use of Java for message-passing
and data-parallel programming through series of case studies. In the process we introduce
some proposals for a Java binding of MPI, and describe the use of a Java class-library
to implement HPF-style distributed data. Prospects for future Java-based parallel
programming environments are discussed.
A Comparison of Three Global Code Scheduling Algorithms
Pohua Chang
Intel Corporation
M/S RN6–18
2200 Mission College Blvd.
Santa Clara, CA 95052-8119
pohua@gomez.sc.intel.com
Telephone: (408) 765-6423
Instruction scheduling is an important compiler code transformation step for generating
code for pipelined and VLIW processors. Dependencies among instructions need to be
enforced. In this paper, we compare the effectiveness of three global code scheduling
algorithms (HS, EHS, GCM). Hyperblock scheduling (HS) is a path-based scheduling
algorithm. Enhanced Hyerblock Scheduling (EHS) is an improvement over HS by merging
path schedules. Global Code Motion (GCM) integrates some instruction-level parallelization
techniques with upward code motion to permit more code motion freedom. We apply the
three algorithms to the most time consuming loops in SPEC95 CINT programs. The experimental
results show that EHS performs as well as GCM. EHS is much simpler to implement than
GCM. Therefore, EHS is the most cost effective global code scheduling algorithm among
the three algorithms.
Definition of the F– – Extension to Fortran
90
Benoit Dupont de Dinechin
bc3@cs.mcgill.ca
Gary Elsesser, Greg Fischer, Brian H. Johnson, Tom MacDonald, Robert W. Numrich,
and Jon L. Steidel
Many computational scientists have adopted message-passing as their model for programming
distributed memory parallel computers. Moreover, they have discovered that the model
works equally well on shared memory machines or even across networks of workstations.
They have chosen this model, not because it is the easiest to program, but because
their physical problems and their numerical methods naturally yield to the method
of data decomposition. The programmer, immersed in the physical problem, things about
the work to be done at each local point and how it is effected by nearby points.
If the programmer has done a good job, the program advances using only local data
with occasional requirements for nonlocal data. Eventually nonlocal data begins to
influence local data and the programmer must be aware of the interaction. At appropriate
points in the program, the programmer fetches data from other regions of the physical
problem or sends data to other regions.
The message-passing programming model enforces a discipline on the programmer that
often pays for itself in the form of high parallel efficiency. A cumbersome expression
of the message-passing model through complicated libraries, however, reduces its
effectiveness and makes the programmer’s job unnecessarily burdensome. If the programmer
understands the physical problem, the parallelism inherent in it should be expressible
in a simple snytax compatible with the language being used, in our case Fortran 90.
Message-passing libraries, designed with portability and standardization as the overriding
requirement, often incur high overheads that cannot be avoided. High overhead translates
into high granularity. High granularity translates into hard programming.
Directive-based, pseudo-shared memory programming models, on the other hand, claim
to offer ease-of-use and even the ability to parallelize dusty-deck scalar codes.
So for, their success has been limited and their proposed mechanisms for handling
irregular or dynamically changing problems are awkward at best. Regular problems
involving dense linear algebra or fixed grids, have been solved. Irregular problems
involving sparse linear algebra or irregular changing grids are unsolved. An effective
approach to these problems must be found for scalable parallel computers to have
a future.
Writing good programs for parallel computers is easier than complicated programming
languages and arcane communication libraries suggest. The F- - extension to Fortran
90 described in this paper defines a simple Fortran-like snytax that is sufficient
to express all communication patterns encountered in parallel programs. The F- -
model was originally defined for Fortran 77 [4], in a substantially different form
that made sense for the language, and has recently been modified to a more complete
Fortran 90 version [5]. The syntax is architecture independent although its implementation,
as for any language syntax, is architecture dependent.
The F- - extension to Fortran 90 does not define a new language. Rather it builds
on the existing standard language. Whenever possible, to minimize the number of new
rules for the programmer to learn, the syntactic and semantic rules for F- - extensions
are the same as the rules for Fortran 90. When the rules must be modified, we specifically
state the changes.
An Array Data Flow Analysis Based Communication Optimizer
X. Yuan, R. Gupta, and R. Melhem
Department of Computer Science
University of Pittsburgh
Pittsburgh, PA 15260
xyuan@cs.pitt.edu, gupta@cs.pitt.edu,
melhem@cs.pitt.edu
Global array data flow analysis computes precise data flow information required for
performing communication optimizations. However, to practically use this technique
on large programs, the analysis cost must be reduced. In this paper, we present an
efficient array data flow analysis based global communication optimizer. To manage
the analysis cost, our optimizer partitions the optimization data flow problems into
subproblems and solves the subproblems one at a time. By doing so, our scheme
offers some advantages over traditional data flow based optimization techniques.
First, the memory space requirement is greatly reduced. Second, set operations in
the analysis are simplified, which leads to the reduction of analysis time. Third,
the analysis time can be managed effectively. Our optimizer performs message vectorization,
global redundant communication elimination and global communication scheduling. We
are implementing the optimizer on top of the Stanford SUIF compiler. Our preliminary
experimental results suggest that array data flow analysis for communication optimization
can be efficient.
Integrating Automatic Data Alignment and
Array Operation Synthesis to Optimize Data Parallel Programs
Gwan-Hwan Hwang
Department of Computer Science
National Tsing-Hua University
Hsinchu, Taiwan
REPUBLIC OF CHINA
ghhwang@cs.nthu.edu.tw
Jenq Kuen Lee
jklee@cs.nthu.edu.tw
Fax: 886-3-5723694
Telephone: 886-3-5715131 Ext. 3519
Dz-Ching R. Ju
Hewlett-Packard Company
Cupertino, CA 95014
royju@cup.hp.com
Given a sequence of FORALL-LOOPs and array operations in data parallel languages
(such as Fortran 90), the automatic alignment problem is to find an alignment relationship
to align arrays so that incurred data communications will be minimized, when programs
are executed on distributed memory architectures. On the other hand, array operation
synthesis is to compose consecutive array operations or array expressions into a
composite access function of the source arrays at compile time to optimize data parallel
programs. Both automatic data alignment and array operations synthesis have been
shown to be very important and effective schemes to optimize data parallel programs.
However, they were considered separately so far by the research community. In this
paper, we address this open issue how to integrate the array operation synthesis
scheme into the automatic alignment process. We propose a new array alignment concept,
called segmented alignment, to help incorporate array operation synthesis scheme
into automatic alignment process. In addition, we derive a cost model and present
an optimal solution based on the cost model to integrate array synthesis into an
automatic alignment framework. We also show that the optimal problem is NP-hard.
Therefore, we develop a practical heuristic algorithm for compilers to incorporate
the synthesis strategy into an automatic alignment framework. Experiments done on
an 8-node DEC Farm show that the automatic alignment process with the help of array
operation synthesis and segmented alignment concepts can significantly outperform
the one without these mechanisms.
Automatic Data Decomposition for Message-Passing
Machines
Mirela Damian-Iordache and Sriram V. Pemmaraju
Department of Computer Science
The University of Iowa
Iowa City, IA 52242-1419
sriram@cs.uiowa.edu
The data distribution problem is very complex, because it involves
trade-off decisions between minimizing communication and maximizing
parallelism. A common approach towards solving this problem is to break the
data mapping into two stages: an alignment stage and a distribution stage.
The alignment stage attempts to increase parallelism, while the distribution
stage attempts to decrease communication overhead. As opposed to previous
approaches, we consider the alignment and distribution problems in a unified
framework, and attempt to simultaneously maximize parallelism and minimize
communication overhead. The problem becomes harder if dynamic remapping,
multi-dimensional distributions, array replications and control flow are
taken into account.
This paper formulates the full data decomposition problem that addresses
all these issues and presents a simple new algorithm to find the optimal
solution of the dynamic data distribution problem, given the number of
processors and a partitioning of the input program into phases. The
algorithm runs efficiently for small search spaces
(several hundreds of data distributions).
Table-Lookup Approach for Compiling Two-Level
Data–Processor Mappings in HPF
Kuei-Ping Shih and Jang-Ping Sheu
Department of Computer Science and Information Engineering
National Central University
Chung-Li 32054, Taiwan, ROC
steven@axpl.csie.ncu.edu.tw
sheujp@csie.ncu.edu.tw
Chua-Huang Huang
Department of Computer and Information Science
The Ohio State University
Columbus, OH 43210-1277
chh@cis.ohio-state.edu
This paper presents compilation techniques for block-cyclic data redistribution on
distributed-memory multicomputers. Since static data distribution is insufficient
in achieving high performance computing for many scientific and engineering applications,
dynamic data redistribution is the only alternative to get better performance. However,
data redistribution is costly at runtime. How to reduce the indexing and communication
overhead is very crucial for distributed-memory multicomputers. We propose compilation
techniques to efficiently generate communication sets for block-cyclic data redistribution,
assuming data-processor mapping is a two-level mapping. That is, related array objects
are first aligned with a template, an abstract index space; and the template is then
distributed onto the user-declared abstract processors. Since there are a repeated
pattern for two-level data-processor mapping, we use class table to record the global
indices of array elements in the first pattern. The global indices of array elements
on other repeated patterns can be obtained by adding some offset. Similarly, the
compressed local array also has a repeated pattern for a processor. We use compression
table to record the compressed local array for the first pattern. Other repeated
patterns can also be obtained accordingly. By using class table and compression table,
hole compression and communication set generation for performing data redistribution
can be easily and efficiently achieved. The proposed methods can save memory usage,
improve spatial locality and further increase system performance. The experimental
results do confirm the advantages of our proposed methods over existing methods.
Data Parallel Language Extensions for Exploiting
Locality in Irregular Problems
Guillermo P. Trabado and Emilio L. Zapata
Department of Computer Architecture
University of Málaga Complejo Tecnológico
P.O. Box 4114
E-29080 Málaga
SPAIN
guille@ac.uma.es
ezapata@ac.uma.es
Many large-scale computational applications contain irregular data access patterns
related to unstructured problem domains. Examples include finite element methods,
computational fluid dynamics, and molecular dynamics codes. Such codes are difficult
to parallelize efficiently with current HPF compilers. However, most of these problems
exhibit spatial locality. This property is exploited by our approach.
In the sequential program, unstructured domains are accessed via indirection arrays.
We introduce a new directive that serves to identify indirection arrays and the boundaries
of the associated domains. The data domains are distributed using Multiple Recursive
Decomposition (MRD), a pseudo-regular distribution, which combines efficient implementation
with good load balancing and communication behavior. Indirection arrays are aligned
with the data arrays. Using the information provided in the directive, the compiler
can produce a target program with significantly better performance than an approach
based on indirect distributions and the inspector/executor paradigm.
NAMD: A Case Study in Multilingual Parallel Programming
L.V. Kalé, M. Bhandarkar, R. Brunner, N. Krawetz, J. Phillips, and A. Shinozaki
Department of Computer Science and
Theoretical Biophysics Group
Beckman Institute
University of Illinois
Urbana, IL 61801
kale@cs.uiuc.edu
Parallel languages are tools for constructing efficient application programs, wile
reducing the required labor. In this light, multi-paradigm multilingual programming
is a natural idea: use the most appropriate tool for each component of a complex
system. The Converse system developed at Illinois addresses the Issues involved in
supporting multilingual applications. This paper describes the development of a large
parallel application in Computational Biophysics from the point of view of multilingual
programming. The application—a molecular dynamics program named NAMD 2—is implemented
using three different “paradigms” or languages: Charm++, a parallel message-driven
object language, PVM, and a portable threads package. The issues faced in implementing
such a system, and the advantages of multilingual approach are discussed. NAMD 2
is already operational on many parallel machines; some preliminary performance results
are presented and the lessons learned from this experience are discussed.
PCRC-based HPF Compilation
Guansong Zhang, Bryan Carpenter, Geoffrey Fox, Xiaoming Li, Xinying Li, Yuhong Wen
NPAC at Syracuse University
Syracuse, NY 13244
zgs@npac.syr.edu, dbc@npac.syr.edu,
gcf@npac.syr.edu, lxm@npac.syr.edu,
xli@npac.syr.edu, wen@npac.syr.edu
This paper describes an ongoing effort supported by ARPA PCRC (Parallel Compiler
Runtime Consortium) project. In particular, we introduce the design and implementation
of an HPF compilation system based on PCRC runtime. The approaches to issues such
as directive analysis, communication detection, and procedure interface processing
are discussed in detail. Since this is already an operational compilation system,
discussion is centered around a complete node program generated by the compiler.
Quantifying the Multi-Level Nature of Tiling
Interactions
Nicholas Mitchell, Larry Carter, Jeanne Ferrante, Karin Högstedt
Computer Science and Engineering Department
University of California at San Diego
La Jolla, CA 92093-0114
mitchell@cs.ucsd.edu
Optimizations, including tiling, often target a single level of memory or
parallelism, such as cache. These optimizations usually operate on a level-by-level
basis, guided by a cost function prameterized by features of that single level. The
benefit of optimizations guided by these one-level cost functions decreases as architectures
trend towards a hierarchy of memory and parallelism.
We look at three common architectural scenarios. For each, we quantify the improvement
a single tiling choice could realize by using information from multiple levels in
concert. To do so, we derive multi-level cost functions which guide the optimal
choice of tile size and shape. We give both analyses and simulation results to support
our points.
Exploiting Parallelism Through Directives
on the Nano-Threads Programming Model
Eduard Ayguadé, Xavier Martorell (tort@eio.upc.es),
Jesús Labarta, Marc Gonzàlez, and Nacho Navarro
Computer Architecture Department
Polytechnic University of Catalunya
cr Jordi Girona 1-3
Mòdul D6
08034 Barcelona
SPAIN
The ability of a application to efficiently use the resources shared with other applications
is a key feature needed to efficiently exploit the potential parallelism offered
by nowadays parallel architectures. In these multiprogrammed environments, the application
should be able to adapt the parallelism (kind and amount) that it is worth to be
exploited to the global utilization of system resources. In this paper we present
a programming model oriented to the hierarchical exploitation of unstructured parallelism
in multiprogrammed multiprocessor systems. The model offers a set of directives targeted
to be used by Fortran programmers that allow them to express the parallelism of the
application. The compiler is responsible for the generation of code that efficiently
exploits and manages this parallelism at run time. The code generated runs on top
of a user-level threads library that allow the program to decide and adapt himself
to parallelism that can be attained at any time.
“Optimal” Parallelism Through Integration
of Data and Control Parallelism: A Case Study in Complete Parallelization
Dwip Banerjee and J. C. Browne
Department of Computer Sciences
University of Texas
Austin, TX
http://www.cs.utexas.edu/users/
browne@cs.utexas.edu, dwip@cs.utexas.edu
This paper presents a detailed case study of programming in a parallel programming
system which targets complete and controlled parallelization of array-oriented computations.
The purpose is to demonstrate how coherent integration of control and data parallelism
enables both effective realization of the potential parallelism of applications and
matching of the degree of parallelism in a program to the resources of the execution
environment. (“Our ability to reason is constrained by the language in which we reason.”).
The programming system is based on an integrated graphical, declarative representation
of control parallelism and data partitioning parallelism. The computation used for
the example is even-odd reduction of block triangular matrices. This computation
has three phases each with a different parallel structure. We derive, implement and
measure the execution of a dynamic parallel computation structure which employs differing
levels of control and data parallelism in each phase of the computation to give load
balanced execution across a range of number of processors. The program formulated
in the integrated representation revealed parallelism not shown in the original algorithm
and has a constant level of actual parallelism throughout the computation where the
original algorithm had unbalanced levels of parallelism during different phases of
the computation. The resulting program shows near-linear speed-up across all phases
of the computation for number of processors ranging from 2 to 32.
A Unified Software Pipeline Construction Scheme
for Modulo Scheduled Loops
Benoît Dupont de Dinechin
(on leave from)
CEA CEL-V
94195 Villeneuve St. Georges cedex
FRANCE
bd3@cs.mcgill.ca
We present a software pipeline construction scheme for D0-loops, while-loops,
and loops with multiple exits, which unifies, simplifies, and generalizes, the separate
techniques previously required to build a complete software pipeline from a local
schedule computed by modulo scheduling. In the setting of this software pipeline
construction scheme, we demonstrate a simple way of implementing modulo expansion.
Then we introduce inductive relaxation, a technique that supplements modulo
expansion when the variable to expand is a simple induction. These techniques
do not require any architectural support from the target processor, and have been
extensively tested as part of the software pipeliner that comes with the 3.0 compiler
releases for the Cray
T3E™ massively parallel computer.
Reducing Synchronization Overhead
for Compiler-Parallelized Codes on Software DSMs
Hwansoo Han, Chau-Wen Tseng, and Pete Keleher
Department of Computer Science
University of Maryland
College Park, MD 20742
hshan@cs.umd.edu, tseng@cs.umd.edu,
keleher@cs.umd.edu
Software distributed-shared-memory (DSM) systems provide an appealing target for
parallelizing compilers due to their flexibility. Previous studies demonstrate such
systems can provide performance comparable to message-passing compilers for dense-matrix
kernels. However, synchronization and load imbalance are significant sources of overhead.
This paper investigates three compiler and run time techniques for reducing synchronization
overhead: 1) compile-time elimination of barriers, 2) nearest-neighbor synchronization,
and 3) increasing coherence unit size. Our compile-time barrier elimination algorithm
extends previous techniques by exploiting chunk iteration partitioning with local
subscript analysis and exploiting delayed updates in lazy-release-consistency DSMs
to eliminate barriers for cross-processor anti-dependences. Experiments on an IBM
SP-2 indicate these techniques can improve parallel performance by 10% on average
and by up to 40% for some applications.
Simplifying Control Flow in Compiler-Generated
Parallel Code
John Mellor-Crummey and Vikram Adve
Department of Computer Science
M/S 132
Rice University
6100 Main Street
Houston, TX 77005-1892
johnmc@rice.edu, adve@rice.edu
Optimizing compilers for high-level languages such as High Performance Fortran (HPF)
perform a sequence of complex code generation steps to synthesize an efficient parallel
program. It is possible to reduce compiler software complexity and to avoid repeated
global reanalysis by having each code generation step operate with only partial information
about the shape of the code that other code generation steps produce. As a consequence,
the initial generated code may contain excess conditional tests and control flow
that can be eliminated by exploiting contextual knowledge about the generated code.
Here, we describe a powerful algorithm that computes symbolic constraints on values
of integer variables. Constraints arise from loop bounds and strides, conditional
branches, and synthetic assertions. A second algorithm uses these constraints to
simplify control flow. These algorithms have been implemented in the Rice dHPF compiler.
We present a preliminary experimental evaluation which shows that these algorithm
are effective in reducing the number of conditionals, code size, and overall execution
time for code generated by dHPF. We also describe the synergy between control flow
simplification and other optimizations in the compiler that together achieve the
effects of specific optimizations for data-parallel programs such as vector message
pipelining and use of overlap areas.
Concurrent Static Single Assignment Form and
Constant Propagation for Explicitly Parallel Programs
Jaejin Lee, Samuel Midkiff, and David A. Padua
Department of Computer Science
University of Illinois at Urbana-Champaign
Urbana, IL 61802
j-lee@cs.uiuc.edu, padua@cs.uiuc.edu
Static Single Assignment (SSA) form has shown its usefulness for powerful code optimization
techniques, such as constant propagation, of sequential programs. We introduce Concurrent
Static Single Assignment (CSSA) form and the transformation algorithm for the explicitly
parallel programs with interleaving semantics and post-wait synchronization. The
parallel construct considered in this paper is cobegin/coend. A new concept, p-assignment,
which summarizes the information of interleaving statements among threads, is introduced.
Concurrent Control Flow Graph, which contains the information of conflicting statements
in addition to control flow and synchronization information, is used as an intermediate
representation for the CSSA transformation. A parallel extension of the Sparse Conditional
Constant propagation algorithm based on the CSSA form makes it possible to apply
the constant propagation optimization to explicitly parallel programs.
Identifying DEF/USE Information of
Statements that Construct and Traverse Dynamic Recursive Data Structures
Yuan-Shin Hwang and Joel Saltz
Department of Computer Science
University of Maryland
College Park, MD 20742
shin@cs.umd.edu, saltz@cs.umd.edu
Pointer analysis is essential for optimizing and parallelizing compilers. It examines
pointer assignment statements and estimates pointer-induced aliases among pointer
variables or possible shapes of dynamic recursive data structures. However, previously
proposed techniques perform pointer analysis without the knowledge of traversing
patterns of dynamic recursive data structures to be constructed. This paper presents
an algorithm to identify the traversing patterns of recursive data structures and
propagate this information back to those statements that define the data structures.
This approach recognizes the DEF/USE relationships between the statements that define
and traverse dynamic recursive data structures. The outcome of this technique will
be useful for pointer analysis and parallelization.
Code Generation for Complex Subscripts in Data-Parallel
Programs
J. Ramanujam, Swaroop Dutta, and Arun Venkatachar
Department of Electrical and Computer Engineering
Louisiana State University
Baton Rouge, LA 70803-5901
swaroop@ee.lsu.edu, jxr@ee.lsu.edu,
arun@ee.lsu.edu
Data parallel languages like High Performance Fortran, demand efficient compile and
run-time techniques for tasks such as address generation. Array references with arbitrary
affine subscripts can make the task of compilers for such languages highly involved.
This paper deals with the efficient address generation in programs with array references
having two types of commonly encountered affine references, namely coupled subscripts
and subscripts containing multiple induction variables (MIVs). Methods discussed
in the paper utilize the repetitive pattern of the memory accesses. In the case of
MIV, we address this issue by presenting runtime techniques which enumerate the set
of addresses in lexicographic order. Our approach to the problem incorporates a general
approach of computing in O (k) time, the start element on a processor
for a given global start element. Several methods are proposed and evaluated here
for generating the access sequences for MIV based on problem parameters. With coupled
subscripts, we present two construction techniques, namely searching and hashing
which minimize the time needed to construct the tables. Extensive experiments were
conducted and the results were then compared to indicate the efficiency of our approach.
The SPNT Test: A New Technology for Run-Time Speculative
Parallelization of Loops
Tsung-Chuan Huang and Po-Hsueh Hsu
Department of Electrical Engineering
National Sun Yat-Sen University
Taiwan
R.O.C.
tch@mail.nsysu.edu.tw
Telephone: +886-7-5252000 Ext. 4140
Fax: +886-7-5254199
The only way for parallelizing compilers to exploit potential parallelism of loops
in which dependence information is inadequate statically is using run-time loop parallelization
technique. There are two approaches in this field: the inspector-executor method
and the speculative DOALL test. For the former approach, there always incurs heavy
preprocessing overhead during inspector phase and synchronization barrier burden
besides load imbalance impact in executor. In this paper, a new proposal for a highly
practicable speculative parallelization test, the SPNT test (Speculative Parallelization
with New Technology), is presented. Speculative parallel execution as DOALL actually
obtains the biggest speedup if the loop is in fact a DOALL loop, otherwise, it will
suffer rather extent penalty. The objective of SPNT test is twofold. The first is
to increase the success rate by ignoring avoidable dependence restrictions. The second
is to reduce the failure penalty by detecting the unavoidable data dependences and
quitting the speculative parallel execution as soon as possible. As the result, the
SPNT test can greatly improve the practicability of the speculative parallel execution.
A Systematic Approach to Branch Speculation
Gianfranco Bilardi
DEI
Università di Padova
ITALY
Alex Nicolau
University of California at Irvine
Irvine, CA
Joe Hummel
University of California at Irvine
Irvine, CA
A general theoretical framework is developed for the study of branch speculation.
The framework yields a systematic way to select the schedule in a given set that,
for any (estimated) bias of the branch, minimizes the expected execution time. Among
other things, it is shown that in some cases the optimal schedule is neither of those
resulting from aggressively speculating on any given outcome of the conditional.
Our results can be useful in either static or dynamic approaches. We propose a simple
run-time estimator for the bias and discuss how to combine it with schedule selection.
A number of examples motivate and illustrate the techniques, and show that our approach
yields better performance in the case of highly unpredictable branches.
Analysis and Optimization of Explicitly Parallel
Programs Using the Parallel Program Graph Representation
Vivek Sarkar
MIT Laboratory for Computer Science
545 Technology Square
Cambridge, MA 02139
vivek@lcs.mit.edu
Major changes in processor architecture over the last decade have created a demand
for new compiler optimization technologies. Optimizing compilers have risen to this
challenge by steadily increasing the uniprocessor performance gap between optimized
compiled and unoptimized compiled code to a level that already exceeds the performance
gap between two successive generations of processor hardware. These traditional optimizations
have been developed in the context of sequential programs—the assumption of sequential
control flow is intrinsic to the definition of basic optimization data structures
such as the control flow graph (CFG), and pervades all the optimization algorithms.
As more and more programs are written in explicitly parallel programming languages,
it becomes essential to extend the scope of sequential analysis and optimization
techniques to explicitly parallel programs. This extension is necessary for maintaining
single-processor performance in parallel programs and also for adapting the parallelism
in the program to the target parallel machine. By an explicitly parallel programming
language, we mean a programming language that contains primitives for creating, terminating,
and synchronizing concurrent (logical) threads of activity. The primitives may be
in the form of syntactic extensions (e.g., cobegin-coend [Han75]), directives (e.g.,
HPF [Hig92]), or library calls (e.g., start (), run(), wait (), etc. In Java). In
this paper, we assume that the parallel program is translated to the parallel
program graph (PPG) representation [Sar92, SS93] and focus our attention of performing
compiler analysis and optimization on PPGs.
Program Optimization for Concurrent Multithreaded
Architectures
Jenn-Yuan Tsai
Department of Computer Science
University of Illinois
Urbana, IL 61801
Zhenzhen Jiang and Pen-Chung Yew
Department of Computer Science
University of Minnesota
200 Union Street SE
Minneapolis, MN 55455
Telephone: (612) 625-7876
jiang015@maroon.tc.umn.edu
Pen-Chung Yew
Department of Computer Science
University of Minnesota
200 Union Street SE
Minneapolis, MN 55455
Telephone: (612) 625-7387
yewxx001@gold.tc.umn.edu
This paper studies some compiler and program transformation techniques for concurrent
multithreaded architectures, in particular the superthreaded architecture, which
adopts a thread pipelining execution model that allows threads with data dependencies
and control dependencies to be executed in parallel. In the study, we identify several
important program analysis and transformation techniques that allow the superthreaded
architecture to exploit more parallelism in programs with less run-time overhead.
We evaluate the performance to the superthreaded architecture and the effectiveness
of the program transformation techniques by manually compiling several benchmark
programs and running them through a trace-driven, cycle-by-cycle superthreaded processor
simulator. The simulation results show that a superthreaded processor can achieve
good speedups for most of the benchmark programs with the proposed program transformation
techniques applied.
Interactive Compilation and Performance
Analysis with URSA MINOR
Insung Park, Michael Voss, Brian Armstrong, and
Rudolf Eigenmann (eigneman@ecn.purdue.edu)
School of Electrical and Computer Engineering
Purdue University
1285 Electrical Engineering Bldg.
West Lafayette, IN 47907-1285
Telephone: (765) 494-1741
Fax: (765) 494-6440
eigneman@ecn.purdue.edu
This paper proposes solutions to two important problems with parallel programming
environments that were not previously addressed. The first issue is that current
compilers are typically black-box tools with which the user has little interaction.
Information gathered by the compiler, although potentially very meaningful for the
user, is often inaccessible or hard to decipher. Second, compilation and performance
analysis tools are not well integrated. While there are many advanced instruments
for gathering and browsing performance results of a program, it is difficult to relate
this information to the source program, to the applied program transformations, and
to the compiler’s reasoning.
The URSA MINOR tool addresses these issues. The tool is designed to help understand
the structure of a program and the information gathered by a compiler in an interactive
way. It facilitates the comparison of performance results under different environments
and the identification of potential parallelism, and it provides a repository for
this information. URSA MINOR is built using the Polaris compiler infrastructure.
We present case studies that show how programmers can use the tool to find additional
parallelism in a compiler-optimized program and to characterize the performance of
parallel applications. The tools are currently being used in several projects to
develop and study parallel applications and to evaluate parallelizing compilers.
These efforts provide feedback for improving the URSA MINOR tool.
A Compiler Abstraction for Machine Independent
Parallel Communication Generation
Bradford L. Chamberlain, Sung-Eun Choi, and Lawrence Snyder
Department of Computer Science and Engineering
Box 352350
University of Washington
Seattle, WA 98195-2350
Telephone: (206) 543-0374
brad@cs.washington.edu, sungeun@cs.washington.edu,
snyder@cs.washington.edu
In this paper, we consider the problem of generating efficient, portable communication
in compilers for parallel languages. We introduce the IRONMAN abstraction and approach,
which separates data transfer from its implementing communication paradigm. This
is done by annotating the compiler-generated code with legal ranges for data transfer
in the form of calls to the IRONMAN library. On each target platform, these library
calls are instantiated to perform the transfer using the machine’s optimal communication
paradigm. We confirm arguments against generating message passing calls in the compiler
based on our experiences using PVM and MPI-—specifically, the observation that these
interfaces do not perform well on machines that are not built with a message passing
communication paradigm. The overhead for using IRONMAN, as opposed to a machine-specific
back end, is demonstrated to be negligible. We give performance results for a number
of benchmarks running with PVM, MPI, and machine-specific implementations of the
IRONMAN abstraction, yielding performance improvements of 2-13% without any source
code or compiler modifications.
Static Analysis of Recursive Data Structures
D.K. Arvind and T.A. Lewis
Department of Computer Science
The University of Edinburgh
Mayfield Road
Edinburgh EH9 3JZ
SCOTLAND
The Static analysis of imperative languages is made difficult by the unrestricted
pointer structures which makes their static analysis difficult and imprecise. A static
analysis method is proposed for a class of graph structures with commuting and inverse
edges. The method hinges on a path normalization scheme which can be described as
a Markov algorithm. This provides a general framework for static analysis of other
structures with different relations over the edges. The method is illustrated with
an example.
Reuse-Driven Tiling for Data Locality
Jingling Xue
Department of Mathematics, Statistics,
and Computing Science
University of New England
Armidale 2351
AUSTRALIA
xue@turing.une.edu.au
Chua-Huang Huang
Department of Computer and
Information Sciences
The Ohio State University
Columbus, OH 43210-1277
chh@cis.ohio-state.edu
This paper studies the problem of applying unimodular transformations and tiling
to improve the cache locality of a loop nest. Due to dependence constraints and reuse
information, not all loops will and can be tiled. Therefore, the tiling approach
proposed consists of three phases, which are to determinate (a) a subspace of the
iteration space to be tiled, (b) the tile shape, and (c) the tile size. This paper
focuses on the first phase. By using cones to represent the dependences and by using
vectors to represent the reuse information in the program, a resue-driven approach
is presented to improve data locality of the program based on matrix transformations.
In the special case of a singly fully permutable loop nest, the data locality problem
is formulated as an optimization problem and solved optimally based on the time-cone
algorithm introduced in the paper. In the general case, an algorithm is presented
that constructs the tiled loop nest incrementally, loop by loop, starting from the
outermost one so that as much reuse as possible is carried in the innermost tiled
loops.
Experiences with Loop Parallelization in javar
(a Prototype Restructuring Compiler for Java)
Aart J.C. Bik, Juan E. Villacis, and Dennis B. Gannon
Computer Science Department
Indiana University
Lindley Hall 215
Bloomington, IN 47495-4101
ajcbik@cs.indiana.edu
In this paper we discuss some experiences with javar, which is a prototype Java restructuring
compiler that supports the automatic conversion of serial loops into constructs that
use the multi-threading mechanism of the language to exploit any parallelism in these
loops. Expressing parallelism in Java itself clearly has the advantage that the transformed
program remains portable. After compilation of the transformed Java program into
bytecode, speedup can be obtained on any platform on which the Java bytecode interpreter
supports the true parallel execution of threads, whereas we will see that the threads
only induce a slight overhead on uni-processors.
The Aggregate Function API: It’s Not Just For
PAPERS Anymore
H.G. Dietz, T.I Mattox, and G. Krishnamurthy
School of Electrical and Computer Engineering
Purdue University
West Lafayette, IN 47907-1285
Telephone: (765) 494-3357
hankd@ecn.purdue.edu, tmattox@ecn.purdue.edu,
gayathri@ecn.purdue.edu
http://garage.ecn.purdue.edu/~papers/
The concept of “data parallelism” is a pervasive force throughout parallel processing.
Although a certain level of processing-element autonomy can help performance, the
fact is that many parallel algorithms, applications, and compiler analysis techniques
focus on identifying a set of data objects that can be processed using loosely synchronous
parallelism. Thus, it is not surprising that a large number of communication libraries
support at least a few synchronized aggregate operations on data. Over the past few
years, we have developed eleven different types of PAPERS (Purdue’s Adapter for Parallel
Execution and Rapid Synchronization) hardware specifically to efficiently implement
aggregate functions for clusters of PCs or workstations.
The Aggregate Function Application Program Interface (AFAPI) library was initially
designed to be a portable high-level interface to the various types of PAPERS cluster
hardware, so one would expect it to work well using this custom hardware, and it
does work well. In this paper, we show that the AFAPI is also an efficient programming
model for other types of parallel systems—especially shared memory multiprocessors.
For many operations, AFAPI can outperform threads libraries and other more traditional
shared memory programming models.
Program Analysis of Overlap Area Usage in Self-Similar
Parallel Programs
Aaron Sawdey
Electrical Engineering Department
University of Minnesota
4-174 EE/CSci
Minneapolis, MN 55455
Telephone: (612) 625-3583
sawdey@maroon.ec.umn.edu
Matthew O’Keefe
Electrical Engineering Department
University of Minnesota
4-174 EE/CSci
Minneapolis, MN 55455
Telephone: (612) 625-6306
okeefe@ee.umn.edu
Highly parallel computers have the memory capacity and potential speed to perform
very high-resolution time-dependent calculations. Parallel computers with hundreds
of fast processors will require highly scalable algorithms to avoid wasting expensive
resources. To fully exploit scalable algorithms on these highly parallel machines
careful attention must be paid to program design. We have proposed a programming
model that can express a class of scaleable algorithms. In this paper we show how
compiler analysis can ease the task of writing programs within this programming model.
We have used this programming model manually and have achieved good results in reducing
communication and synchronization overhead. We describe early results of a prototype
compiler transformation tool that applies this analysis to real programs.
![[Supercomputing Institute Homepage]](http://www.msi.umn.edu/general/Buttons/MSIhome.gif)
|
URL: http://www.msi.umn.edu/general/Symposia/Compiler/abstracts.html
|
|
webmaster@msi.umn.edu
Last modified: August 5, 1997 at 8:36 AM