Abstracts - The 10th International Workshop on Languages and Compilers for Parallel Computing


Lowering HPF Procedure Interface to a Canonical Representation

Jan Borowiec
BMD FIRST Research Institute for Computer
Architecture and Software Technology
Rudower Chaussee 5
12489 Berlin
GERMANY
borowiec@first.gmd.de

Arthur Veen
Parallel Computing
Amsterdam
THE NETHERLANDS
arthur@parcom.nl

Handling the procedure interface in an HPF compiler is complex due to the many possible combinations of Fortran 90/JPF properties of an actual array argument and its associated dummy argument. This paper describes an algorithm that reduces this complexity of mapping all the combinations of properties to a small set of canonical Internal Representations. These internal representations as well as the necessary run-time decsriptors are also described. The algorithm has been implemented in the commercial HPF compiler produced by the PREPARE project.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




Java as a Language for Scientific Parallel Programming

Bryan Carpenter, Yuh-Jye Chang, Geoffrey Fox, Xiaoming Li
Northeast Parallel Architectures Centre
Syracuse University
Syracuse, New York
dbc@npac.syr.edu, yjchang@npac.syr.edu, gcf@npac.syr.edu, lxm@npac.syr.edu

Java may be a natural language for portable parallel programming. We discuss the basis of this claim in general terms, then illustrate the use of Java for message-passing and data-parallel programming through series of case studies. In the process we introduce some proposals for a Java binding of MPI, and describe the use of a Java class-library to implement HPF-style distributed data. Prospects for future Java-based parallel programming environments are discussed.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




A Comparison of Three Global Code Scheduling Algorithms

Pohua Chang
Intel Corporation
M/S RN6–18
2200 Mission College Blvd.
Santa Clara, CA 95052-8119
pohua@gomez.sc.intel.com
Telephone: (408) 765-6423

Instruction scheduling is an important compiler code transformation step for generating code for pipelined and VLIW processors. Dependencies among instructions need to be enforced. In this paper, we compare the effectiveness of three global code scheduling algorithms (HS, EHS, GCM). Hyperblock scheduling (HS) is a path-based scheduling algorithm. Enhanced Hyerblock Scheduling (EHS) is an improvement over HS by merging path schedules. Global Code Motion (GCM) integrates some instruction-level parallelization techniques with upward code motion to permit more code motion freedom. We apply the three algorithms to the most time consuming loops in SPEC95 CINT programs. The experimental results show that EHS performs as well as GCM. EHS is much simpler to implement than GCM. Therefore, EHS is the most cost effective global code scheduling algorithm among the three algorithms.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




Definition of the F– – Extension to Fortran 90


Benoit Dupont de Dinechin
bc3@cs.mcgill.ca
Gary Elsesser, Greg Fischer, Brian H. Johnson, Tom MacDonald, Robert W. Numrich, and Jon L. Steidel

Many computational scientists have adopted message-passing as their model for programming distributed memory parallel computers. Moreover, they have discovered that the model works equally well on shared memory machines or even across networks of workstations. They have chosen this model, not because it is the easiest to program, but because their physical problems and their numerical methods naturally yield to the method of data decomposition. The programmer, immersed in the physical problem, things about the work to be done at each local point and how it is effected by nearby points. If the programmer has done a good job, the program advances using only local data with occasional requirements for nonlocal data. Eventually nonlocal data begins to influence local data and the programmer must be aware of the interaction. At appropriate points in the program, the programmer fetches data from other regions of the physical problem or sends data to other regions.

The message-passing programming model enforces a discipline on the programmer that often pays for itself in the form of high parallel efficiency. A cumbersome expression of the message-passing model through complicated libraries, however, reduces its effectiveness and makes the programmer’s job unnecessarily burdensome. If the programmer understands the physical problem, the parallelism inherent in it should be expressible in a simple snytax compatible with the language being used, in our case Fortran 90. Message-passing libraries, designed with portability and standardization as the overriding requirement, often incur high overheads that cannot be avoided. High overhead translates into high granularity. High granularity translates into hard programming.

Directive-based, pseudo-shared memory programming models, on the other hand, claim to offer ease-of-use and even the ability to parallelize dusty-deck scalar codes. So for, their success has been limited and their proposed mechanisms for handling irregular or dynamically changing problems are awkward at best. Regular problems involving dense linear algebra or fixed grids, have been solved. Irregular problems involving sparse linear algebra or irregular changing grids are unsolved. An effective approach to these problems must be found for scalable parallel computers to have a future.

Writing good programs for parallel computers is easier than complicated programming languages and arcane communication libraries suggest. The F- - extension to Fortran 90 described in this paper defines a simple Fortran-like snytax that is sufficient to express all communication patterns encountered in parallel programs. The F- - model was originally defined for Fortran 77 [4], in a substantially different form that made sense for the language, and has recently been modified to a more complete Fortran 90 version [5]. The syntax is architecture independent although its implementation, as for any language syntax, is architecture dependent.

The F- - extension to Fortran 90 does not define a new language. Rather it builds on the existing standard language. Whenever possible, to minimize the number of new rules for the programmer to learn, the syntactic and semantic rules for F- - extensions are the same as the rules for Fortran 90. When the rules must be modified, we specifically state the changes.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




An Array Data Flow Analysis Based Communication Optimizer

X. Yuan, R. Gupta, and R. Melhem
Department of Computer Science
University of Pittsburgh
Pittsburgh, PA 15260
xyuan@cs.pitt.edu, gupta@cs.pitt.edu, melhem@cs.pitt.edu

Global array data flow analysis computes precise data flow information required for performing communication optimizations. However, to practically use this technique on large programs, the analysis cost must be reduced. In this paper, we present an efficient array data flow analysis based global communication optimizer. To manage the analysis cost, our optimizer partitions the optimization data flow problems into subproblems and solves the subproblems one at a time. By doing so, our scheme offers some advantages over traditional data flow based optimization techniques. First, the memory space requirement is greatly reduced. Second, set operations in the analysis are simplified, which leads to the reduction of analysis time. Third, the analysis time can be managed effectively. Our optimizer performs message vectorization, global redundant communication elimination and global communication scheduling. We are implementing the optimizer on top of the Stanford SUIF compiler. Our preliminary experimental results suggest that array data flow analysis for communication optimization can be efficient.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




Integrating Automatic Data Alignment and Array Operation Synthesis to Optimize Data Parallel Programs


Gwan-Hwan Hwang
Department of Computer Science
National Tsing-Hua University
Hsinchu, Taiwan
REPUBLIC OF CHINA
ghhwang@cs.nthu.edu.tw

Jenq Kuen Lee
jklee@cs.nthu.edu.tw
Fax: 886-3-5723694
Telephone: 886-3-5715131 Ext. 3519

Dz-Ching R. Ju
Hewlett-Packard Company
Cupertino, CA 95014
royju@cup.hp.com

Given a sequence of FORALL-LOOPs and array operations in data parallel languages (such as Fortran 90), the automatic alignment problem is to find an alignment relationship to align arrays so that incurred data communications will be minimized, when programs are executed on distributed memory architectures. On the other hand, array operation synthesis is to compose consecutive array operations or array expressions into a composite access function of the source arrays at compile time to optimize data parallel programs. Both automatic data alignment and array operations synthesis have been shown to be very important and effective schemes to optimize data parallel programs. However, they were considered separately so far by the research community. In this paper, we address this open issue how to integrate the array operation synthesis scheme into the automatic alignment process. We propose a new array alignment concept, called segmented alignment, to help incorporate array operation synthesis scheme into automatic alignment process. In addition, we derive a cost model and present an optimal solution based on the cost model to integrate array synthesis into an automatic alignment framework. We also show that the optimal problem is NP-hard. Therefore, we develop a practical heuristic algorithm for compilers to incorporate the synthesis strategy into an automatic alignment framework. Experiments done on an 8-node DEC Farm show that the automatic alignment process with the help of array operation synthesis and segmented alignment concepts can significantly outperform the one without these mechanisms.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




Automatic Data Decomposition for Message-Passing Machines

Mirela Damian-Iordache and Sriram V. Pemmaraju
Department of Computer Science
The University of Iowa
Iowa City, IA 52242-1419
sriram@cs.uiowa.edu

The data distribution problem is very complex, because it involves trade-off decisions between minimizing communication and maximizing parallelism. A common approach towards solving this problem is to break the data mapping into two stages: an alignment stage and a distribution stage. The alignment stage attempts to increase parallelism, while the distribution stage attempts to decrease communication overhead. As opposed to previous approaches, we consider the alignment and distribution problems in a unified framework, and attempt to simultaneously maximize parallelism and minimize communication overhead. The problem becomes harder if dynamic remapping, multi-dimensional distributions, array replications and control flow are taken into account.

This paper formulates the full data decomposition problem that addresses all these issues and presents a simple new algorithm to find the optimal solution of the dynamic data distribution problem, given the number of processors and a partitioning of the input program into phases. The algorithm runs efficiently for small search spaces (several hundreds of data distributions).


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




Table-Lookup Approach for Compiling Two-Level Data–Processor Mappings in HPF

Kuei-Ping Shih and Jang-Ping Sheu
Department of Computer Science and Information Engineering
National Central University
Chung-Li 32054, Taiwan, ROC
steven@axpl.csie.ncu.edu.tw
sheujp@csie.ncu.edu.tw

Chua-Huang Huang
Department of Computer and Information Science
The Ohio State University
Columbus, OH 43210-1277
chh@cis.ohio-state.edu

This paper presents compilation techniques for block-cyclic data redistribution on distributed-memory multicomputers. Since static data distribution is insufficient in achieving high performance computing for many scientific and engineering applications, dynamic data redistribution is the only alternative to get better performance. However, data redistribution is costly at runtime. How to reduce the indexing and communication overhead is very crucial for distributed-memory multicomputers. We propose compilation techniques to efficiently generate communication sets for block-cyclic data redistribution, assuming data-processor mapping is a two-level mapping. That is, related array objects are first aligned with a template, an abstract index space; and the template is then distributed onto the user-declared abstract processors. Since there are a repeated pattern for two-level data-processor mapping, we use class table to record the global indices of array elements in the first pattern. The global indices of array elements on other repeated patterns can be obtained by adding some offset. Similarly, the compressed local array also has a repeated pattern for a processor. We use compression table to record the compressed local array for the first pattern. Other repeated patterns can also be obtained accordingly. By using class table and compression table, hole compression and communication set generation for performing data redistribution can be easily and efficiently achieved. The proposed methods can save memory usage, improve spatial locality and further increase system performance. The experimental results do confirm the advantages of our proposed methods over existing methods.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




Data Parallel Language Extensions for Exploiting Locality in Irregular Problems

Guillermo P. Trabado and Emilio L. Zapata
Department of Computer Architecture
University of Málaga Complejo Tecnológico
P.O. Box 4114
E-29080 Málaga
SPAIN
guille@ac.uma.es
ezapata@ac.uma.es

Many large-scale computational applications contain irregular data access patterns related to unstructured problem domains. Examples include finite element methods, computational fluid dynamics, and molecular dynamics codes. Such codes are difficult to parallelize efficiently with current HPF compilers. However, most of these problems exhibit spatial locality. This property is exploited by our approach.

In the sequential program, unstructured domains are accessed via indirection arrays. We introduce a new directive that serves to identify indirection arrays and the boundaries of the associated domains. The data domains are distributed using Multiple Recursive Decomposition (MRD), a pseudo-regular distribution, which combines efficient implementation with good load balancing and communication behavior. Indirection arrays are aligned with the data arrays. Using the information provided in the directive, the compiler can produce a target program with significantly better performance than an approach based on indirect distributions and the inspector/executor paradigm.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




NAMD: A Case Study in Multilingual Parallel Programming

L.V. Kalé, M. Bhandarkar, R. Brunner, N. Krawetz, J. Phillips, and A. Shinozaki
Department of Computer Science and
Theoretical Biophysics Group
Beckman Institute
University of Illinois
Urbana, IL 61801
kale@cs.uiuc.edu

Parallel languages are tools for constructing efficient application programs, wile reducing the required labor. In this light, multi-paradigm multilingual programming is a natural idea: use the most appropriate tool for each component of a complex system. The Converse system developed at Illinois addresses the Issues involved in supporting multilingual applications. This paper describes the development of a large parallel application in Computational Biophysics from the point of view of multilingual programming. The application—a molecular dynamics program named NAMD 2—is implemented using three different “paradigms” or languages: Charm++, a parallel message-driven object language, PVM, and a portable threads package. The issues faced in implementing such a system, and the advantages of multilingual approach are discussed. NAMD 2 is already operational on many parallel machines; some preliminary performance results are presented and the lessons learned from this experience are discussed.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




PCRC-based HPF Compilation

Guansong Zhang, Bryan Carpenter, Geoffrey Fox, Xiaoming Li, Xinying Li, Yuhong Wen
NPAC at Syracuse University
Syracuse, NY 13244
zgs@npac.syr.edu, dbc@npac.syr.edu, gcf@npac.syr.edu, lxm@npac.syr.edu, xli@npac.syr.edu, wen@npac.syr.edu

This paper describes an ongoing effort supported by ARPA PCRC (Parallel Compiler Runtime Consortium) project. In particular, we introduce the design and implementation of an HPF compilation system based on PCRC runtime. The approaches to issues such as directive analysis, communication detection, and procedure interface processing are discussed in detail. Since this is already an operational compilation system, discussion is centered around a complete node program generated by the compiler.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




Quantifying the Multi-Level Nature of Tiling Interactions

Nicholas Mitchell, Larry Carter, Jeanne Ferrante, Karin Högstedt
Computer Science and Engineering Department
University of California at San Diego
La Jolla, CA 92093-0114
mitchell@cs.ucsd.edu

Optimizations, including tiling, often target a single level of memory or parallelism, such as cache. These optimizations usually operate on a level-by-level basis, guided by a cost function prameterized by features of that single level. The benefit of optimizations guided by these one-level cost functions decreases as architectures trend towards a hierarchy of memory and parallelism.

We look at three common architectural scenarios. For each, we quantify the improvement a single tiling choice could realize by using information from multiple levels in concert. To do so, we derive multi-level cost functions which guide the optimal choice of tile size and shape. We give both analyses and simulation results to support our points.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




Exploiting Parallelism Through Directives on the Nano-Threads Programming Model

Eduard Ayguadé, Xavier Martorell (tort@eio.upc.es), Jesús Labarta, Marc Gonzàlez, and Nacho Navarro
Computer Architecture Department
Polytechnic University of Catalunya
cr Jordi Girona 1-3
Mòdul D6
08034 Barcelona
SPAIN

The ability of a application to efficiently use the resources shared with other applications is a key feature needed to efficiently exploit the potential parallelism offered by nowadays parallel architectures. In these multiprogrammed environments, the application should be able to adapt the parallelism (kind and amount) that it is worth to be exploited to the global utilization of system resources. In this paper we present a programming model oriented to the hierarchical exploitation of unstructured parallelism in multiprogrammed multiprocessor systems. The model offers a set of directives targeted to be used by Fortran programmers that allow them to express the parallelism of the application. The compiler is responsible for the generation of code that efficiently exploits and manages this parallelism at run time. The code generated runs on top of a user-level threads library that allow the program to decide and adapt himself to parallelism that can be attained at any time.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




“Optimal” Parallelism Through Integration of Data and Control Parallelism: A Case Study in Complete Parallelization

Dwip Banerjee and J. C. Browne
Department of Computer Sciences
University of Texas
Austin, TX
http://www.cs.utexas.edu/users/
browne@cs.utexas.edu, dwip@cs.utexas.edu

This paper presents a detailed case study of programming in a parallel programming system which targets complete and controlled parallelization of array-oriented computations. The purpose is to demonstrate how coherent integration of control and data parallelism enables both effective realization of the potential parallelism of applications and matching of the degree of parallelism in a program to the resources of the execution environment. (“Our ability to reason is constrained by the language in which we reason.”). The programming system is based on an integrated graphical, declarative representation of control parallelism and data partitioning parallelism. The computation used for the example is even-odd reduction of block triangular matrices. This computation has three phases each with a different parallel structure. We derive, implement and measure the execution of a dynamic parallel computation structure which employs differing levels of control and data parallelism in each phase of the computation to give load balanced execution across a range of number of processors. The program formulated in the integrated representation revealed parallelism not shown in the original algorithm and has a constant level of actual parallelism throughout the computation where the original algorithm had unbalanced levels of parallelism during different phases of the computation. The resulting program shows near-linear speed-up across all phases of the computation for number of processors ranging from 2 to 32.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




A Unified Software Pipeline Construction Scheme for Modulo Scheduled Loops

Benoît Dupont de Dinechin
(on leave from)
CEA CEL-V
94195 Villeneuve St. Georges cedex
FRANCE
bd3@cs.mcgill.ca

We present a software pipeline construction scheme for D0-loops, while-loops, and loops with multiple exits, which unifies, simplifies, and generalizes, the separate techniques previously required to build a complete software pipeline from a local schedule computed by modulo scheduling. In the setting of this software pipeline construction scheme, we demonstrate a simple way of implementing modulo expansion. Then we introduce inductive relaxation, a technique that supplements modulo expansion when the variable to expand is a simple induction. These techniques do not require any architectural support from the target processor, and have been extensively tested as part of the software pipeliner that comes with the 3.0 compiler releases for the Cray
T3E™ massively parallel computer.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




Reducing Synchronization Overhead for Compiler-Parallelized Codes on Software DSMs


Hwansoo Han, Chau-Wen Tseng, and Pete Keleher
Department of Computer Science
University of Maryland
College Park, MD 20742
hshan@cs.umd.edu, tseng@cs.umd.edu, keleher@cs.umd.edu

Software distributed-shared-memory (DSM) systems provide an appealing target for parallelizing compilers due to their flexibility. Previous studies demonstrate such systems can provide performance comparable to message-passing compilers for dense-matrix kernels. However, synchronization and load imbalance are significant sources of overhead. This paper investigates three compiler and run time techniques for reducing synchronization overhead: 1) compile-time elimination of barriers, 2) nearest-neighbor synchronization, and 3) increasing coherence unit size. Our compile-time barrier elimination algorithm extends previous techniques by exploiting chunk iteration partitioning with local subscript analysis and exploiting delayed updates in lazy-release-consistency DSMs to eliminate barriers for cross-processor anti-dependences. Experiments on an IBM SP-2 indicate these techniques can improve parallel performance by 10% on average and by up to 40% for some applications.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




Simplifying Control Flow in Compiler-Generated Parallel Co
de

John Mellor-Crummey and Vikram Adve
Department of Computer Science
M/S 132
Rice University
6100 Main Street
Houston, TX 77005-1892
johnmc@rice.edu, adve@rice.edu

Optimizing compilers for high-level languages such as High Performance Fortran (HPF) perform a sequence of complex code generation steps to synthesize an efficient parallel program. It is possible to reduce compiler software complexity and to avoid repeated global reanalysis by having each code generation step operate with only partial information about the shape of the code that other code generation steps produce. As a consequence, the initial generated code may contain excess conditional tests and control flow that can be eliminated by exploiting contextual knowledge about the generated code. Here, we describe a powerful algorithm that computes symbolic constraints on values of integer variables. Constraints arise from loop bounds and strides, conditional branches, and synthetic assertions. A second algorithm uses these constraints to simplify control flow. These algorithms have been implemented in the Rice dHPF compiler. We present a preliminary experimental evaluation which shows that these algorithm are effective in reducing the number of conditionals, code size, and overall execution time for code generated by dHPF. We also describe the synergy between control flow simplification and other optimizations in the compiler that together achieve the effects of specific optimizations for data-parallel programs such as vector message pipelining and use of overlap areas.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




Concurrent Static Single Assignment Form and Constant Propagation for Explicitly Parallel Programs


Jaejin Lee, Samuel Midkiff, and David A. Padua
Department of Computer Science
University of Illinois at Urbana-Champaign
Urbana, IL 61802
j-lee@cs.uiuc.edu, padua@cs.uiuc.edu

Static Single Assignment (SSA) form has shown its usefulness for powerful code optimization techniques, such as constant propagation, of sequential programs. We introduce Concurrent Static Single Assignment (CSSA) form and the transformation algorithm for the explicitly parallel programs with interleaving semantics and post-wait synchronization. The parallel construct considered in this paper is cobegin/coend. A new concept, p-assignment, which summarizes the information of interleaving statements among threads, is introduced. Concurrent Control Flow Graph, which contains the information of conflicting statements in addition to control flow and synchronization information, is used as an intermediate representation for the CSSA transformation. A parallel extension of the Sparse Conditional Constant propagation algorithm based on the CSSA form makes it possible to apply the constant propagation optimization to explicitly parallel programs.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




Identifying DEF/USE Information of Statements that Construct and Traverse Dynamic Recursive Data Structures


Yuan-Shin Hwang and Joel Saltz
Department of Computer Science
University of Maryland
College Park, MD 20742
shin@cs.umd.edu, saltz@cs.umd.edu

Pointer analysis is essential for optimizing and parallelizing compilers. It examines pointer assignment statements and estimates pointer-induced aliases among pointer variables or possible shapes of dynamic recursive data structures. However, previously proposed techniques perform pointer analysis without the knowledge of traversing patterns of dynamic recursive data structures to be constructed. This paper presents an algorithm to identify the traversing patterns of recursive data structures and propagate this information back to those statements that define the data structures. This approach recognizes the DEF/USE relationships between the statements that define and traverse dynamic recursive data structures. The outcome of this technique will be useful for pointer analysis and parallelization.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




Code Generation for Complex Subscripts in Data-Parallel Programs


J. Ramanujam, Swaroop Dutta, and Arun Venkatachar
Department of Electrical and Computer Engineering
Louisiana State University
Baton Rouge, LA 70803-5901
swaroop@ee.lsu.edu, jxr@ee.lsu.edu, arun@ee.lsu.edu

Data parallel languages like High Performance Fortran, demand efficient compile and run-time techniques for tasks such as address generation. Array references with arbitrary affine subscripts can make the task of compilers for such languages highly involved. This paper deals with the efficient address generation in programs with array references having two types of commonly encountered affine references, namely coupled subscripts and subscripts containing multiple induction variables (MIVs). Methods discussed in the paper utilize the repetitive pattern of the memory accesses. In the case of MIV, we address this issue by presenting runtime techniques which enumerate the set of addresses in lexicographic order. Our approach to the problem incorporates a general approach of computing in O (k) time, the start element on a processor for a given global start element. Several methods are proposed and evaluated here for generating the access sequences for MIV based on problem parameters. With coupled subscripts, we present two construction techniques, namely searching and hashing which minimize the time needed to construct the tables. Extensive experiments were conducted and the results were then compared to indicate the efficiency of our approach.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




The SPNT Test: A New Technology for Run-Time Speculative Parallelization of Loops


Tsung-Chuan Huang and Po-Hsueh Hsu
Department of Electrical Engineering
National Sun Yat-Sen University
Taiwan
R.O.C.
tch@mail.nsysu.edu.tw
Telephone: +886-7-5252000 Ext. 4140
Fax: +886-7-5254199

The only way for parallelizing compilers to exploit potential parallelism of loops in which dependence information is inadequate statically is using run-time loop parallelization technique. There are two approaches in this field: the inspector-executor method and the speculative DOALL test. For the former approach, there always incurs heavy preprocessing overhead during inspector phase and synchronization barrier burden besides load imbalance impact in executor. In this paper, a new proposal for a highly practicable speculative parallelization test, the SPNT test (Speculative Parallelization with New Technology), is presented. Speculative parallel execution as DOALL actually obtains the biggest speedup if the loop is in fact a DOALL loop, otherwise, it will suffer rather extent penalty. The objective of SPNT test is twofold. The first is to increase the success rate by ignoring avoidable dependence restrictions. The second is to reduce the failure penalty by detecting the unavoidable data dependences and quitting the speculative parallel execution as soon as possible. As the result, the SPNT test can greatly improve the practicability of the speculative parallel execution.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




A Systematic Approach to Branch Speculation


Gianfranco Bilardi
DEI
Università di Padova
ITALY

Alex Nicolau
University of California at Irvine
Irvine, CA

Joe Hummel
University of California at Irvine
Irvine, CA

A general theoretical framework is developed for the study of branch speculation. The framework yields a systematic way to select the schedule in a given set that, for any (estimated) bias of the branch, minimizes the expected execution time. Among other things, it is shown that in some cases the optimal schedule is neither of those resulting from aggressively speculating on any given outcome of the conditional. Our results can be useful in either static or dynamic approaches. We propose a simple run-time estimator for the bias and discuss how to combine it with schedule selection. A number of examples motivate and illustrate the techniques, and show that our approach yields better performance in the case of highly unpredictable branches.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




Analysis and Optimization of Explicitly Parallel Programs Using the Parallel Program Graph Representation


Vivek Sarkar
MIT Laboratory for Computer Science
545 Technology Square
Cambridge, MA 02139
vivek@lcs.mit.edu

Major changes in processor architecture over the last decade have created a demand for new compiler optimization technologies. Optimizing compilers have risen to this challenge by steadily increasing the uniprocessor performance gap between optimized compiled and unoptimized compiled code to a level that already exceeds the performance gap between two successive generations of processor hardware. These traditional optimizations have been developed in the context of sequential programs—the assumption of sequential control flow is intrinsic to the definition of basic optimization data structures such as the control flow graph (CFG), and pervades all the optimization algorithms.

As more and more programs are written in explicitly parallel programming languages, it becomes essential to extend the scope of sequential analysis and optimization techniques to explicitly parallel programs. This extension is necessary for maintaining single-processor performance in parallel programs and also for adapting the parallelism in the program to the target parallel machine. By an explicitly parallel programming language, we mean a programming language that contains primitives for creating, terminating, and synchronizing concurrent (logical) threads of activity. The primitives may be in the form of syntactic extensions (e.g., cobegin-coend [Han75]), directives (e.g., HPF [Hig92]), or library calls (e.g., start (), run(), wait (), etc. In Java). In this paper, we assume that the parallel program is translated to the parallel program graph (PPG) representation [Sar92, SS93] and focus our attention of performing compiler analysis and optimization on PPGs.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




Program Optimization for Concurrent Multithreaded Architectures


Jenn-Yuan Tsai
Department of Computer Science
University of Illinois
Urbana, IL 61801

Zhenzhen Jiang and Pen-Chung Yew
Department of Computer Science
University of Minnesota
200 Union Street SE
Minneapolis, MN 55455
Telephone: (612) 625-7876
jiang015@maroon.tc.umn.edu

Pen-Chung Yew
Department of Computer Science
University of Minnesota
200 Union Street SE
Minneapolis, MN 55455
Telephone: (612) 625-7387
yewxx001@gold.tc.umn.edu

This paper studies some compiler and program transformation techniques for concurrent multithreaded architectures, in particular the superthreaded architecture, which adopts a thread pipelining execution model that allows threads with data dependencies and control dependencies to be executed in parallel. In the study, we identify several important program analysis and transformation techniques that allow the superthreaded architecture to exploit more parallelism in programs with less run-time overhead. We evaluate the performance to the superthreaded architecture and the effectiveness of the program transformation techniques by manually compiling several benchmark programs and running them through a trace-driven, cycle-by-cycle superthreaded processor simulator. The simulation results show that a superthreaded processor can achieve good speedups for most of the benchmark programs with the proposed program transformation techniques applied.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




Interactive Compilation and Performance Analysis with URSA MINOR


Insung Park, Michael Voss, Brian Armstrong, and
Rudolf Eigenmann (eigneman@ecn.purdue.edu)
School of Electrical and Computer Engineering
Purdue University
1285 Electrical Engineering Bldg.
West Lafayette, IN 47907-1285
Telephone: (765) 494-1741
Fax: (765) 494-6440
eigneman@ecn.purdue.edu

This paper proposes solutions to two important problems with parallel programming environments that were not previously addressed. The first issue is that current compilers are typically black-box tools with which the user has little interaction. Information gathered by the compiler, although potentially very meaningful for the user, is often inaccessible or hard to decipher. Second, compilation and performance analysis tools are not well integrated. While there are many advanced instruments for gathering and browsing performance results of a program, it is difficult to relate this information to the source program, to the applied program transformations, and to the compiler’s reasoning.

The URSA MINOR tool addresses these issues. The tool is designed to help understand the structure of a program and the information gathered by a compiler in an interactive way. It facilitates the comparison of performance results under different environments and the identification of potential parallelism, and it provides a repository for this information. URSA MINOR is built using the Polaris compiler infrastructure. We present case studies that show how programmers can use the tool to find additional parallelism in a compiler-optimized program and to characterize the performance of parallel applications. The tools are currently being used in several projects to develop and study parallel applications and to evaluate parallelizing compilers. These efforts provide feedback for improving the URSA MINOR tool.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




A Compiler Abstraction for Machine Independent Parallel Communication Generation


Bradford L. Chamberlain, Sung-Eun Choi, and Lawrence Snyder
Department of Computer Science and Engineering
Box 352350
University of Washington
Seattle, WA 98195-2350
Telephone: (206) 543-0374
brad@cs.washington.edu, sungeun@cs.washington.edu, snyder@cs.washington.edu

In this paper, we consider the problem of generating efficient, portable communication in compilers for parallel languages. We introduce the IRONMAN abstraction and approach, which separates data transfer from its implementing communication paradigm. This is done by annotating the compiler-generated code with legal ranges for data transfer in the form of calls to the IRONMAN library. On each target platform, these library calls are instantiated to perform the transfer using the machine’s optimal communication paradigm. We confirm arguments against generating message passing calls in the compiler based on our experiences using PVM and MPI-—specifically, the observation that these interfaces do not perform well on machines that are not built with a message passing communication paradigm. The overhead for using IRONMAN, as opposed to a machine-specific back end, is demonstrated to be negligible. We give performance results for a number of benchmarks running with PVM, MPI, and machine-specific implementations of the IRONMAN abstraction, yielding performance improvements of 2-13% without any source code or compiler modifications.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




Static Analysis of Recursive Data Struc
tures

D.K. Arvind and T.A. Lewis
Department of Computer Science
The University of Edinburgh
Mayfield Road
Edinburgh EH9 3JZ
SCOTLAND

The Static analysis of imperative languages is made difficult by the unrestricted pointer structures which makes their static analysis difficult and imprecise. A static analysis method is proposed for a class of graph structures with commuting and inverse edges. The method hinges on a path normalization scheme which can be described as a Markov algorithm. This provides a general framework for static analysis of other structures with different relations over the edges. The method is illustrated with an example.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




Reuse-Driven Tiling for Data Locality


Jingling Xue
Department of Mathematics, Statistics,
and Computing Science
University of New England
Armidale 2351
AUSTRALIA
xue@turing.une.edu.au

Chua-Huang Huang
Department of Computer and
Information Sciences
The Ohio State University
Columbus, OH 43210-1277
chh@cis.ohio-state.edu

This paper studies the problem of applying unimodular transformations and tiling to improve the cache locality of a loop nest. Due to dependence constraints and reuse information, not all loops will and can be tiled. Therefore, the tiling approach proposed consists of three phases, which are to determinate (a) a subspace of the iteration space to be tiled, (b) the tile shape, and (c) the tile size. This paper focuses on the first phase. By using cones to represent the dependences and by using vectors to represent the reuse information in the program, a resue-driven approach is presented to improve data locality of the program based on matrix transformations. In the special case of a singly fully permutable loop nest, the data locality problem is formulated as an optimization problem and solved optimally based on the time-cone algorithm introduced in the paper. In the general case, an algorithm is presented that constructs the tiled loop nest incrementally, loop by loop, starting from the outermost one so that as much reuse as possible is carried in the innermost tiled loops.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




Experiences with Loop Parallelization in javar (a Prototype Restructuring Compiler for Java)


Aart J.C. Bik, Juan E. Villacis, and Dennis B. Gannon
Computer Science Department
Indiana University
Lindley Hall 215
Bloomington, IN 47495-4101
ajcbik@cs.indiana.edu

In this paper we discuss some experiences with javar, which is a prototype Java restructuring compiler that supports the automatic conversion of serial loops into constructs that use the multi-threading mechanism of the language to exploit any parallelism in these loops. Expressing parallelism in Java itself clearly has the advantage that the transformed program remains portable. After compilation of the transformed Java program into bytecode, speedup can be obtained on any platform on which the Java bytecode interpreter supports the true parallel execution of threads, whereas we will see that the threads only induce a slight overhead on uni-processors.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




The Aggregate Function API: It’s Not Just For PAPERS Anymore


H.G. Dietz, T.I Mattox, and G. Krishnamurthy
School of Electrical and Computer Engineering
Purdue University
West Lafayette, IN 47907-1285
Telephone: (765) 494-3357
hankd@ecn.purdue.edu, tmattox@ecn.purdue.edu, gayathri@ecn.purdue.edu
http://garage.ecn.purdue.edu/~papers/

The concept of “data parallelism” is a pervasive force throughout parallel processing. Although a certain level of processing-element autonomy can help performance, the fact is that many parallel algorithms, applications, and compiler analysis techniques focus on identifying a set of data objects that can be processed using loosely synchronous parallelism. Thus, it is not surprising that a large number of communication libraries support at least a few synchronized aggregate operations on data. Over the past few years, we have developed eleven different types of PAPERS (Purdue’s Adapter for Parallel Execution and Rapid Synchronization) hardware specifically to efficiently implement aggregate functions for clusters of PCs or workstations.

The Aggregate Function Application Program Interface (AFAPI) library was initially designed to be a portable high-level interface to the various types of PAPERS cluster hardware, so one would expect it to work well using this custom hardware, and it does work well. In this paper, we show that the AFAPI is also an efficient programming model for other types of parallel systems—especially shared memory multiprocessors. For many operations, AFAPI can outperform threads libraries and other more traditional shared memory programming models.


Schedule || Abstracts || Forms || Addresses || Contacts || Overview




Program Analysis of Overlap Area Usage in Self-Similar Parallel Programs


Aaron Sawdey
Electrical Engineering Department
University of Minnesota
4-174 EE/CSci
Minneapolis, MN 55455
Telephone: (612) 625-3583
sawdey@maroon.ec.umn.edu

Matthew O’Keefe
Electrical Engineering Department
University of Minnesota
4-174 EE/CSci
Minneapolis, MN 55455
Telephone: (612) 625-6306
okeefe@ee.umn.edu

Highly parallel computers have the memory capacity and potential speed to perform very high-resolution time-dependent calculations. Parallel computers with hundreds of fast processors will require highly scalable algorithms to avoid wasting expensive resources. To fully exploit scalable algorithms on these highly parallel machines careful attention must be paid to program design. We have proposed a programming model that can express a class of scaleable algorithms. In this paper we show how compiler analysis can ease the task of writing programs within this programming model. We have used this programming model manually and have achieved good results in reducing communication and synchronization overhead. We describe early results of a prototype compiler transformation tool that applies this analysis to real programs.

Schedule || Abstracts || Forms || Addresses || Contacts || Overview


[Future Symposia and Workshops] [Past Symposia and Workshops] [Supercomputing Institute Homepage]


URL: http://www.msi.umn.edu/general/Symposia/Compiler/abstracts.html


webmaster@msi.umn.edu
Last modified: August 5, 1997 at 8:36 AM