ABSTRACT
Title of thesis: AN INTEGRATED
PROGRAM REPRESENTATION
FOR LOOP OPTIMIZATIONS
Greeshma Yellareddy, Master of Science, 2011
Thesis directed by: Professor Rajeev Barua
Department of Electrical and Computer Engineering
Inspite of all the advances, automatic parallelization has not entered the gen-
eral purpose compiling environment for several reasons.
 There have been two distinct schools of thought in parallelization domain
namely, a ne and non-a ne which have remained incompatible with each
other over the years. Thus, a good practical compiler will have to be able to
analyze and parallelize any type of code { a ne or non-a ne or a mix of both.
 To be able to achieve the best performance, compilers will have to derive the
order of transformations best suitable for a given program on a given system.
This problem, known as \Phase Ordering", is a very crucial impedance for
practical compilers, more so for parallelizing compilers. The ideal compiler
should be able to consider various orders of transformations and reason about
the performance bene ts of the same.
In order to achieve such a compiler, in this paper, we propose a uni ed program
representation which has the following characteristics:
 Modular in nature.
 Ability to represent both a ne and non-a ne transformations.
 Ability to use detailed static run-time estimators directly on the representa-
tion.
AN INTEGRATED
PROGRAM REPRESENTATION
FOR LOOP OPTIMIZATIONS
by
Greeshma Yellareddy
Thesis submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial ful llment
of the requirements for the degree of
Master of Science
2011
Advisory Committee:
Professor Rajeev Barua, Chair/Advisor
Professor Bruce Jacob
Professor Shuvra Bhattacharyya
c Copyright by
Greeshma Yellareddy
2011
Table of Contents
List of Tables iii
List of Figures iv
1 Introduction 1
2 Related Work 11
3 Program Representation 15
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 PDG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 LNG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Representation of Transformations 26
4.1 Loop Interchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Loop Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 DSWP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 Results 37
5 Conclusion 42
Bibliography 43
ii
List of Tables
5.1 Speedup on x86 24-core machine for Polybench benchmark . . . . . . 38
5.2 Speedup for A ne and Non-a ne Transformations . . . . . . . . . . 39
iii
List of Figures
3.1 Example program with the PDG and LNG . . . . . . . . . . . . . . . 18
4.1 Example for Interchange . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Example for Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Example for DSWP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 PDG and LNG for DSWP . . . . . . . . . . . . . . . . . . . . . . . . 35
5.1 Example for A ne and Non-a ne Transformations . . . . . . . . . . 40
iv
Chapter 1
Introduction
In the multi-core era, most of the world?s computers are parallel, but most soft-
ware remains serial. Given the huge investment in existing serial code worldwide,
and that rewriting serial code into parallel code is time-consuming, error-prone, and
expensive, there is a great need for automatic parallelization and cache-optimization
of serial code. Some success has been seen for programs with a ne array refer-
ences [18][5] { an array reference is said to be a ne if its indices are linear combi-
nations of loop induction variables. For example, A[2i + 3j + 7;i 2] is a ne, but
A[i2] is not. A ne accesses are particularly well suited for parallelism and cache op-
timizations. However, only scienti c and media programs are predominantly a ne;
and even those sometimes have small amounts of non-a ne code in otherwise a ne
loops, ruining performance. Non-a ne methods have also seen some work [6][12]
primarily using graph-based methods.
Despite decades of research, the results of automatic parallelization remain
somewhat disappointing. We identify two main reasons. First, existing a ne and
non-a ne methods remain fundamentally incompatible because of di erent internal
representations of candidate transformed code. Since most real-world code is a mix
of a ne and non-a ne code, neither class of parallelizers has been able to conquer
general-purpose code. Second, and just as important, existing compiled parallel
1
code  nds it hard to attain even a reasonable fraction of the peak performance of
a parallel computer. Parallel code that is manually tuned by a highly trained com-
puter engineer almost always yields code that is superior, often by a large margin,
compared to automatically parallelized code.
The root cause of the superior performance of manually tuned code is that a
human programmer is often able to deduce program optimization transformations
that a compiler is unable to  nd. In theory, the compiler should be able to  nd the
transformation order chosen by the human { in most cases, both use the same toolkit
of program transformations to achieve lower run-time. Such human-applied or com-
piler a ne transformations include tiling, loop interchange,  ssion, fusion, reversal,
skewing, interleaving, peeling, and strip-mining. Non-a ne transformations [12][13]
include Decoupled Software Pipelining (DSWP), Parallel-stage DSWP (PS-DSWP),
Cyclic Multi-threading (CMT+) and Control Speculation. These transformations
are extremely crucial to performance to increase parallelism by breaking loop-carried
dependencies, or improving cache locality, or both.
Unfortunately, the problem of  nding a good transformation order and trans-
formation parameters (e.g., tile sizes) is a challenging problem. This is the well-
known \phase ordering" problem in compilers: a sequence of transformations must
be applied in precisely the right order, with the right parameter values if applicable,
to exactly the same loop dimensions to achieve the best run-time. Although this
is a somewhat important problem for serial program transformations, it is crucial
for transformations used for parallelization. For example, we have encountered a
loop in the gemver benchmark in the Polybench benchmark suite for which the
2
best transformation order is a loop interchange, followed by a fusion, followed by
a strip-mining, followed by another loop interchange to di erent dimensions . No
shorter transformation order gives a run-time that is even close to the best order.
We have found several benchmarks where the run-time of a basic parallelizer without
transformations is improved by a factor of 5-10X by using a carefully constructed,
manually discovered sequence of transformations.
To e ectively solve the phase-ordering problem, a compiler must be able to
apply mixed sequences of a ne and non-a ne code together. This is because real-
world programs often belie easy categorization as just \non-a ne" or just \a ne".
In reality, most programs are a mix of a ne and non-a ne code, with the ratio being
more a ne code in scienti c domains, and less so in the general-purpose domain.
For example, scienti c codes are often mostly a ne, but may have small amounts
of non-a ne code that may introduce loop-carried dependencies, such as a printf()
or other calls with side e ects, pointer accesses, or heap data structure accesses.
Existing a ne compilers today tend to be very limited in the scope of programs
they can transform. They are infamously \brittle", breaking on codes that were
not written to perfectly match what they can handle. The result is that a ne
parallelizers fail on such codes.
Non-a ne parallelizers are also hobbled { they can handle any code, but can-
not apply a ne transformations. This limits them since a ne methods can exploit
such codes in a much more scalable fashion. The result is that non-a ne methods
yield lower performance than is possible with an truly integrated a ne + non-a ne
method.
3
The above two shortcomings make it clear that combined compiler methods
to handle a ne and non-a ne transformations together are desirable since they are
likely to achieve higher performance of the compiled code. However composing both
types of transformations in a single compiler to be used simultaneously on the same
program loop is very challenging. Existing compilers do not do this.
The fundamental di culty in integrating a ne and non-a ne transforma-
tions is that they use di erent and incompatible program representations. Good
program representations are essential to the success of compilers. For example,
the control- ow graph and data  ow representations such as SSA are invaluable
for optimization even though they can be derived from the compiler?s intermediate
representation (IR). This is because they make underlying control- and data- ow
information explicit, thus decoupling the discovery of such information from its use.
Such decoupling allows for the modular design of a compiler, and avoids unnecessary
re-computation of analysis results in every optimization pass. Similarly, the well-
known Program Dependence Graph (PDG) is widely used for parallelization since
it makes both types of dependences { control and data { explicit. This is useful
even though everything in the PDG can be derived from the control- ow graph and
data ow representations.
In a similar vein, there is a need for a uni ed program representation that can
make explicit the information needed for both a ne and non-a ne transformations.
Unfortunately, existing program representations for both are unusable by the other.
As a result, a ne and non-a ne transformations remain hard to combine. To
understand why, we consider both types of representations below.
4
A ne mathematical representations represent a ne array references and loop-
carried dependencies between them using custom mathematical representations for
a ne code, such as matrices in traditional methods [18, 7], and systems of linear in-
equalities in polyhedral methods [5]. These representations have the advantage that
they contain detailed information needed to apply a ne transformations. Further
since they contain little other information, they are fast to update when the compiler
is searching through a very large number of possible transformation sequences. The
loop?s IR is converted to the mathematical representation before the transformation
search commences. Thereafter the IR is not used during the search. Only when
the search has found the best transformation sequence is the IR regenerated so that
code generation can commence from it.
Unfortunately after an a ne transformation is applied, existing a ne represen-
tations do not represent the program explicitly at all. Indeed, the result of applying
an a ne matrix transformation is not represented in any general-purpose represen-
tation that non-a ne transformations can reason about. Instead a ne-transformed
codes only exist in mathematical representation, with no concrete representation
until  nal code generation. As a result, if a non-a ne transformation is to analyze
the result of a set of a ne transformations, it has to derive the resultant code by
applying these transformations in the IR and then building the relevant program
representation it depends on from scratch. This is clumsy, non-intuitive, unnatural
and defeats modularity. To build a truly integrated compiler, it would be desirable if
both a ne and non-a ne transformations explicitly represented code in a way that
is useful for either types of subsequent transformations, so a seamless transformation
5
search is possible. This is the goal of our uni ed representation.
Another drawback of a ne mathematical representations is that they are not
complete program representations { they only represent a ne references and their
loop-carried dependencies, but not the rest of the loop. Hence a general-purpose
tool to estimate the run-time of a code fragment at compile-time cannot be applied
after an a ne transform. Such static run-time estimators (SREs) can be a useful
tool to compare candidate transformation sequences to choose the best.
Non-a ne representations There is less standardization in the literature on
representations for non-a ne transformations. Methods such as DSWP [12] use
the well-known Program Dependence Graph (PDG) representation. Earlier meth-
ods such as those in the Parafrase compiler use the Closure of Data and control
Dependencies Graph (CDDG) [11, 10]. Both the PDG and CDDG represent in-
structions in the program along with data and control dependencies between them.
They are suitable for non-a ne transformations which often attempt to break (or
speculate on) dependencies, so applying transformations is quick at compile-time
since they usually only need to modify a ected dependencies. Hence transformation
orders can be searched for quickly.
Unfortunately graph based representations such as the PDG have their own
drawback: they do not represent a ne array indices or a ne distance vectors, crucial
for performing a ne transformations. Further, there is no easy means to represent
a ne transformations such as loop interchange which changes the way a loop-nest
is structured or loop reversal which changes the direction a loop is accessed be-
6
cause such information is not explicit in the PDG. The PDG has no special way
of representing structured loops such as for loops with bounds, induction variables
and increments. Whereas this kind of information may not be required for pure
non-a ne based techniques, it is essential for a ne analysis.
Hence graph-based representations are not suitable for applying mixed se-
quences containing both a ne and non-a ne transformations either. They have
no quick way of deriving a ne information required for a ne analysis. They have
no quick way of updating a ne information after a non-a ne transformation. Of
course the IR can be rewritten to regenerate a ne information after each trans-
formation in the search, but that would be too slow to be feasible given the large
number of transformations a search might attempt.
Discussion The end result of the drawbacks of the existing representations above
is that there is no uni ed explicit program representation today that is suitable for
applying both a ne and non-a ne transformations in a uni ed manner. There is a
need for a single integrated compiler framework that enables building parallelizing
and cache-optimizing compilers for any type of code { a ne code, non-a ne code, or
any mix of the two. It should robustly apply the best possible transformation order
for each program by quickly searching through a large number of transformation
sequences. A uni ed explicit program representation would greatly facilitate this
goal.
We propose a new representation for programs that maintains the results of
both a ne and non-a ne transforms explicitly, so that either type of transform
7
can use it for subsequent transforms. Based on the combination of the new loop
representation called the Loop Nest Graph (LNG) and the well-known Program
Dependence Graph (PDG), it has several features that are desirable when searching
for program transformations in any compiler, but particularly parallelizing or cache-
optimizing compilers. While the PDG maintains information regarding instructions
within a loop and non-a ne dependencies between these instructions, the LNG
(which is built on top of the PDG) maintains information regarding the loops in a
loop-nest and a ne dependencies across the loops in the loop-nest. The advantages
of this representation include:
 Modular One of the biggest advantages of using the LNG + PDG represen-
tation is the decoupling of the building of the program representation phase and
the analysis phase. Intuitively, this is similar to how a PDG decouples paralleliza-
tion methods from methods to construct the PDG. It is also similar to how an
SSA representation decouples data ow optimizations from SSA construction. As
a result, the analysis phase can work on building the algorithms to use this infor-
mation directly without having to deal with building individual representations
themselves.
 Recording the e ects of Transformations directly Another advantage
of our representation is the ability to re ect the e ects of both the a ne and
non-a ne transformations in a uni ed framework such that any future analysis
can use the information directly, instead of going back to IR. We do this by
recording the transformations directly in the LNG + PDG representation. The
8
updates however are not intended to maintain complete information about the
transformations, and as a result, do minimal changes often updating only one of
the two structures.
 Use in detailed SREs The LNG + PDG forms a natural representation to
be used by a detailed Static Runtime Estimator(SRE) to estimate run-time tak-
ing into account detailed system characteristics such as cache, memory, network,
synchronization, and pipeline characteristics. Use of an SRE helps the compiler
evaluate and choose between di erent compositions of transformations. Using a
compiler framework capable of deriving candidate orders of transformations, and
using a good SRE to choose between them, we can foresee automatic compila-
tion results approaching the performance of manually tuned code by using the
characteristics of the target machine in transformation search.
Current program representations have limitations that inhibit their use by
SREs. A ne mathematical representations such as the distance vectors and
polyhedral representations are not complete program representations { they only
represent a ne references and their loop-carried dependencies, but not the rest
of the loop. As a result, SREs cannot be applied on them. On the other hand,
although the PDG maintains complete enough information about the instructions
in a loop, the information about structured for loops needed for a ne run-time
estimation, such as loop-nest structure and bounds, is not explicit in the PDG.
Instead the SRE would need to recover it, violating the modular design of SREs
and PDGs. In contrast the LNG and PDG combination has explicit information
9
for both types of parallelizers, enabling modular SREs.
Although the focus of this paper is not the transformation search but the pro-
gram representation used for it, a prototype search method using our representation
has been built. Results show that it  nds transformation orders among a selection
of both a ne and non-a ne transformations without going back to the IR.
10
Chapter 2
Related Work
Compilers rely on low overhead program representations when they need to
evaluate di erent orders of transformations. Several program representations have
been proposed for parallel compilers to aid with analysis and representation of trans-
formations. These representations are motivated by the kind of analysis and the kind
of loop optimizations considered by the compilers.
An interesting variant to this approach of using program transformations is it-
erative compilation [17, 1]. Instead of relying on any common intermediate program
representation to derive and evaluate di erent compositions of transformations, it-
erative compilation strategy involves generating code for various orders of transfor-
mations and executing it to compare the performance. Since it involves going back
to the IR for every candidate order of transformations, iterative compilation can
have a signi cant compile time overhead. Although some methods reduce the com-
pile time by bringing down the number of choices considered by the use of limited
set of heuristics [17] or machine learning algorithms [1], the compile time however
is still signi cant enough to restrict its use in practical compilers. Our use of a
low overhead means of representing just the essential information in the program,
helps us overcome this high compile time problem. With an accurate SRE, our
representation will provide the same bene ts as iterative compilation with orders of
11
magnitude less compile-time.
For a ne programs, distance vectors [20, 3] have been used by the SUIF com-
piler [18] to analyze and parallelize a ne code. A distance vector represents a
memory dependence between two array accesses across the iteration space of their
common loop nests. For example the distance vector (2,0) means that each outer
loop iteration depends on an iteration that is two before it, and the inner loop
iterations are independent. The collective set of such distance vectors for every
memory dependence in the loop forms the basis for analyzing the loop for various
a ne loop transformations such as loop interchange, loop reversal and loop skewing.
One of the biggest advantages of this representation is that analysis and representa-
tion of the above a ne loop transformations can be abstracted into simple matrix
transformations.
Unfortunately, one limitation of the representation is that merely updating
distance vectors cannot represent the e ect of non-unimodular loop transformations
such as loop fusion, loop  ssion and loop peeling. As a result, heuristics which
are based solely on the distance vectors for evaluation of di erent orders of trans-
formations cannot use these transformations directly. Wolf et al. [19] propose an
algorithm where  ssion and fusion are considered in the initial and  nal phases,
while the middle phase of the algorithm uses the distance vectors to compute and
evaluate various orders of transformations. Thus, these transformations are never
truly part of evaluation of various orders. For example, this model cannot come
up with a transformation order such as interchange followed by fusion followed by
interchange. In our representation, we use distance vectors in combination with the
12
LNG and the PDG to create a hierarchical loop dependence graph, through which
we can represent all these transformations in any order.
The polyhedral representation [4] is the other signi cant program represen-
tation for a ne code, which has been used by PLUTO compiler [5]. Polyhedral
methods [8] represent each statement in an a ne loop separately as a point in an it-
eration domain. The iteration space thus de ned is a polyhedron in a d-dimensional
space, where d is the nesting depth of the loop in question. The polyhedral repre-
sentation represents complex compositions of a ne transformations as a scheduling
function. Although the Polyhedral model is very powerful, it has a serious drawback
{ since it translates a ne code into a series of linear inequalities, this mathematical
representation has no way of representing non-a ne code. As a result it is inap-
plicable to mostly-a ne code which has small amounts of non-a ne code { this
is common in real-word codes. Further, we will be able to represent every trans-
formation that can be represented in Polyhedral framework by means of the LNG
+ PDG combination. Indeed many of the program transformations we found in
polyhedral papers that they said are not handled by the traditional model can be
found by using a mixed series of unimodular and non-unimodular transformations
we considered using our representation.
Dependence-graph representations such as the PDG [9], the CDDG [11, 10]
and the parallel program graph [16] are often used by non-a ne transformations.
The Parafrase-2 [15] compiler uses the CDDG to partition and schedule a given pro-
gram. The CDDG is a hierarchical structure that enables detecting parallel tasks
at various levels; however it is an immensely complicated structure to make changes
13
to and thus is not suitable for representing transformations directly in the program
representation, which is an essential feature if we want to apply several transfor-
mations. The PDG is another dependence-graph representation that is particularly
conducive for incremental optimizations. The PDG encapsulates useful information
that helps detect opportunities for optimizations, vectorization and parallelization.
The main drawback of all graph-based representations is that they are restricted to
non-a ne transformations; for example, the PDG has been used for transformations
such as DSWP, PS-DSWP and Control Speculation [12, 13]. We extend the PDG
for a ne analysis by combining it with the LNG.
A related representation is the PPG [16] which builds on the PDG for rep-
resenting already parallel programs. The PPG extends the serial representation of
PDG by adding parallel control  ow edges and synchronization edges for parallel
programs. Unlike the PPG, our aim is to be able to detect parallelism oportunities
rather than represent parallel programs. Hence our representation is catered to rep-
resenting the serial program. Through this representation of LNG + PDG, we seek
to extend the PDG representation for encompassing di erent kinds of transforma-
tions.
14
Chapter 3
Program Representation
3.1 Overview
The most important and challenging factor motivating the development of
our program representation is the ability to analyze and represent both a ne and
non-a ne transformations. A ne and non-a ne methods have distinctly di erent
characteristics. Whereas a ne analyses use mathematical representations such as
distance vectors to analyze and represent transformations, non-a ne analyses rely
on the use of dependence graphs such as the CDDG and the PDG. Lack of any obvi-
ous commonality between the two methods, as observed by the existing technologies,
poses a major challenge.
Understanding the commonalities and the di erences in the features that the
a ne and non-a ne transformations look for is the key to a uni ed representation.
A good representation relies on the ability to make explicit the information needed
by both these domains. Though the kind of analyses and transformations considered
by the two domains are seemingly di erent, they rely on a common feature of being
able to understand the inherent dependencies in the program and more particularly
within a loop. The di erence lies in the way the dependencies are studied and
interpreted by the di erent transformations. Non-a ne transformations analyze a
dependence graph to partition it into several independent or less-dependent tasks.
15
A ne transformations, on the other hand aim to run loop iterations in parallel
with other iterations by exploiting regular dependence patterns encapsulated in
distance vectors. To do so, they use loop restructuring, iteration space reording and
loop-nest reordering. As a result, non-a ne transformations require the ability to
understand all the dependencies within a loop; a ne transformations require the
abilty to understand and represent the characteristics relevant to the loop-nest such
as the loop-nest structure and the loop-carried dependencies across all the di erent
levels of the loop-nest. Thus the di erence in the representations needed.
There is a need for a representation that can act as a bridge between these two
distinct set of needs helping realize a truly integrated program representation. The
integrated program representation should be capable of not only maintaining the
information relevant to both the a ne and the non-a ne methods but also repre-
sent the e ect of transformations without going back to IR. Current representations
are catered for either a ne or non-a ne methods exclusively because of which they
maintain information relevant only for these methods. Combining the two represen-
tations { the PDG and distance vectors { though seems intuitive, is also insu cient
for our purpose. Neither the PDG nor the distance vectors can re ect the changes
of the other set of tranformations directly. The two representations are inherently
incompatibile with each other.
We propose a new loop representation, the Loop Nest Graph (LNG) which in
combination with the PDG and distance vectors has all the above features. The LNG
acts as the bridge between the a ne-catering mathematical representation (distance
vectors) and the dependence graph representation (PDG). By representing every
16
loop as an individual node, the LNG represents a graph of dependencies between
the loops in a loop-nest, thus naturally representing the nesting structure in an
easily accessible way. Because of its reliance on loops as individual nodes, the LNG
also forms the natural representation to maintain a ne information such as loop
characteristics for structured loops and distance vectors as annotations on loop-
carried a ne dependencies. This representation therefore serves the dual purpose
of representing information required by the a ne analysis, as well as representing
the e ects of a ne transformations in a way that non-a ne analyses can understand
and update.
This representation forms a two-tiered structure. The PDG is at the lower tier
representing dependence information between individual instructions in a loop which
is useful for non-a ne analysis. The LNG in combination with distance vectors is
at the top tier maintaining loop and dependence information in a loop-nest which
the a ne analysis can use e ectively. Thus combining the LNG with the PDG and
distance vectors gives us the  exibility to take advantage of existing research with
little or no modi cation.
3.2 PDG
The PDG [9] is a well-known and powerful representation which is suited for
several loop optimizations and parallelization techniques. The PDG has been used
for several loop transformations such as Loop Peeling and Loop Unrolling. It is
also the basis for parallelization techniques such as DSWP [12]. An important
17
PDG Control DepEdge
PDG Loop-carried Data DepEdgePDG Data  DepEdge LNG Affine Memory DepEdgeLNG Loop Structure Edge
LNG Control Edge
S7
S0
S6
S5
S4
S3
S2
S1
S8
S9
L1
L2 S9S1S0
S8S3S2
S6S5S4 S7
(1,0)
(d) LNG
Figure 3.1: Example program with the PDG and LNG
advantage of the PDG is that it can be used to incrementally apply several of these
optimizations without having to go to the IR.
The PDG has several characteristics that makes it conducive to the optimiza-
tions we are interested in. One of the most important reasons the PDG is particularly
relevant is its reliance on dependence information. In basic terms, the presence of a
dependence between two instructions implies that there exists constraints regarding
the execution of the two instructions, and the absence of a dependence implies that
the two instructions can execute in parallel. As a result, dependencies play a crucial
role in exposing parallelism. Since the program dependence graph captures depen-
dencies in a program instead of emphasizing the  ow of control or data, it becomes
a natural choice for a compiler interested in parallelization.
18
Another feature of the PDG that is useful is its ability to represent the de-
pendencies at any level we are interested in { program level, function level and loop
level. Every level captures dependencies relevant only to that level?s view. For
example, a function level view of the PDG can capture dependencies between two
loops, but may not capture dependencies such as loop-carried dependencies within
each of the loops. In our representation, we use the PDG to represent dependencies
for every loop in the function (including each level within a loop nest).
The loop PDG is the base layer of our representation, maintaining detailed
dependence information of the loop which may be required by certain analyses.
Every loop in the function is associated with a PDG representation that maintains
dependence information between every pair of instructions in the loop.
The loop PDG represents the loop as a graph where every instruction in the
loop is a unique node and edges represent the dependencies between the nodes.
The PDG is a directed graph GPDG = (VPDG;EPDG)j
VPDG =fS0;S1 :::;Sng where Si is an instruction in the loop
EPDG =f(u;v)j8u;v2VPDG and v is dependent on ug
3.1(c) shows the PDG for the loop in example 3.1(a). Each instruction in the
IR shown in the example 3.1(b) is a node in the PDG. In this example,fS0;S1 :::;S9g
represents the set of nodes in the PDG. (S1;S2) is an example edge that represents
the dependence of the instruction S2 on S1.
Di erent kinds of dependencies are represented in the PDG. Each edge (u;v)
in the PDG is associated with one of these dependence types.
19
 PDG Control Dependence Edge: (EPDGCD) A node v is control dependent
on node u if the node u determines whether node v executes or not. Use of control
dependencies instead of control  ow helps avoid the  xed sequencing forced by the
latter. The control dependence property is derived from the control  ow graph of
the loop [9]. For the control  ow graph, GCFG of a loop
(u;v)2EPDGCD is a PDG control dependence edge i 
a) 9 a path P from u to v in GCFG with any w2P post-dominated by v and
b) u is not post-dominated by v.
In the Example 3.1(c), f(S1;S2);(S3;S5)g are some examples of control depen-
dence edges in the PDG.
 PDG Data Dependence Edge: (EPDGDD) A data dependence edge denotes
the possibility of both the instructions (nodes) accessing the same location. This
automatically imposes execution constraints, thus restricting parallelism. Hence
data-dependencies form important constraints in parallelism detection. Since data
dependencies, especially the memory dependencies are di cult to disambiguate,
they often are the most crucial parallelism-inhibiting dependencies. Several trans-
formations speci cally target breaking or moving these dependencies to enhance
parallelism.
(u;v)2EPDGDD is a PDG data dependence edge i 
a) 9 a path from u to v in GCFG and
b) u and v can access the same location and
c) either u or v or both are a write or store operation.
20
Data dependence edges are further classi ed into register dependence and memory
dependence edges. (u;v) is a register dependence edge if u writes to a register v
reads from1. In the example 3.1(c), (S4;S6) is an example register dependence
edge. (u;v) is a memory dependence edge if u writes to a memory location that
could be read by v. In the example 3.1(c), (S7;S5) is a memory anti-dependence
edge.
Data dependence edges are determined using data- ow analysis techniques and
the control  ow graph. Eliminating memory dependencies further requires several
kinds of memory dependence analyses, alias analyses and memory disambiguation
techniques to prove that a pair of memory operations do not alias.
PDG Loop-carried Data Dependence Edge (EPDGLCDD) is a special kind
of data dependence edge in the loop PDG which has the additional property of
being loop-carried in nature. (S9;S4) is an example of a loop-carried register-
dependence edge and (S7;S5) of a loop-carried memory-dependence edge for the
memory location B[j].
3.3 LNG
The LNG is a proposed top-level graph which de nes the loop structure of
a function. It forms a layer over the loop PDGs, providing information from a
loop-nest perspective. The LNG has the following three features :
 Represents every loop as an individual node, thus making loops as top-level
1Here registers mean virtual registers before register allocation, or physical registers after
21
structures making them directly accessible for any loop related information. Rep-
resenting the loop as a node also helps de ne the loop-nest structure, which is a
useful representation for several a ne transformations.
 Makes structured loops explicit. Structured loops such as for loops are the
prime candidates for a ne analyses. Several characteristics of the structured
loops such as induction variables and loop bounds can be derived precisely and
are hence represented directly.
 Encapsulates a ne dependencies. For every PDG loop-carried memory depen-
dence edge that is a ne, we copy that edge to the LNG and annotate it with its
distance vector. This is convenient for a ne transformations since unimodular
transformations can be de ned solely based on the distance vectors of the loop. In
addition, non-unimodular transformations can be represented by updating other
parts of the LNG+ PDG in addition to the distance vectors.
The LNG thus forms a generic program representation representing depen-
dence information from a loop-nest perspective, which is particularly relevant to
a ne analysis. The biggest advantage of the LNG is its easy compatibility with other
dependence graph structures such as the PDG. Any analysis that needs loop-nest
related information and speci c loop attributes can directly use the LNG without
having to derive it from the IR or the PDG.
The LNG represents the loop nest as a graph where every loop and every
instruction is a unique node, with edges that de ne the loop-nesting relationship
and a ne memory dependence relationship between two nodes.
22
The LNG is a directed graph GLNG = (VLNG;ELNG)j
VLNG is a set of nodes, one for each instruction and loop in the program; and
ELNG is a set of edges (u;v) which have either (a) a loop-nest relationship as de ned
below or (b) an a ne memory dependence.
3.1(d) shows the LNG for the loop in 3.1(a). fL1;L2;S0;:::;S9g are the set
of LNG nodes. f(L1;L2);(L2;S5)g are some examples of LNG edges.
There are two kinds of nodes in the LNG:
 LNG Loop Node: (VLNGLoop) Every loop in the function is represented as an
LNG loop node. Each loop node further maintains important loop characteristics
such as loop bounds and induction variable characteristics for structured loops.
In the example 3.1(d), VLNGLoop =fL1;L2g.
 LNG Instruction Node: (VLNGInst) Every instruction is represented as an
LNG instruction node. In the example 3.1(d), VLNGInst =fS0;S1;:::;S9g. There
is a one-to-one correspondence to the instructions represented in the PDGs2.
Similar to the PDG, the LNG maintains  ve types of edges, listed below.
 LNG Loop Control Edge: (ELNGLCE) This type of edge denotes the loop-
nest relationship between nodes. There exists an LNG loop control edge from
node u to node v if u is the immediate parent loop of v. f(L1;L2);(L2;S5)g are
some examples of the LNG control edges. An LNG control edge from one loop
2During implementation, the LNG does not replicate all the information about the instruction
in the PDG; instead it contains a structure with a pointer to the PDG instruction. Having the
structure allows the LNG to add information such as extra edges to the instructions.
23
node u to another v de nes the loop nesting information that the v is a child loop
of u. An edge from an LNG loop node to an LNG instruction node de nes the
association of the instruction with its immediate parent loop. Determining the
LNG loop control edges involves well-known dominator analysis used to recognize
loops in a program.
There exists an LNG control edge (u;v)2ELNGLCE i 
a) u2VLNGLoop and v2[fVLNGLoop;VLNGInstg and
b) u is the immediate dominator of v.
 LNG A ne Memory Dependence Edge: (ELNGAMDE) Each loop-carried
memory dependence between two a ne memory access instructions is represented
in the LNG with an LNG a ne memory dependence edge. These edges are further
annotated with distance vectors. Non-a ne loop-carried dependencies are not
included in this set. In the example 3.1(d), (S7;S5) is an LNG a ne loop memory
edge. The edge is annotated with the distance vector (1;0) associated with the
memory accesses.
In addition to the above two edge types, the LNG maintains three special edges
associated with every structured for loop, which we together refer to as the LNG
Loop Structure Edges (ELNGLSE). We follow the usual de nition of for loops: a for
loop is a loop whose exit condition is a relational operator comparing an induction
variable with a loop-invariant quantity. The three de ning characteristics of a for
loop are the incoming de nition, update, and exit condition of the loop?s induction
variable (IV). These are represented in the LNG by the following three types of
24
edges:
 LNG Loop IV De nition Edge: (ELNGIV Def) A special kind of LNG loop
structure edge from a for loop to its induction variable de nition before the loop.
The set of such induction variable de nitions is called VLNGIV Def. In the example
3.1(d), ELNTIV Def =f(L2;S2);(L1;S0)g and VLNTIV Def =fS0;S2g.
 LNG Loop IV Update Edge: (ELNGIV Up) A special kind of LNG loop
structure edge from a for loop to the instruction in that loop that updates its
induction variable. By the de nition of induction variable, this updating instruc-
tion is unique. The set of such induction variable updates is called VLNGIV Up. In
the example 3.1(d), ELNTIV Up =f(L2;S8);(L1;S9)g and VLNTIV Up =fS8;S9g.
 LNG Loop IV Exit Condition Edge: (ELNGIV ECE) A special kind of LNG
loop structure edge from a for loop to the instruction in the loop that calculates
its exit condition. The set of such induction variable exit condition instructions
is called VLNGIV ECE. In the example 3.1(d), ELNTIV ECE = f(L2;S3);(L1;S1)g
and VLNTIV ECE =fS1;S3g.
For convenience sake, we de ne
VLNTLSE = VLNTIV Def [VLNTIV Up[VLNTIV ECE and
ELNTLSE = ELNTIV Def [ELNTIV Up[ELNTIV ECE.
25
Chapter 4
Representation of Transformations
To illustrate how the LNG and the PDG are updated for transformations, we
consider four transformations as examples below.
4.1 Loop Interchange
Loop interchange is an important a ne-based loop transformation. Loop in-
terchange is desirable for a number of reasons such as improving cache performance
or increasing granularity of parallelism. Loop interchange can also act as an en-
abler transform { enabling other transforms such as fusion to become applicable.
Traditional analysis for the legality of loop interchange relies on analyzing distance
vectors of the loop-nest.
The e ects of loop interchange are captured in the two representations in the
following manner:
 Loop interchange is a reordering transformation. It only a ects the order in
which the loop is executed, without changing the code inside the loop. As a result,
the most important e ect of interchange is on the induction variable behavior of
the two loops. This is re ected in the LNG by interchanging the two loop nodes
and the associated LNG loop structure edges.
For a given LNG GLNG, a new G0LNG is derived for loop interchange of Loop1 and
26
Loop2 by interchanging the position of VLNGLoop1 with VLNGLoop2 along with their
associated loop structure edges in the LNG.
In the example, this would involve interchanging L1 with L2, along with their
associated loop structure edges.
To re ect the corresponding changes in the PDG, we use the special loop struc-
ture nodes that are maintained in the LNG (VLNGLSE). Thus this would involve
interchanging the control edges associated with the loop structure nodes of the
two loops, which would mean:
Swap the control-dependence edges associated with VLNG1IV Def and VLNG2IV Def
Similarly, swap the control-dependence edges associated with VLNG1IV Up and
VLNG2IV Up
Similarly, swap the control-dependence edges associated with VLNG1IV ECE and
VLNG2IV ECE
 Because of the change in the order of execution, loop interchange a ects the
data dependencies in both the structures. It changes the order in which memory
is accessed, thus changing the dependence patterns (distance vectors) associated
with the a ne memory dependencies (ELNGAMDE). This is re ected in the LNG
by performing an interchange on the columns of the distance vectors associated
with the loop-nest as described by unimodular theory [2]. In the example 3.1(d),
the two columns of the distance vector (1;0) would be interchanged to yield (0;1).
27
8 edge e = (u;v)2ELNGAMDE
interchange distance vectors(e);
Since each LNG a ne memory dependence is a copy of the unannotated depen-
dence in the PDG, when the former changes the latter must be changed as well.
When a distance vector component is changed to zero by interchange, the de-
pendence can be removed from the PDG. Thus the dependence edge (S7;S5) is
removed after the interchange.
If d1;d2 were the depths of loops interchanged,
for(d=d1,d2)
8 edge e = (u;v)2ELNGAMDE
if(distance vector(e)[d] changes to 0)
remove edge(pdg edge(u;v))
else if (distance vector(e)[d] changes to non-zero)
add loop carried edge(pdg edge(u;v))
where pdg edge(u;v) is the edge from node u to node v in the PDG associated
with the loop at depth d in the loop-nest.
Loop interchange does not a ect the register dependencies in the PDG because
the intra-loop dependencies are not a ected by interchanging the loops.
The example 4.1 shows the e ect of interchanging the two loops in the example
3.1(b). The highlighted nodes in both the LNG and the PDG show the swapped
nodes and the highlighted edge shows the change in the loop-carried memory de-
pendence due to the change in distance vector.
28
a) Before Interchange                              b) After Interchange
Figure 4.1: Example for Interchange
4.2 Loop Fusion
Loop Fusion is another well-understood and often-used loop transformation.
Loop fusion a ects the runtime in two ways. First, fusing two loops brings down
the number of loops thus reducing the synchronization nodes during parallelization.
Fusion also has a number of cache bene ts, where it can promote cache and data
reuse, thus improving the runtime. Loop fusion is advantageous and desirable for
both a ne and non-a ne methods.
Loop Fusion a ects both the LNG and the PDG in the loop control structure.
29
S5
S4
S1
S0
S9
S7
S6
S5
S4
S1
S0
S9
S7
S6
S3
S0
S8
for(i=0; i< N; ++i)C[i] = A[i]; for(i=0; i< N; ++i)D[i] =  B[i]; for(i=0; i< N; ++i) {C[i] = A[i];  
D[i] = B[i];}
L1
S9S1S0
S5S4
L2
S8S3S2
S7S6
L1
S9S1S0
S6S5S4 S7
Figure 4.2: Example for Fusion
30
Like for interchange, changes in the LNG guide the changes to the PDG. Fusing two
loops can be visualized as fusing the two corresponding LNG loop nodes into one
LNG loop node such that all the dependencies associated with the two loop nodes
will now be associated with the fused node.
Let L1 and L2 represent the two LNG loop nodes and Lfused the fused loop
node. For fusion to be legal, the loop structures must be the same for L1 and L2, so
we can inherit the loop structure for Lfused from either (say L1). Then we inherit
the LNG + PDG nodes and dependences for Lfused by unioning those for L1 and
L2. Thus, if Lfused = L1, we inherit the edges from L2 by:
for(e = outedges(L2))
if(e =2ELNG2LSE)
change the source of edge e to L1
Representing fusion of the two loops in the PDG would involve fusing the loop
structures of the corresponding loop PDGs into one fused PDG. Control dependen-
cies in the PDG that de ne the loop structure are associated with exit condition
node (VLNGIV ECE). Instead of replicating the PDG for a fused loop, we use one
of the candidate PDGs as our base (say L1) and add the control dependencies for
the loop L2 to the PDG. Thus, fusion of the PDGs involves associating the control
dependencies that de ne the loop structure in the loop L2 with the VLNG1IV ECE of
L1.
31
for(e = outedges(VLNG2IV ECE))
if(e2EPDGCD)
change the source of edge e to VLNG1IV ECE
Whereas the data dependencies within the loops are not a ected, new data
dependencies can result from the fusion of loops due to pre-fusion inter-loop depen-
dencies converting to intra-loop dependencies after fusion. Since the analysis for
fusion involves analyzing such inter-loop dependencies, we use this analysis phase to
save the set of possible resultant dependence edges. Once the fusion is determined
to be legal, we use the saved information to update the dependencies. Thus changes
to data dependencies due to fusion follow directly from the analysis.
The example 4.2 shows an example of the fusion of two loops and the resultant
PDG + LNG structure.
4.3 Reduction
Reduction is a well-known loop transformation useful in the parallelization of
both a ne and non-a ne programs. Reduction is possible on any statement of the
form v = v expr in a loop, where  is a commutative and associative operation
such as sum, product, min and max; and v is loop-invariant in that dimension of
the loop. This creates a loop-carried dependence in the loop, inhibiting parallelism.
This reduction dependence can be broken by creating private copies of the variable
v for each parallel thread and  nally accumulating all the copies into the original
after the execution of the entire loop.
32
Reduction appears in the PDG as a loop-carried data dependence edge. The
e ect of performing a reduction operation is equivalent to removing the loop-carried
dependence EPDGLCDD associated with the operation v = v expr. Reduction does
not a ect the LNG because we do not represent any register dependencies in the
LNG.
Reduction is a constant-time operation involving the removal of the edge cor-
responding to the reduction operation.
4.4 DSWP
Decoupled software pipelining [12] is a cyclic multi-threading technique which
has been used to extract parallelism from general-purpose non-a ne code. DSWP
uses the PDG to analyze the dependencies and splits the loop into several pipeline
stages. It has been shown that the DSWP can also be scaled to higher number
of threads when coupled with other parallelizaiton techniques such as DOALL in
Parallel Stage-DSWP [13]. This makes the DSWP a very useful transformation to
work with a ne transformations.
In contrast to the two a ne transformations mentioned previously, the PDG
guides the changes in the LNG for DSWP. Changes to the PDG due to DSWP
techniques follow directly from the analysis [12]. Since DSWP partitions the body
of the loop into threads that execute on di erent pipeline stages, its only a ect on
the LNG is with respect to the association of instructions with the threads. We
represent this information by splitting the LNG loop node in question into several
33
S0:    i=0; S1:    y=0;
S2:    while(i<M) {S3:         computation();
S4:         j=0; S5:         while(j<N) {
S6:              x=arr[j];S7:              y = y+x
S8:               j++;}
S9:       i++;}
S1
S7
S6
S5
S4
S2
S8
S9
S3
L1
L2
S9S2S0
S8S5S4
S7S6
S3
for (i=0; i<M;  i++) {y=0;
computation();     for(j=0; j<N; j++) {
y = y + arr[j];}
}
a) Original Code in C
b) C equivalent of IR
c) PDG d) LNG
Figure 4.3: Example for DSWP
34
S1
S7
S6S5
S4
S2
S8
S9
S1
S2
S9
S3
L1
S9S2S0S
3
L1
L2 S9S2S0
S8S5S4S
7S6c) PDG for
i) Thread 1                  ii) Thread 2
d) LNG fori) Thread 1                    ii) Thread 2
for (i=0; i<M;  i++) {y=0;
computation();     for(j=0; j<N; j++) {
y = y + arr[j];}
}
for(j=0; j<N; j++) y = y + arr[j];computation()
for(j=0; j<N; j++) y = y + arr[j];computation()
for(j=0; j<N; j++) y = y + arr[j];computation()
Thread 1 Thread 2
a) Original Code in C
b) DSWP?s  two pipeline stages
Figure 4.4: PDG and LNG for DSWP
loop nodes, and associating its contained instructions correctly with the new split
loop nodes as dictated by the analysis. Updating the LNG involves producing a
new LNG for each of the threads and distributing the nodes and edges between the
di erent threaded versions of the LNG. Each of the threads is associated with a new
PDG and a new LNG and represents split portions of the original PDG and LNG.
The threads correspond to di erent pipeline stages of execution.
The  gure 4.3 shows an example program with the associated PDG and LNG.
The  gure 4.4(b) shows the result of applying DSWP on the code in 4.3(a). 4.4(c)
shows the resultant PDGs for the two threads associated with the two pipeline
35
stages. 4.4(d) shows the resultant LNGs for the two threads associated with the
two pipeline stages.
36
Chapter 5
Results
In order to show the practicality of our representation, we use a very basic par-
allelizing compiler framework which consists of the following set of transformations
{ loop interchange, loop fusion, reduction and DSWP. The experimental compiler
incrementally applies di erent transformations and derives a set of candidate orders
by using an exhaustive search with pruning. These orders are then evaluated using
a simple SRE implementation which uses a limited estimation of cache and program
characteristics. At each transformation, only the LNG + PDG are updated, not the
IR. Only when one of the orders is chosen, do we apply the set of transformations
to the IR and produce parallel code.
We use benchmarks from the Polybench benchmark suite [14] { these bench-
marks represent heavily used kernels in scienti c and multi-media workloads and
are suited for a ne analysis. The table in  gure 5.1 shows results of parallelism and
cache-optimization on these benchmarks. As can be seen, several of the benchmarks
give substantial improvement even for the single thread case due improvement in
cache locality from using loop interchange and fusion. Of course the main point of
our results is not the magnitude of the improvement since we are not proposing any
particular search method. Rather the main point is that our results demonstrate
that our LNG + PDG representation is sound, and allows mixed a ne and non-
37
Benchmark 1 2 4 8 16 24
Baseline 1.00 2.07 4.14 6.83 12.31 12.59
2mm
Optimized 3.90 7.78 15.17 23.89 34.06 46.19
Baseline 1.00 1.99 3.94 6.46 11.77 16.25
3mm
Optimized 3.24 6.44 12.48 23.93 36.81 49.52
Baseline 1.00 1.77 2.14 1.49 1.40 1.18
adi
Optimized 1.08 1.64 3.11 3.29 6.57 7.70
Baseline 1.00 1.74 2.33 1.61 1.61 1.54
atax
Optimized 0.99 1.26 1.80 1.60 1.91 1.96
Baseline 1.00 2.03 4.07 6.27 9.52 9.01
doitgen
Optimized 1.00 2.03 4.07 6.27 9.52 9.01
Baseline 1.00 1.90 3.98 6.86 11.06 12.10
gemm
Optimized 4.29 8.55 16.70 30.83 24.11 29.22
Baseline 1.00 1.73 3.01 2.40 2.49 1.94
gesummv
Optimized 1.00 1.71 3.00 2.40 2.49 1.95
Baseline 1.00 3.44 7.62 5.80 14.19 11.63
jacobi-2d-imper
Optimized 1.00 3.37 7.64 5.65 13.04 13.45
Baseline 1.00 1.48 2.38 2.61 4.04 4.33
Average
Optimized 1.57 2.63 4.61 6.79 8.85 10.69
baseline=without transformations;
optimized=with transformations
Table 5.1: Speedup on x86 24-core machine for Polybench benchmark
38
Table 5.2: Speedup for A ne and Non-a ne Transformations
a ne sequences to be applied correctly without going to IR. The results demonstrate
that such a compiler has been correctly built using our representation, and can  nd
runtime-improving transformations.
To show the possibility of combining a ne and non-a ne transformations,
consider the example program in  gure 5.1(a) which bene ts from both DSWP and
loop interchange. The results in table 5.2 show that the speedup of the program is
better when using a combination of a ne and non-a ne transformations { inter-
change and DSWP (5.8X) as compared to using either interchange only (2.8X) or
non-a ne only (1.1X). The DSWP partitions the loop such that the inner loop is
on one thread and the \computation()" on the other thus reducing the runtime as
shown in 5.1(c). Loop Interchange, which is an a ne loop transformation, on the
inner loop modi es the cache access pattern of one of the pipeline stages to improve
39
for (x=0; x<Tot; ++x) {y=0;
computation();for(i=0; i<M; i++) {
for (j=0; j<N; j++) {y = y + arr[j][i];
}}
}
for(i=0; i<M; i++)   {  for(j=0; j<N; j++) 
y = y + arr[j][i];
computation()
computation()
computation()
Thread 1 Thread 2
for(i=0; i<M; i++)   {  for(j=0; j<N; j++) 
y = y + arr[j][i];
for(i=0; i<M; i++)   {  for(j=0; j<N; j++) 
y = y + arr[i][j];
for (x=0; x<Tot; ++x) {y=0;
computation();for(j=0; j<N; j++) {
for (i=0; i<M; i++) {y = y + arr[j][i];
}}
}
a) Example in C
b) With Interchange only c) With DSWP only
for(j=0; j<N/2; j++)   {  for(i=0; i<M; i++) 
y = y + arr[j][i];
computation()
computation()
computation()
Thread 1 Thread 2
for(j=N/2; j<N; j++)   {  for(i=0; i<M; i++) 
y = y + arr[j][i];
Thread 3
for(j=0; j<N/2; j++)   {  for(i=0; i<M; i++) 
y = y + arr[j][i];
for(j=0; j<N/2; j++)   {  for(i=0; i<M; i++) 
y = y + arr[j][i];
for(j=N/2; j<N; j++)   {  for(i=0; i<M; i++) 
y = y + arr[j][i];
for(j=N/2; j<N; j++)   {  for(i=0; i<M; i++) 
y = y + arr[j][i];
e) DSWP  on 2 threads + Interchange with DOALL on 2 threads
for(j=0; j<N/2; j++)   {  for(i=0; i<M; i++) 
y = y + arr[j][i];
computation()
computation()
computation()
Thread 1 Thread 2
for(j=0; j<N/2; j++)   {  for(i=0; i<M; i++) 
y = y + arr[j][i];
for(j=0; j<N/2; j++)   {  for(i=0; i<M; i++) 
y = y + arr[j][i];
d) DSWP + Interchange
Figure 5.1: Example for A ne and Non-a ne Transformations
40
cache reuse, thus reduce runtime as shown in 5.1(b). Thus in combination, the
transformations { DSWP on 2 threads and interchange with DOALL on 2 threads
{ yield a better result.
41
Chapter 5
Conclusion
We have combined two of the existing representations { Program Dependence
Graph (PDG) and distance vectors { with a new loop-nest representation called the
Loop Nest Graph (LNG) to provide an integrated program representation which is
capable of working with both a ne and non-a ne methods. Thus combining two of
the well-understood and regularly-used representations allows us to use the existing
research associated with each of these representations directly without having to
reinvest in new research. We have shown how to represent some of the transforma-
tions in this integrated framework.
Future work involves building a compiler framework using this representa-
tion. We have shown its use in a basic compiler which evaluates various orders
of transformations. We would like to extend it by adding more a ne and non-
a ne transformations and parallelization methos. Future work also includes a good
static runtime estimator which is capable of using the various system and program
characteristics to estimate the performance. With all the three pieces, we hope to
build a general-purpose parallelizing compiler capable of handling both a ne and
non-a ne programs. We believe, with the integrated capability of both a ne and
non-a ne transformations, the parallelizing compiler will be able to handle any
general-purpose code.
42
Bibliography
[1] F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. P. O?Boyle,
J. Thomson, M. Toussaint, and C. K. I. Williams. Using machine learning to
focus iterative optimization. In Proceedings of the International Symposium on
Code Generation and Optimization, CGO ?06, pages 295{305, Washington, DC,
USA, 2006. IEEE Computer Society.
[2] U Banerjee. Unimodular transformations of double loops. In In proceedings
of the third Workshop on Languages and Compilers for Parallel Computing,
August 1990.
[3] Utpal K. Banerjee. Loop Transformations for Restructuring Compilers: The
Foundations. Kluwer Academic Publishers, Norwell, MA, USA, 1993.
[4] Cedric Bastoul. Code generation in the polyhedral model is easier than you
think. In Proceedings of the 13th International Conference on Parallel Archi-
tectures and Compilation Techniques, PACT ?04, 2004.
[5] U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical
automatic polyhedral parallelizer and locality optimizer. In Proceedings of the
2008 ACM SIGPLAN conference on Programming language design and imple-
mentation, 2008.
[6] Hideki Saito et al. The design of the promis compiler. In CC ?99: Proceedings
of the 8th International Conference on Compiler Construction, pages 214{228,
1999.
[7] Mary W. Hall et al. Maximizing multiprocessor performance with the suif
compiler. Computer, 29(12):84{89, 1996.
[8] Paul Feautrier. Automatic parallelization in the polytope model. In The Data
Parallel Programming Model: Foundations, HPF Realization, and Scienti c
Applications, pages 79{103, London, UK, 1996. Springer-Verlag.
[9] Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. The program depen-
dence graph and its use in optimization. ACM Trans. Program. Lang. Syst.,
9(3):319{349, 1987.
[10] M Girkar and C Polychronopoulos. Automatic extraction of functional paral-
lelism from ordinary programs. IEEE Trans. Parallel Distrib. Syst., 3:166{178,
March 1992.
[11] Milind Girkar and Constantine D. Polychronopoulos. The hierarchical task
graph as a universal intermediate representation. Int. J. Parallel Program.,
22(5):519{551, 1994.
43
[12] J Huang, A Raman, T Jablin, Y Zhang, T-H Hung, and D August. Decou-
pled software pipelining creates parallelization opportunities. In Proceedings of
the 8th annual IEEE/ACM international symposium on Code generation and
optimization, 2010.
[13] H Kim, A Raman, F Liu, J Lee, and D August. Scalable speculative par-
allelization on commodity clusters. In Proceedings of the 2010 43rd Annual
IEEE/ACM International Symposium on Microarchitecture, MICRO ?43, 2010.
[14] Polybench: The Polyhedral Benchmark Suite. INRIA Paris - Rocquencourt,
2010.
[15] C D Polychronopoulos, M B Gikar, M R Haghighat, C L Lee, B P Leung, and
D Schouten. The structure of parafrase-2: an advanced parallelizing compiler
for c and fortran. In Selected papers of the second workshop on Languages and
compilers for parallel computing, 1990.
[16] V Sarkar and B Simons. Parallel program graphs and their classi cation. In
Proceedings of the 6th Intl Workshop on Languages and Compilers for Parallel
Computing, 1994.
[17] Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, and David I.
August. Compiler optimization-space exploration. In Proceedings of the in-
ternational symposium on Code generation and optimization: feedback-directed
and runtime optimization, CGO ?03, pages 204{215, Washington, DC, USA,
2003. IEEE Computer Society.
[18] R P Wilson, R S French, C S Wilson, S P Amarasinghe, J M Anderson, W. K.
Tjiang, S Liao, C Tseng, M W Hall, M S Lam, and J L Hennessy. Suif: an
infrastructure for research on parallelizing and optimizing compilers. 1994.
[19] Michael E. Wolf, Dror E. Maydan, and Ding-Kai Chen. Combining loop trans-
formations considering caches and scheduling. Int. J. Parallel Program., 26:479{
503, August 1998.
[20] Michael Edward Wolf. Improving locality and parallelism in nested loops. PhD
thesis, Stanford, CA, USA, 1992.
44