ABSTRACT Title of thesis: AN INTEGRATED PROGRAM REPRESENTATION FOR LOOP OPTIMIZATIONS Greeshma Yellareddy, Master of Science, 2011 Thesis directed by: Professor Rajeev Barua Department of Electrical and Computer Engineering Inspite of all the advances, automatic parallelization has not entered the gen- eral purpose compiling environment for several reasons. There have been two distinct schools of thought in parallelization domain namely, a ne and non-a ne which have remained incompatible with each other over the years. Thus, a good practical compiler will have to be able to analyze and parallelize any type of code { a ne or non-a ne or a mix of both. To be able to achieve the best performance, compilers will have to derive the order of transformations best suitable for a given program on a given system. This problem, known as \Phase Ordering", is a very crucial impedance for practical compilers, more so for parallelizing compilers. The ideal compiler should be able to consider various orders of transformations and reason about the performance bene ts of the same. In order to achieve such a compiler, in this paper, we propose a uni ed program representation which has the following characteristics: Modular in nature. Ability to represent both a ne and non-a ne transformations. Ability to use detailed static run-time estimators directly on the representa- tion. AN INTEGRATED PROGRAM REPRESENTATION FOR LOOP OPTIMIZATIONS by Greeshma Yellareddy Thesis submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial ful llment of the requirements for the degree of Master of Science 2011 Advisory Committee: Professor Rajeev Barua, Chair/Advisor Professor Bruce Jacob Professor Shuvra Bhattacharyya c Copyright by Greeshma Yellareddy 2011 Table of Contents List of Tables iii List of Figures iv 1 Introduction 1 2 Related Work 11 3 Program Representation 15 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 PDG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 LNG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4 Representation of Transformations 26 4.1 Loop Interchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2 Loop Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3 Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4 DSWP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5 Results 37 5 Conclusion 42 Bibliography 43 ii List of Tables 5.1 Speedup on x86 24-core machine for Polybench benchmark . . . . . . 38 5.2 Speedup for A ne and Non-a ne Transformations . . . . . . . . . . 39 iii List of Figures 3.1 Example program with the PDG and LNG . . . . . . . . . . . . . . . 18 4.1 Example for Interchange . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Example for Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3 Example for DSWP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.4 PDG and LNG for DSWP . . . . . . . . . . . . . . . . . . . . . . . . 35 5.1 Example for A ne and Non-a ne Transformations . . . . . . . . . . 40 iv Chapter 1 Introduction In the multi-core era, most of the world?s computers are parallel, but most soft- ware remains serial. Given the huge investment in existing serial code worldwide, and that rewriting serial code into parallel code is time-consuming, error-prone, and expensive, there is a great need for automatic parallelization and cache-optimization of serial code. Some success has been seen for programs with a ne array refer- ences [18][5] { an array reference is said to be a ne if its indices are linear combi- nations of loop induction variables. For example, A[2i + 3j + 7;i 2] is a ne, but A[i2] is not. A ne accesses are particularly well suited for parallelism and cache op- timizations. However, only scienti c and media programs are predominantly a ne; and even those sometimes have small amounts of non-a ne code in otherwise a ne loops, ruining performance. Non-a ne methods have also seen some work [6][12] primarily using graph-based methods. Despite decades of research, the results of automatic parallelization remain somewhat disappointing. We identify two main reasons. First, existing a ne and non-a ne methods remain fundamentally incompatible because of di erent internal representations of candidate transformed code. Since most real-world code is a mix of a ne and non-a ne code, neither class of parallelizers has been able to conquer general-purpose code. Second, and just as important, existing compiled parallel 1 code nds it hard to attain even a reasonable fraction of the peak performance of a parallel computer. Parallel code that is manually tuned by a highly trained com- puter engineer almost always yields code that is superior, often by a large margin, compared to automatically parallelized code. The root cause of the superior performance of manually tuned code is that a human programmer is often able to deduce program optimization transformations that a compiler is unable to nd. In theory, the compiler should be able to nd the transformation order chosen by the human { in most cases, both use the same toolkit of program transformations to achieve lower run-time. Such human-applied or com- piler a ne transformations include tiling, loop interchange, ssion, fusion, reversal, skewing, interleaving, peeling, and strip-mining. Non-a ne transformations [12][13] include Decoupled Software Pipelining (DSWP), Parallel-stage DSWP (PS-DSWP), Cyclic Multi-threading (CMT+) and Control Speculation. These transformations are extremely crucial to performance to increase parallelism by breaking loop-carried dependencies, or improving cache locality, or both. Unfortunately, the problem of nding a good transformation order and trans- formation parameters (e.g., tile sizes) is a challenging problem. This is the well- known \phase ordering" problem in compilers: a sequence of transformations must be applied in precisely the right order, with the right parameter values if applicable, to exactly the same loop dimensions to achieve the best run-time. Although this is a somewhat important problem for serial program transformations, it is crucial for transformations used for parallelization. For example, we have encountered a loop in the gemver benchmark in the Polybench benchmark suite for which the 2 best transformation order is a loop interchange, followed by a fusion, followed by a strip-mining, followed by another loop interchange to di erent dimensions . No shorter transformation order gives a run-time that is even close to the best order. We have found several benchmarks where the run-time of a basic parallelizer without transformations is improved by a factor of 5-10X by using a carefully constructed, manually discovered sequence of transformations. To e ectively solve the phase-ordering problem, a compiler must be able to apply mixed sequences of a ne and non-a ne code together. This is because real- world programs often belie easy categorization as just \non-a ne" or just \a ne". In reality, most programs are a mix of a ne and non-a ne code, with the ratio being more a ne code in scienti c domains, and less so in the general-purpose domain. For example, scienti c codes are often mostly a ne, but may have small amounts of non-a ne code that may introduce loop-carried dependencies, such as a printf() or other calls with side e ects, pointer accesses, or heap data structure accesses. Existing a ne compilers today tend to be very limited in the scope of programs they can transform. They are infamously \brittle", breaking on codes that were not written to perfectly match what they can handle. The result is that a ne parallelizers fail on such codes. Non-a ne parallelizers are also hobbled { they can handle any code, but can- not apply a ne transformations. This limits them since a ne methods can exploit such codes in a much more scalable fashion. The result is that non-a ne methods yield lower performance than is possible with an truly integrated a ne + non-a ne method. 3 The above two shortcomings make it clear that combined compiler methods to handle a ne and non-a ne transformations together are desirable since they are likely to achieve higher performance of the compiled code. However composing both types of transformations in a single compiler to be used simultaneously on the same program loop is very challenging. Existing compilers do not do this. The fundamental di culty in integrating a ne and non-a ne transforma- tions is that they use di erent and incompatible program representations. Good program representations are essential to the success of compilers. For example, the control- ow graph and data ow representations such as SSA are invaluable for optimization even though they can be derived from the compiler?s intermediate representation (IR). This is because they make underlying control- and data- ow information explicit, thus decoupling the discovery of such information from its use. Such decoupling allows for the modular design of a compiler, and avoids unnecessary re-computation of analysis results in every optimization pass. Similarly, the well- known Program Dependence Graph (PDG) is widely used for parallelization since it makes both types of dependences { control and data { explicit. This is useful even though everything in the PDG can be derived from the control- ow graph and data ow representations. In a similar vein, there is a need for a uni ed program representation that can make explicit the information needed for both a ne and non-a ne transformations. Unfortunately, existing program representations for both are unusable by the other. As a result, a ne and non-a ne transformations remain hard to combine. To understand why, we consider both types of representations below. 4 A ne mathematical representations represent a ne array references and loop- carried dependencies between them using custom mathematical representations for a ne code, such as matrices in traditional methods [18, 7], and systems of linear in- equalities in polyhedral methods [5]. These representations have the advantage that they contain detailed information needed to apply a ne transformations. Further since they contain little other information, they are fast to update when the compiler is searching through a very large number of possible transformation sequences. The loop?s IR is converted to the mathematical representation before the transformation search commences. Thereafter the IR is not used during the search. Only when the search has found the best transformation sequence is the IR regenerated so that code generation can commence from it. Unfortunately after an a ne transformation is applied, existing a ne represen- tations do not represent the program explicitly at all. Indeed, the result of applying an a ne matrix transformation is not represented in any general-purpose represen- tation that non-a ne transformations can reason about. Instead a ne-transformed codes only exist in mathematical representation, with no concrete representation until nal code generation. As a result, if a non-a ne transformation is to analyze the result of a set of a ne transformations, it has to derive the resultant code by applying these transformations in the IR and then building the relevant program representation it depends on from scratch. This is clumsy, non-intuitive, unnatural and defeats modularity. To build a truly integrated compiler, it would be desirable if both a ne and non-a ne transformations explicitly represented code in a way that is useful for either types of subsequent transformations, so a seamless transformation 5 search is possible. This is the goal of our uni ed representation. Another drawback of a ne mathematical representations is that they are not complete program representations { they only represent a ne references and their loop-carried dependencies, but not the rest of the loop. Hence a general-purpose tool to estimate the run-time of a code fragment at compile-time cannot be applied after an a ne transform. Such static run-time estimators (SREs) can be a useful tool to compare candidate transformation sequences to choose the best. Non-a ne representations There is less standardization in the literature on representations for non-a ne transformations. Methods such as DSWP [12] use the well-known Program Dependence Graph (PDG) representation. Earlier meth- ods such as those in the Parafrase compiler use the Closure of Data and control Dependencies Graph (CDDG) [11, 10]. Both the PDG and CDDG represent in- structions in the program along with data and control dependencies between them. They are suitable for non-a ne transformations which often attempt to break (or speculate on) dependencies, so applying transformations is quick at compile-time since they usually only need to modify a ected dependencies. Hence transformation orders can be searched for quickly. Unfortunately graph based representations such as the PDG have their own drawback: they do not represent a ne array indices or a ne distance vectors, crucial for performing a ne transformations. Further, there is no easy means to represent a ne transformations such as loop interchange which changes the way a loop-nest is structured or loop reversal which changes the direction a loop is accessed be- 6 cause such information is not explicit in the PDG. The PDG has no special way of representing structured loops such as for loops with bounds, induction variables and increments. Whereas this kind of information may not be required for pure non-a ne based techniques, it is essential for a ne analysis. Hence graph-based representations are not suitable for applying mixed se- quences containing both a ne and non-a ne transformations either. They have no quick way of deriving a ne information required for a ne analysis. They have no quick way of updating a ne information after a non-a ne transformation. Of course the IR can be rewritten to regenerate a ne information after each trans- formation in the search, but that would be too slow to be feasible given the large number of transformations a search might attempt. Discussion The end result of the drawbacks of the existing representations above is that there is no uni ed explicit program representation today that is suitable for applying both a ne and non-a ne transformations in a uni ed manner. There is a need for a single integrated compiler framework that enables building parallelizing and cache-optimizing compilers for any type of code { a ne code, non-a ne code, or any mix of the two. It should robustly apply the best possible transformation order for each program by quickly searching through a large number of transformation sequences. A uni ed explicit program representation would greatly facilitate this goal. We propose a new representation for programs that maintains the results of both a ne and non-a ne transforms explicitly, so that either type of transform 7 can use it for subsequent transforms. Based on the combination of the new loop representation called the Loop Nest Graph (LNG) and the well-known Program Dependence Graph (PDG), it has several features that are desirable when searching for program transformations in any compiler, but particularly parallelizing or cache- optimizing compilers. While the PDG maintains information regarding instructions within a loop and non-a ne dependencies between these instructions, the LNG (which is built on top of the PDG) maintains information regarding the loops in a loop-nest and a ne dependencies across the loops in the loop-nest. The advantages of this representation include: Modular One of the biggest advantages of using the LNG + PDG represen- tation is the decoupling of the building of the program representation phase and the analysis phase. Intuitively, this is similar to how a PDG decouples paralleliza- tion methods from methods to construct the PDG. It is also similar to how an SSA representation decouples data ow optimizations from SSA construction. As a result, the analysis phase can work on building the algorithms to use this infor- mation directly without having to deal with building individual representations themselves. Recording the e ects of Transformations directly Another advantage of our representation is the ability to re ect the e ects of both the a ne and non-a ne transformations in a uni ed framework such that any future analysis can use the information directly, instead of going back to IR. We do this by recording the transformations directly in the LNG + PDG representation. The 8 updates however are not intended to maintain complete information about the transformations, and as a result, do minimal changes often updating only one of the two structures. Use in detailed SREs The LNG + PDG forms a natural representation to be used by a detailed Static Runtime Estimator(SRE) to estimate run-time tak- ing into account detailed system characteristics such as cache, memory, network, synchronization, and pipeline characteristics. Use of an SRE helps the compiler evaluate and choose between di erent compositions of transformations. Using a compiler framework capable of deriving candidate orders of transformations, and using a good SRE to choose between them, we can foresee automatic compila- tion results approaching the performance of manually tuned code by using the characteristics of the target machine in transformation search. Current program representations have limitations that inhibit their use by SREs. A ne mathematical representations such as the distance vectors and polyhedral representations are not complete program representations { they only represent a ne references and their loop-carried dependencies, but not the rest of the loop. As a result, SREs cannot be applied on them. On the other hand, although the PDG maintains complete enough information about the instructions in a loop, the information about structured for loops needed for a ne run-time estimation, such as loop-nest structure and bounds, is not explicit in the PDG. Instead the SRE would need to recover it, violating the modular design of SREs and PDGs. In contrast the LNG and PDG combination has explicit information 9 for both types of parallelizers, enabling modular SREs. Although the focus of this paper is not the transformation search but the pro- gram representation used for it, a prototype search method using our representation has been built. Results show that it nds transformation orders among a selection of both a ne and non-a ne transformations without going back to the IR. 10 Chapter 2 Related Work Compilers rely on low overhead program representations when they need to evaluate di erent orders of transformations. Several program representations have been proposed for parallel compilers to aid with analysis and representation of trans- formations. These representations are motivated by the kind of analysis and the kind of loop optimizations considered by the compilers. An interesting variant to this approach of using program transformations is it- erative compilation [17, 1]. Instead of relying on any common intermediate program representation to derive and evaluate di erent compositions of transformations, it- erative compilation strategy involves generating code for various orders of transfor- mations and executing it to compare the performance. Since it involves going back to the IR for every candidate order of transformations, iterative compilation can have a signi cant compile time overhead. Although some methods reduce the com- pile time by bringing down the number of choices considered by the use of limited set of heuristics [17] or machine learning algorithms [1], the compile time however is still signi cant enough to restrict its use in practical compilers. Our use of a low overhead means of representing just the essential information in the program, helps us overcome this high compile time problem. With an accurate SRE, our representation will provide the same bene ts as iterative compilation with orders of 11 magnitude less compile-time. For a ne programs, distance vectors [20, 3] have been used by the SUIF com- piler [18] to analyze and parallelize a ne code. A distance vector represents a memory dependence between two array accesses across the iteration space of their common loop nests. For example the distance vector (2,0) means that each outer loop iteration depends on an iteration that is two before it, and the inner loop iterations are independent. The collective set of such distance vectors for every memory dependence in the loop forms the basis for analyzing the loop for various a ne loop transformations such as loop interchange, loop reversal and loop skewing. One of the biggest advantages of this representation is that analysis and representa- tion of the above a ne loop transformations can be abstracted into simple matrix transformations. Unfortunately, one limitation of the representation is that merely updating distance vectors cannot represent the e ect of non-unimodular loop transformations such as loop fusion, loop ssion and loop peeling. As a result, heuristics which are based solely on the distance vectors for evaluation of di erent orders of trans- formations cannot use these transformations directly. Wolf et al. [19] propose an algorithm where ssion and fusion are considered in the initial and nal phases, while the middle phase of the algorithm uses the distance vectors to compute and evaluate various orders of transformations. Thus, these transformations are never truly part of evaluation of various orders. For example, this model cannot come up with a transformation order such as interchange followed by fusion followed by interchange. In our representation, we use distance vectors in combination with the 12 LNG and the PDG to create a hierarchical loop dependence graph, through which we can represent all these transformations in any order. The polyhedral representation [4] is the other signi cant program represen- tation for a ne code, which has been used by PLUTO compiler [5]. Polyhedral methods [8] represent each statement in an a ne loop separately as a point in an it- eration domain. The iteration space thus de ned is a polyhedron in a d-dimensional space, where d is the nesting depth of the loop in question. The polyhedral repre- sentation represents complex compositions of a ne transformations as a scheduling function. Although the Polyhedral model is very powerful, it has a serious drawback { since it translates a ne code into a series of linear inequalities, this mathematical representation has no way of representing non-a ne code. As a result it is inap- plicable to mostly-a ne code which has small amounts of non-a ne code { this is common in real-word codes. Further, we will be able to represent every trans- formation that can be represented in Polyhedral framework by means of the LNG + PDG combination. Indeed many of the program transformations we found in polyhedral papers that they said are not handled by the traditional model can be found by using a mixed series of unimodular and non-unimodular transformations we considered using our representation. Dependence-graph representations such as the PDG [9], the CDDG [11, 10] and the parallel program graph [16] are often used by non-a ne transformations. The Parafrase-2 [15] compiler uses the CDDG to partition and schedule a given pro- gram. The CDDG is a hierarchical structure that enables detecting parallel tasks at various levels; however it is an immensely complicated structure to make changes 13 to and thus is not suitable for representing transformations directly in the program representation, which is an essential feature if we want to apply several transfor- mations. The PDG is another dependence-graph representation that is particularly conducive for incremental optimizations. The PDG encapsulates useful information that helps detect opportunities for optimizations, vectorization and parallelization. The main drawback of all graph-based representations is that they are restricted to non-a ne transformations; for example, the PDG has been used for transformations such as DSWP, PS-DSWP and Control Speculation [12, 13]. We extend the PDG for a ne analysis by combining it with the LNG. A related representation is the PPG [16] which builds on the PDG for rep- resenting already parallel programs. The PPG extends the serial representation of PDG by adding parallel control ow edges and synchronization edges for parallel programs. Unlike the PPG, our aim is to be able to detect parallelism oportunities rather than represent parallel programs. Hence our representation is catered to rep- resenting the serial program. Through this representation of LNG + PDG, we seek to extend the PDG representation for encompassing di erent kinds of transforma- tions. 14 Chapter 3 Program Representation 3.1 Overview The most important and challenging factor motivating the development of our program representation is the ability to analyze and represent both a ne and non-a ne transformations. A ne and non-a ne methods have distinctly di erent characteristics. Whereas a ne analyses use mathematical representations such as distance vectors to analyze and represent transformations, non-a ne analyses rely on the use of dependence graphs such as the CDDG and the PDG. Lack of any obvi- ous commonality between the two methods, as observed by the existing technologies, poses a major challenge. Understanding the commonalities and the di erences in the features that the a ne and non-a ne transformations look for is the key to a uni ed representation. A good representation relies on the ability to make explicit the information needed by both these domains. Though the kind of analyses and transformations considered by the two domains are seemingly di erent, they rely on a common feature of being able to understand the inherent dependencies in the program and more particularly within a loop. The di erence lies in the way the dependencies are studied and interpreted by the di erent transformations. Non-a ne transformations analyze a dependence graph to partition it into several independent or less-dependent tasks. 15 A ne transformations, on the other hand aim to run loop iterations in parallel with other iterations by exploiting regular dependence patterns encapsulated in distance vectors. To do so, they use loop restructuring, iteration space reording and loop-nest reordering. As a result, non-a ne transformations require the ability to understand all the dependencies within a loop; a ne transformations require the abilty to understand and represent the characteristics relevant to the loop-nest such as the loop-nest structure and the loop-carried dependencies across all the di erent levels of the loop-nest. Thus the di erence in the representations needed. There is a need for a representation that can act as a bridge between these two distinct set of needs helping realize a truly integrated program representation. The integrated program representation should be capable of not only maintaining the information relevant to both the a ne and the non-a ne methods but also repre- sent the e ect of transformations without going back to IR. Current representations are catered for either a ne or non-a ne methods exclusively because of which they maintain information relevant only for these methods. Combining the two represen- tations { the PDG and distance vectors { though seems intuitive, is also insu cient for our purpose. Neither the PDG nor the distance vectors can re ect the changes of the other set of tranformations directly. The two representations are inherently incompatibile with each other. We propose a new loop representation, the Loop Nest Graph (LNG) which in combination with the PDG and distance vectors has all the above features. The LNG acts as the bridge between the a ne-catering mathematical representation (distance vectors) and the dependence graph representation (PDG). By representing every 16 loop as an individual node, the LNG represents a graph of dependencies between the loops in a loop-nest, thus naturally representing the nesting structure in an easily accessible way. Because of its reliance on loops as individual nodes, the LNG also forms the natural representation to maintain a ne information such as loop characteristics for structured loops and distance vectors as annotations on loop- carried a ne dependencies. This representation therefore serves the dual purpose of representing information required by the a ne analysis, as well as representing the e ects of a ne transformations in a way that non-a ne analyses can understand and update. This representation forms a two-tiered structure. The PDG is at the lower tier representing dependence information between individual instructions in a loop which is useful for non-a ne analysis. The LNG in combination with distance vectors is at the top tier maintaining loop and dependence information in a loop-nest which the a ne analysis can use e ectively. Thus combining the LNG with the PDG and distance vectors gives us the exibility to take advantage of existing research with little or no modi cation. 3.2 PDG The PDG [9] is a well-known and powerful representation which is suited for several loop optimizations and parallelization techniques. The PDG has been used for several loop transformations such as Loop Peeling and Loop Unrolling. It is also the basis for parallelization techniques such as DSWP [12]. An important 17 PDG Control DepEdge PDG Loop-carried Data DepEdgePDG Data DepEdge LNG Affine Memory DepEdgeLNG Loop Structure Edge LNG Control Edge S7 S0 S6 S5 S4 S3 S2 S1 S8 S9 L1 L2 S9S1S0 S8S3S2 S6S5S4 S7 (1,0) (d) LNG Figure 3.1: Example program with the PDG and LNG advantage of the PDG is that it can be used to incrementally apply several of these optimizations without having to go to the IR. The PDG has several characteristics that makes it conducive to the optimiza- tions we are interested in. One of the most important reasons the PDG is particularly relevant is its reliance on dependence information. In basic terms, the presence of a dependence between two instructions implies that there exists constraints regarding the execution of the two instructions, and the absence of a dependence implies that the two instructions can execute in parallel. As a result, dependencies play a crucial role in exposing parallelism. Since the program dependence graph captures depen- dencies in a program instead of emphasizing the ow of control or data, it becomes a natural choice for a compiler interested in parallelization. 18 Another feature of the PDG that is useful is its ability to represent the de- pendencies at any level we are interested in { program level, function level and loop level. Every level captures dependencies relevant only to that level?s view. For example, a function level view of the PDG can capture dependencies between two loops, but may not capture dependencies such as loop-carried dependencies within each of the loops. In our representation, we use the PDG to represent dependencies for every loop in the function (including each level within a loop nest). The loop PDG is the base layer of our representation, maintaining detailed dependence information of the loop which may be required by certain analyses. Every loop in the function is associated with a PDG representation that maintains dependence information between every pair of instructions in the loop. The loop PDG represents the loop as a graph where every instruction in the loop is a unique node and edges represent the dependencies between the nodes. The PDG is a directed graph GPDG = (VPDG;EPDG)j VPDG =fS0;S1 :::;Sng where Si is an instruction in the loop EPDG =f(u;v)j8u;v2VPDG and v is dependent on ug 3.1(c) shows the PDG for the loop in example 3.1(a). Each instruction in the IR shown in the example 3.1(b) is a node in the PDG. In this example,fS0;S1 :::;S9g represents the set of nodes in the PDG. (S1;S2) is an example edge that represents the dependence of the instruction S2 on S1. Di erent kinds of dependencies are represented in the PDG. Each edge (u;v) in the PDG is associated with one of these dependence types. 19 PDG Control Dependence Edge: (EPDGCD) A node v is control dependent on node u if the node u determines whether node v executes or not. Use of control dependencies instead of control ow helps avoid the xed sequencing forced by the latter. The control dependence property is derived from the control ow graph of the loop [9]. For the control ow graph, GCFG of a loop (u;v)2EPDGCD is a PDG control dependence edge i a) 9 a path P from u to v in GCFG with any w2P post-dominated by v and b) u is not post-dominated by v. In the Example 3.1(c), f(S1;S2);(S3;S5)g are some examples of control depen- dence edges in the PDG. PDG Data Dependence Edge: (EPDGDD) A data dependence edge denotes the possibility of both the instructions (nodes) accessing the same location. This automatically imposes execution constraints, thus restricting parallelism. Hence data-dependencies form important constraints in parallelism detection. Since data dependencies, especially the memory dependencies are di cult to disambiguate, they often are the most crucial parallelism-inhibiting dependencies. Several trans- formations speci cally target breaking or moving these dependencies to enhance parallelism. (u;v)2EPDGDD is a PDG data dependence edge i a) 9 a path from u to v in GCFG and b) u and v can access the same location and c) either u or v or both are a write or store operation. 20 Data dependence edges are further classi ed into register dependence and memory dependence edges. (u;v) is a register dependence edge if u writes to a register v reads from1. In the example 3.1(c), (S4;S6) is an example register dependence edge. (u;v) is a memory dependence edge if u writes to a memory location that could be read by v. In the example 3.1(c), (S7;S5) is a memory anti-dependence edge. Data dependence edges are determined using data- ow analysis techniques and the control ow graph. Eliminating memory dependencies further requires several kinds of memory dependence analyses, alias analyses and memory disambiguation techniques to prove that a pair of memory operations do not alias. PDG Loop-carried Data Dependence Edge (EPDGLCDD) is a special kind of data dependence edge in the loop PDG which has the additional property of being loop-carried in nature. (S9;S4) is an example of a loop-carried register- dependence edge and (S7;S5) of a loop-carried memory-dependence edge for the memory location B[j]. 3.3 LNG The LNG is a proposed top-level graph which de nes the loop structure of a function. It forms a layer over the loop PDGs, providing information from a loop-nest perspective. The LNG has the following three features : Represents every loop as an individual node, thus making loops as top-level 1Here registers mean virtual registers before register allocation, or physical registers after 21 structures making them directly accessible for any loop related information. Rep- resenting the loop as a node also helps de ne the loop-nest structure, which is a useful representation for several a ne transformations. Makes structured loops explicit. Structured loops such as for loops are the prime candidates for a ne analyses. Several characteristics of the structured loops such as induction variables and loop bounds can be derived precisely and are hence represented directly. Encapsulates a ne dependencies. For every PDG loop-carried memory depen- dence edge that is a ne, we copy that edge to the LNG and annotate it with its distance vector. This is convenient for a ne transformations since unimodular transformations can be de ned solely based on the distance vectors of the loop. In addition, non-unimodular transformations can be represented by updating other parts of the LNG+ PDG in addition to the distance vectors. The LNG thus forms a generic program representation representing depen- dence information from a loop-nest perspective, which is particularly relevant to a ne analysis. The biggest advantage of the LNG is its easy compatibility with other dependence graph structures such as the PDG. Any analysis that needs loop-nest related information and speci c loop attributes can directly use the LNG without having to derive it from the IR or the PDG. The LNG represents the loop nest as a graph where every loop and every instruction is a unique node, with edges that de ne the loop-nesting relationship and a ne memory dependence relationship between two nodes. 22 The LNG is a directed graph GLNG = (VLNG;ELNG)j VLNG is a set of nodes, one for each instruction and loop in the program; and ELNG is a set of edges (u;v) which have either (a) a loop-nest relationship as de ned below or (b) an a ne memory dependence. 3.1(d) shows the LNG for the loop in 3.1(a). fL1;L2;S0;:::;S9g are the set of LNG nodes. f(L1;L2);(L2;S5)g are some examples of LNG edges. There are two kinds of nodes in the LNG: LNG Loop Node: (VLNGLoop) Every loop in the function is represented as an LNG loop node. Each loop node further maintains important loop characteristics such as loop bounds and induction variable characteristics for structured loops. In the example 3.1(d), VLNGLoop =fL1;L2g. LNG Instruction Node: (VLNGInst) Every instruction is represented as an LNG instruction node. In the example 3.1(d), VLNGInst =fS0;S1;:::;S9g. There is a one-to-one correspondence to the instructions represented in the PDGs2. Similar to the PDG, the LNG maintains ve types of edges, listed below. LNG Loop Control Edge: (ELNGLCE) This type of edge denotes the loop- nest relationship between nodes. There exists an LNG loop control edge from node u to node v if u is the immediate parent loop of v. f(L1;L2);(L2;S5)g are some examples of the LNG control edges. An LNG control edge from one loop 2During implementation, the LNG does not replicate all the information about the instruction in the PDG; instead it contains a structure with a pointer to the PDG instruction. Having the structure allows the LNG to add information such as extra edges to the instructions. 23 node u to another v de nes the loop nesting information that the v is a child loop of u. An edge from an LNG loop node to an LNG instruction node de nes the association of the instruction with its immediate parent loop. Determining the LNG loop control edges involves well-known dominator analysis used to recognize loops in a program. There exists an LNG control edge (u;v)2ELNGLCE i a) u2VLNGLoop and v2[fVLNGLoop;VLNGInstg and b) u is the immediate dominator of v. LNG A ne Memory Dependence Edge: (ELNGAMDE) Each loop-carried memory dependence between two a ne memory access instructions is represented in the LNG with an LNG a ne memory dependence edge. These edges are further annotated with distance vectors. Non-a ne loop-carried dependencies are not included in this set. In the example 3.1(d), (S7;S5) is an LNG a ne loop memory edge. The edge is annotated with the distance vector (1;0) associated with the memory accesses. In addition to the above two edge types, the LNG maintains three special edges associated with every structured for loop, which we together refer to as the LNG Loop Structure Edges (ELNGLSE). We follow the usual de nition of for loops: a for loop is a loop whose exit condition is a relational operator comparing an induction variable with a loop-invariant quantity. The three de ning characteristics of a for loop are the incoming de nition, update, and exit condition of the loop?s induction variable (IV). These are represented in the LNG by the following three types of 24 edges: LNG Loop IV De nition Edge: (ELNGIV Def) A special kind of LNG loop structure edge from a for loop to its induction variable de nition before the loop. The set of such induction variable de nitions is called VLNGIV Def. In the example 3.1(d), ELNTIV Def =f(L2;S2);(L1;S0)g and VLNTIV Def =fS0;S2g. LNG Loop IV Update Edge: (ELNGIV Up) A special kind of LNG loop structure edge from a for loop to the instruction in that loop that updates its induction variable. By the de nition of induction variable, this updating instruc- tion is unique. The set of such induction variable updates is called VLNGIV Up. In the example 3.1(d), ELNTIV Up =f(L2;S8);(L1;S9)g and VLNTIV Up =fS8;S9g. LNG Loop IV Exit Condition Edge: (ELNGIV ECE) A special kind of LNG loop structure edge from a for loop to the instruction in the loop that calculates its exit condition. The set of such induction variable exit condition instructions is called VLNGIV ECE. In the example 3.1(d), ELNTIV ECE = f(L2;S3);(L1;S1)g and VLNTIV ECE =fS1;S3g. For convenience sake, we de ne VLNTLSE = VLNTIV Def [VLNTIV Up[VLNTIV ECE and ELNTLSE = ELNTIV Def [ELNTIV Up[ELNTIV ECE. 25 Chapter 4 Representation of Transformations To illustrate how the LNG and the PDG are updated for transformations, we consider four transformations as examples below. 4.1 Loop Interchange Loop interchange is an important a ne-based loop transformation. Loop in- terchange is desirable for a number of reasons such as improving cache performance or increasing granularity of parallelism. Loop interchange can also act as an en- abler transform { enabling other transforms such as fusion to become applicable. Traditional analysis for the legality of loop interchange relies on analyzing distance vectors of the loop-nest. The e ects of loop interchange are captured in the two representations in the following manner: Loop interchange is a reordering transformation. It only a ects the order in which the loop is executed, without changing the code inside the loop. As a result, the most important e ect of interchange is on the induction variable behavior of the two loops. This is re ected in the LNG by interchanging the two loop nodes and the associated LNG loop structure edges. For a given LNG GLNG, a new G0LNG is derived for loop interchange of Loop1 and 26 Loop2 by interchanging the position of VLNGLoop1 with VLNGLoop2 along with their associated loop structure edges in the LNG. In the example, this would involve interchanging L1 with L2, along with their associated loop structure edges. To re ect the corresponding changes in the PDG, we use the special loop struc- ture nodes that are maintained in the LNG (VLNGLSE). Thus this would involve interchanging the control edges associated with the loop structure nodes of the two loops, which would mean: Swap the control-dependence edges associated with VLNG1IV Def and VLNG2IV Def Similarly, swap the control-dependence edges associated with VLNG1IV Up and VLNG2IV Up Similarly, swap the control-dependence edges associated with VLNG1IV ECE and VLNG2IV ECE Because of the change in the order of execution, loop interchange a ects the data dependencies in both the structures. It changes the order in which memory is accessed, thus changing the dependence patterns (distance vectors) associated with the a ne memory dependencies (ELNGAMDE). This is re ected in the LNG by performing an interchange on the columns of the distance vectors associated with the loop-nest as described by unimodular theory [2]. In the example 3.1(d), the two columns of the distance vector (1;0) would be interchanged to yield (0;1). 27 8 edge e = (u;v)2ELNGAMDE interchange distance vectors(e); Since each LNG a ne memory dependence is a copy of the unannotated depen- dence in the PDG, when the former changes the latter must be changed as well. When a distance vector component is changed to zero by interchange, the de- pendence can be removed from the PDG. Thus the dependence edge (S7;S5) is removed after the interchange. If d1;d2 were the depths of loops interchanged, for(d=d1,d2) 8 edge e = (u;v)2ELNGAMDE if(distance vector(e)[d] changes to 0) remove edge(pdg edge(u;v)) else if (distance vector(e)[d] changes to non-zero) add loop carried edge(pdg edge(u;v)) where pdg edge(u;v) is the edge from node u to node v in the PDG associated with the loop at depth d in the loop-nest. Loop interchange does not a ect the register dependencies in the PDG because the intra-loop dependencies are not a ected by interchanging the loops. The example 4.1 shows the e ect of interchanging the two loops in the example 3.1(b). The highlighted nodes in both the LNG and the PDG show the swapped nodes and the highlighted edge shows the change in the loop-carried memory de- pendence due to the change in distance vector. 28 a) Before Interchange b) After Interchange Figure 4.1: Example for Interchange 4.2 Loop Fusion Loop Fusion is another well-understood and often-used loop transformation. Loop fusion a ects the runtime in two ways. First, fusing two loops brings down the number of loops thus reducing the synchronization nodes during parallelization. Fusion also has a number of cache bene ts, where it can promote cache and data reuse, thus improving the runtime. Loop fusion is advantageous and desirable for both a ne and non-a ne methods. Loop Fusion a ects both the LNG and the PDG in the loop control structure. 29 S5 S4 S1 S0 S9 S7 S6 S5 S4 S1 S0 S9 S7 S6 S3 S0 S8 for(i=0; i< N; ++i)C[i] = A[i]; for(i=0; i< N; ++i)D[i] = B[i]; for(i=0; i< N; ++i) {C[i] = A[i]; D[i] = B[i];} L1 S9S1S0 S5S4 L2 S8S3S2 S7S6 L1 S9S1S0 S6S5S4 S7 Figure 4.2: Example for Fusion 30 Like for interchange, changes in the LNG guide the changes to the PDG. Fusing two loops can be visualized as fusing the two corresponding LNG loop nodes into one LNG loop node such that all the dependencies associated with the two loop nodes will now be associated with the fused node. Let L1 and L2 represent the two LNG loop nodes and Lfused the fused loop node. For fusion to be legal, the loop structures must be the same for L1 and L2, so we can inherit the loop structure for Lfused from either (say L1). Then we inherit the LNG + PDG nodes and dependences for Lfused by unioning those for L1 and L2. Thus, if Lfused = L1, we inherit the edges from L2 by: for(e = outedges(L2)) if(e =2ELNG2LSE) change the source of edge e to L1 Representing fusion of the two loops in the PDG would involve fusing the loop structures of the corresponding loop PDGs into one fused PDG. Control dependen- cies in the PDG that de ne the loop structure are associated with exit condition node (VLNGIV ECE). Instead of replicating the PDG for a fused loop, we use one of the candidate PDGs as our base (say L1) and add the control dependencies for the loop L2 to the PDG. Thus, fusion of the PDGs involves associating the control dependencies that de ne the loop structure in the loop L2 with the VLNG1IV ECE of L1. 31 for(e = outedges(VLNG2IV ECE)) if(e2EPDGCD) change the source of edge e to VLNG1IV ECE Whereas the data dependencies within the loops are not a ected, new data dependencies can result from the fusion of loops due to pre-fusion inter-loop depen- dencies converting to intra-loop dependencies after fusion. Since the analysis for fusion involves analyzing such inter-loop dependencies, we use this analysis phase to save the set of possible resultant dependence edges. Once the fusion is determined to be legal, we use the saved information to update the dependencies. Thus changes to data dependencies due to fusion follow directly from the analysis. The example 4.2 shows an example of the fusion of two loops and the resultant PDG + LNG structure. 4.3 Reduction Reduction is a well-known loop transformation useful in the parallelization of both a ne and non-a ne programs. Reduction is possible on any statement of the form v = v expr in a loop, where is a commutative and associative operation such as sum, product, min and max; and v is loop-invariant in that dimension of the loop. This creates a loop-carried dependence in the loop, inhibiting parallelism. This reduction dependence can be broken by creating private copies of the variable v for each parallel thread and nally accumulating all the copies into the original after the execution of the entire loop. 32 Reduction appears in the PDG as a loop-carried data dependence edge. The e ect of performing a reduction operation is equivalent to removing the loop-carried dependence EPDGLCDD associated with the operation v = v expr. Reduction does not a ect the LNG because we do not represent any register dependencies in the LNG. Reduction is a constant-time operation involving the removal of the edge cor- responding to the reduction operation. 4.4 DSWP Decoupled software pipelining [12] is a cyclic multi-threading technique which has been used to extract parallelism from general-purpose non-a ne code. DSWP uses the PDG to analyze the dependencies and splits the loop into several pipeline stages. It has been shown that the DSWP can also be scaled to higher number of threads when coupled with other parallelizaiton techniques such as DOALL in Parallel Stage-DSWP [13]. This makes the DSWP a very useful transformation to work with a ne transformations. In contrast to the two a ne transformations mentioned previously, the PDG guides the changes in the LNG for DSWP. Changes to the PDG due to DSWP techniques follow directly from the analysis [12]. Since DSWP partitions the body of the loop into threads that execute on di erent pipeline stages, its only a ect on the LNG is with respect to the association of instructions with the threads. We represent this information by splitting the LNG loop node in question into several 33 S0: i=0; S1: y=0; S2: while(i