ABSTRACT
Title of Dissertation: The Java Memory Model
Jeremy Manson, Doctor of Philosophy, 2004
Dissertation directed by: William Pugh
Department of Computer Science
After many years, support for multithreading has been integrated into main-
stream programming languages. Inclusion of this feature brings with it a need
for a clear and direct explanation of how threads interact through memory. Pro-
grammers need to be told, simply and clearly, what might happen when their
programs execute. Compiler writers need to be able to work their magic without
interfering with the promises that are made to programmers.
Java?s original threading specification, its memory model, was fundamentally
flawed. Some language features, like volatile fields, were under-specified: their
treatment was so weak as to render them useless. Other features, including
fields without access modifiers, were over-specified: the memory model prevents
almost all optimizations of code containing these ?normal? fields. Finally, some
features, like final fields, had no specification at all beyond that of normal fields;
no additional guarantees were provided about what will happen when they are
used.
This work has attempted to remedy these limitations. We provide a clear and
concise definition of thread interaction. It is sufficiently simple for programmers
to work with, and flexible enough to take advantage of compiler and processor-
level optimizations. We also provide formal and informal techniques for verifying
that the model provides this balance. These issues had never been addressed
for any programming language: in addressing them for Java, this dissertation
provides a framework for all multithreaded languages. The work described in this
dissertation has been incorporated into the version 5.0 of the Java programming
language.
The Java Memory Model
by
Jeremy Manson
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2004
Advisory Committee:
Michael Hicks
Pete Keleher
Edgar G. K. Lopez-Escobar
Adam Porter
William Pugh, Chair/Advisor
c? Copyright by
Jeremy Manson
2004
ACKNOWLEDGEMENTS
The work in this thesis could not have even begun to speculate about
the merest possibility of crossing my mind without the input of many,
many people. First and foremost is my advisor, Bill Pugh. However,
the contributions of the many members of the Java memory model
mailing list were also crucial ? especially Doug Lea and Sarita Adve.
Although the others are too numerous to mention, I shall mention
some of them anyway - Victor Luchangco, Hans Boehm, Joseph Bow-
beer and David Holmes all had a significant impact on the content of
this dissertation.
I would like to thank David Hovemeyer and Jaime Spacco for sitting
through innumerable conversations and lectures on the Java memory
model. They truly have the patience of saints.
Without Penelope Asay, I would have been even more of an emotional
wreck than the typical graduate student in the throes of dissertation
ii
writing. I might have gotten it done, but I wouldn?t have been very
happy about it.
Special thanks to my Mother, Father, Josh and Matthew for not mak-
ing me explain what I am doing. I would still be on the phone. And,
of course, I wouldn?t be anywhere without you.
iii
TABLE OF CONTENTS
List of Tables viii
List of Figures ix
1 Introduction 1
1.1 Why Solve This Problem? . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Building Blocks 9
2.1 Code is Reordered . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Synchronization and Happens-Before . . . . . . . . . . . . . . . . 12
2.3 Atomicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Visibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 In Which Some Motivations Are Given 22
3.1 Guarantees for Correctly Synchronized Programs . . . . . . . . . 23
3.2 Simple Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Transformations that Involve Dependencies . . . . . . . . . . . . . 26
iv
3.3.1 Reordering Not Visible to Current Thread . . . . . . . . . 29
3.4 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5 Additional Synchronization Issues . . . . . . . . . . . . . . . . . . 32
3.5.1 Optimizations Based on Happens-Before . . . . . . . . . . 32
3.5.2 Additional Guarantees for Volatiles . . . . . . . . . . . . . 36
3.5.3 Optimizers Must Be Careful . . . . . . . . . . . . . . . . . 37
3.6 Infinite Executions, Fairness and Observable Behavior . . . . . . . 38
3.6.1 Control Dependence . . . . . . . . . . . . . . . . . . . . . 41
4 Causality ? Approaching a Java Memory Model 43
4.1 Sequential Consistency Memory Model . . . . . . . . . . . . . . . 43
4.2 Happens-Before Memory Model . . . . . . . . . . . . . . . . . . . 44
4.3 Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.1 When Actions Can Occur . . . . . . . . . . . . . . . . . . 51
4.4 Some Things That Are Potentially Surprising about This Memory
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.1 Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.2 Thread Inlining . . . . . . . . . . . . . . . . . . . . . . . . 58
5 The Java Memory Model 61
5.1 Actions and Executions . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3 Well-Formed Executions . . . . . . . . . . . . . . . . . . . . . . . 65
5.4 Causality Requirements for Executions . . . . . . . . . . . . . . . 67
5.5 Observable Behavior and Nonterminating Executions . . . . . . . 69
5.6 Formalizing Observable Behavior . . . . . . . . . . . . . . . . . . 70
v
5.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.7.1 Simple Reordering . . . . . . . . . . . . . . . . . . . . . . 72
5.7.2 Correctly Synchronized Programs . . . . . . . . . . . . . . 73
5.7.3 Observable Actions . . . . . . . . . . . . . . . . . . . . . . 75
6 Simulation and Verification 77
6.1 The Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.1.1 Simulating Causality . . . . . . . . . . . . . . . . . . . . . 79
6.1.2 Simulator Performance . . . . . . . . . . . . . . . . . . . . 81
6.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2.1 Semantics Allows Reordering . . . . . . . . . . . . . . . . 86
6.2.2 Considerations for Programmers . . . . . . . . . . . . . . . 89
7 Immutable Objects 92
7.1 Informal Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.1.1 Complications . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.2 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.2.1 A Simple Example . . . . . . . . . . . . . . . . . . . . . . 98
7.2.2 Informal Guarantees for Objects Reachable from Final Fields100
7.2.3 Additional Freeze Passing . . . . . . . . . . . . . . . . . . 103
7.2.4 Reads and Writes of Final Fields in the Same Thread . . . 106
7.2.5 Guarantees Made by Enclosing Objects . . . . . . . . . . . 107
7.3 Full Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.4 Illustrative Test Cases and Behaviors of Final Fields . . . . . . . . 111
7.5 Permissible Optimizations . . . . . . . . . . . . . . . . . . . . . . 114
7.5.1 Prohibited Reorderings . . . . . . . . . . . . . . . . . . . . 115
vi
7.5.2 Enabled Reorderings . . . . . . . . . . . . . . . . . . . . . 115
7.6 Implementation on Weak Memory Orders . . . . . . . . . . . . . . 117
7.6.1 Weak Processor Architectures . . . . . . . . . . . . . . . . 117
7.6.2 Distributed Shared Memory Based Systems . . . . . . . . 119
8 Related Work 121
8.1 Architectural Memory Models . . . . . . . . . . . . . . . . . . . . 121
8.2 Processor Memory Models . . . . . . . . . . . . . . . . . . . . . . 122
8.3 Programming Language Memory Models . . . . . . . . . . . . . . 126
8.3.1 Single-Threaded Languages . . . . . . . . . . . . . . . . . 126
8.3.2 Multi-Threaded Languages . . . . . . . . . . . . . . . . . . 127
8.4 Final Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
9 Future Work 136
9.1 Performance Testing and Optimization . . . . . . . . . . . . . . . 136
9.2 Better Programming Models . . . . . . . . . . . . . . . . . . . . . 138
9.3 Other Programming Languages . . . . . . . . . . . . . . . . . . . 140
10 Conclusion 142
A Finalization 144
A.1 Implementing Finalization . . . . . . . . . . . . . . . . . . . . . . 146
A.2 Interaction with the Memory Model . . . . . . . . . . . . . . . . . 148
vii
LIST OF TABLES
6.1 Simulator Timing Results . . . . . . . . . . . . . . . . . . . . . . 83
6.2 Table of Requirements, Part 1 . . . . . . . . . . . . . . . . . . . . 84
6.3 Table of Requirements, Part 2 . . . . . . . . . . . . . . . . . . . . 85
viii
LIST OF FIGURES
1.1 Surprising results caused by forward substitution . . . . . . . . . 3
1.2 Motivation for disallowing some cycles . . . . . . . . . . . . . . . 4
1.3 Compilers Can Think Hard about when Actions are Guaranteed
to Occur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Behaves Surprisingly . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Figure 2.1, after reordering . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Atomicity example . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Visibility example . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Ordering example . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Surprising Correctly Synchronized Program . . . . . . . . . . . . 24
3.2 Behaves Surprisingly . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Effects of Redundant Read Elimination . . . . . . . . . . . . . . . 26
3.4 An Example of a More Sophisticated Analysis . . . . . . . . . . . 28
3.5 Sometimes Dependencies are not Obvious . . . . . . . . . . . . . . 28
3.6 An Unexpected Reordering . . . . . . . . . . . . . . . . . . . . . . 29
3.7 Simple Use of Volatile Variables . . . . . . . . . . . . . . . . . . . 30
3.8 Example of Lock Coarsening . . . . . . . . . . . . . . . . . . . . . 34
3.9 Volatiles Must Occur In A Total Order . . . . . . . . . . . . . . . 35
ix
3.10 Strong or Weak Volatiles? . . . . . . . . . . . . . . . . . . . . . . 36
3.11 Another Surprising Correctly Synchronized Program . . . . . . . 38
3.12 Lack of fairness allows Thread 1 to never surrender the CPU . . . 39
3.13 If we observe print message, Thread 1 must see write to v and
terminate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.14 Correctly Synchronized Program . . . . . . . . . . . . . . . . . . . 41
4.1 Behaves Surprisingly . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Surprising Correctly Synchronized Program . . . . . . . . . . . . 46
4.3 An Out Of Thin Air Result . . . . . . . . . . . . . . . . . . . . . 47
4.4 An Unexpected Reordering . . . . . . . . . . . . . . . . . . . . . . 48
4.5 Can Threads 1 and 2 see 42, if Thread 4 didn?t write 42? . . . . . 51
4.6 Can Threads 1 and 2 see 42, if Thread 4 didn?t write to x? . . . . 51
4.7 A Complicated Inference . . . . . . . . . . . . . . . . . . . . . . . 53
4.8 Another Out Of Thin Air Example . . . . . . . . . . . . . . . . . 54
4.9 Behavior disallowed by semantics . . . . . . . . . . . . . . . . . . 57
4.10 Result of thread inlining of Figure 4.9; behavior allowed by semantics 57
4.11 A variant ?bait-and-switch? behavior . . . . . . . . . . . . . . . . 59
4.12 Behavior that must be allowed . . . . . . . . . . . . . . . . . . . . 59
5.1 A Standard Reordering . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Table of commit sets for Figure 5.1 . . . . . . . . . . . . . . . . . 73
5.3 A Correctly Synchronized Program . . . . . . . . . . . . . . . . . 74
5.4 Must Disallow Behavior by Ensuring Happens-Before Order is Stable 75
5.5 If we observe print message, Thread 1 must see write to v and
terminate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
x
7.1 Example Illustrating Final Field Semantics . . . . . . . . . . . . . 94
7.2 Without final fields or synchronization, it is possible for this code
to print /usr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.3 Example of Simple Final Semantics . . . . . . . . . . . . . . . . . 99
7.4 Example of Transitive Final Semantics . . . . . . . . . . . . . . . 101
7.5 Example of Reachability . . . . . . . . . . . . . . . . . . . . . . . 102
7.6 Freezes are Passed Between Threads . . . . . . . . . . . . . . . . 103
7.7 Example of Happens-Before Interaction . . . . . . . . . . . . . . . 104
7.8 Four Examples of Final Field Optimization . . . . . . . . . . . . . 105
7.9 Guarantees Should be made via the Enclosing Object . . . . . . . 107
7.10 Cyclic Definition Causes Problems . . . . . . . . . . . . . . . . . . 108
7.11 Final field example where reference to object is read twice . . . . 112
7.12 Transitive guarantees from final fields . . . . . . . . . . . . . . . . 113
7.13 Yet Another Final Field Example . . . . . . . . . . . . . . . . . . 114
7.14 Happens-Before Does Matter . . . . . . . . . . . . . . . . . . . . . 116
7.15 Number of Required Memory Barriers on Weak Memory Orders . 119
8.1 Can r2 = 1, r3 = 0? . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.2 CRF does not allow i == 2 and j == 1 . . . . . . . . . . . . . . . 131
8.3 A Purely Operational Approach Does Not Work . . . . . . . . . . 132
8.4 Compilers Can Think Hard About When Actions Are Guaranteed
to Occur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
xi
Chapter 1
Introduction
A facility for quotation covers the absence of original thought.
? Dorothy L. Sayers, ?Gaudy Night?
Much of the work done in modern computer science focuses on one of two
goals. A tremendous amount of effort has been funneled into ensuring that these
two goals are met. Unfortunately, in modern programming environments, these
goals can conflict with each other in surprising ways.
First, we want to make sure that programs run quickly. At the high level,
this involves data structure and algorithm design. At a lower level, this involves
investing a great deal of time and effort reordering code to ensure that it is run
in the most efficient way possible.
For modern compilers and processors, this lower level is crucial. Specula-
tive execution and instruction reordering are two tools that have been integral
to the explosion in computer power over the last several decades. Most com-
piler optimizations, including instruction scheduling, register allocation, common
subexpression elimination and redundant read elimination, can be thought of as
relying on instruction reordering to some extent. For example, when a redundant
read is removed, the compiler is, in effect, moving the redundant read to where
1
the first read is performed. In similar ways, processor optimizations such as the
use of write buffers and out-of-order completion / issue reflect reorderings.
Our second goal is to ensure that both programmers and computers under-
stand what we are telling them to do. Programmers make dozens of assumptions
about the way code is executed. Sometimes those assumptions are right, and
sometimes they are wrong. In the background, at a lower level than our pro-
gramming efforts, compilers and architectures are spending a lot of time creating
an environment where our code is transformed; it is optimized to run on the
system. Our assumptions about the way code is executing can be easily violated.
It is fairly easy (by comparison) for a processor, dealing with a single thread
of instructions, to avoid messing around too much with our notions of how in-
structions are scheduled. It simply needs to ensure that when an instruction is
performed early, that instruction doesn?t affect any of the instructions past which
it was moved. Programmers will generally not need to reason about potential re-
orderings when they write single-threaded programs. This model actually allows
for a wide variety of program transformations.
The real difficulty comes when there is more than one thread of instructions
executing at the same time, and those threads are interacting. The processor can
ensure that the actions in each thread appear to happen in the correct order in
isolation. However, if more than one thread is executing, the limiting factors we
impose for single-threaded execution are not enough ? because of the need for
code optimization, we can start to see bizarre side effects.
This is not necessarily a problem. For most modern multithreaded program-
ming patterns, the programmer ensures that there are ordering constraints by
explicitly communicating between threads. Within these specified constraints,
2
Original code Valid compiler transformation
Initially: p == q, p.x == 0
Thread 1 Thread 2
r1 = p; r6 = p;
r2 = r1.x; r6.x = 3
r3 = q;
r4 = r3.x;
r5 = r1.x;
Result: r2 == r5 == 0, r4 == 3
Initially: p == q, p.x == 0
Thread 1 Thread 2
r1 = p; r6 = p;
r2 = r1.x; r6.x = 3
r3 = q;
r4 = r3.x;
r5 = r2;
Result: r2 == r5 == 0, r4 == 3
Figure 1.1: Surprising results caused by forward substitution
it is possible to reorder code with a great deal of freedom. However, this begs
the question of how that communication works, and what happens in a program
when it is missing.
Obviously, we have to spend some time specifying these issues. We need a
lingua franca, which we usually call a memory model, because it describes how
programs interact with memory. The memory model for whatever language the
programmer is using defines what kind of reorderings may be perceived, down
to the level of machine code. In a high-level interpreted language, like Java
or C#, it defines the reorderings the compiler can perform when translating to
bytecode, the virtual machine can perform when translating to machine code,
and the processor can perform when executing that machine code.
A simple example of how reorderings can be perceived can be seen in Figure 1.1
[Pug99]. One common compiler optimization involves having the value read for
r2 reused for r5: they are both reads of r1.x with no intervening write.
Now consider the case where the assignment to r6.x in Thread 2 happens
3
Initially, x == y == 0
Thread 1 Thread 2
r1 = x; r2 = y;
y = r1; x = r2;
Can r1 == r2 == 42?
Figure 1.2: Motivation for disallowing some cycles
between the first read of r1.x and the read of r3.x in Thread 1. If the compiler
decides to reuse the value of r2 for the r5, then r2 and r5 will have the value
0, and r4 will have the value 3. From the perspective of the programmer, the
value stored at p.x has changed from 0 to 3 and then changed back. It was this
surprising example the first motivated this work.
This example is a relatively simple one; if the examples were all so simple, a
memory model could be defined in terms of allowable reorderings. Not a great
deal of subtlety would be required. However, this is not the case; we have to deal
with the notion of a causal loop.
Consider Figure 1.2. We have two threads of execution, both of which read
from one variable and then write to another. If these actions are executed out
of their program order, we could imagine that, for example, the number 42 was
written to x in Thread 2, then read from x in Thread 1, written to y in Thread
1, then read from y in Thread 2, justifying the initial write to x. Where did the
value 42 come from?
This example may seem glib and unrealistic to some. However, a complete
semantics for a programming language memory model would have to make a
decision whether or not to allow it. In this case, the causal loop leads to an
undesirable behavior: one which we wish to disallow.
4
Initially, x == y == 0
Thread 1 Thread 2
r1 = x; r3 = y;
r2 = r1 | 1; x = r3;
y = r2;
r1 == r2 == r3 == 1 is legal behavior
Figure 1.3: Compilers Can Think Hard about when Actions are Guaranteed to
Occur
Another example of a causal loop ? this time, one that describes an acceptable
behavior ? can be seen in Figure 1.3. In order to see the result r1 == r2 == r3
== 1, it would seem as if Thread 1 would need to write 1 to y before reading x.
It also seems as if Thread 1 can?t know what value r2 will be until after x is read.
In fact, a compiler could perform an inter-thread analysis that shows that
only the values 0 and 1 will be written to x. Knowing that, the compiler can
determine that the statement containing the logical or operator (?) will result
in Thread 1?s always writing 1 to y. Thread 1 may, therefore, write 1 to y before
reading x. The write to y is not dependent on the values seen for x.
It is clear, therefore, that sometimes causal loops are acceptable, and other
times, they are not. The fundamental problem with these sorts of examples is
that there is no ?first cause? from which we can reason. We have to think some
more about cause and effect. The progress we have made in this area ? the deep
thinking about what sort of behaviors are desirable in a multithreaded program
? is one of the most important contributions of this dissertation.
5
1.1 Why Solve This Problem?
In the past, multithreaded languages have not defined a full semantics for mul-
tithreaded code. Ada, for example, simply defines unsynchronized code as ?er-
roneous? [Ada95]. The reasoning behind this is that since such code is incorrect
(on some level), no guarantees should be made for that code. What it means for
code to be correctly synchronized should be fully defined; after that, nothing.
This is the same strategy that some languages take with array bounds overflow
? unpredictable results may occur, and it is the programmer?s responsibility to
avoid these scenarios.
The problem with this strategy is one of security and safety. In an ideal
world, all programmers would write correct code all of the time. However, this
rarely happens. Programs frequently contain errors; not only does this cause code
to misbehave, but it also allows attackers an easy way into a program. Buffer
overflows, in particular, are frequently used to compromise a program?s security.
Program semantics must be carefully defined: otherwise, it becomes harder to
track down errors, and easier for attackers to take advantage of those errors. If
programmers don?t know what their code is doing, programmers won?t be able
to know what their code is doing wrong.
1.2 Approach
The approach taken in this dissertation was done in two overlapping stages. First,
we gathered requirements from an experienced community of professional soft-
ware engineers, hardware architects and compiler / virtual machine designers. In
the coming chapters, you may notice references to guarantees and optimizations
that are ?reasonable? or ?unreasonable?; these notions came from long discus-
6
sions with this group. The requirements that were gathered are largely described
in Chapters 2 and 3. When the requirements were gathered, we designed a model
that reflected those requirements (Chapters 4 and 5), and verified that the final
specification met the requirements (Chapter 6).
The requirements largely centered around two goals:
? We needed to define, in a careful way, how programmers could ensure that
actions taken by one thread would be seen in a reasonable way by other
threads.
? We needed to define, in a careful way, what could happen if programmers
did not take care to ensure that threads were communicating in the pro-
scribed manners.
The first bullet (arguably the more simple of the two goals) required a careful
and complete definition of what constitutes synchronization, which allows threads
to communicate. Locking, described in Chapter 2, is the most obvious way for
two threads to communicate. However, other mechanisms are necessary. For
example, it is often the case that we want threads to communicate without the
overhead of mutual exclusion. To address this, we introduced the notion of a
volatile field (Chapter 3) : this supplants the more traditional memory barrier
as a basic memory coherence operation. Finally, we wanted to present a way
that a programmer could ensure object immutability regardless of data races:
the resulting semantics for final fields are outlined in Chapter 7.
The second bullet is somewhat more complex. The fundamental question is
described above: what kind of causal loops are acceptable, and what kind must
be disallowed? This question frames the discussion of causality (Chapter 4).
7
1.3 Development
A final note: this work has been done in the context of the Java programming
language [GJS96]; Java is used as a running example. One of the goals of the
designers of the Java programming language was that multithreaded programs
written in Java would have consistent and well-defined behavior. This would
allow Java programmers to understand how their programs might behave; it
would also allow Java platform architects to develop their platforms in a flexible
and efficient way, while still ensuring that Java programs ran on them correctly.
Unfortunately, Java?s original memory model was not defined in a way that
allowed programmers and architects to understand the requirements for a Java
system. For example, it disallowed the program transformation seen in Figure 1.1.
For more details on this, see Chapter 8.
The work described in this dissertation has been adapted to become part of
Java Specification Request (JSR) 133 [Jav04], a new memory model for Java,
which is included in the 5.0 release of the Java programming language.
8
Chapter 2
Building Blocks
Time is an illusion. Lunchtime doubly so.
? Douglas Adams, The Hitchhiker?s Guide to the Galaxy
In order to reason about cause and effect in a programming language, we have
to do two things:
1. Provide a clear model for concurrent programmers to write correct code.
2. Provide a clear model for platform architects to optimize code.
There has been a lot of work in defining, conceptually, how threads interact
through memory. In order to develop a memory model that fulfills out require-
ments, it is necessary to review some concepts and vocabulary first. This chapter
provides that review.
2.1 Code is Reordered
The most commonly assumed memory model is sequential consistency [Lam79].
Sequential consistency specifies that memory actions will appear to execute one
at a time in a single total order; actions of a given thread must appear in this
total order in the same order in which they appear in the program prior to any
9
optimizations (the program order). This model basically reflects an interleaving
of the actions in each thread; under sequential consistency, a read must return
the value written most recently to the location being read in the total order.
While this is an intuitive extension of the single threaded model, sequen-
tial consistency restricts the use of many system optimizations. In general, a
sequentially consistent system will not have much freedom to reorder memory
statements within a thread, even if there are no conventional data or control
dependences between the two statements.
Many of the important optimizations that can be performed on a program
involve reordering program statements. For example, superscalar architectures
frequently reorder instructions to ensure that the execution units are all in use
as much as possible. Even optimizations as ubiquitous as common subexpression
elimination and redundant read elimination can be seen as reorderings: each
evaluation of the common expression is conceptually ?moved? to the point at
which it is evaluated for the first time.
Many different aspects of a system may affect reorderings. For example, the
just-in-time compiler and the processor may perform reorderings. The memory
hierarchy of the architecture on which a virtual machine is run may make it
appear as if code is being reordered. Source code to bytecode transformation can
also reorder and transform programs. For the purposes of simplicity, we shall
simply refer to anything that can reorder code as being a compiler.
In a single threaded program, a compiler can (and, indeed, must) be careful
that these program transformations not interfere with the possible results of
the program. We refer to this as a compiler?s maintaining of the intra-thread
semantics of the program ? a thread in isolation has to behave as if no code
10
Initially, x == y == 0
Thread 1 Thread 2
1: r2 = x; 3: r1 = y
2: y = 1; 4: x = 2
May return r2 == 2, r1 == 1
Figure 2.1: Behaves Surprisingly
Initially, x == y == 0
Thread 1 Thread 2
2: y = 1; 4: x = 2
1: r2 = x; 3: r1 = y
May return r2 == 2, r1 == 1
Figure 2.2: Figure 2.1, after reordering
transformations occurred at all.
However, it is much more difficult to maintain a sequentially consistent seman-
tics while optimizing multithreaded code. Consider Figure 2.1. It may appear
that the result r2 == 2, r1 == 1 is impossible. Intuitively, if r2 is 2, then in-
struction 4 came before instruction 1. Further, if r1 is 1, then instruction 2 came
before instruction 3. So, if r2 == 2 and r1 == 1, then instruction 4 came before
instruction 1, which comes before instruction 2, which came before instruction
3, which comes before instruction 4. This is a cyclic execution, which is, on the
face of it, absurd.
On the other hand, we must consider the fact that a compiler can reorder the
instructions in each thread (as seen in Figure 2.2. If instruction 3 does not come
before instruction 4, and instruction 1 does not come before instruction 2, then
the result r2 == 2 and r1 == 1 is perfectly reasonable.
11
2.2 Synchronization and Happens-Before
To some programmers, the behavior demonstrated in Figure 2.1 may seem ?bro-
ken?. However, multithreaded programming languages provide built-in mecha-
nisms to provide constraints when two threads interact: such mechanisms are
referred to as synchronization. The code in Figure 2.1 does not use these mech-
anisms.
The Java programming language provides multiple mechanisms for synchro-
nizing threads. The most common of these methods is locking, which is imple-
mented using monitors. Each object in Java is associated with a monitor, which
a thread can lock or unlock. Only one thread at a time may hold a lock on a
monitor. Any other threads attempting to lock that monitor are blocked until
they can obtain a lock on that monitor. A thread may lock a particular monitor
multiple times; each unlock reverses the effect of one lock operation.
There is a total order over all synchronization operations, called the synchro-
nization order. We say that an unlock action on monitor m synchronizes-with all
lock actions on m that come after it (or are subsequent to it) in the synchroniza-
tion order.
We say that one action happens-before another [Lam78] in three cases:
? if the first action comes before the second in program order, or
? if the first action synchronizes-with the second, or
? if you can reach the second by following happens-before edges from the first
(in other words, happens-before is transitive).
Note that all of this means that happens-before is a partial order: it is re-
flexive, transitive and anti-symmetric. There are more happens-before edges, as
12
described in Chapter 5 ? we will just focus on these for now. We now have the
key to understanding why the behavior in Figure 2.1 is legal. Here is what we
observe:
? there is a write in one thread,
? there is a read of the same variable by another thread and
? the write and read are not ordered by happens-before.
In general, when there are two accesses (reads of or writes to) the same shared
field or array element, and at least one of the accesses is a write, we say that the
accesses conflict (regardless of whether there is a data race on those accesses)
[SS88]. When two conflicting accesses are not ordered by happens-before, they
are said to be in a data race. When code contains a data race, counterintuitive
results are often possible.
We use the discussion of data races to define what it means for a program to
be correctly synchronized. A program is correctly synchronized if and only if all
sequentially consistent executions are free of data races. The code in Figure 2.1,
for example, is incorrectly synchronized.
When a program is not correctly synchronized, errors tend to crop up in three
intertwined ways. Atomicity deals with which actions and sets of actions have
indivisible effects. This is the aspect of concurrency most familiar to program-
mers: it is usually thought of in terms of mutual exclusion. Visibility determines
when the effects of one thread can be seen by another. Ordering determines when
actions in one thread can be seen to occur out of order with respect to another.
In the rest of this chapter, we discuss how those problems can arise.
13
2.3 Atomicity
If an action is (or a set of actions are) atomic, its result must be seen to happen
?all at once?, or indivisibly. Atomicity is the traditional bugbear of concurrent
programming. Enforcing it means using locking to enforce mutual exclusion.
The Java programming language has some minor issues related to the atom-
icity of individual actions. Specifically, writes of 64-bit scalar values (longs and
doubles) are allowed to take place in two separate atomic 32 bit ?chunks?. If
a thread is scheduled between the separate writes, it may see the first half of
the 64-bit value, but not the second. In other circumstances, we assume that all
individual read and write actions take place atomically. This is not an aspect of
the memory model that we shall address in any greater depth.
Atomicity can also be enforced on a sequence of actions. A program can be free
from data races without having this form of atomicity. However, enforcing this
kind of atomicity is frequently as important to program correctness as enforcing
freedom from data races. Consider the Java code in Figure 2.3. Since all accesses
to the shared variable balance are guarded by synchronization, the code is free
of data races.
Now assume that one thread calls deposit(5) and another calls withdraw(5);
there is an initial balance of 10. Ideally, at the end of these two calls, there would
still be a balance of 10. However, consider what would happen if:
? The deposit() method sees a value of 10 for the balance, then
? The withdraw() method sees a value of 10 for the balance and withdraws
5, leaving a balance of 5, and finally
? The deposit() method uses the balance it originally saw (10) to calculate
14
class BrokenBankAccount {
private int balance;
synchronized int getBalance() {
return balance;
}
synchronized void setBalance(int x)
throws IllegalStateException {
balance = x;
if (balance < 0) {
throw new IllegalStateException("Negative Balance");
}
}
void deposit(int x) {
int b = getBalance();
setBalance(b + x);
}
void withdraw(int x) {
int b = getBalance();
setBalance(b - x);
}
}
Figure 2.3: Atomicity example
15
a new balance of 15.
As a result of this lack of atomicity, the balance is 15 instead of 10. This effect
is often referred to as a lost update because the withdrawal is lost. A programmer
writing multi-threaded code must use synchronization carefully to avoid this sort
of error. In the Java programming language, if the deposit() and withdraw()
methods are declared synchronized, it will ensure that locks are held for their
duration: the actions of those methods will be seen to take place atomically.
Note that atomicity guarantees from synchronization are relative. By synchro-
nizing on an object, you can guarantee that a sequence of actions is perceived to
happen atomically by other threads that synchronize on the same object. But
threads that do not synchronize on that object may not see the actions occur
atomically.
Atomicity is the most well-studied problem when using synchronization. It
is a common mistake to think that it is the only problem; it is not. In the rest
of this chapter, we look at two other important issues that may crop up when
writing multithreaded code.
2.4 Visibility
If an action in one thread is visible to another thread, then the result of that action
can be observed by the second thread. In order to guarantee that the results of
one action are observable to a second action, then the first must happen before
the second.
Consider the code in Figure 2.4. Now imagine that two threads are created;
one thread calls work, and at some point, the other thread calls stopWork on the
same object. Because there is no happens-before relationship between the two
16
class LoopMayNeverEnd {
boolean done = false;
void work() {
while (!done) {
// do work
}
}
void stopWork() {
done = true;
}
}
Figure 2.4: Visibility example
17
threads, the thread in the loop may never see the update to done performed by
the other thread. In practice, this may happen if the compiler detects that no
writes are performed to done in the first thread; the compiler may hoist the read
of done out of the loop, transforming it into an infinite loop.
To ensure that this does not happen, there must be a happens-before rela-
tionship between the two threads. In LoopMayNeverEnd, this can be achieved by
declaring done to be volatile. Conceptually, all actions on volatiles happen in
a single total order, and each write to a volatile field happens-before any read of
that volatile that occurs later in that order.
There is a side issue here; some architectures and virtual machines may exe-
cute this program without providing a guarantee that the thread that executes
work will ever give up the CPU and allow other threads to execute. This would
prevent the loop from ever terminating because of scheduling guarantees, not
because of a lack of visibility guarantees. For more discussion of this issue, see
Section 3.6.
2.5 Ordering
Ordering constraints govern the order in which multiple actions are seen to have
happened. The ability to perceive ordering constraints among actions is only
guaranteed to actions that share a happens-before relationship with them. We
have already seen a violation of ordering in Figure 2.1; here is another.
The code in Figure 2.5 shows an example of where the lack of ordering con-
straints can produce surprising results. Consider what happens if threadOne()
gets invoked in one thread and threadTwo() gets invoked on the same object
in another. Would it be possible for threadTwo() to return the value true?
18
class BadlyOrdered {
boolean a = false;
boolean b = false;
void threadOne() {
a = true;
b = true;
}
boolean threadTwo() {
boolean r1 = b; // sees true
boolean r2 = a; // sees false
boolean r3 = a; // sees true
return (r1 && !r2) && r3; // returns true
}
}
Figure 2.5: Ordering example
19
If threadTwo() returns true, it means that the thread saw both updates by
threadOne, but that it saw the change to b before the change to a.
The Java memory model allows this result, illustrating a violation of the or-
dering that a user might have expected. This code fragment is not correctly
synchronized (the conflicting accesses are not ordered by a happens-before order-
ing).
If ordering is not guaranteed, then the assignments to a and b in threadOne()
can be seen to be performed out of order. Compilers have substantial freedom
to reorder code in the absence of synchronization. For example, the compiler is
allowed to freely reorder the writes in threadOne or the reads in threadTwo.
To avoid this behavior, programmers must ensure that their code is correctly
synchronized.
2.6 Discussion
The discussion in this chapter centers around some of the basic issues of memory
semantics. We have given some form to the problem: we have outlined the basic
ideas behind memory models, synchronization and happens-before, and we have
seen how misinterpretation of how multithreading works can lead to problems
with atomicity, visibility and ordering. However, this skirts the major issue:
how can we define multithreaded semantics in a way that allows programmers to
avoid these pitfalls, but still allows compiler designers and computer architects
to optimize run time?
To answer this question, we still need to gather more requirements. What
kinds of compiler optimizations do we need? What guarantees are sufficient
for correctly synchronized code? What guarantees are sufficient for incorrectly
20
synchronized code? In the chapters ahead, we address these questions.
21
Chapter 3
In Which Some Motivations Are Given
The part can never be well unless the whole is well.
- Plato, Charmides
The last chapter outlined some informal properties that all concurrent pro-
grams have, and some ways in which those properties could be used incorrectly,
leading to broken code. This chapter outlines some of the informal requirements
that the memory model has ? in other words, ways in which these broken pro-
grams can be tamed.
Many compiler writers and programmers contributed to the discussions that
led to the conclusions in this chapter. These conclusions were only reached after
a great deal of thinking, staring at white boards, and spirited debate. A careful
balance had to be maintained. On one hand, it was necessary for the model
to allow programmers to be able to reason carefully and correctly about their
multithreaded code. On the other, it was necessary for the model to allow com-
piler writers, virtual machine designers and hardware architects to optimize code
ruthlessly, which makes predicting the results of a multithreaded program less
straightforward.
At the end of this process, a consensus emerged as to what the informal re-
22
quirements for a programming language memory model are. As the requirements
emerged, a memory model took shape. Most of these requirements have rela-
tively simple to understand motivations ? obvious guarantees that need to be
made, optimizations that need to be allowed, and so on. These are documented
in this chapter. The other requirements ? those relating to causality ? are more
complicated. The causality requirements, together with the memory model, are
described in full in Chapter 4.
3.1 Guarantees for Correctly Synchronized Programs
It is very difficult for programmers to reason about the kinds of transformations
that compilers perform. One of the goals of the Java memory model is to provide
programmers a mechanism that allows them to avoid having to do this sort of
reasoning.
For example, in the code in Figure 3.2, the programmer can only see the result
of the reordering because the code is improperly synchronized. Our first goal is
to ensure that this is the only reason that a programmer can see the result of a
reordering.
Prior work [Adv93, Gha96, AH90] has shown that one of the best guaran-
tees that can be provided is that reorderings should only be visible when code is
incorrectly synchronized. This is a strong guarantee for programmers, whose as-
sumptions about how multithreaded code behaves almost always include sequen-
tial consistency (as defined in Section 2.1). Our first guarantee for programmers
(which we call DRF), therefore, applies to correctly synchronized programs (as
defined in Section 2.2):
DRF Correctly synchronized programs have sequentially consistent semantics.
23
Initially, x == y == 0
Thread 1 Thread 2
r1 = x; r2 = y;
if (r1 != 0) if (r2 != 0)
y = 42; x = 42;
Correctly synchronized, so r1 == r2 == 0 is the only legal behavior
Figure 3.1: Surprising Correctly Synchronized Program
Given this requirement, programmers need only worry about code transfor-
mations having an impact on their programs? results if those program contain
data races.
This requirement leads to some interesting corner cases. For example, the
code shown in Figure 3.1 ( [Adv93]) is correctly synchronized. This may seem
surprising, since it doesn?t perform any synchronization actions. Remember that
a program is correctly synchronized if, when it is executed in a sequentially con-
sistent manner, there are no data races. If this code is executed in a sequentially
consistent way, each action will occur in program order, and neither of the writes
to x or y will occur. Since no writes occur, there can be no data races: the
program is correctly synchronized.
This disallows some subtle program transformations. For example, an aggres-
sive write speculation could predict that the write to y of 42 was going to occur,
allowing Thread 2?s read of y to see that write. This would cause the write of 42
to x to occur, allowing the read of x to see 42; this would cause the write to y
to occur, justifying the speculation! This sort of transformation is not legal; as
a correctly synchronized program, only sequentially consistent results should be
allowed.
24
Initially, x == y == 0
Thread 1 Thread 2
1: r2 = x; 3: r1 = y
2: y = 1; 4: x = 2
May return r2 == 2, r1 == 1
Figure 3.2: Behaves Surprisingly
3.2 Simple Reordering
In earlier chapters, we outlined how important it is for compilers to reorder
program actions ? it is the engine that drives most optimizations. Our first
requirement, therefore, is that we always allow statements that are not control
or data dependent on each other in a program to be reordered:
Reorder1 Independent adjacent statements can be reordered.
Note that Reorder1 actually allows the reordering of statements that were
not adjacent in the original program: it is simply necessary to perform repeated
reorderings until the statement appears in the desired location.
In a multithreaded context, doing this may lead to counter-intuitive results,
like the one in Figure 2.1 (reprinted here as Figure 3.2). Remember that the
key notion behind this figure is that the actions in each thread are reordered;
once this happens, even a machine that executes code in a sequentially consistent
way can result in the behavior shown. However, it should be noted again that
that code is improperly synchronized: there is no ordering of the accesses by
synchronization. When synchronization is missing, weird and bizarre results are
allowed.
25
Before compiler transformation After compiler transformation
Initially, a = 0, b = 1
Thread 1 Thread 2
1: r1 = a; 5: r3 = b;
2: r2 = a; 6: a = r3;
3: if (r1 == r2)
4: b = 2;
Is r1 == r2 == r3 == 2 possible?
Initially, a = 0, b = 1
Thread 1 Thread 2
4: b = 2; 5: r3 = b;
1: r1 = a; 6: a = r3;
2: r2 = r1;
3: if (true) ;
r1 == r2 == r3 == 2 is
sequentially consistent
Figure 3.3: Effects of Redundant Read Elimination
The Reorder1 guarantee ensures that independent actions can be reordered
regardless of the order in which theyappear inthe program. Itdoesnotguarantee
that two independent actions can always be reordered. For example, actions
cannot generally be reordered out of locking regions.
The reordering in Figure 3.2 does not interfere too heavily with our notion of
cause and effect. Specifically, there is no reason to think that the first action in
either of these threads has an effect on the second ? causality is still served.
3.3 Transformations that Involve Dependencies
In Section 3.2, we gave Reorder1, which is a guarantee that independent actions
can be reordered. Reorder1 is a strong guarantee, but not quite strong enough.
Sometimes, compilers can perform transformations that have the effect of remov-
ing dependencies.
26
For example, the behavior shown in Figure 3.3 is allowed. This behavior may
seem cyclic, as the write to y is control dependent on the reads of x, which see a
write to x that is data dependent on a read of y which must see the aforementioned
write to y. However, the compiler should be allowed to
? eliminate the redundant read of a, replacing r2 = a with r2 = r1, then
? determine that the expression r1 == r2 is always true, eliminating the
conditional branch 3, and finally
? move the write 4: b = 2 early.
After the compiler does the redundant read elimination, the assignment 4:
b = 2 is guaranteed to happen; the second read of a will always return the same
value as the first. Without this information, the assignment seems to cause itself
to happen. With this information, there is no dependency between the reads and
the write. Thus, dependence-breaking optimizations can also lead to apparent
cyclic executions.
Note that intra-thread semantics guarantee that if r1 negationslash= r2, then Thread 1
will not write to b and r3 == 1. In that case, either r1 == 0 and r2 == 1, or
r1 == 1 and r2 == 0.
Figure 3.4 shows another surprising behavior. In order to see the result r1 ==
r2 == r3 == 1, it would seem as if Thread 1 would need to write 1 to y before
reading x. It also seems as if Thread 1 can?t know what value r2 will be until
after x is read.
In fact, a compiler could perform an inter-thread analysis that shows that
only the values 0 and 1 will be written to x. Knowing that, the compiler can
determine that the operation with the bitwise or will always return 1, resulting
27
Initially, x == y == 0
Thread 1 Thread 2
r1 = x; r3 = y;
r2 = r1 | 1; x = r3;
y = r2;
r1 == r2 == r3 == 1 is legal behavior
Figure 3.4: An Example of a More Sophisticated Analysis
Initially, x == y == 0
Thread 1 Thread 2
r1 = x; r3 = y;
if (r1 == 1) if (r2 == 1)
y = 1; x = 1;
if (r2 == 0)
x = 1;
r1 == r2 == 1 is legal behavior
Figure 3.5: Sometimes Dependencies are not Obvious
in Thread 1?s always writing 1 to y. Thread 1 may, therefore, write 1 to y before
reading x. The write to y is not dependent on the values seen for x. This kind of
analysis of the program reveals that there is no real dependency in Thread 1.
A similar example of an apparent dependency can be seen in the code in
Figure 3.5. As it does for Figure 3.4, a compiler can determine that only the
values 0 and 1 are ever written to x. As a result, the compiler can remove the
dependency in Thread 2 and move the write to x to the start of that thread. If
the resulting code were executed in a sequentially consistent way, it would result
in the circular behavior described.
28
Initially, x = 0
Thread 1 Thread 2
r1 = x; r2 = x;
x = 1; x = 2;
r1 == 2 and r2 == 1 is a legal behavior
Figure 3.6: An Unexpected Reordering
It is clear, then, that compilers can perform many optimizations that remove
dependencies. So we make another guarantee:
Reorder2 If a compiler can detect that an action will always happen (with the
same value written to the same variable), it can be reordered regardless of
apparent dependencies.
Like Reorder1, this guarantee does not allow an implementation to reorder
actions around synchronization actions arbitrarily. As was mentioned before,
actions can usually not be moved outside locking regions. Another example of
this will be shown later (Figure 3.11).
Even though Reorder1 and Reorder2 are strong guarantees for compilers, they
are not a complete set of allowed reorderings. They are simply a set that is
always guaranteed to be allowed. More transformations are possible; it is simply
necessary to ensure that the results are allowable by the memory model.
3.3.1 Reordering Not Visible to Current Thread
Figure 3.6 contains a small but interesting example. The behavior r1 == 2 and
r2 == 1 is a legal behavior, although it may be difficult to see how it could
occur. A compiler would not reorder the statements in each thread; this code
29
Initially, x == 0, ready == false. ready is a volatile variable.
Thread 1 Thread 2
x = 1; if (ready)
ready = true r1 = x;
If r1 = x; executes, it will read 1.
Figure 3.7: Simple Use of Volatile Variables
must never result in r1 == 1 or r2 == 2. However, the behavior r1 == 2 and
r2 == 1 might be allowed by an optimizer that performs the writes early, but
does so without allowing them to be visible to local reads that came before
them in program order. This behavior, while surprising, is allowed by several
processor memory architectures, and therefore is one that should be allowed by
a programming language memory model.
3.4 Synchronization
We haven?t really discussed how programmers can use explicit synchronization
(in whatever form we give it) to make sure their code is correctly synchronized.
In general, we use synchronization to enforce the happens-before relationships
that we briefly discussed in Chapter 2. The typical way of doing this is by using
locking. Another way is to use volatile variables.
The properties of volatile variables arose from the need to provide a way to
communicate between threads without the overhead of ensuring mutual exclusion.
A very simple example of their use can be seen in Figure 3.7. If ready were not
volatile, the write to it in Thread 1 could be reordered with the write to x.
This might result in r1 containing the value 0. We define volatiles so that this
reordering cannot take place; if Thread 2 reads true for ready, it must also read
30
1 for x. Communicating between threads in this way is clearly useful for non-
blocking algorithms (such as, for example, wait free queues [MS96]). Volatiles
are discussed in more detail in Section 3.5.2.
Locks and unlocks work in a way similar to volatiles: actions that take place
before an unlock must also take place before any subsequent locks on that mon-
itor. The resulting property reflects the way synchronization is used to commu-
nicate between threads, the happens-before property:
HB Volatile writes are ordered before subsequent volatile reads of the same vari-
able. Unlocks are ordered before subsequent locks of the same monitor.
The word subsequent is defined as it was in Chapter 2 (reiterated here for your
convenience). Synchronization actions include locks, unlocks, and reads of and
writes to volatile variables. There is a total order over all synchronization actions
in an execution of a program; this is called the synchronization order. An action
y is subsequent to another action x if x comes before y in the synchronization
order.
The happens-before relationship between two actions described in Chapter 2
is what enforces an ordering between those actions. For example, if one action
occurs before another in the program order for a single thread, then the first
action happens-before the second. The program has to be executed in a way that
does not make it appear to the second that it occurred out of order with respect
to the first.
This notion may seem at odds with the results shown in many of our examples
(for example, Figure 3.2).
However, a ?reordering? is only visible in these examples if we are reasoning
about both threads. The two threads do not share a happens-before relationship,
31
so we do not reason about both. The individual threads can only be examined in
isolation. When this happens, no reordering is visible; the only mystery is where
the values seen by the reads are written.
The basic principle at work here is that threads in isolation will appear to
behave as if they are executing in program order; however, the memory model
will tell you what values can be seen by a particular read.
Synchronization actions create happens-before relationships between threads.
We call such relationships synchronizes-with relationships. In addition to the
happens-before relationship between actions in a single thread, we also have (in
accordance with HB):
? An unlock on a particular monitor happens-before a lock on that monitor
that comes after it in the synchronization order.
? A write to a volatile variable happens-before a read of that volatile variable
that comes after it in the synchronization order.
? A call to start a thread happens-before the actual start of that thread.
? The terminationof athread happens-before ajoinperformed on that thread.
? Happens-before is transitive. That is, if a happens-before b, and b happens-
before c, then a happens-before c.
3.5 Additional Synchronization Issues
3.5.1 Optimizations Based on Happens-Before
Notice that lock and unlock actions only have happens-before relationships with
other lock and unlock actions on the same monitor. Similarly, accesses to
32
a volatile variable only create happens-before relationships with accesses to the
same volatile variable.
A happens-before relationship can be thought of as an ordering edge with two
points; we call the start point a release, and the end point an acquire. Unlocks
and volatile writes are release actions, and locks and volatile reads are acquire
actions.
Synchronization Removal
There have been many optimizations proposed ([Ruf00, CGS+99]) that have tried
to remove excess, or ?redundant? synchronization. One of the requirements of
the Java memory model was that redundant synchronization (such as locks that
are only accessed in a single thread) could be removed.
One possible memory model would require that all synchronization actions
have happens-before relationships with all other synchronization actions. If we
forced all synchronization actions to have happens-before relationships with each
other, none of them could ever be described as redundant ? they would all have
to interact with the synchronization actions in other threads, regardless of what
variable or monitor they accessed. Java does not support this; it does not simplify
the programming model sufficiently to warrant the additional synchronization
costs.
This is therefore another of our guarantees:
RS Synchronization actions that only introduce redundant happens-before edges
can be treated as if they don?t introduce any happens-before edges.
This is reflected in the definition of happens-before. For example, a lock that
is only accessed in one thread will only introduce happens-before relationships
33
Before Coarsening After Coarsening
x = 1;
synchronized(someLock) {
// lock contents
}
y = 2;
synchronized(someLock) {
x = 1;
// lock contents
y = 2;
}
Figure 3.8: Example of Lock Coarsening
that are already captured by the program order edges. This lock is redundant,
and can therefore be removed.
Lock Coarsening
One transformation that is frequently effective in increasing concurrency is com-
putation lock coarsening [DR98]. If a computation frequently acquires and re-
leases the same lock, then computation lock coarsening can coalesce those multi-
ple locking regions into a single region. This requires the ability to move accesses
that occur outside a locking region inside of it, as seen in Figure 3.8.
An acquire ensures an ordering with a previous release. Consider an action
that takes place before an acquire. It may or may not have been visible to actions
that took place before the previous release, depending on how the threads are
scheduled. If we move the access to after the acquire, we are simply saying that
the access is definitely scheduled after the previous release. This is therefore a
legal transformation.
This is reflected in Figure 3.8. The write to x could have been scheduled before
or after the last unlock of someLock. By moving it inside the synchronized block,
the compiler is merely ensuring that it was scheduled after that unlock.
34
Initially, v1 == v2 == 0
Thread 1 Thread 2 Thread 3 Thread 4
v1 = 1; v2 = 2; r1 = v1; r3 = v2;
r2 = v2; r4 = v1;
Is r1 == 1, r3 == 2, r2 == r4 == 0 legal behavior?
Figure 3.9: Volatiles Must Occur In A Total Order
Similarly, the release ensures an ordering with a subsequent acquire. Consider
an action that takes place after a release. It may or may not be visible to
particular actions after the subsequent acquire, depending on how the threads
are scheduled. If we move the access to before the release, we are simply saying
that the access is definitely scheduled before the next acquire. This is therefore
also a legal transformation. This can also be seen in Figure 3.8, where the write
to y is moved up inside the synchronized block.
All of this is simply a roundabout way of saying that accesses to normal
variables can be reordered with a following volatile read or monitor enter, or a
preceding volatile write or monitor exit, This implies that normal accesses can be
moved inside locking regions, but not out of them; for this reason, we sometimes
call this property roach motel semantics.
It is relatively easy for compilers to ensure this property; indeed, most do
already. Processors, which also reorder instructions, often need to be given mem-
ory barrier instructions to execute at these points in the code to ensure that
they do not perform the reordering. Processors often provide a wide variety of
these barrier instructions ? which of these is required, and on what platform, is
discussed in greater detail in Chapter 8 and in [Lea04].
Figure 3.9 gives us another interesting glimpse into the guarantees we provide
35
Initially, x == y == v == 0, v is volatile.
Thread 1 Thread 2
r1 = x; r3 = y;
v = 0; v = 0;
r2 = v; r4 = v;
y = 1; x = 1;
Is the behavior r1 == r3 == 1 possible?
Figure 3.10: Strong or Weak Volatiles?
to programmers. The reads of v1 and v2 should be seen in the same order by
both Thread 3 and Thread 4; if they are not, the behavior r1 == 1, r3 == 2,
r2 == r4 == 0 can be observed. Specifically, Thread 3 sees the write to v1, but
not the write to v2; Thread 4 sees the write to v2, but not the write to v1).
The memory modelprohibits this behavior: it does not allowwrites tovolatiles
to be seen in different orders by different threads. In fact, it makes a much
stronger guarantee:
VolatileAtomicity All accesses to volatile variables are performed in a total
order.
This is clear cut, implementable, and has the unique property that the original
Java memory model not only came down on the same side, but was also clear on
the subject.
3.5.2 Additional Guarantees for Volatiles
Another issue that arises with volatiles has come to be known as strong versus
weak volatility. There are two possible interpretations of volatile, according to
the happens-before order:
36
? Strong interpretation There is a happens-before relationship from each
write to each subsequent read of that volatile.
? Weak interpretation There is a happens-before relationship from each
write to each subsequent read of that volatile that sees that write. This
interpretation reflects the ordering constraints on synchronization variables
in the memory model referred to as weak ordering [DSB86, AH90].
In Figure 3.10, under the weak interpretation, the read of v in each thread
might see its own volatile write. If this were the case, then the happens-before
edges would be redundant, and could be removed. The resulting code could
behave much like the simple reordering example in Figure 3.2.
To avoid confusion stemming from when multiple writer threads are commu-
nicating to reader threads via a single volatile variable, Java supports the strong
interpretation.
StrongVolatile There must be a happens-before relationship from each write
to each subsequent read of that volatile.
3.5.3 Optimizers Must Be Careful
Optimizers have to consider volatile accesses as carefully as they consider locking.
In Figure 3.11, we have a correctly synchronized program. When executed in a
sequentially consistent way, Thread 2 will loop until Thread 1 writes to v or b.
Since the only value available for the read of a to see is 0, r1 will have that value.
As a result, the value 1 will be written to v, not b. There will therefore be a
happens-before relationship between the read of a in Thread 1 and the write to
a in Thread 2.
37
Initially, a == b == v == 0, v is volatile.
Thread 1 Thread 2
r1 = a; do {
if (r1 == 0) r2 = b;
v = 1; r3 = v;
else } while (r2 + r3 < 1);
b = 1; a = 1;
Correctly synchronized, so r1 == 1 is illegal
Figure 3.11: Another Surprising Correctly Synchronized Program
Even though we know that the write to a will always happen, we cannot
reorder that write with the loop. If we did perform that reordering, Thread 1
would be able to see the value 1 for a, and perform the write to b. Thread 2
would see the write to b and terminate the loop. Since b is not a volatile variable,
there would be no ordering between the read in Thread 1 and the write in Thread
2. There would therefore be data races on both a and b.
The result of this would be a correctly synchronized program that does not
behave in a sequentially consistent way. This violates DRF, so we do not allow it.
The need to prevent this sort of reordering caused many difficulties in formulating
a workable memory model.
Compiler writers need to be very careful when reordering code past all syn-
chronization points, not just those involving locking and unlocking.
3.6 Infinite Executions, Fairness and Observable Behavior
The Java specification does not guarantee preemptive multithreading or any kind
of fairness guarantee. There is no hard guarantee that any thread will surrender
38
Thread 1 Thread 2
while (true) synchronized (o) {
synchronized (o) { done = true;
if (done) break; }
think
}
Figure 3.12: Lack of fairness allows Thread 1 to never surrender the CPU
Initially, v is volatile and v = false
Thread 1 Thread 2
while (!v); v = true;
print("Thread 1 done"); print("Thread 2 done");
Figure 3.13: If we observe print message, Thread 1 must see write to v and
terminate
39
the CPU and allow other threads to be scheduled. The lack of such a guarantee is
partially due to the fact that any such guarantee would be complicated by issues
such as thread priorities and real-time threads (in real-time Java implementa-
tions). Most Java implementations will provide some sort of fairness guarantee,
but the details are implementation specific and are treated as a quality of service
issue, rather than a rigid requirement.
To many, it may seem as if this is not a memory model issue. However,
the issues are closely related. An example of their interrelation can be seen
in Figure 3.12. Due to the lack of fairness, it is legal for the CPU running
Thread 1 never to surrender the CPU to Thread 2; thus the program may never
terminate. Since this behavior is legal, it is also legal for a compiler to hoist the
synchronized block outside the while loop, which has the same effect.
This is a legal compiler transformation, but an undesirable one. As mentioned
in Section 3.5.1, the compiler is allowed to perform lock coarsening (e.g., if the
compiler sees two successive calls to synchronized methods on the same object,
it doesn?t have to give up the lock between calls). The exact tradeoffs here are
subtle, and improved performance needs to be balanced by the longer delays
threads will suffer in trying to acquire a lock.
However, there should be some limitations on compiler transformations that
reduce fairness. For example, in Figure 3.13, if we observe the print message from
Thread 2, and no threads other than Threads 1 and 2 are running, then Thread
1 must see the write to v, print its message and terminate. This prevents the
compiler from hoisting the volatile read of v out of the loop in Thread 1. This
motivates another requirement:
Observable The only reason that an action visible to the external world (e.g., a
40
Initially, x == y == 0
Thread 1 Thread 2
do { do {
r1 = x; r2 = y;
} while (r1 == 0); } while (r2 == 0);
y = 42; x = 42;
Correctly synchronized, so non-termination is the only legal behavior
Figure 3.14: Correctly Synchronized Program
file read / write, program termination) might not be externally observable
is if there is an infinite sequence of actions that might happen before it or
come before it in the synchronization order.
3.6.1 Control Dependence
As a result of some of these requirements, the new Java memory model makes
subtle but deep changes to the way in which implementors must reason about Java
programs. For example, the standard definition of control dependence assumes
that execution always proceeds to exit. This must not be casually assumed in
multithreaded programs.
Consider the program in Figure 3.14. Under the traditional definitions of
control dependence, neither of the writes in either thread are control dependent
on the loop guards. This might lead a compiler to decide that the writes could
be moved before the loops. However, this would be illegal in Java. This program
is correctly synchronized: in all sequentially consistent executions, neither thread
writes to shared variables and there are no data races (this figure is very similar
to Figure 3.1). A compiler must create a situation where the loops terminate.
41
The notion of control dependence that correctly encapsulates this is called
weak control dependence [PC90] in the context of program verification. This
property has also been restated as loop control dependence [BP96] in the context
of program analysis and transformation.
42
Chapter 4
Causality ? Approaching a Java Memory Model
It is an old maxim of mine that when you have excluded the im-
possible, whatever remains, however improbable, must be the truth.
- Sir Arthur Conan Doyle, The Adventure of the Beryl Coronet
In Section 2.1, we described sequential consistency. It is too strict for use
as the Java memory model, because it forbids standard compiler and processor
optimizations. We must formulate a better memory model. In this chapter, we
address this need by carefully examining the necessary guarantees for unsynchro-
nized code. In other words, we address the question: what is acceptable behavior
for a multithreaded program? Answering this question allows us to synthesize
our requirements and formulate a workable memory model.
4.1 Sequential Consistency Memory Model
Section 2.1 discusses the implications of sequential consistency. For convenience,
it is presented again here, and formalized.
In sequential consistency, all actions occur in a total order (the execution
order). The actions in the execution order occur in the same order they do in the
program (that is to say, they are consistent with program order). Furthermore,
43
each read r of a variable v sees the value written by the write w to v such that:
? w comes before r in the execution order, and
? there is no other write wprime such that w comes before wprime and wprime comes before
r in the execution order.
4.2 Happens-Before Memory Model
We can describe a simple, interesting memory model using the HB guarantee
introduced in Chapter 3 by abstracting a little from locks and unlocks. We call
this model the happens-before memory model. Many of the requirements of our
simple memory model are built out of the requirements in Chapters 2 and 3:
? There is a total order over all synchronization actions, called the synchro-
nization order (Section 2.2).
? Synchronization actions induce synchronizes-with edges between matched
actions. Together with program order, these two relationships form the
happens-before order (Section 3.4).
? A volatile read sees the value written by the previous volatile write in the
synchronization order (Section 3.4).
The only additional constraint on the Happens-Before Memory Model is that
the value seen by a normal read is determined by happens-before consistency.
Formally, we say that it is happens-before consistent for a read r of a variable v
to see a write w to v if:
? w is ordered before r by happens-before and there is no intervening write
wprime to v (i.e., w hb? r and there is no wprime such that w hb? wprime hb? r) or
44
Initially, x == y == 0
Thread 1 Thread 2
1: r2 = x; 3: r1 = y
2: y = 1; 4: x = 2
May return r2 == 2, r1 == 1
Figure 4.1: Behaves Surprisingly
? w and r are not ordered by happens-before (i.e., it is not the case that
w hb? r or r hb? w).
Less formally, it is happens-before consistent for a read to see a write in an
execution of a program in two cases. First, a read is happens-before consistent if
the write happens-before the read and there is no intervening write to the same
variable. So, if a write of 1 to x happens-before a write of 2 to x, and the write of
2 happens-before a read of x, then that read cannot see the value 1. Alternatively,
it can be happens-before consistent for the read to see the write if the read is not
ordered by happens-before with the write.
As an example, consider Figure 4.1 (the same as Figures 3.2 and 2.1). The
code in this figure has two writes and two reads; the reads are not ordered by
happens-before with the write of the same variable. Therefore, it is happens-
before consistent for the reads to see the writes.
If all of the reads in an execution see writes which it is happens-before con-
sistent for them to see, then we say that execution is happens-before consistent.
Note that happens-before consistency implies that every read must see a write
that occurs somewhere in the program.
As far as first approximations go, the Happens-Before Memory Model is not
a bad one. It provides some necessary guarantees. For example, the result of the
45
Initially, x == y == 0
Thread 1 Thread 2
r1 = x; r2 = y;
if (r1 != 0) if (r2 != 0)
y = 42; x = 42;
Correctly synchronized, so r1 == r2 == 0 is the only legal behavior
Figure 4.2: Surprising Correctly Synchronized Program
code in Figure 4.1 is happens-before consistent: the reads are not prevented from
seeing the writes by any happens-before relationship.
Unfortunately, happens-before consistency is not a good memory model; it
simply isn?t sufficient. Figure 4.2 (the same as Figure 3.1) shows an example
where happens-before consistency does not produce the desired results. Remem-
ber that this code is correctly synchronized: if it is executed under sequential
consistency, neither the write to y nor the write to x can occur, because the reads
of those variable will see the value 0. Therefore, there are no data races in this
code. However, under the happens-before memory model, if both writes occur,
it is happens-before consistent for the reads to see those writes. Even though
this is a correctly synchronized program, under happens-before consistency, exe-
cutions that are not sequentially consistent are legal. Therefore, happens-before
consistent executions can violate our DRF guarantee.
Nevertheless, happens-before consistency provides a good outer bound for our
model; based on HB, all executions must be happens-before consistent.
46
Initially, x == y == 0
Thread 1 Thread 2
r1 = x; r2 = y;
y = r1; x = r2;
Incorrectly Synchronized: But r1 == r2 == 42 Still Cannot Happen
Figure 4.3: An Out Of Thin Air Result
4.3 Causality
Basic Notions
So, the happens-before memory model provides necessary constraints on our final
memory model, but it is not complete. We saw why this was when examining
Figure 4.2, in which the writes are control dependent on the reads. A similar
example can be seen in Figure 4.3. In this case, the writes will always happen,
and the values written are data dependent on the reads.
The happens-before memory model also allows the undesirable result in this
case. Say that the value 42 was written to x in Thread 2. Then, under the
happens-before memory model, it would be legal for the read of x in Thread 1
to see that value. The write to y in Thread 1 would then write the value 42. It
would therefore be legal for the read of y in Thread 2 to see the value 42. This
allows the value 42 to be written to x in Thread 2. The undesirable result justifies
itself, using a circular sort of reasoning.
This is no longer a correctly synchronized program, because there is a data
race between Thread 1 and Thread 2. However, as it is in many ways a very
similar example, we would like to provide a similar guarantee. In this case, we
say that the value 42 cannot appear out of thin air.
47
Initially, x = null, y = null.
o is an object with a field f that refers to o.
Thread 1 Thread 2
r1 = x; r3 = y;
r2 = x.f; x = r4;
y = r2;
r1 == r2 == o is not an acceptable behavior
Figure 4.4: An Unexpected Reordering
In fact, the behavior shown in Figure 4.3 may be even more of a cause for
concern than the behavior shown in Figure 4.2. If, for example, the value that
was being produced out of thin air was a reference to an object which the thread
was not supposed to have, then such a transformation could be a serious security
violation. There are no reasonable compiler transformations that produce this
result.
An example of this can be seen in Figure 4.4. Let?s assume that there is some
object o which we do not wish Thread 1 or Thread 2 to see. o has a self-reference
stored in the field f. If our compiler were to decide to perform an analysis that
assumed that the reads in each thread saw the writes in the other thread, and
saw a reference to o, then r1 = r2 = r3 = o would be a possible result. The
value did not spring from anywhere ? it is simply an arbitrary value pulled out
of thin air.
Determining what constitutes an out-of-thin-air read is complicated. A first
(but inaccurate) approximation would be that we don?t want reads to see values
that couldn?t be written to the variable being read in some sequentially consistent
execution. Because the value 42 is never written in Figure 4.3, no read can ever
48
see it.
The problem with this solution is that a program can contain writes whose
program statements don?t occur in any sequentially consistent executions. Imag-
ine, as an example, a write that is only performed if the value of r1 + r2 is equal
to 3 in Figure 4.1. This write would not occur in any sequentially consistent
execution, but we would still want a read to be able to see it.
One way to think about these issues is to consider when actions can occur in
an execution. The transformations we have examined all involve moving actions
earlier than they would otherwise have occurred. For example, to get the out-of-
thin-air result in Figure 4.3, we have, in some sense, moved the write of 42 to y
in Thread 1 early, so that the read of y in Thread 2 can see it and allow the write
to x to occur. The read of x in Thread 1 then sees the value 42, and justifies the
execution of the write to y.
If we assume that the key to these issues is to consider when actions can be
moved early, then we must consider this issue carefully. The question to answer
is, what is it that causes an action to occur? When can the action be performed
early? One potential answer to this question involves starting at the point where
we want to execute the action, and then considering what would happen if we
carried on the execution in a sequentially consistent way from that point. If we
did, and it would have been possible for the action to have occurred afterward,
then perhaps the action can be considered to be caused.
In the above case, we identify whether an action can be performed early
by identifying some well-behaved execution in which it takes place, and using
that execution to justify performing the action. Our model therefore builds an
execution iteratively; it allows an action (or a group of actions) to be committed
49
(in essence, performed early) if it occurs in some well-behaved execution that
also contains the actions committed so far. Obviously, this needs a base case: we
simply assume that no actions have been committed.
The resulting model can therefore be described with two, iterative, phases.
Starting with some committed set of actions, generate all the possible ?well-
behaved? executions. Then, use those well-behaved executions to determine fur-
ther actions that can be reasonably performed early. Commit those actions.
Rinse and repeat until all actions have been committed.
Identifying what entails a ?well-behaved? execution is crucial to our model,
and key to our notions of causality. If we had, for example, a write that was
control dependent on the value of r1 + r2 being equal to 3 in Figure 4.1, we
would know that write could have occurred in an execution of the program that
behaves in a sequentially consistent way after the result of r1 + r2 is determined.
We can apply this notion of well-behavedness to our other example, as well.
In Figure 4.1, the writes to x and y can occur first because they will always occur
in sequentially consistent executions. In Figure 3.3, the write to b can occur early
because it occurs in a sequentially consistent execution when r1 and r2 see the
same value. In Figure 4.3, the writes of 42 to y and x cannot happen, because
they do not occur in any sequentially consistent execution. This, then, is our first
(but not only) ?out of thin air? guarantee:
ThinAir1 A write can occur earlier in an execution than it appears in program
order. However, that write must have been able to occur without the as-
sumption that any subsequent reads see non-sequentially consistent values.
This is only a first approximation to causality: it, too, is a good starting
point, but does not cover all of our bases.
50
Initially, x == y == z == 0
Thread 1 Thread 2 Thread 3 Thread 4
r1 = x; r2 = y; z = 42; r0 = z;
y = r1; x = r2; x = r0;
Is r0 == 0, r1 == r2 == 42 legal behavior?
Figure 4.5: Can Threads 1 and 2 see 42, if Thread 4 didn?t write 42?
Initially, x == y == z == 0
Thread 1 Thread 2 Thread 3 Thread 4
r1 = x; r2 = y; z = 1; r0 = z;
if (r1 != 0) if (r2 != 0) if (r0 == 1)
y = r1; x = r2; x = 42;
Is r0 == 0, r1 == r2 == 42 legal behavior?
Figure 4.6: Can Threads 1 and 2 see 42, if Thread 4 didn?t write to x?
4.3.1 When Actions Can Occur
Disallowing Some Results
It is difficult to define the boundary between the kinds of results that are reason-
able and the kind that are not. The example in Figure 4.3 provides an example
of a result that is clearly unacceptable, but other examples may be less straight-
forward.
The examples in Figures 4.5 and 4.6 are similar to the examples in Figures 4.2
and 4.3, with one major distinction. In those examples, the value 42 could never
be written to x in any sequentially consistent execution. Thus, our ThinAir1
guarantee prevented the value 42 from appearing. In the examples in Figures 4.5
and 4.6, 42 can be written to x in some sequentially consistent executions (specif-
51
ically, ones in which the write to z in Thread 3 occurs before the read of z in
Thread 4). Should these new examples also get an out-of-thin-air guarantee, even
though they are not covered by our previous guarantee? In other words, could
it be legal for the reads in Threads 1 and 2 to see the value 42 even if Thread 4
does not write that value?
This is a potential security issue. Consider what happens if, instead of 42, we
write a reference to an object that Thread 4 controls, but does not want Threads
1 and 2 to see without Thread 4?s first seeing 1 for z. If Threads 1 and 2 see this
reference, they can be said to manufacture it out-of-thin-air.
This sort of behavior is not known to result from any combination of known
reasonable and desirable optimizations. However, there is also some question as
to whether this reflects a real and serious security requirement. In Java, the
semantics usually side with the principle of having safe, simple and unsurprising
semantics when possible. Thus, the Java memory model prohibits the behaviors
shown in Figures 4.5 and 4.6.
Allowing Other Results
Now consider the code in Figure 4.7. A compiler could determine that the only
values ever assigned to x are 0 and 42. From that, the compiler could deduce
that, at the point where we execute r1 = x, either we had just performed a write
of 42 to x, or we had just read x and seen the value 42. In either case, it would
be legal for a read of x to see the value 42. By the principle we articulated as
Reorder2, it could then change r1 = x to r1 = 42; this would allow y = r1 to
be transformed to y = 42 and performed earlier, resulting in the behavior in
question.
52
Initially, x == y == z == 0
Thread 1 Thread 2
r3 = x; r2 = y;
if (r3 == 0) x = r2;
x = 42;
r1 = x;
y = r1;
r1 == r2 == r3 == 42 is a legal behavior
Figure 4.7: A Complicated Inference
This is a reasonable transformation that needs to be balanced with the out-
of-thin-air requirement. Notice that the code in Figure 4.7 is quite similar to
the code in Figures 4.5 and 4.6. The difference is that Threads 1 and 4 are
now joined together; in addition, the write to x that was in Thread 4 is now
performed in every sequentially consistent execution ? it is only when we try
to get non-sequentially consistent results that the write does not occur. The
ThinAir1 guarantee is therefore not strong enough to encapsulate this notion.
The examples in Figure 4.5 and 4.6 are significantly different from the example
in Figure 4.7. One way of articulating this difference is that in Figure 4.7, we
know that r1 = x can see 42 without reasoning about what might have occurred
in another thread because of a data race. In Figures 4.5 and 4.6, we need to
reason about the outcome of a data race to determine that r1 = x can see 42.
This, then, is another, stronger way of differentiating out-of-thin-air reads.
A read is not considered to be out-of-thin-air if you can determine whether it
happens without considering data races that influenced its execution. This is
also our second out-of-thin-air principle:
53
Initially, x = y = 0; a[0] = 1, a[1] = 2
Thread 1 Thread 2
r1 = x; r3 = y;
a[r1] = 0; x = r3;
r2 = a[0];
y = r2;
r1 == r2 == r3 == 1 is unacceptable
Figure 4.8: Another Out Of Thin Air Example
ThinAir2 An action can occur earlier in an execution than it appears in program
order. However, that write must have been able to occur in the execution
without assuming that any additional reads see values via a data race.
We can use ThinAir2 as a basic principle to reason about multithreaded pro-
grams. Consider Figures 4.5 and 4.6. We want to determine if a write of 42 to
y can be performed earlier than its original place in the program. We therefore
examine the actions that happen-before it. Those actions (a read of x, and the
initial actions) do not allow for us to determine that a write of 42 to y occurred.
Therefore, we cannot move the write early.
As another example, consider the code in Figure 4.8. In this example, the only
way in which r2 could be set to 1 is if r1 was not 0. In this case, r3 would have
to be 1, which means r2 would have to be 1. The value 1 in such an execution
clearly comes out of thin air. The only way in which the unacceptable result could
occur is if a write of 1 to one of the variables were performed early. However, we
cannot reason that a write of 1 to x or y will occur without reasoning about data
races. Therefore, this result is impossible.
The memory model that results from this causality constraint is our final
54
model. We build executions based on the notion that they must be ?well-
behaved? (as described in Section 4.3). However, the executions must be ?well-
behaved? based on the notion that additional reads cannot see any values from
data races: a ?well-behaved? execution requires that all reads see writes that
happen-before them. The model is presented formally in the next chapter.
4.4 Some Things That Are Potentially Surprising about This Memory
Model
Many of the requirements and goals for the Java memory model were straightfor-
ward and non-controversial (e.g., DRF). Other decisions about the requirements
generated quite a bit of discussion; the final decision often came down to a mat-
ter of taste and preference rather than any concrete requirement. This section
describes some of the implications of the memory model that are not a result of
requirements, but, rather, more artifacts of our reasoning.
4.4.1 Isolation
Sometimes, when debugging a program, we are given an execution trace of that
program in which the error occurred. Given a particular execution of a program,
the debugger can create a partition of the threads and variables in the program
so that if a thread accessed a variable in that execution, then the thread and
variable are in the same partition. Monitors can be considered variables for the
purposes of this discussion.
Given this partitioning, you can explain the behavior in the execution of the
threads in each partition without having to examine the behavior or code for the
other threads. If a thread or a set of threads is isolated from the other threads
55
in an execution, the programmer can reason about that isolated set separately
from the other threads. This is called the isolation principle:
Isolation Consider a partition P of the threads and variables in the program so
that if a thread accessed a variable in that execution, then the thread and
variable are in the same partition. Given P, you can explain the behavior
in the execution of the threads in each partition without having to examine
the behavior or code for the other threads.
How is this helpful? Consider the code in Figure 4.6. If we allowed the
unacceptable execution, then we could say that the actions in Threads 3 and 4
affected the actions in Threads 1 and 2, even though they touched none of the
same variables. Reasoning about this would be difficult, at best.
The Isolation principle closely interacts with our out-of-thin-air properties. If
a thread A does not access the variables accessed by a thread B, then the only
way A could have really affected B is if A might have accessed those variables
along another program path not taken. The compiler might speculate that the
other program path would be taken, and that speculation might affect B. The
speculation could only really affect B if B could happen at the same time as A.
This would imply a data race between A and B, and we would be speculating
about that race; this is something ThinAir2 is designed to avoid.
Isolation is not necessarily a property that should be required in all memory
models. It seems to capture a property that is useful and important, but all of
the implications of it are not understood well enough for us to decide if it must
be true of any acceptable memory model.
56
Initially, a = b = c = d = 0
Thread 1 Thread 2 Thread 3 Thread 4
r1 = a;
if (r1 == 0)
b = 1;
r2 = b;
if (r2 == 1)
c = 1;
r3 = c;
if (r3 == 1)
d = 1;
r4 = d;
if (r4 == 1) {
c = 1;
a = 1;
}
Behavior in question: r1 == r3 == r4 == 1; r2 == 0
Figure 4.9: Behavior disallowed by semantics
Initially, a = b = c = d = 0
Thread 1/2/3 Thread 4
r1 = a;
if (r1 == 0)
b = 1;
r2 = b;
if (r2 == 1)
c = 1;
r3 = c;
if (r3 == 1)
d = 1;
r4 = d;
if (r4 == 1) {
c = 1;
a = 1;
}
Behavior in question: r1 == r3 == r4 == 1; r2 == 0
Figure 4.10: Result of thread inlining of Figure 4.9; behavior allowed by semantics
57
4.4.2 Thread Inlining
One behavior that is disallowed by a straightforward interpretation of the out-
of-thin-air property that we have developed is shown in Figure 4.9. An imple-
mentation that always scheduled Thread 1 before Thread 2 and Thread 2 before
Thread 3 could conceivably decide that the write to d by Thread 3 could be
performed before anything in Thread 1 (as long as the guard r3 == 1 evaluates
to true). This could lead to a result where the write to d occurs, then Thread 4
writes 1 to c and a. The write to b does not occur, so the read of b by Thread
2 sees 0, and does not write to c. The read of c in Thread 3 then sees the write
by Thread 4.
However, this requires reasoning that Thread 3 will see a value for c that
is given by a data race. A straightforward interpretation of ThinAir2 therefore
disallows this.
In Figure 4.10, we have another example, similar to the one in Figure 4.9,
where Threads 1, 2 and 3 are combined (or ?inlined?). We can use the same
reasoning that we were going to use for Figure 4.9 to decide that the write to
d can occur early. Here, however, it does not clash with ThinAir2: we are only
reasoning about the actions in the combined Thread 1/2/3. The behavior is
therefore allowed in this example.
As a result of this distinction, it is clear that when a compiler performs thread
inlining, the resulting thread is not necessarily allowed to be treated in the same
way as a thread that was written ?inline? in the first place. Thus, a compiler
writer must be careful when considering inlining threads. When a compiler does
decide to inline threads, as in this example, it may not be possible to utilize the
full flexibility of the Java memory model when deciding how the resulting code
58
Initially, x == y == 0
Thread 1 Thread 2 Thread 3
1: r1 = x 4: r2 = x 6: r3 = y
2: if (r1 == 0) 5: y = r2 7: x = r3
3: x = 1
Must not allow r1 == r2 == r3 == 1
Figure 4.11: A variant ?bait-and-switch? behavior
Initially, x == y == 0
Thread 1 Thread 2
1: r1 = x 6: r3 = y
2: if (r1 == 0) 7: x = r3
3: x = 1
4: r2 = x
5: y = r2
Compiler transformations can result
in r1 == r2 == r3 == 1
Figure 4.12: Behavior that must be allowed
can execute.
Another example of how thread inlining is generally disallowed can be seen
from Figures 4.11 and 4.12. The behavior in Figure 4.11 is very similar to that
shown in Figure 4.6. Given that our isolation principle states that the behavior
in Figure 4.6 is unacceptable, it seems reasonable to prohibit both.
Figure 4.12 shows a code fragment very similar to that of Figure 4.11. How-
ever, for the code in Figure 4.12, we must allow the behavior that was prohibited
in Figure 4.11. We do this because that behavior can result from well understood
59
and reasonable compiler transformations.
? The compiler can deduce that the only legal values for x and y are 0 and 1.
? The compiler can then replace the read of x on line 4 with a read of 1,
because either
? 1 was read from x on line 1 and there was no intervening write, or
? 0 was read from x on line 1, 1 was assigned to x on line 3, and there
was no intervening write.
? Via forward substitution, the compiler is allowed to transform line 5 to y =
1. Because there are no dependencies, this line can be made the first action
performed by Thread 1.
After these transformations are performed, a sequentially consistent execution
of the program will result in the behavior in question. The fact that the behavior
in Figure 4.11 is prohibited and the behavior in Figure 4.12 is allowed is, perhaps,
surprising. However, this is another example of inlining: we could derive Figure
4.12 from Figure 4.11 by combining Threads 1 and 2 into a single thread.
This is an example where adding happens-before relationships can increase
the number of allowed behaviors. This property can be seen as an extension of the
way in which causality is handled in the Java memory model (as per, for example,
ThinAir1 and ThinAir2). The happens-before relationship is used to express
causality between two actions; if an additional happens-before relationship is
inserted, the causal relationships change.
60
Chapter 5
The Java Memory Model
Everything should be as simple as possible, but no simpler.
? Albert Einstein
This chapter provides the formal specification of the Java memory model
(excluding final fields, which are described in Chapter 7).
5.1 Actions and Executions
An action a is described by a tuple ?t,k,v,u?, comprising:
t - the thread performing the action
k - the kind of action: volatile read, volatile write, (normal or non-volatile)
read, (normal or non-volatile) write, lock, unlock or other synchronization
action. Volatile reads, volatile writes, locks and unlocks are synchronization
actions, as are the (synthetic) first and last action of a thread, actions that
start a thread or detect that a thread has terminated, as described in Section
5.2. There are also external actions, and thread divergence actions
v - the (runtime) variable or monitor involved in the action
61
u - an arbitrary unique identifier for the action
An execution E is described by a tuple
?P,A, po?, so?,W,V, sw? , hb? ?
comprising:
P - a program
A - a set of actions
po? - program order, which for each thread t, is a total order over all actions
performed by t in A
so? - synchronization order, which is a total order over all synchronization ac-
tions in A
W - a write-seen function, which for each read r in A, gives W(r), the write
action seen by r in E.
V - a value-written function, which for each write w in A, gives V(w), the
value written by w in E.
sw? - synchronizes-with, a partial order over synchronization actions.
hb? - happens-before, a partial order over actions
Note that the synchronizes-with and happens-before are uniquely determined
by the other components of an execution and the rules for well-formed executions.
Two of the action types require special descriptions, and are detailed further
in Section 5.5. These actions are introduced so that we can explain why such a
thread may cause all other threads to stall and fail to make progress.
62
external actions - An external action is an action that may be observable out-
side of an execution, and has a result based on an environment external to
the execution. An external action tuple contains an additional component,
which contains the results of the external action as perceived by the thread
performing the action. This may be information as to the success or failure
of the action, and any values read by the action.
Parameters to the external action (e.g., which bytes are written to which
socket) are not part of the external action tuple. These parameters are set
up by other actions within the thread and can be determined by examining
the intra-thread semantics. They are not explicitly discussed in the memory
model.
In non-terminating executions, not all external actions are observable. Non-
terminating executions and observable actions are discussed in Section 5.5.
thread divergence action - A thread divergence action is only performed by
a thread that is in an infinite loop in which no memory, synchronization or
external actions are performed. If a thread performs a thread divergence
action, that action is followed in program order by an infinite number of
additional thread divergence actions.
5.2 Definitions
1. Definition of synchronizes-with. The source of a synchronizes-with
edge is called a release, and the destination is called an acquire.
They are defined as follows:
? An unlock action on monitor m synchronizes-with all subsequent lock
63
actions on m (where subsequent is defined according to the synchro-
nization order).
? A write to a volatile variable v synchronizes-with all subsequent reads
of v by any thread (where subsequent is defined according to the syn-
chronization order).
? An action that starts a thread synchronizes-with the first action in the
thread it starts.
? The final action in a thread T1 synchronizes-with any action in another
thread T2 that detects that T1 has terminated. T2 may accomplish
this by calling T1.isAlive() or doing a join action on T1.
? If thread T1 interrupts thread T2, the interrupt by T1 synchronizes-
with any point where any other thread (including T2) determines that
T2 has been interrupted (by invoking Thread.interrupted, invok-
ing Thread.isInterrupted, or by having an InterruptedException
thrown).
? The write of the default value (zero, false or null) to each variable
synchronizes-with to the first action in every thread. Although it may
seem a little strange to write a default value to a variable before the
object containing the variable is allocated, conceptually every object
is created at the start of the program with its default initialized values.
Consequently, the default initialization of any object happens-before
any other actions (other than default writes) of a program.
? At the invocation of a finalizer for an object, there is an implicit read
of a reference to that object. There is a happens-before edge from the
end of a constructor of an object to that read. Note that all freezes for
64
this object (see Section 7.1) happen-before the starting point of this
happens-before edge .
2. Definition of happens-before. If we have two actions x and y, we use
x hb? y to mean that x happens-before y. If x and y are actions of the same
thread and x comes before y in program order, then x hb? y. If an action x
synchronizes-with a following action y, then we also have x hb? y. Further
more, happens-before is transitively closed. In other words, if x hb? y and
y hb? z, then x hb? z.
3. Definition of sufficient synchronization edges. A set of synchroniza-
tion edges is sufficient if it is the minimal set such that if the transitive
closure of those edges with program order edges is taken, all of the happens-
before edges in the execution can be determined. This set is unique.
4. Restrictions of partial orders and functions. We use f|d to denote
the function given by restricting the domain of f to d: for all x ? d,
f(x) = f|d(x) and for all x negationslash? d,f|d(x) is undefined. Similarly, we use e? |d
to represent the restriction of the partial order e? to the elements in d: for
all x,y ? d, x e? y if and only if x e? |d y. If either x negationslash? d or y negationslash? d, then it
is not the case that x e? |d y.
5.3 Well-Formed Executions
We onlyconsider well-formedexecutions. An executionE = ?P,A, po?, so?,W,V, sw?
, hb?? is well formed if the following conditions are true:
1. Each read of a variable x sees a write to x. All reads and writes
of volatile variables are volatile actions. For all reads r ? A, we have
65
W(r) ? A and W(r).v = r.v. The variable r.v is volatile if and only if r is
a volatile read, and the variable w.v is volatile if and only if w is a volatile
write.
2. Synchronization order is consistent with program order and mu-
tual exclusion. Having synchronization order is consistent with program
order implies that the happens-before order, given by the transitive clo-
sure of synchronizes-with edges and program order, is a valid partial order:
reflexive, transitive and antisymmetric. Having synchronization order con-
sistent with mutual exclusion means that on each monitor, the lock and
unlock actions are properly nested.
3. The execution obeys intra-thread consistency. For each thread t, the
actions performed by t in A are the same as would be generated by that
thread in program-order in isolation, with each write w writing the value
V(w), given that each read r sees / returns the value V(W(r)). Values
seen by each read are determined by the memory model. The program
order given must reflect the program order in which the actions would be
performed according to the intrathread semantics of P, as specified by the
parts of the Java language specification that do not deal with the memory
model.
4. The execution obeys synchronization-order consistency. Consider
all volatile reads r ? A. It is not the case that r so? W(r). Additionally,
there must be no write w such that w.v = r.v and W(r) so? w so? r.
5. The execution obeys happens-before consistency. Consider all reads
r ? A. It is not the case that r hb? W(r). Additionally, there must be no
66
write w such that w.v = r.v and W(r) hb? w hb? r.
5.4 Causality Requirements for Executions
A well-formed execution
E = ?P,A, po?, so?,W,V, sw?, hb??
is validated by committing actions from A. If all of the actions in A can be
committed, then the execution satisfies the causality requirements of the Java
memory model.
Starting with the empty set as C0, we perform a sequence of steps where
we take actions from the set of actions A and add them to a set of committed
actions Ci to get a new set of committed actions Ci+1. To demonstrate that this
is reasonable, for each Ci we need to demonstrate an execution Ei containing Ci
that meets certain conditions.
Formally, an execution E satisfies the causality requirements of the Java mem-
ory model if and only if there exist
? Sets of actions C0,C1,... such that
? C0 = ?
? Ci ? Ci+1
? A = ?(C0,C1,C2,...)
such that E and (C0,C1,C2,...) obey the restrictions listed below.
If A is finite, then the sequence C0,C1,... will be finite, ending in a set
Cn = A. However, if A is infinite, then the sequence C0,C1,... may be
67
infinite, and it must be the case that the union of all elements of this
infinite sequence is equal to A.
? Well-formed executions E1,..., where Ei = ?P,Ai,poi?, soi?,Wi,Vi,swi?,hbi??.
Given these sets of actions C0,... and executions E1,..., every action in Ci must
be one of the actions in Ei. All actions in Ci must share the same relative
happens-before order and synchronization order in both Ei and E. Formally,
1. Ci ? Ai
2. hbi? |Ci = hb? |Ci
3. soi? |Ci = so? |Ci
The values written by the writes in Ci must be the same in both Ei and E. Only
the reads in Ci?1 need to see the same writes in Ei as in E. Formally,
4. Vi|Ci = V|Ci
5. Wi|Ci?1 = W|Ci?1
All reads in Ei that are not in Ci?1 must see writes that happen-before them.
Each read r in Ci ?Ci?1 must see writes in Ci?1 in both Ei and E, but may see
a different write in Ei from the one it sees in E. Formally,
6. For any read r ? Ai ?Ci?1, we have Wi(r) hbi? r
7. For any read r ? Ci ?Ci?1, we have Wi(r) ? Ci?1 and W(r) ? Ci?1
Given a set of sufficient synchronizes-with edges for Ei, if there is a release-acquire
pair that happens-before an action in Ci ?Ci?1, then that pair must be present
in all Ej, where j ? i. Formally,
68
8. Let sswi? be the swi? edges that are in the transitive reduction of hbi? but
not in poi?. We call sswi? the sufficient synchronizes-with edges for Ei. If
x sswi? y hbi? z and z ? Ci ?Ci?1, then x swj? y for all j ? i.
If an action y is committed, all external actions that happen-before y are also
committed.
9. If y ? Ci, x is an external action and x hbi? y, then x ? Ci.
5.5 Observable Behavior and Nonterminating Executions
For programs that always terminate in some bounded finite period of time, their
behavior can be understood (informally) simply in terms of their allowable exe-
cutions. For programs that can fail to terminate in a bounded amount of time,
more subtle issues arise.
The observable behavior of a program is defined by the finite sets of external
actions that the program may perform. A program that, for example, simply
prints ?Hello? forever is described by a set of behaviors that for any non-negative
integer i, includes the behavior of printing ?Hello? i times.
Termination is not explicitly modeled as a behavior, but a program can easily
be extended to generate an additional external action ?executionTermination?
that occurs when all threads have terminated.
We also define a special hang action. If a behavior is described by a set of
external actions including a hang action, it indicates a behavior where after the
(non-hang) external actions are observed, the program can run for an unbounded
amount of time without performing any additional external actions or terminat-
ing. Programs can hang:
69
? if all non-terminated threads are blocked, and at least one such blocked
thread exists, or
? if the program can perform an unbounded number of actions without per-
forming any external actions.
A thread can be blocked in a variety of circumstances, such as when it is
attempting to acquire a lock or perform an external action (such as a read) that
depends on an external data. If a thread is in such a state, Thread.getState
will return BLOCKED or WAITING. An execution may result in a thread being
blocked indefinitely and the execution not terminating. In such cases, the actions
generated by the blocked thread must consist of all actions generated by that
thread up to and including the action that caused the thread to be blocked
indefinitely, and no actions that would be generated by the thread after that
action.
5.6 Formalizing Observable Behavior
To reason about observable behaviors, we need to talk about sets of observable
actions. If O is a set of observable actions for E, then set O must be a subset
of A, and must contain only a finite number of actions, even if A contains an
infinite number of actions. Furthermore, if an action y ? O, and either x hb? y
or x so? y, then x ? O.
Note that a set of observable actions is not restricted to external actions.
Rather, only external actions that are in a set of observable actions are deemed
to be observable external actions.
A behavior B is an allowable behavior of a program P if and only if B is a
finite set of external actions and either
70
? There exists an execution E of P, and a set O of observable actions for
E, and B is the set of external actions in O (if any threads in E end in a
blocked state and O contains all actions in E, then B may also contain a
hang action), or
? There exists a set O of actions such that
? B consists of a hang action plus all the external actions in O and
? for all K ? |O|, there exists an execution E of P and a set of actions
Oprime such that:
? Both O and Oprime are subsets of A that fulfill the requirements for
sets of observable actions.
? O ? Oprime ? A
? |Oprime| ? K
? Oprime ?O contains no external actions
Note that a behavior B does not describe the order in which the external
actions in B are observed, but other (implicit and unstated) constraints on how
the external actions are generated and performed may impose such constraints.
71
Initially, x = y = 0
Thread 1 Thread 2
r1 = x; r2 = y;
y = 1; x = r2;
r1 == r2 == 1 is a legal behavior
Figure 5.1: A Standard Reordering
5.7 Examples
There are a number of examples of behaviors that are either allowed or prohibited
by the Java memory model. Most of these are either deliberately prohibited
examples that show violations of the causality rules, or permissible examples
that seem to be a violation of causality, but can result from standard compiler
optimizations. In this section, we examine how some of these examples can be
worked through the formalism.
5.7.1 Simple Reordering
As a first example of how the memory model works, consider Figure 5.1. Note that
there are initially writes of the default value 0 to x and y. We wish to get the result
r1 == r2 == 1, which can be obtained if a compiler reorders the statements in
Thread 1. This result is consistent with the happens-before memory model, so
we only have to ensure that it complies with the causality rules in Section 5.4.
The set of actions C0 is the empty set, and there is no execution E0. As a
result of this, execution E1 will be an execution where all reads see writes that
happen-before them, as per Rule 6. In E1, both reads must see the value 0. We
first commit the initial writes of 0 to x and y, as well as the write of 1 to y by
Thread 1; these writes are contained in the set C1.
72
Final First First Sees
Action Value Committed In Final Value In
x = 0 0 C1 E1
y = 0 0 C1 E1
y = 1 1 C1 E1
r2 = y 1 C2 E3
x = r2 1 C3 E3
r1 = x 1 C4 E
Figure 5.2: Table of commit sets for Figure 5.1
We wish the action r2 = y to see the value 1. C1 cannot contain this action
seeing this value: neither write to y had been committed. C2 may contain this
action; however, the read of y must return 0 in E2, because of Rule 6. Execution
E2 is therefore identical to E1.
In E3, by Rule 7, r2 = y can see any conflicting write that occurs in C2
(as long as that write is happens-before consistent). This action can now see
the write of 1 to y in Thread 1, which was committed in C1. We commit one
additional action in C3: a write of 1 to x by x = r2.
C4, as part of E4, contains the read r1 = x; it still sees 0, because of Rule 6.
In our final execution E = E5, however, Rule 7 allows r1 = x to see the write of
1 to x that was committed in C3.
5.7.2 Correctly Synchronized Programs
It is easy to see how most of the guarantees that we wished to provide for volatile
variables are fulfilled; Section 5.2 explicitly provides most of them. A slightly
73
Initially, a == b == v == 0, v is volatile.
Thread 1 Thread 2
r1 = a; do {
if (r1 == 0) r2 = b;
v = 1; r3 = v;
else } while (r2 + r3 < 1);
b = 1; a = 1;
Correctly synchronized, so r1 == 1 is illegal
Figure 5.3: A Correctly Synchronized Program
more subtle issue (mentioned in Section 3.5.3) is reproduced as Figure 5.3. This
code is correctly synchronized: in all sequentially consistent executions, the read
of a by Thread 1 sees the value 0; the volatile variable v is therefore written, and
there is a happens-before relationship between that read of a by Thread 1 and
the write to a in Thread 2.
In order for the read of a to see the value 1 (and therefore result in a non-
sequentially consistent execution), the write to a must be committed before the
read. We may commit that write first, in E1. We then would try to commit the
read of a by Thread 1, seeing the value 1. However, Rule 2 requires that the
happens-before orderings between an action being committed and the actions
already committed remain the same when the action is committed. In this case,
we are trying to commit the read, and the write is already committed, so the
read must happen-before the write in E2. This makes it impossible for the read
to see the write.
Note that in addition to this, Rule 8 states that any release-acquire pair that
happens-before the write to a when that write is committed must be present in
74
Initially, x = y = v = 0, v is volatile
Thread 1 Thread 2 Thread 3 Thread 4
join Thread 4 join Thread 4 v = 1 r3 = v
r1 = x r2 = y if (r3 == 1) {
y = r1 x = r2 x = 1
y = 1
}
Behavior in question:r1 = r2 = 1, r3 = 0
Figure 5.4: Must Disallow Behavior by Ensuring Happens-Before Order is Stable
Ci. This requirement implies that the volatile accesses would have to be added
to Ci. Since they have to be present in the final execution for the write to take
place early, the data race is not possible.
Rule 8 is not redundant; it has other implications. For example, consider the
code in Figure 5.4. We wish to prohibit the behavior where the read of v returns
0, but Threads 1 and 2 return the value 1. Because v is volatile, Rule 8 protects
us. In order to commit one of the writes in Threads 1 or 2, the release-acquire
pair of (v = 1, r3 = v) must be committed before the reads in Threads 1 and 2
can see the value 1 ? r3 must be equal to 1.
Note that if v were not volatile, the read in Thread 4 would have to be
committed to see the write in Thread 3 before the reads in Threads 1 and 2
could see the value 1 (by Rule 6).
5.7.3 Observable Actions
Figure 5.5 is the same as Figure 3.13 in Section 3.6. Our requirement for this
program was that if we observe the print message from Thread 2, and no other
75
Initially, v is volatile and v = false
Thread 1 Thread 2
while (!v); v = true;
print("Thread 1 done"); print("Thread 2 done");
Figure 5.5: If we observe print message, Thread 1 must see write to v and termi-
nate
threads other than Threads 1 and 2 run, then Thread 1 must see the write to v,
print its message and terminate. The compiler should not be able to hoist the
volatile read of v out of the loop in Thread 1.
The fact that Thread 1 must terminate if the print by Thread 2 is observed
follows from the rules on observable actions described in Section 5.5. If the print
by Thread 2 is in a set of observable actions O, then the write to v and all reads
of v that see the value 0 must also be in O. Additionally, the program cannot
perform an unbounded amount of additional actions that are not in O. Therefore,
the only observable behavior of this program in which the program hangs (runs
forever without performing additional external actions) is one in which it performs
no observable external actions other than hanging. This includes the print action.
76
Chapter 6
Simulation and Verification
You cannot proceed formally from an informal specification
In Chapter 3, we looked at the informal requirements for our memory model,
exposing the principles that guided its development. In Chapter 5, we provided
a formal specification for the model. However, as the maxim goes, you cannot
proceed formally from an informal specification. How do we know that the formal
specification meets our informal criteria?
The answer is clearly verification. For the Java memory model, we did two
types of verification. The first was a simulator, which was useful for determining
that the behavior of the model was correct on individual test cases. The other
was a set of proofs that the properties that we specified (in Chapter 3) for our
memory model did, in fact, hold true for it.
This chapter is broken into two parts. Section 6.1 describes the simulator:
how it works and its runtime complexity. Section 6.2 recapitulates the properties
that we outlined for our memory model, and discusses how we ensured that those
properties were realized.
77
6.1 The Simulator
As shown in Chapters 3 and 4, the informal requirements for our memory model
were motivated by a dozen (or so) test cases. When developing the memory
model, we actually formulated dozens of these test cases, each of which was
created specifically to demonstrate a particular property that our model had to
have. Every change we made to the model could conceivably affect any number of
those properties; we needed a way to review the test cases systematically. Being
computer scientists, we naturally developed the idea of simulating the model.
The simulator can be used in a number of ways. For example, you could feed
a program P into the simulator to obtain a set of results R. Then, you can apply
a compiler transformation by hand to convert P into Pprime, and feed Pprime into the
simulator, giving a set of results Rprime. The transformation from P to Pprime is legal if
and only if Rprime ? R. The insight here is that while a compiler can do transfor-
mations that eliminate possible behaviors (e.g., performing forward substitution
or moving memory references inside a synchronized block), transformations must
not introduce new behaviors.
The simulator provides three important benefits:
? It gives us confidence that the formal model means what we believe, and
that we believe what it means.
? As we fine tuned and modified the formal model, we gained confidence that
we were only changing the things we intended to change.
? The formal model is not easy to understand; it is likely that only a subset of
the people who need to understand the Java memory model will understand
the formal description of model. People such as JVM implementors and
78
authors of books and articles on thread programming may find the simulator
a useful tool for understanding the memory model.
6.1.1 Simulating Causality
The simulator is built on top of a global system that executes one operation from
one thread in each step, in program order. This is used to provide all possible
interleavings of a given program, when the actions appear in the interleavings in
the same order in which they occur in the program. This framework has been
useful in simulating several versions of the model [MP01a, MP01b].
Although this is a useful framework, it does not allow for the full flexibility
of the model. As discussed in Chapter 4, the difficulty in developing a sufficient
model lies in determining which actions can occur early (out of program order).
The notion of causality underlies this; how can an action be performed early (or
presciently) while also ensuring that the action is not causing itself to occur?
A basic simulator that executes instructions in program order does not have
to deal with causality, because no actions are performed early. However, it is also
incomplete, because it does not provide results of executions with prescient ac-
tions. A na??ve solution would be to attempt to perform all actions early, at every
possible place where they might be performed. However, this is far too compu-
tationally expensive. The single greatest difficulty in developing a simulator is in
determining where prescient actions may be executed.
An early version of the simulator was able to use some heuristics to decide
where early reads could occur, but it largely relied on manual placement [MP02].
However, this is an incomplete approach; that simulator could not generate and
verify all legal executions.
79
However, this is not as computationally expensive as it sounds. Remember
that the formal model is given an execution E to verify, and takes an iterative,
two phase approach (as mentioned in Chapters 4 and 5):
1. Starting with some set of actions that are ensured to occur (the ?commit-
ted? actions), generate all possible executions that include those actions,
and in which all reads that are not committed see writes that happen-before
them.
2. Given those resulting executions, determine what actions can be added to
the set of ?committed? actions: these represent additional actions that can
be performed early.
This process is repeated until there are no actions left to be added to the set
of committed actions. If the resulting execution Eprime is the same as the execution E
provided, then Eprime is a legal execution. The simulator therefore simply generates
all sets of legal executions.
The tricky part, of course, is picking the next actions to commit, while still
preventing the execution time from being too long. A simple answer to this can
be found in the model itself. The ?obvious? extension (as mentioned in the first
bullet above) is to include all executions where reads see only writes that happen
before those reads. The only executions that this does not produce are ones
where a read r is in a data race with a write w, and r returns the value of w. It
is therefore simply a matter of detecting writes that are involved in data races
with r, and committing those writes so that r can see them. As an optimization,
we only commit those writes that can actually allow the read to see a value that
it did not see in the justifying execution.
80
To detect data races, the simulator uses a technique commonly used for dy-
namic on-the-fly data race detection [Sch89, NR86, DS90]. The technique is quite
simple; it works by examining the partial ordering induced by the happens-before
relationship (referred to as a partially-ordered execution graph in the literature).
If two accesses in the graph are not strictly ordered with respect to each other,
and one is a write, then they are in a data race. This is easily supported by the
simulator, which needs to keep track of the happens-before relationships anyway.
The simulator simply commits writes that are in data races, then commits
reads in the same data races that see those writes. It then executes all inter-
leavings the program, ensuring that uncommitted reads can only see the result
of writes that happen-before them (as described in Chapter 5). Additionally, it
should be noted that when actions in a data race are committed, acquire and
release actions that happen-before them also have to be committed (as per Rule
8 in Section 5.4).
Most of the other functional details are easy to extract from the definition of
the model as found in Chapter 5. One additional point is worth noting. While
the simulator is generating executions, it is important that it ensure that those
executions obey the well-formedness criteria listed in Section 5.3. The simulator
must, for example, ensure that volatile variables read the correct values and that
intra-thread consistency is maintained.
6.1.2 Simulator Performance
In general, the performance of the simulator scales with the number of data
races in a program. We can take some time to describe this in more detail.
Consider a multithreaded program with t threads, where the ith thread has ni
81
instructions and the total number of instructions in all threads is n. The number
of interleavings of a given multithreaded program can be determined by taking
the total number of permutations of all instructions (n!) and removing those
permutations where the instructions occur out of program order. Since, for each
thread, there are ni! orderings in which the instructions occur out of program
order, we have:
n!producttext
n
i=1(ni!)
Since the simulator works by committing all permutations of data races in the
program, we have a bound on runtime of (where dr is the number of data races
in the program):
O(dr!( n!producttextn
i=1(ni!)
))
The number dr is a very rough one, as the number of data races in a program
is by no means fixed. For example, a data race may be control or data dependent
on the value seen by another data race read.
This runtime looks ugly, but it isn?t actually so bad, as most of the examples
we care about are relatively short. As they say, the proof of the pudding is in
the eating: what is the practical runtime?
The programs we ran through the simulator ranges in size from 2 to 5 threads,
and in length from 2 to 17 instructions. Most of our examples contain 1-3 data
races. A sampling of the results, which can be seen in Table 6.1, indicate that
for the kind of examples we have been concerned about in this dissertation,
performance has not been much of an issue. The tests were performed under
Java 1.5.0 beta 3 on a 1.8 GHz Pentium 4 Xeon with 1 GB of RAM.
82
Test Case Elapsed Time (Seconds)
Figure 3.1 0.48
Figure 3.3 1.23
Figure 3.11 .54
Figure 4.3 0.48
Figure 4.5 5.12
Figure 4.6 2.85
Figure 4.12 1.14
Figure 5.1 .54
Table 6.1: Simulator Timing Results
The results are worth some discussion. Note, for example, that the fastest
examples are those with no data races (Figures 3.1 and 3.11). Figure 4.3 (the
simplest out of thin air example) is also fast: the data races in it cannot allow
the reads to see different values, so those data races need not be committed.
An increase in thread count (as evidenced in Figures 4.5 and 4.6) increases
the run time, but an increase in data races does so more: Figure 4.5 is a data
dependent version of Figure 4.6, so it more frequently evinces data races. Thus,
its run time is substantially higher (although not so substantially that the run
time is meaningful).
6.2 Proofs
The most important guarantee that we can make that the Java memory model
has the properties that we wish for it, of course, is to perform a set of proofs. The
fact that the Java memory model provides many of the properties (reproduced in
83
Name Description Exemplar
Guarantees for Optimizers
Reorder1
Independent adjacent statements can be
reordered.
Figure 2.1
Reorder2
If a compiler can detect that an action
will always happen (with the same value
written to the same variable), it can be re-
ordered regardless of apparent dependen-
cies.
Figures 3.3, 3.4, 3.5, 3.6, 4.7
RS
Synchronization actions that only intro-
duce redundant happens-before edges can
be treated as if they don?t introduce any
happens-before edges.
Section 3.5.1
Guarantees for Programmers
DRF
Correctly synchronized programs have se-
quentially consistent semantics.
Figures 3.1, 3.11
HB
Volatile writes are ordered before sub-
sequent volatile reads of the same vari-
able. Unlocks are ordered before subse-
quent locks of the same monitor.
Figure 3.7
VolatileAtomicity
All accesses to volatile variables are per-
formed in a total order.
Figure 3.9
StrongVolatile
There is a happens-before relationship
from each write to each subsequent read
of that volatile.
Figure 3.10
ThinAir1
A write can occur earlier in an execution
than it appears in program order. How-
ever, that write must have been able to oc-
cur without the assumption that any sub-
sequent reads see non-sequentially consis-
tent values.
Figures 4.3, 4.4, 4.8
Table 6.2: Table of Requirements, Part 1
84
Name Description Exemplar
ThinAir2
An action can occur earlier in an execu-
tion than it appears in program order.
However, that write must have been able
to occur in the execution without assum-
ing that any additional reads see values
via a data race.
Figures 4.5, 4.6, 4.9, 4.10
Isolation
Consider a partition P of the threads
and variables in the program so that if a
thread accessed a variable in that execu-
tion, then the thread and variable are in
the same partition. Given P, you can ex-
plain the behavior in the execution of the
threads in each partition without having
to examine the behavior or code for the
other threads.
Figures 4.5, 4.6
Observable
The only reason that an action visible to
the external world (e.g., a file read / write,
program termination) might not be ob-
servable is if there is an infinite sequence
of actions that might happen before it or
come before it in the synchronization or-
der.
Figure 3.13
Table 6.3: Table of Requirements, Part 2
85
Tables 6.2 and 6.3) is fairly straightforward. For example, the VolatileAtomicity
property, the StrongVolatile property and the HB property are all explicitly stated
as parts of the memory model.
The ThinAir1 and ThinAir2 properties are trivial extensions of the causality
description. Actions can only be committed early if they occur in an execution
where reads only see writes that happen-before them. Since no out of thin air
reads occur in executions where reads only see writes that happen-before them,
no writes can be committed that are dependent on out of thin air reads. Thus,
the cycles that cause out-of-thin-air executions are not possible.
There are only really two key properties that require proof. In many ways,
these two properties exemplify the fundamental dichotomy we faced in construct-
ing the memory model. The first of these properties is that the semantics allows
for reordering: this is the cornerstone of understanding why program transforma-
tions and optimizations are legal under the model. The second of these properties
is that correctly synchronized programs obey sequentially consistent semantics:
this is the strongest of the guarantees that we give to programmers. The rest of
this section outlines the proofs that the model obeys these properties.
6.2.1 Semantics Allows Reordering
We mentioned earlier that a key notion for program optimization was that of
reordering. We demonstrated in Figure 2.1 that standard compiler reorderings,
unobservable in single threaded programs, can have observable effects in mul-
tithreaded programs. However, reorderings are crucial in many common code
optimizations. In compilers, instruction scheduling, register allocation, common
sub-expression elimination and redundant read elimination all involve reordering.
86
On processors, the use of write buffers, out-of-order completion and out-of-order
issue all require the use of reordering.
In this section, we demonstrate that many of the reorderings necessary for
these optimizations are legal. This is not a complete list of legal reorderings;
others can be derived from the model. However, this demonstrates a very common
sample of reorderings, used for many common optimizations. Specifically, we
demonstrate the legality of reordering two independent actions when doing so
does not change the happens-before relationship for any other actions.
Theorem 1 Consider a program P and the program Pprime that is obtained from P
by reordering two adjacent statements sx and sy. Let sx be the statement that
comes before sy in P, and after sy in Pprime. The statements sx and sy may be any
two statements such that
? reordering sx and sy doesn?t eliminate any transitive happens-before edges
in any valid execution (it will reverse the direct happens-before edge between
the actions generated by sx and sy)
? sx and sy are not conflicting accesses to the same variable,
? sx and sy are not both synchronization actions or external actions, and
? Reordering sx and sy does not hoist an action above an infinite loop.
? the intra-thread semantics of sx and sy allow reordering (e.g., sx doesn?t
store into a register that is read by sy).
Transforming P into Pprime is a legal program transformation.
Proof:
87
Assume that we have a valid execution Eprime of program Pprime. It has a legal set
of behaviors Bprime. To show that the transformation of P into Pprime is legal, we need
to show that there is a valid execution E of P that has the same observable
behaviors as Eprime.
Let x be the action generated by sx and y be the action generated by sy. If
x and y are executed multiple times, we repeat this analysis for each repetition.
The execution Eprime has a set of observable actions Oprime, and the execution E has
a set of observable actions O. If O includes x, O must also include y because
x hb? y in E. If Oprime does not include y (and therefore y is not an external action,
as it must not take place after an infinite series of actions), we can use O as the
set of observable actions for Eprime instead: they induce the same behavior.
Since Eprime is legal, we have a sequence of executions, Eprime0,Eprime1,... that eventually
justifies E. Eprime0 doesn?t have any committed actions and ?i,0 ? i, Eprimei is used to
justify the additional actions that are committed to give Eprimei+1.
We will show that we can use Ei ? Eprimei to show that E ? Eprime is a legal execution
of P.
If x and y are both uncommitted in Eprimei, the happens-before ordering between
x and y doesn?t change the possible behaviors of actions in Ei and Eprimei. Any
action that happens-before x or y happens-before both of them. If either x
or y happens-before an action, both of them do (excepting, of course, x and y
themselves). Thus, the reordering of x and y can?t affect the write seen by any
uncommitted read.
Similarly, the reordering doesn?t affect which (if any) incorrectly synchronized
write a read can be made to see when the read is committed.
If Eprimei is used to justify committing x in Eprimei+1, then Ei may be used to justify
88
committing x in Ei+1. Similarly for y.
If one or both of x or y is committed in Eprimei, it can also be committed in Ei,
without behaving any differently, with one caveat. If y is a lock or a volatile read,
it is possible that committing x in Eprimei will force some synchronization actions that
happen-before y to be committed in Eprimei. However, we are allowed to commit those
actions in Ei, so this does not affect the existence of Ei.
Thus, the sequences of executions used to justify Eprime will also justify E, and
the program transformation is legal.
6.2.2 Considerations for Programmers
The most important property of the memory model that is provided for program-
mers is the notion that if a program is correctly synchronized, it is unnecessary
to worry about reorderings. In this section, we prove this property holds of the
Java memory model.
Correctly Synchronized Programs Exhibit Only Sequentially Consis-
tent Behaviors
As described in Chapter 2, we say an execution has sequentially consistent (SC)
behavior if there is a total order over all actions consistent with the program
order of each thread such that each read returns the value of the most recent
write to the same variable. Two memory accesses are conflicting if they access
the same variable and one or both of them are writes. A program is correctly
synchronized if and only if in all sequentially consistent executions, all conflicting
accesses to non-volatile variables are ordered by happens-before edges.
The most important property of the memory model that is provided for pro-
89
grammers is the notion that if a program is correctly synchronized, it is unnec-
essary to worry about reorderings. In this section, we prove this property holds
of the Java memory model. First, we prove a lemma that shows that when each
read sees a write that happens-before it, the resulting execution behaves in a
sequentially consistent way. We then show that reads in executions of correctly
synchronized programs can only see writes that happen-before them. Thus, by
the lemma, the resulting behavior of such programs is sequentially consistent.
Lemma 2 Consider an execution E of a correctly synchronized program P that
is legal under the Java memory model. If, in E, each read sees a write that
happens-before it, E has sequentially consistent behavior.
Proof:
Since the execution is legal according to the memory model, the execution is
synchronization order consistent and happens-before consistent.
A topological sort on the happens-before edges of the actions in an execution
gives a total order consistent with program order and synchronization order. Let
r be the first read in E that doesn?t see the most recent conflicting write w in
the sort but instead sees wprime. Let the topological sort of E be ?wprime?w?r?.
Let Eprime be an execution whose topological sort is ?wprime?w?rprime?prime. Eprime is obtained
exactly as E, except that instead of r, it performs the action rprime, which is the
same as r except that it sees w; ?prime is any sequentially consistent completion of
the program such that each read sees the previous conflicting write.
The execution Eprime is sequentially consistent, and it is not the case that wprime hb?
w hb? r, so P is not correctly synchronized.
Thus, no such r exists and the program has sequentially consistent behav-
ior.
90
Theorem 3 If an execution E of a correctly synchronized program is legal under
the Java memory model, it is also sequentially consistent.
Proof: By Lemma 2, if an execution E is not sequentially consistent, there
must be a read r that sees a write w such that w does not happen-before r. The
read must be committed, because otherwise it would not be able to see a write
that does not happen-before it. There may be multiple reads of this sort; if so,
let r be the first such read that was committed. Let Ei be the execution that
was used to justify committing r.
The relative happens-before order of committed actions and actions being
committed must remain the same in all executions considering the resulting set
of committed actions. Thus, if we don?t have w hb? r in E, then we didn?t have
w hb? r in Ei when we committed r.
Since r was the first read to be committed that doesn?t see a write that
happens-before it, each committed read in Ei must see a write that happens-
before it. Non-committed reads always sees writes that happens-before them.
Thus, each read in Ei sees a write that happens-before it, and there is a write w
in Ei that is not ordered with respect to r by happens-before ordering.
A topological sort of the actions in Ei according to their happens-before edges
gives a total order consistent with program order and synchronization order. This
gives a total order for a sequentially consistent execution in which the conflicting
accesses w and r are not ordered by happens-before edges. However, Lemma 2
shows that executions of correctly synchronized programs in which each read sees
a write that happens-before it must be sequentially consistent. Therefore, this
program is not correctly synchronized. This is a contradiction.
91
Chapter 7
Immutable Objects
I am constant as the northern star,
Of whose true-fix?d and resting quality
There is no fellow in the firmament.
? William Shakespeare, Julius Caesar (III, i, 60 ? 62)
In Java, a final field is (intuitively) written to once, in an object?s constructor,
and never changed. The original Java memory model contained no mention of
final fields. However, programmers frequently treated them as immutable. This
resulted in a situation where programmers passed references between threads to
objects they thought were immutable without synchronization. In this chapter,
we cover how our memory model deals with final fields.
One design goal for the Java memory model was to provide a mechanism
whereby an object can be immutable if all of its fields are declared final. This
immutable object could be passed from thread to thread without worrying about
data races. This relatively simple goal proved remarkably difficult, as this chapter
describes.
Figure 7.1 gives an example of a typical use of final fields in Java. An object of
92
type FinalFieldExample is created by the thread that invokes writer(). That
thread then passes the reference to a reader thread without synchronization. A
reader thread reads both fields x and y of the newly constructed object.
Under Java?s original memory model, it was possible to reorder the write to
f with the invocation of the constructor. Effectively, the code:
r1 = new FinalFieldExample;
r1.x = 3;
r1.y = 4;
f = r1;
would be changed to:
r1 = new FinalFieldExample;
f = r1;
r1.x = 3;
r1.y = 4;
This reordering allowed the reader thread to see the default value for the
final field and for the non-final (or normal field). One requirement for Java?s
memory model is to make such transformations illegal; it is now required that
the assignment to f take place after the constructor completes.
A more serious example of how this can affect a program is shown in Fig-
ure 7.2. String objects are intended to be immutable; methods invoked on
Strings do not perform synchronization. This class is often implemented as a
pointer to a character array, an offset into that array, and a length. This approach
allows a character array to be reused for multiple String objects. However, this
can create a dangerous security hole.
93
class FinalFieldExample {
final int x;
int y;
static FinalFieldExample f;
public FinalFieldExample() {
x = 3;
y = 4;
}
static void writer() {
f = new FinalFieldExample();
}
static void reader() {
if (f != null) {
int i = f.x;
int j = f.y;
}
}
}
A reader must never see x == 0
Figure 7.1: Example Illustrating Final Field Semantics
94
Thread 1 Thread 2
Global.s = "/tmp/usr".substring(4);
String myS = Global.s;
if (myS.equals("/tmp"))
System.out.println(myS);
Figure 7.2: Without final fields or synchronization, it is possible for this code to
print /usr
In particular, if the fields of the String class are not declared final, then it
would be possible for Thread 2 initially to see the default value of 0 for the offset
of the String object, allowing it to compare as equal to "/tmp". A later operation
on the String object might see the correct offset of 4, so that the String object
is perceived as being "/usr". This is clearly a danger; many security features of
the Java programming language depend upon Strings being perceived as truly
immutable.
In the rest of this chapter, we discuss the way in which final fields were
formalized in the Java memory model. Section 7.1 lays out the full, informal
requirements for the semantics of final fields. The bulk of the document discusses
the motivation; the full semantics of final fields are presented in Section 7.3.
Finally, we present some implementation issues in Section 7.5.
7.1 Informal Semantics
The detailed semantics of final fields are somewhat different from those of normal
fields. In particular, we provide the compiler with great freedom to move reads
of final fields across synchronization barriers and calls to arbitrary or unknown
methods. Correspondingly, we also allow the compiler to keep the value of a final
field cached in a register and not reload it from memory in situations where a
95
non-final field would have to be reloaded.
Final fields also provide a way to create thread-safe immutable objects that do
not require synchronization. A thread-safe immutable object is seen as immutable
by all threads, even if a data race is used to pass references to the immutable
object between threads.
The original Java semantics did not enforce an ordering between the writer
and the reader of the final fields for the example in Figure 7.1. Thus, the read
was not guaranteed to see the write.
In the abstract, the guarantees for final fields are as follows. When we say an
object is ?reachable? from a final field, that means that the field is a reference,
and the object can be found by following a chain of references from that field.
When we say the ?correctly initialized? value of a final field, we mean both the
value of the field itself, and, if it is a reference, all objects reachable from that
field.
? At the end of an object?s constructor, all of its final fields are ?frozen? by
an implicit ?freeze? action. The freeze for a final field takes place at the
end of the constructor in which it was set. In particular, if one constructor
invokes another constructor, and the invoked constructor sets a final field,
the freeze for the final field takes place at the end of the invoked constructor.
? If a thread only reads references to an object that were written after the
last freeze of its final fields, that thread is always guaranteed to see the
frozen value of the object?s final fields. Such references are called correctly
published, because they are published after the object is initialized. There
may be objects that are reachable by following a chain of references from
such a final field. Reads of those objects will see values at least as up to
96
date as they were when the freeze of the final field was performed.
? Conversely, if a thread reads a reference to an object written before a freeze,
that thread is not automatically guaranteed to see the correctly initialized
value of the object?s final fields. Similarly, if a thread reads a reference
to an object reachable from the final field without reaching it by following
pointers from that final field, the thread is not automatically guaranteed to
see the value of that object when the field was frozen.
? If a thread is not guaranteed to see a correct value for a final field or
anything reachable from that field, the guarantees can be enforced by a
normal happens-before relationship. In other words, those guarantees can
be enforced by normal synchronization techniques.
? When you freeze a final which points to an object, then freeze a final field of
that object, there is a happens-before relationship between the first freeze
and the second.
7.1.1 Complications
Retrofitting the semantics to the existing Java programming language requires
that we deal with a number of complications:
? Serialization is the writing of an object to an input or output stream, usually
so that object can be stored or passed across a network. When the object
is read, that is called deserialization . Using serialization in Java to read
an object requires that the object first be constructed, then that the final
fields of the object be initialized. After this, deserialization code is invoked
to set the object to what is specified in the serialization stream. This means
97
that the semantics must allow for final fields to change after objects have
been constructed.
Although our semantics allow for this, the guarantees we make are some-
what limited; they are specialized to deserialization. These guarantees are
not intended to be part of a general and widely used mechanism for chang-
ing final fields. In particular, you cannot make a call to native code to
modify final fields; the use of this technique will invalidate the semantics of
the VM.
To formalize the semantics for multiple writes / initializations of a final
field, we allow multiple freezes. A second freeze action might, for example,
take place after deserialization is complete.
? System.in, System.out and System.err are static final fields that allow
access to the system?s stdin, stdout and stderr files. Since programs often
redirect their input and output, these fields are defined to be mutable by
public methods. Thus, we give these three fields (and only these three
fields) different semantics. This is discussed in detail in Section 7.3.
7.2 Motivating Examples
7.2.1 A Simple Example
Consider Figure 7.3. We will not start out with the complications of multiple
writes to final fields; a freeze, for the moment, is simply what happens at the end
of a constructor. Although r1, r2 and r3 can see the value null, we will not
concern ourselves with that; that just leads to a null pointer exception.
The reference q is correctly published after the end of o?s constructor. Our
98
f1 is a final field; its default value is 0
Thread 1 Thread 2 Thread 3
o.f1 = 42 r1 = p; r3 = q;
p = o; i = r.f1; j = r3.f1;
freeze o.f1 r2 = q;
q = o; if (r2 == r)
k = r2.f1;
We assume r1, r2 and r3 do not see the value null. i and k can be 0 or 42, and
j must be 42.
Figure 7.3: Example of Simple Final Semantics
semantics guarantee that if a thread only sees correctly published references to
o, that thread will see the correct value for o?s final fields. We therefore want to
construct a special happens-before relationship between the freeze of o.f1 and
the read of it as q.f1 in Thread 3.
The read of p.f1 in Thread 2 is a different case. Thread 2 sees p, an incor-
rectly published reference to object o; it was made visible before the end of o?s
constructor. A read of p.f1 could easily see the default value for that field, if
a compiler decided to reorder the write to p with the write to o.f1. No read
of p.f1 should be guaranteed to see the correctly constructed value of the final
field.
What about the read of q.f1 in Thread 2? Is that guaranteed to see the
correct value for the final field? A compiler could determine that p and q point
to the same object, and therefore reuse the same value for both p.f1 and q.f1
for that thread. We want to allow the compiler to remove redundant reads of
final fields wherever possible, so we allow k to see the value 0.
99
One way to conceptualize this is by thinking of an object being ?tainted? for a
thread if that thread reads an incorrectly published reference to the object. If an
object is tainted for a thread, the thread is never guaranteed to see the object?s
correctly constructed final fields. More generally, if a thread t reads an incorrectly
published reference to an object o, thread t forever sees a tainted version of o
without any guarantees of seeing the correct value for the final fields of o.
In Figure 7.3, the object is not tainted for Thread 3, because Thread 3 only
sees the object through p. Thread 2 sees the object through both both the q
reference and the p reference, so it is tainted.
7.2.2 Informal Guarantees for Objects Reachable from Final Fields
In Figure 7.4, the final field o.f2 is a reference instead of being a scalar (as it
was in Section 7.2.1). It would not be very useful if we only guaranteed that
the values read for references were correct, without also making some guarantees
about the objects pointed to by those references. In this case, we need to make
guarantees for f2, the object p to which it points and the array pointed to by
p.b. Thread 1 executes threadOne, Thread 2 executed threadTwo, and Thread
3 executed threadThree.
We make a very simple guarantee: if a final reference is published correctly,
and its correct value was guaranteed to be seen by an accessing thread (as de-
scribed in Section 7.2.1), everything transitively reachable from that final reference
is also guaranteed to be up to date as of the freeze. In Figure 7.4, o?s reference
to p, p?s reference to a and the contents of a are all guaranteed to be seen by
Thread 3. We call this idiom a dereference chain.
We make one exception to this rule. In Figure 7.4, Thread 2 reads a[0]
100
class First {
final Second f2;
static Second p = new Second();
static First pub;
static int [] a = {0};
public First() {
f2 = p;
p.b = a;
a[0] = 42;
}
public void threadOne() {
pub = new First();
}
// First continues ...
public void threadTwo() {
int i = a[0];
Second r1 = pub.f2;
int [] r2 = r1.b;
int r3 = r2[0];
}
public void threadThree() {
Second s1 = pub.f2;
int [] s2 = s1.b;
int s3 = s2[0];
}
}
class Second {
int [] b;
}
We assume r1 and s1 do not see the value null.
r2 and s2 must both see the correct pointer to array a.
s3 must be 42, but r3 does not have to be 42.
Figure 7.4: Example of Transitive Final Semantics
101
Thread 1 Thread 2
o.f = p; i = pub.x;
p.g = 42; j = pub.y;
pub.x = o; k = j.f;
freeze p.g; l = k.g;
pub.y = o;
Figure 7.5: Example of Reachability
through two different references. A compiler might determine that these refer-
ences are the same, and reuse i for r3. Here, a reference reachable from a final
field is read by a thread in a way that does not provide guarantees; it is not read
through the final field. If this happens, the thread ?gives up? its guarantees from
that point in the dereference chain; the address is now tainted. In this example,
the read of a[0] in Thread 2 can return the value 0.
The definition of reachability is a little more subtle than might immediately
be obvious. Consider Figure 7.5. It may seem that the final field p.g, read as
k.g in Thread 2, can only be reached through one dereference chain. However,
consider the read of pub.x. A global analysis may indicate that it is feasible
to reuse its value for j. o.f and p.g may then be read without the guarantees
that are provided when they are reached from pub.y. As with normal fields, an
apparent dependency can be broken by a compiler analysis (see Section 3.3 for
more discussion of how this affects normal fields).
The upshot of this is that a reachability chain is not solely based on syntactic
rules about where dereferences occur. There is a link in a dereference chain from
any dynamic read of a value to any action that dereferences that value, no matter
where the dereference occurs in the code.
102
Thread 1 Thread 2 Thread 3
o.f1 = 42; r1 = Global.a; s1 = Global.b;
freeze o.f1; Global.b = r2; s2 = s1.f1;
Global.a = o;
s2 is guaranteed to see 42, if s1 is a reference to o.
Figure 7.6: Freezes are Passed Between Threads
7.2.3 Additional Freeze Passing
In this section, we will discuss some of the other ways that a read can be guar-
anteed to see a freeze.
Freezes Are Passed Between Threads
Figure 7.6 gives an example of another guarantee we provide. If s1 is a reference
to o, should s2 have to see 42? The answer to this lies in the way in which Thread
3 saw the reference to o.
Thread 1 correctly published a reference to o, which Thread 2 then observed.
Had Thread 2 then read a final field of o, it would have seen the correct value for
that field; the thread would have to have ensured that it saw all of the updates
made by Thread 1. To do this on SMP systems, Thread 2 does not need to know
that it was Thread 1 that performed the writes to the final variable, it needs
only to know that updates were performed. On systems with weaker memory
constraints (such as DSMs), Thread 2 would need this information; we shall
discuss implementation issues for these machines later.
How does this impact Thread 3? Like Thread 2, Thread 3 cannot see a
reference to o until the freeze has occurred. Any implementation that allows
103
p.x is initialized to 42.
o.f is final.
Thread 1 Thread 2 Thread 3
lock m; lock m; k = Global.b;
Global.a = o; i = Global.a; l = k.x;
o.f = p; Global.b = i;
freeze o.f; j = i.x;
unlock m unlock m;
If the unlock in Thread 1 happens-before the unlock in Thread 2:
i will see o.
j will see 42.
k will see o or null.
l will see 42 or throw a null pointer exception..
Figure 7.7: Example of Happens-Before Interaction
Thread 2 to see the writes to o that occurred prior to the freeze will therefore
allow Thread 3 to see all of the writes prior to the freeze. There is therefore no
reason not to provide Thread 3 with the same guarantees with which we provide
Thread 2.
Semantics? Interaction with Happens-Before Edges
Now consider Figure 7.7. We want to describe the interaction between ordinary
happens-before relationships and final field guarantees. In Thread 1, o is pub-
lished incorrectly (before the freeze). However, if the code in Thread 1 happens-
before the code in Thread 2, the normal happens-before relationships ensure that
Thread 2 will see all of the correctly published values. As a result, j will be 42.
104
i = p.x;
q = deserialize(p);
j = p.x;
k = q.x;
i may be 42 or 0.
j may be 42 or 0.
k must be 42.
i = p.x;
deserialize(p);
i may be 42 or 0.
q = deserialize(p);
j = q.x;
j must be 42.
// In p?s
// constructor
q.x = 42;
i = q.x;
i must be 42.
The deserialize() method sets the final field p.x to 42 and then performs a
freeze on p.x. It passes back a reference to the p object. This is done in native
code.
Figure 7.8: Four Examples of Final Field Optimization
What about the reads in Thread 3? We assume that k does not see a null
value: should the normal guarantees for final fields be made? We can answer
this by noting that the write to Global.b in Thread 2 is the same as a correct
publication of o, as it is guaranteed to happen after the freeze. We therefore
make the same guarantees for any read of Global.b that sees o as we do for a
read of any other correct publication of o.
105
7.2.4 Reads and Writes of Final Fields in the Same Thread
Up to this point, we have only made guarantees about the contents of final fields
for reads that have seen freezes of those final fields. This implies that a read of a
final field in the same thread as the write, but before a freeze, might not see the
correctly constructed value of that field.
Sometimes this behavior is acceptable, and sometimes it is not. We have four
examples of how such reads could occur in Figure 7.8. In three of the examples,
a final field is written via deserialization; in one, it is written in a constructor.
We wish to preserve the ability of compiler writers to optimize reads of final
fields wherever possible. When the programs shown in Figure 7.8 access p.x
before calling the deserialize() method, they may see the uninitialized value
of p.x. However, because the compiler may wish to reorder reads of final fields
around method calls, we allow reads of p.x to see either 0 or 42, the correctly
written value.
On the other hand, we do want to maintain the programmer?s ability to see
the correctly constructed results of writes to final fields. We have a simple metric:
if the reference through which you are accessing the final field was not used before
the method that sets the final field, then you are guaranteed to see the last write
to the final field. We call such a reference a new reference to the object.
This rule allows us to see the correctly constructed value for q.x. Because
the reference deserialize() returns is a new reference to the same object, it
provides the correct guarantees.
For cases where a final field is set once in the constructor, the rules are simple:
the reads and writes of the final field in the constructing thread are ordered
according to program order.
106
q.this$0 and p.x are final
Thread 1 Thread 2
// in constructor for r = Global.b;
// p s = r.this$0;
q.this$0 = p; t = s.x;
freeze q.this$0;
p.x = 42;
freeze p.x;
Global.b = p;
t should be 42
Figure 7.9: Guarantees Should be made via the Enclosing Object
We must treat the cases (such as deserialization) where a final field can be
modified after the constructor is completed a little differently.
7.2.5 Guarantees Made by Enclosing Objects
Consider Figure 7.9. In Java, objects can be logically nested inside each other ?
when an inner object is constructed, it is given a reference to the outer object,
which is denoted in bytecode by $0. In Thread 1, the inner object q is constructed
inside the constructor for p. This allows a reference to p to be written before the
freeze of p.x. The reference is now tainted, according to the intuition we have
built up so far: no other thread reading it will be guaranteed to see the correctly
constructed values for p.x.
However, Thread 2 is guaranteed not to see the final fields of p until after
p?s constructor completes, because it can only see them through the correctly
published variable Global.b. Therefore, it is not unreasonable to allow this
107
Initially, a.ptr points to b, and b.ptr points to a. a.o, b.o and obj.x are all
final.
Thread 1 Thread 2
b.o = obj; r1 = A.ptr;
freeze b.o; r2 = r1.o;
a.o = obj; r3 = r2.x
freeze a.o;
obj.x = 42;
freeze obj.x; s1 = B.ptr;
A = a; s2 = s1.o;
B = b; s3 = s2.x;
Figure 7.10: Cyclic Definition Causes Problems
thread to be guaranteed to see the correct value for p.x.
In general, we the semantics to reflect the notion that a freeze for an object o
is seen by a thread reading a final field o.f if o is only read through a dereference
chain starting at a reference that was written after the freeze of o.f.
7.3 Full Semantics
The semantics for final fields are as follows. A freeze action on a final field f of an
object o takes place when a constructor for o in which f is written exits, either
normally or abruptly (because of an exception).
Reflection and other special mechanisms (such as deserialization) can be used
to change final fields after the constructor for the object completes. The set(...)
method of the Field class in java.lang.reflect may be used to this effect. If
the underlying field is final, this method throws an IllegalAccessException
108
unless setAccessible(true) has succeeded for this field and the field is non-
static. If a final field is changed via such a special mechanism, a freeze of that
field is considered to occur immediately after the modification.
Final Field Safe Contexts
An implementation may provide a way to execute a block of code in a final field
safe context. Actions executed in a final field safe context are considered to occur
in a separate thread for the purposes of Section 7.3, although not with respect
to other aspects of the semantics. The actions performed within a final field safe
context are immediately followed in program order by a synthetic action marking
the end of the final field safe context.
Replacement and/or Supplemental Ordering Constraints
For each execution, the behavior of reads is influenced by two additional partial
orders, dereference chain (dc?) and memory chain (mc?), which are considered to
be part of the execution (and thus, fixed for any particular execution). These
partial orders must satisfy the following constraints (which need not have a unique
solution):
? Dereference Chain If an action a is a read or write of a field or element
of an object o by a thread t that did not construct o, then there must exist
some read r by thread t that sees the address of o such that r dc? a.
? Memory Chain There are several constraints on the memory chain order-
ing:
a) If r is a read that sees a write w, then it must be the case that w mc? r.
109
b) If r and a are actions such that r dc? a, then it must be the case that
r mc? a.
c) If w is a write of the address of an object o by a thread t that did not
construct o, then there must exist some read r by thread t that sees
the address of o such that r mc? w.
d) If r is a read of a final instance field of an object constructed within
a final field safe context ending with the synthetic action a such that
a po? r, then it must be the case that a mc? r.
With the addition of the semantics for final fields, we use a different set
of ordering constraints for determining which writes occur before a read, for
purposes of determining which writes can be seen by a read.
We start with normal happens-before orderings, except in cases where the read
is a read of a final instance field and either the write occurs in a different thread
from the read or the write occurs via a special mechanism such as reflection.
In addition, we use orderings derived from the use of final instance fields.
Given a write w, a freeze f, an action a (that is not a read of a final field), a read
r1 of the final field frozen by f and a read r2 such that w hb? f hb? a mc? r1 dc? r2,
then when determining which values can be seen by r2, we consider w hb? r2 (but
these orderings do not transitively close with other hb? orderings). Note that the
dc? order is reflexive, and r
1 can be the same as r2. Note that these constraints
can arise regardless of whether r2 is a read of a final or non-final field.
We use these orderings in the normal way to determine which writes can be
seen by a read: a read r can see a write w if r is ordered before w, and there is
no intervening write wprime ordered after w but before r.
110
Static Final Fields
The rules for class initialization ensure that any thread that reads a static field
will be synchronized with the static initialization of that class, which is the only
place where static final fields can be set. Thus, no special rules in the JMM are
needed for static final fields.
Static final fields may only be modified in the class initializer that defines
them, with the exception of the java.lang.System.in, java.lang.System.out,
and java.lang.System.err static fields, which can be modified respectively
by the java.lang.System.setIn, java.lang.System.setOut, and java.lang.
System.setErr methods.
7.4 Illustrative Test Cases and Behaviors of Final Fields
In order to determine if a read of a final field is guaranteed to see the initialized
value of that field, you must determine that there is no way to construct a partial
order mc? without providing the chain f hb? a mc? r1 from the freeze f of that
field to the read r1 of that field.
An example of where this can go wrong can be seen in Figure 7.11. An object
o is constructed in Thread 1 and read by Threads 2 and 3. Dereference and
memory chains for the read of r4.f in Thread 2 can pass through any reads by
Thread 2 of a reference to o. On the chain that goes through the global variable
p, there is no action that is ordered after the freeze operation. If this chain is
used, the read of r4.f will not be correctly ordered with regards to the freeze
operation. Therefore, r5 is not guaranteed to see the correctly constructed value
for the final field.
The fact that r5 does not get this guarantee reflects legal transformations
111
f is a final field; its default value is 0
Thread 1 Thread 2 Thread 3
r1.f = 42; r2 = p; r6 = q;
p = r1; r3 = r2.f; r7 = r6.f;
freeze r1.f; r4 = q;
q = r1; if (r2 == r4)
r5 = r4.f;
We assume r2, r4 and r6 do not see the value null. r3 and r5 can be 0 or 42,
and r7 must be 42.
Figure 7.11: Final field example where reference to object is read twice
by the compiler. A compiler can analyze this code and determine that r2.f
and r4.f are reads of the same final field. Since final fields are not supposed to
change, it could replace r5 = r4.f with r5 = r3 in Thread 2.
Formally, this is reflected by the dereference chain ordering (r2 = p) dc?
(r5 = r4.f), but not ordering (r4 = q) dc? (r5 = r4.f). An alternate partial
order, where the dereference chain does order (r4 = q) dc? (r5 = r4.f) is also
valid. However, in order to get a guarantee that a final field read will see the
correct value, you must ensure the proper ordering for all possible dereference
and memory chains.
In Thread 3, unlike Thread 2, all possible chains for the read of r6.f include
the write to q in Thread 1. The read is therefore correctly ordered with respect
to the freeze operation, and guaranteed to see the correct value.
In general, if a read R of a final field x in thread t2 is correctly ordered with
respect to a freeze F in thread t1 via memory chains, dereference chains, and
happens-before, then the read is guaranteed to see the value of x set before the
112
a is a final field of a class A
Thread 1 Thread 2
r1 = new A; r3 = p;
r2 = new int[1]; r4 = r3.a;
r1.a = r2; r5 = r4[0]
r2[0] = 42
freeze r1.a;
p = r1;
Assuming Thread 2 read of p sees the write by Thread 1, Thread 2 reads of
r3.a and r4[0] are guaranteed to see the writes to Thread 1.
Figure 7.12: Transitive guarantees from final fields
freeze F. Furthermore any reads of elements of objects that were only reached
in thread t2 by following a reference loaded from x are guaranteed to occur after
all writes w such that w hb? F.
Figure 7.12 shows an example of the transitive guarantees provided by final
fields. For this example, there is no dereference chain in Thread 2 that would
permit the reads through a to be traced back to an incorrect publication of p.
Since the final field a must be read correctly, the program is not only guaranteed
to see the correct value for a, but also guaranteed to see the correct value for
contents of the array.
Figure 7.13 shows two interesting characteristics of one example. First, a ref-
erence to an object with a final field is stored (by r2.x = r1) into the heap before
the final field is frozen. Since the object referenced by r2 isn?t reachable until
the store p = r2, which comes after the freeze, the object is correctly published,
and guarantees for its final fields apply.
113
f is a final field; x is non-final
Thread 1 Thread 2 Thread 3
r1 = new ; r3 = p; r5 = q;
r2 = new ; r4 = r3.x; r6 = r5.f;
r2.x = r1; q = r4;
r1.f = 42;
freeze r1.f;
p = r2;
Assuming that Thread 2 sees the writes by Thread 1, and Thread 3?s read of q
sees the write by Thread 2, r6 is guaranteed to see 42.
Figure 7.13: Yet Another Final Field Example
This example also shows the use of rule (c) for memory chains. The memory
chain that guarantees that Thread 3 sees the correctly initialized value for f passes
through Thread 2. In general, this allows for immutability to be guaranteed for
an object regardless of which thread writes out the reference to that object.
7.5 Permissible Optimizations
The question for optimizing final fields is the same one we address when opti-
mizing normal fields: what reorderings can a compiler writer prise out of these
semantics? To be more precise, we must address two issues: first, what trans-
formations can be performed on normal fields but not final fields? Next, what
transformations can be performed on final fields but not on normal fields?
114
7.5.1 Prohibited Reorderings
The most important guarantee that we make for the use of final fields is that if
an object is only made visible to other threads after its constructor ends, then
those other threads will see the correctly initialized values for its final fields.
This can be easily derived from the semantics. Freezes occur at the end of
constructors; the only way for another thread to read a final field of an object that
has been properly constructed is by a combination of dereference and memory
chains from the point at which that object was properly published (i.e., after the
constructor ends). Thus, the requirement that the write in the chain w hb? f hb?
a mc? r1 dc? r2 must be seen by the read is trivially true.
It is therefore of paramount importance that a write of a reference to an object
where it might become visible to another thread never be reordered with respect
to a write to a final field of the object pointed to by the reference. In addition,
such a write should never be reordered with anything reachable from a final field
that was written before the end of the constructor.
As an example of this, we look back at Figure 7.4. In that figure, the write
to pub in Thread 1 must never be reordered with respect to anything that takes
place before the read. If such a reordering occurred, then Thread 3 might be able
to see the reference to the object without seeing the correctly initialized final
fields.
7.5.2 Enabled Reorderings
There is one principles that guides whether a reordering is legal for final fields:
the notion that ?all references are created equal?. If a thread reads a final field
via multiple references to its containing object, it doesn?t matter which one of
115
ready is a boolean volatile field, initialized to false.
a is an array.
Thread 1 Thread 2
a = {1,2,3}; i = pub.x;
ready = true; j = i.f;
o.f = a; if (ready) {
pub.x = o; k = j[0];
freeze o.f; }
Figure 7.14: Happens-Before Does Matter
those references is used to access the final field. None of those references will
make more guarantees about the contents of that final field than any other. The
upshot of this is that as soon as a thread sees a reference to an object, it may load
all of that object?s final fields, and reuse those values regardless of intervening
control flow, data flow, or synchronization operations.
Consider once more the code in Figure 7.5. As soon as the read of pub.x
occurs, all of the loads of o?s final fields may occur; the reference pub.x of o is
?equal? to the reference pub.y of o. This might cause uninitialized values to be
seen for p.g and o.f, as the read of pub.x can occur before the freeze of p.g.
This should not be taken to mean that normal fields reachable through final
fields can always be treated in the same way. Consider Figure 7.14. As a reminder,
a happens-before ordering is enforced between a write to and a read of a volatile.
In this figure, the volatile enforces a happens-before ordering between the write
to the array and the read of j[0]: assuming that the other reads see the correct
values (which is not guaranteed by the rest of the semantics), then k is required
to have the value 1.
116
7.6 Implementation on Weak Memory Orders
One of the problems with guaranteeing where writes are seen without explicit
lock and unlock actions to provide ordering is that it is not always immediately
obvious how the implementation will work. One useful thing to do is consider
how this approach might be implemented on a system where few guarantees are
given about memory coherence. On such machines, it is often the case that the
hardware will reorder your actions, spoiling some of the guarantees we want to
give to final fields.
In this section, we discuss sample implementation strategies that allow im-
plementation of final fields on symmetric multiprocessor systems under various
architectures and under lazy release consistent distributed shared memory sys-
tems.
7.6.1 Weak Processor Architectures
To implement the semantics as we have described them, a processor must make
guarantees for both reading and writing threads. For the writer thread, a proces-
sor must not allow stores to final fields to be seen to be reordered with later stores
of the object that contains the final field. On many architectures, the processor
must be explicitly prevented from doing this by some sort of explicit memory
barrier instruction, which can be performed after the constructor finishes.
Many processor architectures, including SPARC TSO [WG94] and Intel x86,
do not require this memory barrier ? there is an implicit barrier between stores
in the program. Other, more relaxed architectures, such as Intel?s IA-64 [Int02]
and Alpha [Com98] architectures, do require an explicit memory barrier.
On the reader side, there needs to be an ordering between a dereference of
117
an object, and a dereference of final fields of that object. Most architectures
do not require such a memory barrier; generally, ordering is preserved between
dependent loads.
The notable exception to this is the Alpha architecture. On the Alpha, no
such ordering is provided. One simple strategy would be to insert a memory
barrier after each load of an object that has a final field. This, however, is an
extremely heavyweight strategy, and is not recommended.
Before we proceed with alternate strategies, it is worth mentioning that there
are no plans for future versions of Java to be implemented on Alpha processors.
The Intel IA-64 platform was originally an exception to this as well, but discus-
sions between Intel and the JSR group overseeing the new Java memory model
caused a rewrite in the appropriate section of the IA-64 memory model.
An alternative is to maintain an invariant that if a thread or stack has a ref-
erence to a heap allocated object, then the thread has done the memory barriers
required to see the constructed version of that object. The simple implementation
of this is to put a memory barrier after each operation that loads a reference (get-
field, getstatic or aaload). Since we only need to put a barrier after getfields that
load a reference, as opposed to putting one before all getfields, this is somewhat
better than the simple strategy.
A further optimization is possible if we note that it is not necessary to perform
a memory barrier after a getfield of a final field. The reasoning here is that if a
thread is up to date with respect to an object X, and then loads a final field of
X that references Y, then Y must be older than X and the thread must therefore
be up to date with respect to Y.
Finally, if the reference we load is null, we can skip the memory barrier (this
118
Memory Barriers At Which Action? Number of Required Memory Barriers
Without Making Fields Final
getfields 202,869,185
loads of refs 64,993,844
nonfinal loads of refs 64,809,454
nonfinal nonnull loads of refs 50,011,951
After Making Fields Final
nonfinal loads of refs 22,628,420
nonfinal nonnull loads of refs 7,832,852
Figure 7.15: Number of Required Memory Barriers on Weak Memory Orders
would involve a runtime test to see if the ref is null, and doing the memory barrier
only if it is not).
To test the overhead for both of these solutions, we instrumented Sun?s JDK
1.2.2 JVM (a relatively early VM with an easily instrumented architecture). Final
fields are rarely used in standard benchmarks; for some tests we adjusted the
SPECjvm98 benchmarks so that all fields that are never modified were made into
final fields. The total number of required memory barriers over all benchmarks
can be seen in Figure 7.15.
More implementation strategies for final fields on the Alpha, that decrease
the number of required memory barriers further, are discussed in [Man01].
7.6.2 Distributed Shared Memory Based Systems
Imagine a Lazy Release Consistent (LRC) machine (see [KCZ92]): a processor
acquires data when a lock action occurs, and releases it when an unlock action
occurs. The data ?piggyback? on the lock acquire and release messages in the
119
form of ?diffs?, a listing of the differences made to a given page since the last
acquire of that memory location.
Let us assume that each object with a final field is allocated in space that
had previously been free. The only way for a second processor to see a pointer
to that object at all is to perform an acquire after the processor constructing the
object performed a release. If the release and the acquire do not happen, the
second processor will never see a pointer to that object: in this case, neither the
object?s final fields nor anything reachable in a dereference chain from its final
fields will appear to be incorrectly initialized.
Let us now assume that the acquire and release do happen. As long as these
actions take place after object has been constructed (and there is no code motion
around the end of the constructor), the diffs that the second processor acquires
are guaranteed to reflect the correctly constructed object. This property makes
implementation of final fields on a LRC-based DSM possible.
120
Chapter 8
Related Work
A scientist will never show any kindness for a theory which he did
not start himself.
- Mark Twain, A Tramp Abroad
8.1 Architectural Memory Models
Most work on memory models has been done because of the need to verify prop-
erties of hardware architectures. An excellent primer for this work is available
[AG96]. An early discussion of memory models can be found in [Lam78], which
provides the widely used definition for sequential consistency.
The needs of memory models for hardware architectures differ from the needs
of programming language memory models. The most obvious difference is the lack
of a need for language specific features, such as immutability support, type safety,
class initialization, and finalization. The design of architecture-based models
must focus on issues like when writes get sent to a cache or to main memory,
whether instructions are allowed to issue out of their original order, whether
reads must block waiting for their result, and so on. There is very little discussion
of the interaction of compiler technology and processor architectures in much of
121
the literature. Perhaps the greatest difference between the research that has been
done for architectures and our research is the lack of a full treatment for causality
(as in Section 4.3).
Architecture memory models are generally classified along a continuum from
strong (which does not allow the results of instructions to be viewed out of order)
to weak (which allows for much more speculation and reordering). Our work has
to provide a ?catch-all? for these memory models; Java should run efficiently on
as many machines as possible. Our memory model can therefore be categorized
as ?relaxed?.
8.2 Processor Memory Models
Each memory model provides synchronization operations that allow a strengthen-
ing of the model for that operation. On processors, there is customarily processor
support for a ?memory barrier? (membar), an operation that makes guarantees
about whether instructions are finished. In addition, there may also be comple-
tion flags to the memory barriers, which determine what it means for an instruc-
tion to be finished. Finally, individual load and store operations for a processor
may have special memory semantics. When these operations are needed depend
on the strength of the memory model. Here, we examine the requirements of
several different processors.
SPARC
The SPARC processor has three different memory models: Total Store Order
(TSO), Partial Store Order (PSO) and Relaxed Memory Order (RMO) [WG94].
Sun?s Solaris operating system is implemented in TSO; this mode allows the
122
Initially, a = b = 0;
Thread 1 Thread 2 Thread 3
a = 1; r1 = a; r2 = b;
if (r1 == 1) if (r2 == 1)
b = 1; r3 = a;
Figure 8.1: Can r2 = 1, r3 = 0?
system to execute a write followed by a read out of program order, but forbids
other reorderings.
One interesting additional relaxation that TSO allows is illustrated by Fig-
ure 8.2 (as seen in [AG96]). If all three threads are on the same processor, the
value for a written in Thread 1 can be seen early by Thread 2, but not by Thread
3. In the mean time, the write to b can occur, and be seen by Thread 3. The
result of this would be b == 1, r1 == 0. TSO allows this relaxation, which is
sometimes called write atomicity relaxation.
An example of the way in which a SPARC membar can affect memory is the
modifier #StoreLoad, which ensures that all stores before the memory barrier
complete before any loads after it start. There are also #LoadStore, #StoreStore
and #LoadLoad modifiers, which have the obvious related implications.
Alpha
The Alpha makes very few guarantees about the ordering of instructions. A read
may be reordered with a write, and a write may be reordered with another write,
as long as they are not to the same memory location. The Alpha does enforce
ordering between two reads of the same memory location, however [Com98]. Fi-
123
nally, the Alpha allows for write atomicity relaxation if all threads are on a single
processor.
On the Alpha 21264, the MB instruction ensures that no other processor will
see a memory operation after it before they see all of the memory operations be-
fore it. The strength of this operation is important because of the weak nature of
the Alpha memory model. The Alpha also has a WMB (Write Memory Barrier),
which prevents two writes from being reordered.
IA-64
The IA-64 memory model is similar to that of the Alpha. The IA-64 memory
fence instruction (mf) takes two forms. First, the mf instruction (ordering form)
ensures that ?all data memory accesses are made visible prior to any subsequent
memory accesses?. [Int99].
The IA-64 allows one relaxation that the Alpha does not allow: writes can be
seen early by other processors, not just the one that performed the write. This
relaxation may be a little obscure, but it is simply another reflection of Figure 8.2.
We have already mentioned that the result b = 1,r1 = 0 can happen on most
architectures, including Alpha, if the threads are all on the same processor. IA-
64, unlike Alpha, allows this result to be seen if the threads are all running on
different processors.
Other
There are a number of other memory orders, of both the purely theoretical and
the implemented variety. The PowerPC model, for example, is substantially
similar to the IA-64 model, including the write atomicity relaxation for multiple
124
processors.
Release Consistency
Distributed shared memory systems, because of the high cost of communication
between the processors, tend to have very weak memory models. A common
model, called release consistency [GLL+90], allows a processor to defer making
its writes available to other threads/processors until it performs a release. This
allows arbitrary reordering of code, but must respect all synchronization actions;
all acquires and releases affect global memory.
A relaxation of release consistency, called lazy release consistency [KCZ92], is
similar to our model. Not only does lazy release consistency allow a processor to
defer making its writes available to other processors until it performs a release
operation, but, in fact, the publication of the writes can be deferred until the
other processor performs an acquire operation on the same lock that was released
by the first processor. Because release consistency requires that all writes be
released immediately, it is sometimes called eager release consistency.
Location consistency [GS00a] is also similar to our model. Each memory
location is said to have a partial order of writes and synchronization actions
associated with it. If a processor sees an acquire for a memory location l, it can
read the last write made to l before the last release on l, or any write made to l
after that point. This is similar to lazy release consistency; the main difference
is that acquires and releases only affect a single memory location.
125
8.3 Programming Language Memory Models
Our work focuses on memory models for programming languages, whose needs
differ significantly from those of hardware memory models. Memory models for
programming languages need the ability to deal with programming language level
constructs, such as final fields and volatile variables. In addition, unlike processor
models, programming language models need to be applicable to a wide variety
of architectures: any architecture which runs programs written in that language
needs to be able to support the memory model.
8.3.1 Single-Threaded Languages
Programming languages that are either single-threaded or only support multiple
threads through application libraries do not require their own memory models.
C [KR88] and C++ [Str97] are prime examples of this: they use libraries such as
POSIX threads (pthreads) [LB98] to support multi-threading. If a system uses
pthreads, there is no guarantee of portability across all platforms that support
that language.
In a single-threaded language, there is no concept equivalent to the one for
which we use volatile fields. The sole purpose of a volatile field in our semantics
is to enable efficient interthread communication; in an environment where there
is no interthread communication, there is no meaningful use for volatile.
C and C++ do, of course, have a volatile modifier. It is traditionally used
for preventing code reordering when dealing with hardware-dependent issues like
memory mapped I/O. According to the C++ standard [Str97, ?r7.1.6], there are
?no implementation-independent semantics for volatile objects?.
Similarly, it is not necessary for a single-threaded language to provide multi-
126
threaded semantics for immutable fields. The const qualifier in C and C++
enforces single-threaded immutability, but there is no way for a const field to
imply that any object or structure reachable through that field is visible to an-
other thread.
8.3.2 Multi-Threaded Languages
Multi-threaded languages generally fall into one of two categories, message pass-
ing and shared memory. Each has its own specific memory model needs.
Message Passing Languages
In message passing languages, data are shared between threads by explicit com-
munication. If one thread (a sender) wants to communicate with another (a
receiver), it sends a message to it. The receiver then accepts the message.
Because all communication between threads in message passing languages is
explicit, they have no need for a detailed memory model. If a thread receives the
data from the sender, it can be safely assumed that communication has taken
place.
One example of a message passing language is Occam [Jon87], which uses a
message passing framework similar to that of CSP [Hoa85]. Ada [Ada95] allows
for message passing between threads using a rendezvous, but it also allows shared
memory parallelism using locking and fork/join parallelism (see below).
Shared Memory Languages
Because they can rely on both implicit and explicit communication, memory mod-
els for shared memory languages are more complicated. Nearly every language
127
with built-in multithreading has made some attempt to define what communica-
tion between threads entails. Many multi-threaded languages contain some sort
of lock construct; when a thread a acquires a lock released by a thread b, the
memory values visible to a at the time it released the lock must be visible to b.
Languages with built in locking include Java [GJS96], Ada [Ada95], Cilk [Gro00]
and most of the other multithreaded languages.
The language Cilk employs a form of fork/join parallelism. A parent thread
can spawn (or fork) child processes. If the parent wishes, it can execute a sync
statement (or a join) - the parent will then wait for all of the spawned threads
to complete and continue. The forked and joined threads form a DAG; writes
are visible from a parent in the DAG to its children. This provides an easy to
understand memory model which is sometimes referred to as DAG-consistency
[BFJ+96]
Hiding the Memory Model
A certain amount of work has been done to try to hide the fact that the memory
model for programming languages is not sequential consistency. A trivial way of
performing this on many architectures is to insert memory and compiler barriers
between every memory access; not only is this obviously tremendously expensive
in and of itself, but it also negates many of the design features that provide for
efficient execution.
Most programming language implementations make the guarantee that cor-
rectly synchronized programs have sequentially consistent semantics. This means
that non-sequentially consistent results are only visible when data races are
present. In fact, Shasha and Snir [SS88] show that non-sequentially consistent
128
results can only occur when there is a cycle in the graph of happens-before and
conflict edges (if two accesses conflict, there is a conflict edge between them).
Therefore, detecting where to place memory barriers is a matter of detecting
these cycles, and placing memory barriers accordingly.
Midkiff, Padua and Cytron [MPC90] extended this work to apply to arrays.
Shasha and Snir?s original analysis did not take locking and other explicit syn-
chronization into account; this was addressed by the compiler analysis of Krish-
namurthy and Yelick [KY95]. Lee and Padua [LP01] developed a similar compiler
analysis, and showed that minimizing the number of memory barriers using their
technique was NP-complete.
The work in this area places constraints on compilers, and does not apply to
correctly written code. Our work takes the position that there are few, if any,
correct programming patterns for high-level languages that involve data races.
We therefore provide the minimum number of guarantees for such code to ensure
basic safety properties. We do not constrain compilers or architectures to sacrifice
performance or complexity for programs that are likely incorrect (usually because
of a misunderstanding of the semantics of multithreaded code).
Semantics of Multithreaded Java
Java?s original memory model [GJS96, ?17] is extremely difficult to understand;
there have been several attempts to formalize it [CKRW97, CKRW98, GS97].
Unfortunately, due to its complexity, the resulting formalisms were contradictory
and difficult to understand, at best. Gontmakher and Schuster [GS97, GS98]
believed that simple reordering of independent statements was illegal. However,
they did not consider prescient stores (as described in the original JLS [GJS96]).
129
They later presented a revised model [GS00b].
The same authors [GS97, GS00b] proved that the original Java memory model
required a property called coherence (as originally described in [ABJ+93]). This
innocent sounding property requires that the order in which program actions
occur in memory is seen to be the same on each thread. Specifically, coherence
disallows a common case for reordering reads when those reads can see writes
that occur in another thread via a data race [Pug99].
The language specification was, therefore, obviously not complete. A lan-
guage?s memory model must take into account that fact that both processors and
compilers can perform optimizations that may change the nature of the code, and
that if consistent behavior is desired, a specification must account for what may
happen when synchronization is not used to emulate sequential consistency.
A number of proposals for replacement memory models for Java have emerged
since the deficiencies of the original model became apparent. Most of these are
based around modeling techniques originally used for architecture memory mod-
els; most are also operational.
Commit / Reconcile / Fence Approach Maessen, ArvindandShen [AMS00]
used the Commit / Reconcile / Fence protocol [SAR99] to propose a memory
model for Java. That proposal has several major weaknesses. First, it does not
distinguish between writes to final fields and writes to non-final fields; final field
semantics are guaranteed through the use of a memory barrier at the end of a
constructor. This means that writes to any fields of an object a in a?s constructor
must be guaranteed to be seen by other threads if they access a via a reference
published after that memory barrier. This is an onerous burden for architectures
with weak memory models and disallows optimizations that might be performed
130
Initially: a = 0
Thread 1 Thread 2
a = 1; a = 2;
i = a; j = a;
Figure 8.2: CRF does not allow i == 2 and j == 1
on objects with no final fields.
Additionally, problems arise in their approach because all communication is
performed through a single, global memory. For example, in Figure 8.3.2 the
result i = 2 and j = 1 is prohibited. This is because CRF requires an ordering
over writes to global memory; one thread cannot perceive that a = 1 happens
before a = 2 while the other thread perceives that a = 2 happens before a = 1.
Finally, their model does not allow for as much elimination of ?useless? syn-
chronization operations. The CRF-based specification provides a special rule to
skip the communication actions associated with a lock action if that thread is
the last one that released the lock. However, there is no symmetric rule for un-
lock actions or volatile variables, so the communication associated with them can
never be eliminated. Finally, even though the inter-thread communications may
be skipped for lock actions, it is unclear that instructions can be moved around
thread-local locks.
Operational Semantics The work that led to development of the model in
this dissertation [MP01b, MP01a] originally used an operational semantics to
describe a memory model for Java. The semantics in those papers describes a
machine that can take as input a Java program, and produce as output any legal
execution of that program.
131
Initially, x = y = z == 0
Thread 1 Thread 2 Thread 3
z = 1; r1 = y r3 = x;
r2 = z; y = 1;
x = r2
We must allow r1 == r2 == r3 == 1
Figure 8.3: A Purely Operational Approach Does Not Work
One of the major problems with that model is the operational approach it
takes. The final Java memory model allows a write to occur early if an execu-
tion in which that write occurs can be demonstrated. The original, operational
approach we described allows a write to be performed early if three conditions
were fulfilled:
1. The write would occur (to the same variable, with the same value),
2. The write would not be seen by the thread that performed it before the
point at which it occurred originally, and
3. The write would not appear to occur out of thin air.
A major problem with this approach is illustrated by the example in Fig-
ure 8.3. In order to allow the reads of y and x to return the value 1, they need to
be able to be reordered with the writes to x and y, respectively. This is similar
to the sorts of reorderings we have seen.
The approach allows the writes to occur early, but only if the write is guaran-
teed to occur to write out the same value. In this case, the write is not guaranteed
to occur with the same value: the read of z may return either 0 or 1, so either
132
value may be written. Thus, the approach in that paper does not allow this
result.
Unified Memory Model approach Yang, Gopalakrishnan and Lindstrom
[YGL01, YGL02] attempttocharacterize the approaches taken in [MP01b, MP01a,
AMS00] using a single, unified memory model (called UMM). They [YGL02,
Yan04] also use this notation to propose a third memory model.
The UMM based approach does not attempt to solve some of the issues that
a memory model for Java is required to solve. The paper suggests that it is
more important to formulate a simple memory model than one that fulfills the
informal requirements laid out by the Java community. First, they do not allow
the result in Figures 3.3 and 8.4; an action in their model is only allowed to
be reordered around a control or data dependence if the action is guaranteed to
occur in all executions. This disallows, for example, redundant load elimination,
an extremely important compiler optimization.
Their treatment of final fields is equally lacking. First, they do not allow for
transitive visibility effects for finals. For the example of an immutable String
object with an integer length field and a reference to an array of characters
contents, the integer would be correct, the reference would point to the correct
array, but the array?s contents would not be guaranteed to be up to date. This
simplifies the model, but does not satisfy safety conditions.
Second, their rules for final have the effect that once the field has been frozen
(at the end of a constructor), all other threads are guaranteed to see the final value
rather than the default value. It is difficult to see how this could be implemented
efficiently; this has therefore never been part of the agreed-upon safety guarantees
for final fields. They do not address this issue.
133
Initially, x == y == 0
Thread 1 Thread 2
r1 = x; r3 = y;
r2 = r1 | 1; x = r3;
y = r2;
r1 == r2 == r3 == 1 is legal behavior
Figure 8.4: Compilers Can Think Hard About When Actions Are Guaranteed to
Occur
Other Kotrajaras [Kot01] proposes a memory model for Java that is based
on the original, flawed model. It suffers not only from the complexity of that
model, but from its reliance on a single, global shared memory. It also fails to
describe adequately what happens in the case of a cycle in control dependency.
Furthermore, their proposed model violates the informal rules for final fields
by disallowing references to the object being constructed from escaping their
constructors.
Saraswat [Sar04] presents a memory model for Java based on solving a system
of constraints between actions for a unique fixed point, rather than depending
on control and data dependence. Thus, Saraswat?s model allows the behavior in
Figure 3.3; given which writes are seen by each read, there is a unique fixed point
solution. However, Saraswat?s model does not allow the behavior in Figure 8.4; in
that example, given the binding of reads to writes, there are multiple fixed-point
solutions.
134
8.4 Final Fields
The C# language [ECM02a] has a readonly keyword that is similar to Java?s
final modifier. Fields marked as readonly do not have any additional threading
semantics in C#?s runtime environment, the Common Language Infrastructure
[ECM02b]; they must, as in earlier revisions of Java, be treated as normal fields
for the purposes of synchronization.
135
Chapter 9
Future Work
Answers are easy. It?s asking the right questions which is hard.
- Chris Boucher, The Face of Evil
This work is the first that has explored programming language memory models
in depth. While it has resolved the issue of what the Java programming language?s
memory model is to be, there are many potential areas for future work. This
chapter explores some of these areas.
9.1 Performance Testing and Optimization
We have spent a lot of time in this dissertation detailing the results a Java
program may have. We have also pointed out the fact that as long as one of those
results is obtained, any implementation is allowed. For example, we have seen
how the Java memory model allows a wide variety of optimizations disallowed by
the previous memory model.
Although we can never enumerate all possible implementations (there are an
infinite number of choices that can be made), we do have a clearly defined notion
of what is not allowed in an implementation. We have strengthened both volatile
and final fields ? having done so, it is important to measure the performance
136
impact of that strengthening.
This is a more difficult task than it may appear to be at first blush. There
are several questions that need to be answered. First is the question of what we
use for comparison. The answer to this question probably must be an alternative
model that does not strengthen volatile or final fields. This may come in the
form of, say, an earlier implementation of Java, but older implementations do not
include as many performance optimizations as newer ones, so direct comparisons
are difficult.
More important than the basis for comparison is the problem of choosing
benchmarks. In order to measure the impact of volatiles and finals effectively,
you have to choose a benchmark that will be correct under both the old and new
implementations. This has different implications for final and volatile fields.
For final fields, the impact is going to be that of a memory / compiler barrier
at the end of a constructor of an object with at least one final field. We have
performed some analysis of the impact of these barriers, as discussed in Chapter 7.
In general, however, non-static final fields are not all that common in widely used
benchmarks; our benchmarks made fields final that were not final previously.
Furthermore, barriers are only needed at the end of a constructor when a reference
to the object is passed to another thread via a data race. This is an uncommon
case; more future work may eliminate a substantial number of barriers.
It is more difficult to evaluate the impact of volatile fields. All of the known
algorithms that employ volatile fields require the semantics that we gave them.
In developing the memory model, we found that programmers who were using
volatile fields were invariably either tacitly assuming our semantics, or had code
that wouldn?t have done what they wanted it to do. As a result of this, existing
137
benchmarks that employ volatile require our semantics for correctness. Thus, it
is useless to run those benchmarks on virtual machines that do not implement
volatile correctly.
Future work that evaluates the performance impact of the new memory model
must also take into account the new ways of improving performance that are
allowed. That means exploiting newly available techniques to their fullest. This
includes, but is not limited to:
? Newly permissible optimizations on monitors, such as synchronization elim-
ination and lock coarsening.
? Newly available programming idioms, including lock- and wait-free algo-
rithms that use volatile.
? Optimizations that reduce the cost of final and volatile variables, such as
escape analysis to determine objects that do not need barriers at the end
of constructors, and elimination of barriers from accesses to volatile fields
of thread-local objects.
9.2 Better Programming Models
The need for backwards compatibility had a tremendous impact on the design of
the Java memory model. Examples of this abound. Consider the semantics of
final fields; they are substantially more complex simply because of the possibil-
ity that final fields can be changed by deserialization. Another example is the
presence of potential data races in the language; this brings with it the need for
causality semantics.
138
Java programmers should not have to deal with the full complexity introduced
by the model. There are two ways of mitigating this complexity:
? Changes can be made to Java to reduce the difficulty of dealing with con-
currency, or
? New programming languages and paradigms can be introduced with less
complexity.
These two approaches are intertwined; features introduced in a new language
(like lightweight transactions [HF03], for example) may be introduced in an older
language by way of new APIs and system calls. Actual language changes in a
language like Java must be made very carefully, because of backwards compati-
bility issues. New APIs are frequently used to introduce significant functionality
(as per Isolates or Real-Time extensions [Jav02, Jav03]).
However, deciding what features to develop is a difficult task. For example,
many attempts have been made to design a language that is completely free
from data races [BST00, FF00]. However, because they eliminate data races by
enforcing mutual exclusion, such languages only allow limited support for wait-
and lock-free data structures.
Some language designers are attempting to introduce the notion of atomicity
(in the form of lightweight transactions, for example) ? however, such languages
are still prototypical, and it is still unclear whether they can be implemented
effectively and efficiently.
139
9.3 Other Programming Languages
Many of the issues for other languages were described in Chapter 8. There is
some potential work to do for those languages that does not reflect existing work.
As we have mentioned, most languages do not incorporate a well-designed
memory model. Those that do tend to have limited options for concurrency
control. It is quite easy to define semantics for languages where, for example,
unsynchronized access to shared variables is impossible.
On the other hand, these issues may be more difficult for popular languages
like C and C++. C and C++, for example, are single threaded languages; this
implies that there is no consistent specification that prohibits reordering synchro-
nization operations with non-synchronization operations. The volatile modifier,
for example, is defined to be ?implementation dependent?. This also implies
that the semantics change depending on the threading library used and on the
compiler / platform combination.
This leads to some serious potential anomalies when writing multithreaded
code on these platforms. Consider the following C code:
static short sharedArr [4];
void t1() {
sharedArr[1] = 1;
sharedArr[3] = 3;
}
void t2() {
140
sharedArr[2] = 2;
sharedArr[4] = 4;
}
Consider two threads, each acquiring and releasing a different lock, and ex-
ecuting t1 and t2 concurrently. One thread may read the array as a single, 32
bit value, update it in a register and block. The other thread may read the array
as a single, 32 bit value, update it in a register and write out its changes. The
first thread might then continue, and write out the value of the array stored in
its register. This would cause the update in t2 to be lost. Java disallows this,
but C does not ? in fact, the pthread specification allows it.
In general, many of the most important questions for C and C++ are the
same kind of questions we answered for Java. For example, should the volatile
modifier have the same semantics? It is possible to make it stronger ? for example,
perhaps all writes that occur to any volatile should be visible after a read of any
other subsequent volatile. Decisions on this would affect the semantics of all
synchronization mechanisms, including the use of locking and explicit memory
barriers.
Another example is that of word tearing. In Java, reads and writes of 32-bit
values are always atomic, while 64-bit values may be written or read in 32-bit
chunks. C and C++ may require fewer guarantees. For example, if a 32-bit value
is aligned improperly, it may be more difficult to provide atomicity guarantees.
141
Chapter 10
Conclusion
In this dissertation, we have outlined the necessary properties for a programming
language memory model, and outlined how those properties can be achieved.
The resulting model balances two crucial needs: it allows implementors flexibility
in their ability to perform code transformations and optimizations, and it also
provides a clear and simple programming model for those writing concurrent
code.
The model meets the needs of programmers in several key ways:
? It provides a clear and simple semantics that explains how threads can use
locking to interact through memory, providing crucial ordering and visibility
properties.
? It provides a clear definition for the behavior of programs in the presence
of data races, including the most comprehensive treatment to date of the
dangers of causality and how they can be avoided.
? It refines the concept of volatile variables, which can be used in place
of explicit memory barriers to design lock- and wait-free algorithms, while
still providing the aforementioned crucial ordering and visibility properties.
142
? It strengthens the concept of final variables, which can now be used to
provide thread-safe immutability regardless of the presence of data races.
This dissertation (and the model contained within) clarified and formalized
these needs, balancing them carefully with a wide variety of optimizations and
program transformations commonly performed by compilers and processor archi-
tectures. It also provided verification techniques to ensure that the model reflects
this balancing act accurately and carefully.
It is these two things in concert: both the balance, and the degree to which
the model has been verified (both by peer review and proof techniques), that has
allowed this model to be used for practical applications. The designers of the
Itanium memory model, for example, changed their specification to account for
the need for final fields to appear immutable without additional memory barriers.
The architects of C and C++ are now looking into adapting some of the work
that has been done for this model to their languages. And finally, of course, this
model has been adopted as the foundation for concurrent programming in the
Java programming language.
143
Appendix A
Finalization
?Dying is a very dull, dreary affair and my advice to you is to
have nothing whatever to do with it.?
? W. Somerset Maugham
In the Java programming language, the Object class has a protected method
called finalize; this method can be overridden by other classes. The particular
definition of finalize that can be invoked for an object is called the finalizer of
that object. Before the storage for an object is reclaimed by the garbage collector,
the Java virtual machine will invoke the finalizer of that object.
Finalizers provide a chance to free up resources that cannot be freed automat-
ically by an automatic storage manager. In such situations, simply reclaiming
the memory used by an object would not guarantee that the resources it held
would be reclaimed.
The JLS does not specify how soon a finalizer will be invoked, except to say
that it will occur before the storage for the object is reused. Also, the language
does not specify which thread will invoke the finalizer for any given object. It
is guaranteed, however, that the thread that invokes the finalizer will not be
holding any user-visible synchronization locks when the finalizer is invoked. If
144
an uncaught exception is thrown during the finalization, the exception is ignored
and finalization of that object terminates.
It should also be noted that the completion of an object?s constructor (includ-
ing all of its freezes for final fields) happens-before the execution of its finalize
method (in the formal sense of happens-before).
Finalizers interact in a special way with final field semantics. Each finalizer
takes place in a final field safe context. Additionally, as described in Section 5.2,
each finalizer begins with a logical read of a reference to the object being finalized.
The memory chain for that final field (see Section 7.1) passes through that read,
providing immutability guarantees to the finalizer.
It is important to note that many finalizer threads may be active (this is
sometimes needed on large SMPs), and that if a large connected data structure
becomes garbage, all of the finalize methods for every object in that data structure
could be invoked at the same time, each finalizer invocation running in a different
thread.
The finalize method declared in class Object takes no action.
The fact that class Object declares a finalize method means that the
finalize method for any class can always invoke the finalize method for its
superclass. This should always be done, unless it is the programmer?s intent to
nullify the actions of the finalizer in the superclass. Unlike constructors, finalizers
do not automatically invoke the finalizer for the superclass; such an invocation
must be coded explicitly.)
For efficiency, an implementation may keep track of classes that do not over-
ride the finalize method of class Object, or override it in a trivial way, such
as:
145
protected void finalize() throws Throwable {
super.finalize();
}
We encourage implementations to treat such objects as having a finalizer that
is not overridden, and to finalize them more efficiently, as described in Section A.1.
A finalizer may be invoked explicitly, just like any other method.
The package java.lang.ref describes weak references, which interact with
garbage collection and finalization. As with any API that has special interactions
with the language, implementors must be cognizant of any requirements imposed
by the java.lang.ref API. This specification does not discuss weak references in
any way. Readers are referred to the API documentation for details.
A.1 Implementing Finalization
Every object has two attributes: it may be reachable, finalizer-reachable, or un-
reachable, and it may also be unfinalized, finalizable, or finalized.
A reachable object is any object that can be accessed in any potential con-
tinuing computation from any live thread. Any object that may be referenced
from a field or array element of a reachable object is reachable. Finally, if a refer-
ence to an object is passed to a JNI method, then the object must be considered
reachable until that method completes.
Optimizing transformations of a program can be designed that reduce the
number of objects that are reachable to be less than those which would na??vely
be considered reachable. For example, a compiler or code generator may choose
146
to set a variable or parameter that will no longer be used to null to cause the
storage for such an object to be potentially reclaimable sooner.
Another example of this occurs if the values in an object?s fields are stored in
registers. The program then may access the registers instead of the object, and
never access the object again. This would imply that the object is garbage.
Note that this sort of optimization is only allowed if references are on the
stack, not stored in the heap. For example, consider the Finalizer Guardian
pattern [Blo01]:
class Foo {
private final Object finalizerGuardian = new Object() {
protected void finalize() throws Throwable {
/* finalize outer Foo object */
}
}
}
The finalizer guardian forces a super.finalize to be called if a subclass
overrides finalize and does not explicitly call super.finalize.
If these optimizations are allowed for references that are stored on the heap,
then the compiler can detect that the finalizerGuardian field is never read,
null it out, collect the object immediately, and call the finalizer early. This
runs counter to the intent: the programmer probably wanted to call the Foo
finalizer when the Foo instance became unreachable. This sort of transformation
is therefore not legal: the inner class object should be reachable for as long as
the outer class object is reachable.
147
Transformations of this sort may result in invocations of the finalize method
occurring earlier than might be otherwise expected. In order to allow the user
to prevent this, we enforce the notion that synchronization may keep the object
alive. If an object?s finalizer can result in synchronization on that object, then
that object must be alive and considered reachable whenever a lock is held on it.
Note that this does not prevent synchronization elimination: synchronization
only keeps an object alive if a finalizer might synchronize on it. Since the finalizer
occurs in another thread, in many cases the synchronization could not be removed
anyway.
A finalizer-reachable object can be reached from a finalizable object through
some chain of references, but not from any live thread. An unreachable object
cannot be reached by either means.
An unfinalized object has never had its finalizer automatically invoked; a fi-
nalized object has had its finalizer automatically invoked. A finalizable object has
never had its finalizer automatically invoked, but the Java virtual machine may
eventually automatically invoke its finalizer. An object cannot be considered
finalizable until its constructor has finished. Every pre-finalization write to a field
of an object must be visible to the finalization of that object. Furthermore, none
of the pre-finalization reads of fields of that object may see writes that occur after
finalization of that object is initiated.
A.2 Interaction with the Memory Model
It must be possible for the memory model to decide when it can commit actions
that take place in a finalizer. This section describes the interaction of finalization
with the memory model.
148
Each execution has a number of reachability decision points, labeled di. Each
action either comes-before di or comes-after di. Other than as explicitly men-
tioned, comes before in this section is unrelated to all other orderings in the
memory model.
If r is a read that sees a write w and r comes-before di, then w must come-
before di. If x and y are synchronization actions on the same variable or monitor
such that x so? y and y comes-before di, then x must come-before di.
At each reachability decision point, some set of objects are marked as un-
reachable, and some subset of those objects are marked as finalizable. These
reachability decision points are also the points at which References are checked,
enqueued and cleared according to the rules provided in the specifications for
java.lang.ref.
Reachability
The only objects that are considered definitely reachable at a point di are those
that can be shown to be reachable by the application of these rules:
? An object B is definitely reachable at di from static fields if there exists
there is a write w1 to a static field v of a class C such that the value written
by w1 is a reference to B, the class C is loaded by a reachable classloader
and there does not exist a write w2 to v s.t. ?(w2 hb? w1), and both w1
and w2 come-before di.
? An object B is definitely reachable from A at di if there is a write w1 to an
element v of A such that the value written by w1 is a reference to B and
there does not exist a write w2 to v s.t. ?(w2 hb? w1), and both w1 and w2
come-before di.
149
? If an object C is definitely reachable from an object B, object B is definitely
reachable from an object A, then C is definitely reachable from A.
An action a is an active use of X if and only if
? it reads or writes an element of X
? it locks or unlocks X and there is a lock action on X that happens-after
the invocation of the finalizer for X.
? it writes a reference to X
? it is an active use of an object Y, and X is definitely reachable from Y
If an object X is marked as unreachable at di,
? X must not be definitely reachable at di from static fields,
? All active uses of X in a thread t that come-after di must occur in the
finalizer invocation for X or as a result of thread t performing a read that
comes-after di of a reference to X.
? All reads that come-after di that see a reference to X must see writes to
elements of objects that were unreachable at di, or see writes that came
after di.
If an object X marked as finalizable at di, then
? X must be marked as unreachable at di,
? di must be the only place where X is marked as finalizable,
? actions that happen-after the finalizer invocation must come-after di
150
BIBLIOGRAPHY
[ABJ+93] Mustaque Ahamad, Rida A. Bazzi, Ranjit John, Prince Kohli, and
Gil Neiger. The Power of Processor Consistency. In The Fifth ACM
Symposium on Parallel Algorithms and Architectures (SPAA), 1993.
[Ada95] Ada Joint Program Office. Ada 95 Rationale. Intermetrics, Inc., Cam-
bridge, Massachusetts, 1995.
[Adv93] Sarita Adve. Designing Memory Consistency Models for Shared-Memory
Multiprocessors. PhD thesis, University of Wisconsin, Madison, Decem-
ber 1993.
[AG96] Sarita Adve and Kourosh Gharachorloo. Shared memory consistency
models: A tutorial. IEEE Computer, 29(12):66?76, 1996.
[AH90] Sarita Adve and Mark Hill. Weak ordering?A new definition. In Pro-
ceedings of the Seventeenth International Symposium on Computer Ar-
chitecture, pages 2?14, 1990.
[AMS00] Arvind, Jan-Willem Maessen, and Xiaowei Shen. Improving the Java
memory model using CRF. In Object Oriented Programming Systems,
Languages and Applications, pages 1?12, October 2000.
151
[BFJ+96] Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Charles E.
Leiserson, and Keith H. Randall. DAG-consistent distributed shared
memory. In 10th International Parallel Processing Symposium (IPPS
?96), pages 132?141, Honolulu, Hawaii, April 1996.
[Blo01] Joshua Bloch. Effective Java programming language guide. Sun Microsys-
tems, Inc., 2001.
[BP96] Gianfranco Bilardi and Keshav Pingali. A Framework for Generalized
Control Dependence. In ACM SIGPLAN 1996 Conference on Program-
ming Language Design and Implementation, Philadelphia, Pennsylvania,
United States, June 1996.
[BST00] David F. Bacon, Robert E. Strom, and Ashis Tarafdar. Guava: A dialect
of java without data races. In Object Oriented Programming Systems,
Languages and Applications, pages 382?400, 2000.
[CGS+99] Jong-Deok Choi, Manish Gupta, Mauricio Serrano, Vugranam C.
Sreedhar, and Sam Midkiff. Escape analysis for Java. In Object Ori-
ented Programming Systems, Languages and Applications, pages 1?19,
1999.
[CKRW97] Pietro Cenciarelli, Alexander Knapp, Bernhard Reus, and Martin
Wirsing. From Sequential to Multi-Threaded Java: An Event Based
Operational Semantics. In Sixth International Conference on Algebraic
Methodology and Software Technology, October 1997.
[CKRW98] Pietro Cenciarelli, Alexander Knapp, Bernhard Reus, and Martin
Wirsing. Formal Syntax and Semantics of Java. Springer-Verlag, 1998.
152
[Com98] Compaq Computer Corporation. Alpha Architecture Handbook, version
4. Compaq Computer Corporation, Houston, TX, USA, October 1998.
[DR98] Pedro C. Diniz and Martin C. Rinard. Lock coarsening: Eliminating
lock overhead in automatically parallelized object-based programs. The
Journal of Parallel and Distributed Computing, 49(2):218?244, 1998.
[DS90] Anne Dinning and Edith Schonberg. An empirical comparison of mon-
itoring algorithms for access an omaly detection. In Proceedings of the
2nd ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming, pages 1?10, June 1990.
[DSB86] M. Dubois, C. Scheurich, and F. Briggs. Memory Access Buffering in
Multiprocessors. In Proceedings of the Thirteenth International Sympo-
sium on Computer Architecture, pages 434?442. IEEE Computer Society
Press, 1986.
[ECM02a] ECMA. C# Language Specification, December 2002.
http://www.ecma-international.org/publications/standards/Ecma-
334.htm.
[ECM02b] ECMA. Common Language Infrastructure (CLI), December 2002.
http://www.ecma-international.org/publications/standards/Ecma-
335.htm.
[FF00] Cormac Flanagan and Stephen N. Freund. Type-based race detection for
Java. In ACM SIGPLAN 2000 Conference on Programming Language
Design and Implementation, pages 219?232, 2000.
153
[Gha96] Kourosh Gharachorloo. Memory Consistency Models for Shared-
Memory Multiprocessors. PhD thesis, Stanford University, 1996.
[GJS96] James Gosling, Bill Joy, and Guy Steele. The Java Language Specifica-
tion. Addison Wesley, 1996.
[GLL+90] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and
J. L. Hennessy. Memory consistency and event ordering in scalable
shared-memory multiprocessors. In Proceedings of the Seventeenth Inter-
national Symposium on Computer Architecture, pages 15?26, May 1990.
[Gro00] Supercomputing Technologies Group. Cilk 5.3.1 Reference Manual. MIT
Laboratory for Computer Science, Cambridge, Massachusetts, 2000.
[GS97] Alex Gontmakher and Assaf Schuster. Java consistency: Non-operational
characterizations for the Java memory behavior. Technical Report
CS0922, Department of Computer Science, Technion, November 1997.
[GS98] Alex Gontmakher and Assaf Schuster. Characterization for Java Memory
Behavior. In Twelfth International Parallel Processing Symposium and
Ninth Symposium on Parallel and Distributed Processing, pages 682?686,
1998.
[GS00a] Guang R. Gao and Vivek Sarkar. Location consistency-a new memory
model and cache consistency proto col. IEEE Transactions on Comput-
ers, 49(8):798?813, 2000.
[GS00b] Alex Gontmakher and Assaf Schuster. Java Consistency: Nonopera-
tional Characterizations for Java Memory Behavior. ACM Transactions
on Computer Systems, 18(4):333?386, 2000.
154
[HF03] Tim Harris and Keir Fraser. Language support for lightweight transac-
tions. In Object Oriented Programming Systems, Languages and Appli-
cations, October 2003.
[Hoa85] C.A.R. Hoare. Communicating Sequential Processes. Prentice-Hall In-
ternational, 1985.
[Int99] Intel Corporation. IA-64 Application Developer?s Architecture Guide,
version 1. Intel Corporation, Santa Clara, CA, USA, May 1999.
[Int02] Intel Corporation. IA-64 Architecture Software Developer?s Manual, vol-
ume 2. Intel Corporation, Santa Clara, CA, USA, October 2002.
[Jav02] Java Specification Request (JSR) 1. Real-time Specification for Java,
2002. http: //jcp.org/jsr/detail/121.jsp.
[Jav03] Java Specification Request (JSR) 121. Application Isolation API Speci-
fication, 2003. http: //jcp.org/jsr/detail/121.jsp.
[Jav04] Java Specification Request (JSR) 133. Java Memory Model and Thread
Specification Revision, 2004. http: //jcp.org/jsr/detail/133.jsp.
[Jon87] Geraint Jones. Programming in Occam. Prentice-Hall International,
1987.
[KCZ92] Pete Keleher, Alan L. Cox, and Willy Zwaenepoel. Lazy release con-
sistency for software distributed shared memory. In Proceedings of the
Nineteenth International Symposium on Computer Architecture, pages
13?21, May 1992.
155
[Kot01] Vishnu Kotrajaras. Towards an Improved Memory Model for Java. PhD
thesis, Department of Computing, Imperial College, August 2001.
[KR88] B. W. Kernighan and D. M. Ritchie. The C Programming Language.
Prentice Hall, second edition, 1988.
[KY95] Arvind Krishnamurthy and Katherine Yelick. Optimizing parallel pro-
grams with explicit synchronization. In ACM SIGPLAN 1995 Conference
on Programming Language Design and Implementation, pages 196 ? 204,
La Jolla, California, June 1995.
[Lam78] Leslie Lamport. Time, clocks, and the ordering of events in a distributed
system. Communications of the ACM, 21(7):558?564, 1978.
[Lam79] Leslie Lamport. How to make a multiprocessor computer that cor-
rectly executes multiprocess programs. IEEE Transactions on Comput-
ers, 9(29):690?691, 1979.
[LB98] Bil Lewis and Daniel J. Berg. Multithreaded programming with pthreads.
Sun Microsystems, 2550 Garcia Avenue, Mountain View, CA 94043,
USA, 1998.
[Lea04] Doug Lea. JSR-133 Cookbook, 2004. Available from
http: //gee.cs.oswego.edu/dl/jmm/cookbook.html.
[LP01] Jaejin Lee and David A. Padua. Hiding relaxed memory consistency with
a compiler. IEEE Transactions on Computers, 2001.
[Man01] Jeremy Manson. Virtual Machine Implementation Issues for the Re-
vised Java Me mory Model, December 2001. M.S. Thesis, Available from
http: //www.cs.umd.edu/users/jmanson/java.html.
156
[MP01a] Jeremy Manson and William Pugh. Core semantics of multithreaded
Java. In ACM Java Grande Conference, June 2001.
[MP01b] Jeremy Manson and William Pugh. Semantics of Multithreaded Java.
Technical Report CS-TR-4215, Dept. of Computer Science, University of
Maryland, College Park, March 2001.
[MP02] Jeremy Manson and William Pugh. The Java Memory Model Simulator.
In Workshop on Formal Techniques for Java Programs, in association
with ECOOP 2002, June 2002.
[MPC90] Samuel P. Midkiff, David A. Padua, and Ron G. Cytron. Compiling
programs with user parallelism. In August, editor, Proceedings of the 3rd
Workshop on Languages and Compilers for Parallel Computing, 1990.
[MS96] Maged Michael and Michael Scott. Simple, fast and practical non-
blocking and blocking concurrent queue algorithms. In Fifteenth ACM
Symposium on Principles of Distributed Computing, 1996.
[NR86] Itzhak Nudler and Larry Rudolph. Tools for the efficient development of
efficient parallel programs. In Proceedings of the First Israeli Conference
on Computer Systems Engineering, May 1986.
[PC90] Andy Podgurski and Lori Clarke. A formal model of program depen-
dences and its implications for software testing debugging and mainte-
nance. IEEE Transactions on Software Engineering, 1990.
[Pug99] William Pugh. Fixing the Java memory model. In ACM Java Grande
Conference, June 1999.
157
[Ruf00] Eric Ruf. Effective Synchronization Removal for Java. In ACM SIG-
PLAN 2000 Conference on Programming Language Design and Imple-
mentation, Vancouver, BC Canada, June 2000.
[SAR99] Xiaowei Shen, Arvind, and Larry Rudolph. Commit-reconcile & fences
(CRF): A new memory model for architects and compiler writers. Pro-
ceedings of the Twenty-Sixth International Symposium on Computer Ar-
chitecture, 1999.
[Sar04] Vijay Saraswat. Concurrent Constraint-based Memory Machines: A
framework for Java Memory Models. Technical report, IBM TJ Wat-
son Research Center, March 2004.
[Sch89] Edith Schonberg. On-the-fly detection of access anomalies. In ACM
SIGPLAN 1989 Conference on Programming Language Design and Im-
plementation, pages 285?297, 1989.
[SS88] Dennis Shasha and Marc Snir. Efficient and correct execution of par-
allel programs that share memory. ACM Transactions on Programming
Languages and Systems (TOPLAS), 10(2):282?312, April 1988.
[Str97] Bjarne Stroustrup. The C++ Programming Language. Addison-Wesley
Longman, Reading Mass. USA, 3rd edition, 1997.
[WG94] David Weaver and Tom Germond. The SPARC Architecture Manual,
version 9. Prentice-Hall, 1994.
[Yan04] Yue Yang. Formalizing Shared Memory Consistency Models for Program
Analysis. PhD thesis, University of Utah, 2004.
158
[YGL01] Yue Yang, Ganesh Gopalakrishnan, and Gary Lindstrom. Specifying
Java thread semantics using a uniform memory model. In ACM Java
Grande Conference, November 2001.
[YGL02] Yue Yang, Ganesh Gopalakrishnan, and Gary Lindstrom. Formalizing
the Java Memory Model for Multithreaded Program Correctness and
Optimization. Technical Report UUCS-02-011, University of Utah, April
2002.
159