ABSTRACT 
 
 
 
 
Title of Document: DISK DESIGN-SPACE EXPLORATION IN 
TERMS OF SYSTEM-LEVEL 
PERFORMANCE, POWER, AND ENERGY 
CONSUMPTION 
  
 Nuengwong Tuaycharoen 
Doctor of Philosophy, 2006 
Directed By: Associate Professor Bruce L. Jacob 
Department of Electrical and Computer 
Engineering 
University of Maryland, College Park 
 
 
To make the common case fast, most studies focus on the computation phase of 
applications in which most instructions are executed. However, many programs spend 
significant time in the I/O intensive phase due to the I/O latency. To obtain a system 
with more balanced phases, we require greater insight into the effects of the I/O 
configurations to the entire system in both performance and power dissipation 
domains.  
Due to lack of public tools with the complete picture of the entire memory hierarchy, 
we developed SYSim. SYSim is a complete-system simulator aiming at complete 
memory hierarchy studies in both performance and power consumption domains.  
In this dissertation, we used SYSim to investigate the system-level impacts of several 
disk enhancements and technology improvements to the detailed interaction in 
  
memory hierarchy during the I/O-intensive phase. The experimental results are 
reported in terms of both total system performance and power/energy consumption. 
With SYSim, we conducted the complete-system experiments and revealed intriguing 
behaviors including, but not limited to, the following:  
? During the I/O intensive phase which consists of both disk reads and writes, 
the average system CPI tracks only average disk read response time, and not 
overall average disk response time, which is the widely-accepted metric in 
disk drive research. 
? In disk read-dominating applications, Disk Prefetching is more important than 
increasing the disk RPM. On the other hand, in applications with both disk 
reads and writes, the disk RPM matters.  
? The execution time can be improved to an order of magnitude by applying 
some disk enhancements. Using disk caching and prefetching can improve the 
performance by the factor of 2, and write-buffering can improve the 
performance by the factor of 10. Moreover, using disk caching/prefetching 
and the write-buffering techniques in conjunction can improve the total 
system performance by at least an order of magnitude.  
? Increasing the disk RPM and the number of disks in RAID disk system also 
have an impressive improvement over the total system performance. 
However, employing such techniques requires careful consideration for trade-
offs in power/energy consumption. 
 
 
 
  
 
 
 
 
 
 
 
 
DISK DESIGN-SPACE EXPLORATION IN TERMS OF SYSTEM-LEVEL 
PERFORMANCE, POWER, AND ENERGY CONSUMPTION   
 
 
 
By 
 
 
Nuengwong Tuaycharoen 
 
 
 
 
 
Dissertation submitted to the Faculty of the Graduate School of the  
University of Maryland, College Park, in partial fulfillment 
of the requirements for the degree of 
Doctor of Philosophy 
2006 
 
 
 
 
 
 
 
 
 
 
Advisory Committee: 
Associate Professor Bruce Jacob, Chair 
Associate Professor Manoj Franklin 
Associate Professor Gang Qu 
Assistant Professor Ankur Srivastava 
Professor Lawrence Washington 
 
 
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
? Copyright by 
Nuengwong Tuaycharoen 
2006 
 
 
 
 
 
 
 
 
 
 
 
 
 
 ii
Table of Contents
Chapter 1: Introduction ..........................................1
1.1 Problem Description ............................................................................ 1
1.2 Contribution and Significance ........................................................... 12
1.3 Organization of Dissertation.............................................................. 14
Chapter 2: Memory Hierarchy ............................15
2.1 Memory Hierarchy............................................................................. 16
2.2 Virtual Memory ................................................................................. 17
2.3 Caches................................................................................................ 19
2.3.1. Cache Memory Cell ..................................................................................20
2.3.2. Cache Operations ......................................................................................21
2.3.3. CACTI: An Integrated Cache Timing, Power, and Area Model ..............26
2.3.4. Wattch .......................................................................................................31
2.4 Main Memory: DRAM ...................................................................... 34
2.4.1. DRAM Memory Cell ................................................................................35
2.4.2. Standard DRAM Device ...........................................................................36
2.4.3. DRAM-Based Memory System Organization ..........................................39
2.4.4. DRAM Commands ...................................................................................43
2.4.5. Memory Controller ...................................................................................50
Chapter 3: Overview of Disks ..............................56
3.1 Classifications of Disk Drives ........................................................... 57
3.2 Areal Density Growth Trend ............................................................. 59
3.3 Performance Metrics.......................................................................... 61
3.3.1. Command overhead ..................................................................................62
3.3.2. Seek time ...................................................................................................63
 iii
3.3.3. Rotational latency .....................................................................................64
3.3.4. Data transfer time ......................................................................................64
3.4 The Physical Layer ............................................................................ 65
3.4.1. Principles of Rotating Storage Devices ....................................................65
3.4.2. Magnetic Recording ..................................................................................67
3.4.3. Mechanical and Magnetic Components ....................................................69
3.4.4. Electronics ................................................................................................76
3.5 The Data Layer .................................................................................. 79
3.5.1. Disk Blocks or Sectors ..............................................................................80
3.5.2. Tracks ........................................................................................................82
3.5.3. Cylinders ...................................................................................................82
3.5.4. Address Mapping ......................................................................................82
3.5.5. Internal Addressing ...................................................................................83
3.5.6. External Addressing ..................................................................................83
3.5.7. Logical Address to Physical Location Mapping .......................................83
3.5.8. Zoned Bit Recording .................................................................................85
3.5.9. Servo .........................................................................................................87
3.5.10. Sector ID and No-ID Formatting ..............................................................93
3.5.11. Defect Management ..................................................................................93
3.6 File System Caching.......................................................................... 96
Chapter 4: Related Work .....................................99
4.1 Complete-System Simulations........................................................... 99
4.2 Magnetic Disk Drive Enhancements and Physical Improvements.. 105
4.2.1. Disk Drive Enhancements ......................................................................107
4.2.2. Disk Drive Physical Improvements ........................................................119
Chapter 5: Methodology .....................................124
5.1 The Processor Simulator: Bochs...................................................... 126
5.2 The Cache Simulator: Wattch.......................................................... 127
5.3 The DRAM Simulator and Its Power Model................................... 129
5.4 The Disk Simulator: DiskSim.......................................................... 137
5.5 The Benchmarks: SPEC2000 .......................................................... 139
5.6 Interactions ...................................................................................... 140
 iv
5.7 Parameter and Benchmark Selections ............................................. 145
5.8 SYSim and Real Systems Comparison............................................ 151
5.9 Sample Output ................................................................................. 154
Chapter 6: Experimental Results .......................159
6.1 I/O intensive phase .......................................................................... 159
6.2 Memory Size and I/O Behaviors ..................................................... 175
6.3 Power/Energy Consumption of the Disk due to Different Memory Size 
.....................................................................................................................182
6.4 Effects of Disk Physical Technology Improvement and Enhancements  
.....................................................................................................................187
6.4.1. Rotational Speed (RPM) .........................................................................187
6.4.2. Prefetching ..............................................................................................192
6.4.3. Parallel I/O: RAID5 ................................................................................198
6.4.4. Size of the Disk Cache ............................................................................205
6.4.5. Disk Cache Organization ........................................................................211
6.4.6. Bus Transmission Latency ......................................................................213
6.4.7. Perfect Write-Buffering ..........................................................................214
6.5 Total CPI v.s. Disk Response Time................................................. 221
6.6 The CPI Breakdown ........................................................................ 223
6.7 Power/Energy Consumption............................................................ 225
6.8 The System Bandwidth.................................................................... 237
6.9 Configuration Comparison .............................................................. 239
Chapter 7: Conclusions .......................................245
Appendix: SPEC CPU2000 ................................ 250
References...................................................................... 262
 v
List of Tables
Table 1.1: Execution Time Breakdown for System #1: 750MHz CPU with 96MB 
memory............................................................................................................ 1
Table 1.2: Execution Time Breakdown for System #2: 750MHz CPU with 128MB of 
memory............................................................................................................ 1
Table 2.1: CACTI input parameters............................................................................... 28
Table 2.2: CACTI output implementation parameters .................................................. 28
Table 4.1: Attributes of various performance modeling techniques [3]...................... 100
Table 4.2: Latest Disk Interfaces and Their Data Rate................................................ 123
Table 5.1: The definitions of symbols in the DRAM datasheet .................................. 130
Table 5.2: Base Configuration for CPU, caches, and memory.................................... 146
Table 5.3: Disk Active and Idle Power Values ............................................................ 149
Table 5.4: Execution Time Breakdown for System #1: 750MHz CPU with 96MB 
memory........................................................................................................ 151
Table 5.5: Execution Time Breakdown for System #2: 750MHz CPU with 128MB of 
memory comparing with a SYSim system with 2GHz CPU with 128MB of 
memory........................................................................................................ 152
Table 5.6: Execution Time Breakdown for System #3: 2.4GHz CPU with 1GB of 
memory comparing with a SYSim system with 2GHz CPU with 512MB of 
memory........................................................................................................ 153
 vi
List of Figures
Figure 1.1: The System CPI............................................................................................... 3
Figure 2.1: Memory Hierarchy in Typical Computer Systems....................................... 16
Figure 2.2: The basic SRAM cell--six-transistor memory cell(6T MC) ........................ 20
Figure 2.3: Block Diagram of a 2-way set associative cache organization, ................... 22
Figure 2.4: Cache structure.............................................................................................. 27
Figure 2.5: Schematic of wordlines and bitlines in Wattch array structure [19] ............ 32
Figure 2.6: a DRAM memory cell--one transistor one capacity (1T1C) ........................ 35
Figure 2.7: 64 Mbit Fast Page Mode DRAM Device (4096 x 1024 x 16) [17] ............. 37
Figure 2.8: The Classic Memory Topology [17]............................................................. 42
Figure 2.9: Command and data movement on generic SDRAM device. [17] ............... 44
Figure 3.1: Basic Data Organization of a disk drive ....................................................... 79
Figure 3.2: Components of a sector................................................................................. 81
Figure 3.3: Components of a servo sector....................................................................... 92
Figure 5.1: SYSim architecture ..................................................................................... 124
Figure 5.2: Read Current with I/O Power Included [18]. ............................................. 132
Figure 5.3: TPM Power Modes [15].............................................................................. 138
Figure 5.4: : Abstract Illustration of a Load Instruction in a Processor-Memory 
System [17].................................................................................................. 141
Figure 5.5: Idle and Active Power of 47 Commercially Available Disk drives........... 148
Figure 5.6: Son and Kandemir?s Disk Power Projection for IBM Ultrastar 36Z15..... 149
Figure 5.7: (1) RAID5 Configuration for an 4-disk system.......................................... 150
Figure 5.7: (2) RAID5 Configuration for an 8-disk system.......................................... 150
Figure 5.8: Sample Output of Cache Accesses and Total system CPI ......................... 155
Figure 5.9: Sample Output of Cache miss rate and Disk Accesses .............................. 156
Figure 5.10: Sample Output of Cache power and Disk Power Dissipation.................... 157
Figure 5.11: Sample Output of DRAM and Disk Accesses and Power ......................... 158
Figure 6.1: The System CPI........................................................................................... 160
Figure 6.2: The interaction in memory hierarchy in a system with 512MB of memory....
................................... .................................................................................. 164
Figure 6.3: I/O intensive phase of ammp ...................................................................... 165
Figure 6.4: I/O intensive phase of bzip2........................................................................ 166
Figure 6.5: I/O intensive phase of gcc........................................................................... 167
Figure 6.6: I/O intensive phase of gzip.......................................................................... 168
Figure 6.7: I/O intensive phase of mcf .......................................................................... 169
 vii
Figure 6.8: I/O intensive phase of mgrid....................................................................... 170
Figure 6.9: I/O intensive phase of parser....................................................................... 171
Figure 6.10: I/O intensive phase of twolf........................................................................ 172
Figure 6.11: I/O intensive phase of vortex ...................................................................... 173
Figure 6.12: Memory Size Exploration........................................................................... 176
Figure 6.13: ammp and mgrid Disk Activities................................................................ 178
Figure 6.14: gzip Disk Activities..................................................................................... 179
Figure 6.15: bzip2 Disk Activities................................................................................... 180
Figure 6.16: parser and gcc Disk Activities .................................................................... 181
Figure 6.17: Power Dissipation and Energy Consumption of DRAM and a Disk......... 184
Figure 6.18: DRAM & Disk Power Dissipation and Energy Consumption................... 185
Figure 6.19: Energy Consumption v.s. CPI Trade-offs................................................... 186
Figure 6.20: CPI due to the disk RPM and disk cache.................................................... 188
Figure 6.21: The interaction in the memory hierarchy for a system with a 5k-RPM disk 
drive............................................................................................................. 189
Figure 6.22: The interaction in the memory hierarchy for a system with a 20k-RPM disk 
drive............................................................................................................. 190
Figure 6.23: The Effects of Disk Prefetching.................................................................. 193
Figure 6.24: The interaction of the memory components in a system without disk cache ..
.......................................... ........................................................................... 195
Figure 6.25: The interaction of the memory components in a system with disk cache but 
no prefetching.............................................................................................. 196
Figure 6.26: The interaction of the memory components in a system with disk caching 
and prefetching............................................................................................ 197
Figure 6.27: RAID5 configuration .................................................................................. 199
Figure 6.28: Disk RAID5 Configuration with different RPMs ...................................... 201
Figure 6.29: RAID5 with no writes................................................................................. 202
Figure 6.30: The interaction between the memory components in the hierarchy of a 
system with 4-disk RAID system ............................................................... 203
Figure 6.31: The interaction between the memory components in the hierarchy of a 
system with 8-disk RAID system ............................................................... 204
Figure 6.32: The Effects of Disk Cache Size by varying the Segment Size................... 206
Figure 6.33: The Effects of Disk Cache Size by varying the Number of Segments ...... 208
Figure 6.34: The Trade--offs between Memory Sizes and Disk Cache Sizes ................ 209
Figure 6.35: Disk Cache Organization ............................................................................ 212
Figure 6.36: Bus Latency Exploration............................................................................. 213
Figure 6.37: The Limit of Write-Buffering Technique ................................................... 216
Figure 6.38: The interaction of the memory components in the hierarchy in a single disk 
system with perfect write buffering ............................................................ 218
Figure 6.39: The interaction of the memory components in the hierarchy in a system with 
4-disk RAID disk subsystem along with perfect write buffering............... 219
 viii
Figure 6.40: The interaction of the memory components in the hierarchy in a system with 
8-disk RAID disk subsystem along with perfect write buffering............... 220
Figure 6.41: CPI v.s. Disk Average Response Time ....................................................... 222
Figure 6.42: System CPI Breakdown .............................................................................. 224
Figure 6.43: Power and Energy Consumption in the system with different memory size...
.............................................................. ....................................................... 226
Figure 6.44: Power and Energy Consumption for the system with different RPM and the 
presence of disk cache................................................................................. 228
Figure 6.45: Power and Energy Consumption of the system with Disk 
Caching/Prefetching.................................................................................... 230
Figure 6.46: Power and Energy Consumption for Caching and Perfect Write Buffering....
......................................................... ............................................................ 232
Figure 6.47: Power and Energy Consumption for the system with constant sum of 
memory size and disk cache ....................................................................... 233
Figure 6.48: Power and Energy Consumption for the system with different size of disk 
cache............................................................................................................ 234
Figure 6.49: Power Dissipation/Energy Consumption v.s. CPI trade-offs ..................... 235
Figure 6.50: The System Bandwidth............................................................................... 238
Figure 6.51: The interaction in memory hierarchy in our base configuration with 128MB 
of memory ................................................................................................... 241
Figure 6.52: The interaction in memory hierarchy in a system with 512MB of memory....
........................................................ ............................................................. 242
Figure 6.53: The interaction in memory hierarchy in a system with 128MB of memory 
and a disk drive with perfect write buffering.............................................. 243
Figure 6.54: The interaction in memory hierarchy in a system with the same 
configuration with RAID disk system ........................................................ 244
 1
CHAPTER 1:   INTRODUCTION
1.1. Problem Description
The 90/10 rule states that 90% of the execution time is spent in 10% of the code. Most
studies, therefore, focus on the computation phase which contains the most repeated number
of instructions--i.e. the main loops, are executed, believing that it is the most important part
in the entire course of execution. The argument for this is to make the most repeated case
fast. However, this dissertation takes a different path. We are not looking at the duration that
the most repeated instructions are executed; we are looking at the duration that the most time
spent in the execution. 
Run # User (s) Kernel (s) I/O stall (s) Total (s)
1 (cold cache) 93.11 15.06 600.83 709
2 (warm cache) 92.7 16.3 397.00 506
3 (warm cache) 92.8 14.3 425.90 533
4 (warm cache) 93.3 14.3 460.40 568
5 (warm cache) 93.6 14.3 441.10 549
Table 1.1: Execution Time Breakdown for System #1: 750MHz CPU with 96MB 
memory
Run # User (s) Kernel (s) I/O stall (s) Total (s)
1 (cold cache) 90.4 6.4 164.20 261
2 (warm cache) 90.1 6 126.90 223
3 (warm cache) 89.8 5.7 129.50 225
4 (warm cache) 90.5 5.5 121.00 217
5 (warm cache) 90.3 6.1 168.60 265
Table 1.2: Execution Time Breakdown for System #2: 750MHz CPU with 128MB of 
memory
 2
Table 1.1 shows the execution time breakdown for gzip in a real system. The system has
a 750MHz CPU with 96MB of the system memory and runs Fedora Core 3. Table 1.2 shows
the execution time breakdown for gzip in another real system. The second system is the
same system as the first system, but the system memory is set to 128MB. One would expect
that the memory in both systems should be large enough to run a SPEC benchmark Though
the systems spent significant amount of time executing the user code, they also spent more
time stalling for I/O. 
Figure 1.1 shows the simulation results of an entire execution of gzip, a spec benchmark,
on our complete-system simulator--SYSim, in a single-processing environment. The system
configuration used in this example is a 2-GHz Pentium processor, 128MB of main memory,
and a 12k-RPM disk drive with built-in disk cache. The figure shows the interaction
between all components of the entire memory hierarchy, including the level-1 instruction
cache, the level-1 data cache, the level-2 unified cache, the DRAM, and the disk drive. All
graphs use the same x-axis, which represents the execution time in seconds. The x-axis does
not start at zero since the system boot time is excluded. Each data point is collected for every
10 milliseconds epoch. The CPI graph shows 2 system CPI values: one is the average CPI
for any 10 milliseconds epoch, the other is the accumulated average CPI. The duration with
no data point displaying means no instructions are executed due to the I/O latency. The
application is run in single-user mode to make accurate calculation of execution time.
Otherwise, the kernel would swap to other processes on a system call to read from disk.
Therefore, disk delay shows up as stall time. The course of execution when the accumulated
average CPI is over 100 is the I/O intensive phase (i.e. before the 140th second), while the
 3
I/O intensive phase
computation
phase0
5e+06
1e+07
1.5e+07
2e+07
I
cac
he A
c
ces
s
Cache Accesses (per 10 ms) and System CPI
gzip; memory: 128MB; run to completion
0
5e+06
1e+07
1.5e+07
2e+07
Dc
ac
h
e
 A
c
c
e
s
s
0
50000
1e+05
1.5e+05
2e+05
2.5e+05
3e+05
3.5e+05
L
2
ca
c
h
e Ac
ce
s
s
   20    40    60    80  100    120    140    160    180   200
time(s)
1
10
100
1000
10000
CPI
CPI@10ms
cum. CPI
0
1
2
3
4
5
DRAM Po
we
r (W
)
DRAM & Disk Accesses/Power(per 10ms)
10
0
10
1
10
2
10
3
10
4
10
5
DRAM Ac
c
e
s
s
e
s
0
5
10
15
Di
sk
 Po
we
r(
W)
   20    40    60    80   100    120    140    160    180   200
time(s)
10
0
10
1
10
2
10
3
10
4
D
i
sk A
c
ces
s(
per
 10m
s)
Figure 1.1: The System CPI. The figure shows the System CPI over the entire run of gzip. The system configuration is a 2-GHz processor
with 128MB of memory and a 12k-RPM disk. The CPI graph shows 2 CPI values: one is the instant CPI for every 10ms, another is the
accumulated average CPI. The duration having no data point means no instructions are executed due to the I/O latency. The course of
execution when the accumulated CPI is over 100 is the I/O intensive phase, and the course of execution when the CPI is below 100 is the
computation phase.
 4
course of execution when the CPI is below 100 is the computation phase. Note that the CPI,
the DRAM accesses, and the Disk accesses are in log scale.
During the course of the execution, there are I/O intensive phase and steady or main
computation phase. The figure shows that the program spent a significant amount of the
time, if not most, in the I/O intensive phase due to the I/O activities. Unlike the disk, the
other components in the memory hierarchy cause very little activity during the I/O intensive
phase. On the other hand, those components are accessed regularly during the computation
phase which is where the most instructions are executed, and a phase during which the disk
is mostly idle. Therefore, the I/O intensive phase has been exposed as a significant
component due to the I/O latency with respect to the total execution time.
Most studies skip the I/O intensive phase due to the claim that it is unimportant as it is
only executed once. However, despite being run only once, the I/O intensive phase takes far
longer than other phases. The underlying reason for the lack of attention to this issue is that
the I/O intensive phase is dealing with I/O activities which only a small number of tools
implement due to the complexity and time-consuming experimentation. These tools can take
years to develop, and a single data point on an experiment can take weeks or even months to
execute. For these reasons, most publications conduct an experiment only on the
computation phase and claim a single digit CPI value. Unfortunately, they entirely ignore
the effects of the I/O intensive phase. As figure 1.1 shows, the CPI value can vary by many
orders of magnitude due to the I/O activities during the long I/O intensive phase. CPI finally
reduces to a single-digit number during the computation phase as portrayed by many
studies. Therefore, the techniques that simply ignore the I/O intensive phase and claim 10%
or even a factor of 2 improvement over only computation phase, as exhibited in most
 5
processor, cache, and DRAM enhancements, may have only minor impact for the entire
course of execution.
The solution to this problem is to investigate the I/O. To obtain a system with more
balanced phases, we require more understanding of the effects of the parameter
configurations of the I/O to the entire system, especially during the I/O intensive phase. A
variety of disk optimization techniques including caching, write buffering, prefetching, and
parallel I/O have been invented to optimize I/O operations. These techniques have been in
place in the server-classed disk drives for over 10 years. As the technology is getting
cheaper with the time, there is a shift to using these techniques in PC disk drives as well. For
example, in general, an optimizing technique would be first introduced in server drives.
Then, a few years later, the technique will be implemented in desktop drives. After widely
used in desktops for another few years, it would be applied in mobile drives. As a result, an
optimization technique would take approximately 10 years to shift from server class disks to
mobile drives. In addition, with better technology, the disk physical characteristics are also
improved, including the RPM (rotational speed in term of round-per-minute), the seek time,
and the disk drive interface. 
The effectiveness of these techniques in term of total system performance, however, is
not clear because they have been studied in isolation by different researchers using different
methodologies. As the performance gap between the processor and disk-based storage
continues to widen, increasingly aggressive optimization of the storage system is needed.
This requires a good understanding of the real potential of the various I/O optimization
techniques and how they work together. Therefore, we are required to study the effects in the
full-system scale.
 6
To our knowledge, this dissertation is the first to explore disk design space during the
I/O intensive phase, and the results are reported in both total system performance and the
power/energy consumption. We will show later that the overall system performance can be
improved greatly by the enhancements in the disk drives, i.e. using disk caching and
prefetching can improve the performance by the factor of 2, and write-buffering techniques
can improve the performance by the factor of 10. Moreover, the combination of disk
caching/prefetching and the write-buffering technique is the most important enhancement to
focus on since the enhancement can improve the total system performance over an order of
magnitude without increasing the energy consumption significantly. Increasing the disk
RPM and the number of disks in RAID disk system also have an impressive improvement
over the total system performance. However, since such techniques can consume significant
energy, they have trade-off to be considered carefully. Our studies also revealed that as the
capacity of the main memory decreases to the capacity which causes memory page
swapping during the I/O intensive phase, the overall system performance decreases greatly.
This type of behavior will also continue to show as the size of application?s memory
footprint grows, which is the trend for the future applications. Therefore, increasing the
memory size will only solve the problem in the short term. The long-term solution is to
improve the disk system performance.
Recently, the design trade-off of performance versus power consumption has received
much attention because of the following [36]: 
1. the ever-growing number of disk-based mobile systems that need to provide 
services with the energy supplied by a battery of limited weight and size; 
2. the technical feasibility of high performance computation due to heat extraction; and
 7
3. concerns about the operating costs of large systems caused by electric power 
consumption as well as the dependency of systems operating at high temperatures 
because of power dissipation. For example, a data warehouse of an Internet service 
provider with 8000 servers requires 2 MW [36].
Thus, the demand for low-power systems is increasing not only for mobile systems but
also for general-purpose systems. Additionally, the energy consumption of the computer
systems will scale up as they become more complex.
In general, optimization techniques aiming at performance do not necessarily optimize
power consumption and can often lead to more power consumed. An et al. [33] demonstrate
this fact with spatial database application. They evaluate three spatial indexing methods for
memory resident database in embedded systems from both the energy and performance
angles. Their experimental results show that one indexing method is superior to others in
performance angle while aggravating the power dissipation. Other publications also exhibit
that performance optimization techniques can aggravate the power dissipation. For example,
in [34], the simulation results of an embedded system running an MPEG video show that
using a 4-way set-associative, burst SDRAM L2 cache in addition to the L1 cache improves
performance by approximately 10% but almost doubles the total energy consumption.
Therefore, since there are trade-offs in the performance and energy consumption, care must
be taken to apply the performance optimizations. 
Energy-efficient design requires reducing power dissipation in all parts of the design.
Design decisions in a part of a system (e.g., the micro-architecture of a computing element)
can affect the energy consumption in another part (e.g., memory and/or memory-processor
busses), or even affect the energy consumption in many parts. For example, reducing miss
 8
rates in L1 cache can reduce the access and power consumption in other lower-level
memory. Another example is that a power reduction technique in the memory can be
responsible for increasing the power consumption in the processor as demonstrated by
Kandemir et al. [40]. They evaluate five state-of-the-art high-level compiler optimizations
on energy consumption, considering both the processor core (datapath) and memory system.
They found that, while most performance oriented optimizations reduce the overall energy
consumption, they also increase the energy consumption in the datapath. As a result, energy-
efficient system-level design must address the reduction and balance of power consumption
in all constituents. 
Nowadays, the cost per memory bit is extremely low, and sheer memory size is rarely
the main issue in system design. In stead, memory performance and power are now the key
challenges. Memory accesses become slower with respect to the processor and consume
more power with increasing memory size. Many studies show that memory power and
access time dominate over 50% of the total power and performance for computations with
large storage requirements[37, 38, 1]. As a result, memory becomes the main bottleneck.
All advanced memory organizations rely on the concept of memory hierarchy. High
hierarchy levels are made of small memories, close to computation units, and tightly
coupled with them. Low hierarchy levels are made of increasingly large memories, far from
computation units, and shared. Hierarchical organizations reduce memory power by
exploiting non-uniformities in access frequencies. Most applications access a relatively
small area in memory with high frequency, while most memory locations are accessed a
small number of times. In a hierarchical memory, frequently-accessed locations should be
 9
placed in high hierarchy levels, closer to the processor, thereby minimizing average cost per
access.
One technique which is implemented in many levels of memory hierarchy is caching.
Caching temporarily holds data that is likely to be utilized in a faster memory, called cache.
The term ?cache? is used in every level in the memory hierarchy where the technique is
applied. Therefore, terms, such as Memory Cache, Disk Cache, and File System Cache,
confuse most people, even the people in Operating Systems and disk research. Sometimes
they use these terms interchangeably. To clarify the terms used in this dissertation, the terms
are defined as following. All Memory Cache, Disk Cache, and File System Cache serve the
same purpose, which is to hide the disk latency by caching the data in the location closer to
the processor. In this dissertation, Memory Cache indicates the level-1 data cache, level-1
instruction cache, level-2 cache, and so on. Disk Cache, sometimes called Disk Buffer, is a
set of memory chips physically located in a disk drive. Disk Cache usually exists without the
knowledge of the operating system, and it is controlled by the processing unit embedded in
the disk drive. The File System Cache is a part of the system memory, managed by the
operating system and reserved for files which are read from the disk system. Therefore, the
File System Cache is physically located in the main memory which is usually in the DRAM,
but it can be anywhere in the memory hierarchy.
Approaches to memory optimization considering both power and performance in the
literature can be grouped into three classes [36]: 
1. Memory hierarchy design assumes a given dynamic trace of memory access, 
obtained by profiling an application, and produces a customized memory 
hierarchy. Many publications present strategies for the optimal cache 
 10
configuration [43, 44, 45, 46, 47], and others partition each memory level into 
subbanks to be able to put into low-power mode when not used [47, 49, 50, 51]. 
Many recent publications insert specialized buffers in the hierarchy [52, 53, 54, 55, 
56, 57] to improve data locality in each memory level and/or to reduce traffic 
between them. These specialized buffers are also used as instruction compression 
buffers [58, 59, 60, 60]. 
2. Computation transformations for memory optimization assumes a fixed memory 
hierarchy and tries to modify the storage requirements and access patterns of the 
target computation to optimally match the given hierarchy. For example, data 
structure selection [63, 64], register and memory allocation [65, 66], dynamically 
switching devices to low-power mode [67, 68], extending the low-power duration of 
a device by code transformation [69, 70], and reducing memory address bus 
transitions [71, 72, 73, 74]. 
3. Synergic memory and computation optimization tries to concurrently optimize 
memory access patterns and memory architecture [41, 42, 62].
In this dissertation, we explore the disk drive optimization techniques, which are
physical improvements and enhancements with additional hardware. Since we only
concentrate on customizing a level in memory hierarchy without attempts to modify the
access pattern, our approach can be classified into the first group, Memory hierarchy
design.
A magnetic disk has been considered a fundamental component in a computer system
since 1965 [39]. It primarily serves as long-term, non-volatile storage for files, and as a level
of the memory hierarchy below main memory. Disk is included in the virtual memory
 11
implemented in most popular operating systems as a slow memory during program
execution. Though disk is an indispensable part of general-purpose computer systems, so far
no literature addresses the complete picture of the memory hierarchy including disk, or how
memory systems (caches and DRAM) interact with disk in both performance and power
dissipation. As we will show in the next section, one reason is that there are no proper tools
available in the public domain for such studies. Such a tool would have to demonstrate
accurate interactions between the caches, the main memory, and the disk via I/O requests
from the operating system. The components must be implemented in detail to capture the
systemic interactions between them. Furthermore, as low-power consumption is another
system requirement, the tool needs to estimate the instantaneous power dissipation in each
component to reflect the efficiency in power consumption. Such tools are considered very
complex to implement. 
Therefore, SYSim was created to be a complete-system simulator aiming at complete
memory hierarchy studies. SYSim focuses on demonstrating the detailed interaction in
memory hierarchy in both performance and power domains. In this dissertation, we
employed SYSim to explore several disk enhancements and the disk physical technology
improvements during the I/O intensive phase. The experimental results were reported in
both total system performance domain and the power/energy consumption domain. 
 12
1.2. Contribution and Significance
The contribution in this dissertation is two-fold. First, we explore several disk
enhancements and disk physical technology improvements in both isolation and in
combination, considering both total system performance and the power/energy
consumption, focusing on the I/O intensive phase. Secondly, we create a complete-system
simulator, SYSim, to demonstrate the detailed interaction in memory hierarchy in both
performance and power domains.
With SYSim, we are able to conduct the complete-system experiments to evaluate the
disk optimization techniques effect on actual total system performance and power/energy
consumption. To our knowledge, this dissertation is the first to explore several disk
enhancements and technology improvements both in isolation and in combination during
the I/O intensive phase of applications. The disk enhancements we studied include disk
caching, prefetching, write buffering, and parallel I/O. In addition, the disk technology
improvements we explored include the disk seek time, rotational speed, and interface data
rate. The results are reported in terms of both total system performance and the
power/energy consumption. We captured the following intriguing behaviors:
? During the I/O intensive phase which consists of both disk reads and writes, average 
CPI tracks only average disk read response time not overall average disk response 
time, which includes both disk read/write response time. This is important because 
most disk studies report performance in terms of average disk response time.
? The effect of the size of the disk cache is limited to the presence of the cache with a 
particular size. That is increasing the size of the disk cache will not result in better 
performance if the disk cache is already large enough. This behavior is also agreed 
 13
with the file system cache size and disk cache size exploration by Zhu and Hu 
in [75].
? In the disk read-dominating benchmarks, Disk Prefetching is more important than 
increasing the disk RPM. That is rotational latency and bandwidth can be overcome 
by simple prefetching during disk read phase.
? In the benchmarks containing both disk reads and writes, the disk RPM matters. This 
is because the disk maintains the concepts of non-volatile storage, so disk write 
requests must be processed to the disk immediately. The experiment shows that 
using some techniques, such as, NVRAM to buffer the writes, may improve the 
performance significantly.
? Increasing the number of disks in a RAID system does not proportionally translate 
into better performance. For instance, increase number of disks form one to eight 
does not improve performance by the factor of 8. Worse, the power/energy 
consumption does increases proportionally by the number of disks: a system with 8 
disks dissipates roughly 8 times the power of a single-disk system.
? The cost of writing in RAID system is significant as the RAID systems usually 
suffer from small writes [80]. If the cost of a write is reduced, such as by 
implementing write buffer mechanism, the overall system performance will be 
improved significantly.
? Individual DRAM chips dissipate little power, but a system must have a substantial 
amount of DRAM to keep the disk from dissipating significant power. Moreover, 
when there is sufficient DRAM capacity in the system, the total DRAM power can 
be significant.
 14
? The energy consumption seems to have more significance than the power 
dissipation. The energy consumed can grow significantly with different disk 
parameters because the I/O latency substantially prolongs the program execution 
time.
? In the systems with fast disks, increasing the system bandwidth alone fails to 
improve performance directly. To significantly improve total system performance 
further, disk enhancement techniques are required.
1.3. Organization of Dissertation
The dissertation is organized as follows: Chapter 2 provides an overview of the memory
hierarchy. Both Caches and DRAM-based memory systems from the system level down to
the circuit level are discussed to provide the reader fundamental insights about the memory
hierarchy. Chapter3 gives a background about disk drives in computer systems, and
describes a drive?s physical components, data organization, and interfaces. Chapter 4
discusses related works in the literature for the dissertation. It consists of insight about
system simulators in the research community, disk drive enhancements, and disk drive
technology improvement that we considered in this dissertation. Methodology for proposing
simulator, SYSim, is discussed in Chapter 5. It details the parameter settings in our
experiments, the simulator implementation details, and sample output to provide more
insight in our proposed complete-system simulator. Chapter 6 presents the experiments and
results for the memory system studies, mainly about system-level behaviors focusing on the
disk system configuration. Finally, we end the dissertation with the conclusions.
 15
CHAPTER 2:   MEMORY HIERARCHY
In this chapter, we will be providing background information regarding the memory
hierarchy in a general-purpose computer system, primarily focusing on the cache and main
memory. The chapter will consist of detailing the memory hierarchy in general. Other topics
discussed will be the first level in the memory hierarchy and cache. A detailed background
on cache design is given to serve as the foundation while discussing concepts used in cache
tools. We also include the basics of cache design. The cache tools we used in our
experiment, CACTI and Wattch, are also introduced, as well as a general explanation of
how the selections have been made for the cache configuration and power dissipation
calculation. After that, we discuss the main memory included in a general-purpose
computer, based on DRAM (Dynamic Random Access Memory). Then, the basic structures
of DRAM devices and memory system organizations are described in some detail. The
section starts with the description of the smallest unit of the DRAM, a memory cell, moves
upward to the DRAM device, and then to DRAM system organization. Next, the DRAM
memory access protocol is discussed to provide an explanation of fundamental DRAM
operations and interactions between them. Finally, the chapter concludes with abstracted
concepts of DRAM memory controller design.
 16
2.1. Memory Hierarchy
A relatively unlimited amount of fast memory with low cost is always a requirement for
future computer systems. Memory hierarchies have been invented to support this
requirement. A memory hierarchy is defined as the hierarchical arrangement of storage in
computer architecture. The hierarchy takes into consideration the advantage of both locality
of accesses and the cost-performance ratio of memory technologies. The principle of locality
states that most programs do not access all code or data in well-distributed spatial or
temporal distributions. The programs have some forms of locality in their accesses. Another
principle is the memory hierarchy organization. Each level of the hierarchy is organized in
such a way that it can be accessed with higher speed and lower latency than lower levels.
These two principles are the basis of the hierarchy, which is based on memories of different
speeds and sizes. Since fast memory technology is expensive, a memory hierarchy organizes
different speeds and sizes of memory into several levels, with the smaller, faster, and more
expensive memory placed closer to the processor. The goal is to provide a memory system
with cost comparable to the cheapest level of memory and speed comparable to the fastest
level of memory. The levels of the hierarchy usually are inclusive, meaning that data located
in the upper levels are also included in the levels below. The data in the memory hierarchy
may also be exclusive, which means data is allocated in only one level at a time among
Microprocessor
Level 1
Cache
Level 2-N
Cache
Memory
CPU
fastest
fast
slow
Figure 2.1: Memory Hierarchy in Typical Computer Systems. 
 17
multi-level caches. Note that each level maps addresses from a larger memory to a smaller
but faster memory that is placed closer to the processor in the hierarchy. As part of address
mapping, the memory hierarchy also provides address checking and protection schemes
preventing harmful address accesses.
2.2. Virtual Memory
Virtual memory allows programs to run in a memory address space whose size and
addressing are independent from the computer's physical memory. If a program exceeds the
physical memory capacity, virtual memory automatically loads or unloads pages without the
user program knowing. A page is simply a chunk of memory that is loaded or unloaded as a
single unit. Therefore, virtual memory reduces the startup time for a program since only the
necessary pages are loaded at startup. On the other hand, in a multiprocessing environment,
multiple processes can run simultaneously with their own independent address spaces.
Virtual memory divides physical memory into pages and allocates them among different
processes. It locates the unnecessary pages in some secondary storage and loads only pages
necessary for multiple processes at a given moment. It automatically manages the two levels
of the memory hierarchy, which are main memory and secondary storage. Virtual memory
also protects the processes? address spaces by restricting the processes to only the address
spaces to which they belong. 
With virtual memory, the address referenced by the processor is called a virtual address.
The virtual addresses are translated by a combination of hardware and software to physical
addresses, which are used to access main memory. This process is called memory mapping
or address translation. A data structure called a page table is used in address translation.
 18
Each page table entry is indexed by the virtual page number and contains the physical page
address. Two virtual page numbers can map to the same physical page frame. However, the
size of the page table is quite large compared with the size of the system memory: the page
table can occupy approximately 0.1 -1% of the system memory. Additionally, the address
translation process would deteriorate the system performance if every memory access
required another access for address translation. 
The TLB (translation look-aside buffer) is introduced to mitigate the cost of the address
translation process. Since the address translations for accesses have spatial locality, the TLB
is used to improve address translation by caching recently accessed page table entries. The
process of address translation via TLB can be placed before or after the cache depending on
what scheme is used. Therefore, address translation is directly related to caching, and it will
be discussed further in the next section regarding cache operations. The page table entries
cached by TLB have been recently referred to by the processor and therefore their physical
pages are located in main memory. When the processor refers to an address, the system
looks it up in TLB first. If the entry is found in the TLB, the referring virtual address is
translated to a physical address accordingly. This process eliminates the need to look up the
address in the actual page table located in main memory. Therefore, accessing memory to
look up a page table entry for address translation is not necessary in this case. However,
when a TLB miss occurs, which is when the referring page table entry is not found in the
TLB, the system unavoidably accesses the page table for the entry. The TLB misses may
cause a series of activities to allocate a new page for the entry, including memory page
allocation, page table entry creation, and TLB entry insertion, depending on the status of the
page. 
 19
The TLB miss is handled by the operating system or hardware. If the memory page was
previously allocated, the operating system would only require looking up of the page table
entry in the page table and inserting it into the TLB. Then the access can be restarted. In the
process of scanning through the page table, the operating system may find that the referring
entry is actually unmapped. A mapped virtual page is defined as a virtual page previously
allocated by the operating system, and whose mapping information is currently maintained
under the operating system?s awareness. In contrast, an unmapped page can occur in two
situations which are (1) it has not been previously allocated by the operating system, or (2) it
has been de-allocated. In either situation, its mapping information is discarded, and the
system initiates the page allocation process mentioned above. Virtual memory systems can
be categorized into two classes: those with fixed-size blocks, called pages, and those with
variable-size blocks, called segments.
2.3. Caches
The term cache is applied to the techniques buffering reusable data since locality exists
in the reference stream. Caching can improve the performance in many computer system
levels. Examples include file caches, disk caches, name caches, web caches, etc. Generally,
caches in computer architecture refer to the first levels of the memory hierarchy. Caches
have been an integral and ever-growing part of modern microprocessors. Approximately
half of a modern microprocessor?s die area is allocated for caches and the number of
transistors dedicated to caches is still growing. They are usually implemented with a number
of arrays of SRAM cells because an SRAM cell is superior in terms of performance versus a
DRAM cell. Caches also use the same fabrication process as the processor core. 
 20
2.3.1.Cache Memory Cell
Figure 2.2 shows the basic SRAM cell, which is an implementation of a six-transistor
memory cell (6T MC). This 6T MC cell is connected in a way that it restores the charge
back into the memory cell, so the charge does not decrease and the need for refreshing the
cell is eliminated. The 6T MC has only one port, which can be used to either read or write a
values. When there is a read access, the wordline (WL) is asserted and the voltage difference
between the bitline pair is detected. The pair of cross-coupled transistors and the access
transistors pull the voltage all the way up to Vdd or down to the ground according to the bit
data value. After that, the bitlines are precharged to the original voltage for the next access.
In the case of a write, the bitline is driven with a differential voltage from an external source
according to the new data to be written. The wordline WL is then asserted and the value that
is to be stored is latched into the memory cell. Most conventional designs use this full-
CMOS six-transistor memory cell with different variations including sizing, physical layout,
and transistor threshold voltages for low power.
word line
bit line bit line
Figure 2.2: The basic SRAM cell--six-transistor memory cell(6T MC). 
 21
2.3.2.Cache Operations
A cache consists of multiple blocks of data. Data blocks are usually organized in a set-
associative manner. For example, a cache with four sets and two blocks per set is called two-
way set associative. A data block mapped onto a particular set can be freely placed in any of
those two blocks in the set. The one-way set associative cache is called a direct-mapped
cache and the cache with one set is called fully-associative. Intuitively, more blocks per set
translate into more freedom to place the data block to any available blocks in the set, but, as
a result, more time and/or power is taken to process the lookup. A cache consists of two
arrays, which are the tag array and the data array. The tag array contains an address tag on
each block frame that gives the address the data block contains. It also contains a valid bit to
identify the validity of the tag entry. 
Figure 2.3 shows the read operation to a cache. The virtual address from the processor is
divided into virtual page number and page offset. The virtual page number is translated to a
physical page address via a page table entry in the TLB. Then, the physical page address is
combined with the page offset to produce a physical address. Next, the physical address is
divided into tag, index, and block offset. The block offset field selects the desired bytes from
the block, the index field selects the set in the cache, and the tag field is compared in parallel
against the tags of a selected cache set for a hit. If the address tag is matched with any of the
tags from the selected cache set, it is called a cache hit. If it is not a match, it is called a cache
miss, and the physical address is forwarded to the next level of memory hierarchy. Note,
 22
accessing the cache and address translation at the TLB can be done simultaneously
depending on how the virtual address is divided and translated.
The address translation process can be different than the process described above. It
depends on whether the virtual address or the physical address is used to tag and index the
cache. The virtual address referred to by the processor core is not the same as the physical
address which is used to refer to the location physically in memory, so the system requires
address translation. Modern processors handle the address translation with the cooperation
between the TLB and the operating system. The differences in caches and TLB placement in
VIRTUAL ADDRESS
TAG DATA
TAG DATA
SA SA SA SA
=
HIT
OUTPUT
DATA
TLB_OUT
TLB
Block address
Tag Index
Block Offset
Figure 2.3: Block Diagram of a 2-way set associative cache organization, 
 23
the memory hierarchy create different address translation schemes. These different address
translation schemes are described below.
The first scheme is physically indexed, physically tagged (PIPT). In PIPT caches, both
the tag and index of the cache are in the form of physical addresses. Meaning that the tag and
index are identified after the completion of the address translation. Therefore, the TLB
lookup process has to be completed before the cache can be accessed, which can slow down
the system. However, if the referring page table entry is not a valid entry in the TLB, the
system has to access the memory for the entry, translate the address, and then access the
cache sequentially. As a result, the benefit of using TLB can be overshadowed by sequential
memory accesses for the TLB miss process followed by the actual memory access. 
The second scheme is virtually indexed, virtually tagged (VIVT). In VIVT caches,
unlike PIPT, both the index and tag are identified from the virtual address directly from the
processor without the translation. This VIVT scheme improves the speed of cache access
significantly by eliminating the address translation. However, the scheme suffers from
several problems, including: 
1. Care must be taken to changes in TLB entries and changes in address space since 
virtual address translations usually are changed as part of normal kernel operation. 
Cache lines must be flushed if the cache lines? translations have changed.
2. Cache line Aliasing Problem: multiple virtual addresses may exist for the same 
physical address, even in a single address space. Each of these virtual addresses 
should never be in the cache at the same time, even though they represent the same 
data. 
 24
To solve mentioned problems in VIVT, the virtually indexed, physically tagged (VIPT)
scheme is introduced. The VIPT scheme can maintain the speed of cache accesses
comparable to VIVT. The scheme is as described in the previous section. The index is
identified in the virtual address, but the tag is identified in the physical address. VIPT can
solve the aliasing problem because the tag refers to the physical address. Therefore, VIPT
can detect aliasing when two identical tags exist in the cache. Depending on OS page
mapping and shared memory protocol, a VIPT cache can be constructed in such a way that
cache-line aliasing will never occur.
Since, in VIPT, the process of cache lookup and TLB lookup can be processed
simultaneously, the cache access speed of VIPT is improved versus PIPT. However, the
processor can not acknowledge a cache miss until the address translation is complete.
Finally, the last scheme, physically indexed, virtually tagged (PIVT), is basically not
used and is not discussed further.
When a cache miss occurs, a block must be selected to be replaced with the referring
data. A direct-mapped cache selects the block specified by the address, since there is only
one location that the data can go. On the other hand, the set-associative and fully-associative
caches have many blocks to select from. There are a number of strategies employed for
selecting the block to be replaced, which are collectively called the cache replacement
policy. The most common policy is called least-recently used (LRU), which selects the
block that has not been accessed for the longest time.
There are two basic options when writing data to the cache, which are write back and
write through. Write back writes the data to the block in the cache. The modified cache
block is written to a lower level of the memory only when it is replaced. On the other hand,
 25
write through writes to both the block in the cache and to the block in the lower-level
memory. Since the data are not needed on a write, there are two common options on a write
miss, which are write allocate and no-write allocate. Write allocate loads the block to the
cache on a write miss, and then restarts the write which causes a write hit. In contrast, no-
write allocate does not load the data in the cache, but modifies the block in the lower level
where the data is located.
To allow concurrent accesses, a cache can be given multiple ports. Multiported caches
can be implemented using different methods. The most common methods are true
multiporting and creating multiple independent banks. True multiported caches include
additional access transistors for each port, which cause a significant increase in memory area
and wire length within the cache. On the other hand, multiple independent banks divide the
cache into small banks, where each bank is a simple single-ported cache. Multiple
concurrent accesses can be satisfied if the accesses are to different banks. The disadvantage
of multiple independent banks is that the cache controller requires additional complexity and
intelligence to control each individual bank.
The cache controller functionality includes controlling cache operations and accesses to
comply with cache strategies mentioned above. Its function also includes effectively
managing other functions, such as multiporting with multiple banks which requires
intelligence to control accesses and prevent bank conflicts. Lastly, the cache controller
controls the interfacing mechanism to the lower level of memory when a miss occurs.
Although caches can be implemented in many different ways, the simple cache
implementation described in this section serves as a fundamental cache design.
 26
One approach to improve cache performance is to reduce the cache miss rate. All misses
can be sorted into three categories: compulsory, capacity, and conflict. Compulsory misses
are misses caused by accessing the blocks for the first time. These misses occur to bring the
blocks into the cache. Capacity misses are misses due to the limitation in the size of the
cache because the cache can not hold all the blocks. Conflict misses are misses caused by
multiple blocks mapped to the same set, but the set cannot hold all blocks mapped to it. To
reduce the misses, there are three common cache organization strategies: increasing the
cache size, increasing the cache associativity, and increasing the block size. While
increasing the cache size is a costly fool-proof method, increasing the cache associativity
and the block size have their optimal points. Increasing both cache associativity and the
block size too aggressively can cause the miss rate to increase.
2.3.3.CACTI: An Integrated Cache Timing, Power, and Area 
Model
CACTI [12] is a widely-accepted analytical model for the access and cycle times of on-
chip direct-mapped and set-associative caches. The inputs to the model are the cache size,
block size, and associativity, as well as array organization and process parameters. CACTI
was originally written by Wilton and Norm Jouppi at DEC WRL. It is available publicly for
academic purposes. 
Figure 2.4 shows the organization of the SRAM cache being considered in CACTI.
First, the decoder decodes the address, then the appropriate row is selected according to the
decoded address by driving one wordline in the data array and the corresponding wordline in
the tag array. Only one wordline in each array can be asserted at a time. Along the selected
 27
wordline, each memory cell is associated with a pair of bitlines, which are initially
precharged high. Then, each memory cell in that row pulls down one of its two bitlines
according to the value stored in the memory cell.
Each sense amplifier detects the changes in multiple pairs of bitlines, whose number
depends on the layout parameter. The sense amplifier determines the value of the memory
cell by detecting which bitline in a pair is pulled down. Multiple pairs of bitlines can share
one sense amplifier by inserting a multiplexor before the sense amps. To specify the pair of
bitlines to be detected, the select signals from the decoder are fed to the multiplexor.
Figure 2.4: Cache structure. 
 28
The data from the tag array is compared with the tag bits of the address. The number of
comparators required depends on the number of associativity of the cache, for example an
N-way set-associative cache requires N comparators. The comparison results, whether a hit
or a miss, drive valid output to the output multiplexors. These output multiplexors select the
appropriate data from the data array in the case of a set-associative cache or a cache in which
the data array is wider than the output width. Additionally, the output multiplexors drive the
selected data out of the cache.
Table 2.1 and Table 2.2 show the input and output parameters of CACTI.
Input 
Parameter
Use
C Cache size in bytes
B
Block size in byes signifying the number of bytes in a single cache 
entry
A Cache associativity
TECH Technology node in micrometers
N
subbanks
Number of cache subbanks
b
0
Number of bits of output data
b
addr
Number of bits of system address
Table 2.1: CACTI input parameters
Output 
Parameter
Use
N
dwl
Number of segmentations of the wordline (Data)
N
spd
Aspect ratio control parameter (Data)
N
dbl
Number of segmentations of the bitline (Data)
N
twl
Number of segmentations of the wordline (Tag)
N
tspd
Aspect ratio control parameter (Tag)
N
tbl
Number of segmentations of the bitline (Tag)
Table 2.2: CACTI output implementation parameters
 29
CACTI calculates the access and cycle times by estimating delays of the cache
components, including:
? decoder
? wordlines (in both the data and tag array)
? bitlines (in both the data and tag array)
? sense amplifiers (in both the data and tag arrays)
? comparators
? multiplexor drivers
? output drivers (data output and valid signal output)
The delay of each of these components is estimated separately and the results combined
to estimate the access and cycle time of the entire cache. The delay of each component is
estimated by decomposing each component into several equivalent RC circuits, and using
simple RC equations to estimate the delay of each stage. 
There are two potential critical paths in a cache read access. One is the time to access the
tag array and the other is the time to access the data array. The time to read the tag array,
perform the comparison, and drive the multiplexor select signals is compared with the time
to read the data array. If the former is larger, then the tag array is the critical path. Otherwise,
the data array is the critical path. Despite the cache designer?s attempts for faster tag path
compared with the data path, it is not always possible to do this. Therefore, both sides must
be modeled in detail to determine the critical path.
The cycle time is calculated by adding the access time and the precharge delay together.
The precharge delay is assumed to be dominated by the wordline fall time and bitline rise
time in the data array. The wordline fall time is approximately equal to the wordline rise
 30
time and a constant bitline rise time is assumed to be equal to four inverter delays (each with
a fan-out of four).
To determine the optimal configuration, CACTI performs an exhaustive search over all
combinations of the output parameters corresponding to the specified input parameters. The
implementation with the best behavior among all criteria is considered the optimal based on
the CACTI?s optimization algorithm, which is different for each version of CACTI.
CACTI 2.0 is an extension of the first version of CACTI with support for fully-
associative caches, multiported caches, feature size scaling, and power modeling. The inputs
are the cache capacity, associativity, cacheline size, number of read/write ports, and feature
size. Its analytical models compute the access time and the energy consumption of the cache
for all combinations of possible configurations. The calculation for each configuration
divides the data and tag array into smaller subarrays. Finally, CACTI returns the
configuration that has the best access time and energy consumption as determined by its
optimization function as mentioned above. 
The CACTI 2.0 optimizing function takes into account only the access time of the cache
and the energy consumption. CACTI 2.0 does not have a concept of the total area or the
efficiency (percentage of area occupied by the bits alone) of each configuration. It
approximates the wire capacitance and resistance that is associated with the wires in many
parts of the cache because it does not have a detailed area model. Additionally, since the
optimizing function considers only the access time and the energy consumption, the cache
configuration output may not be efficient in area or aspect ratio.
CACTI 3.0 adds a detailed cache area model to CACTI 2.0. CACTI 3.0 calculates the
area occupied by each component of the cache for each possible configuration. The model
 31
produces both the efficiency in performance and the aspect ratio of the entire cache for each
configuration. To determine the best configuration, the optimizing function considers access
time, power consumption, efficiency of the layout, and aspect ratio. The detailed area model
accurately calculates the wire lengths and the associated capacitance and resistance of the
address and data routing tracks. This results in more realistic power estimates. Finally,
CACTI 3.0 also supports fully independent banking of caches.
2.3.4.Wattch
Wattch [19] is a framework for analyzing and optimizing microprocessor power
dissipation at the architecture-level. Wattch is claimed to be 1000 times or more faster than
existing layout-level power tools, and yet maintains accuracy within 10% of their estimates
as verified by using industry tools on leading-edge designs. It provides a power evaluation
methodology within the popular SimpleScalar framework. Since, in this dissertation, Wattch
is used as a power estimation tool for caches, this section will discuss Wattch only in the
context of cache power estimation.
Wattch calculates the dynamic power consumption (P
d
) as: 
P
d
 = CV
dd
2
af
where
C is the load capacitance,
V
dd
 is the supply voltage, 
f is the clock frequency, and
a is the activity factor.
 32
The activity factor a is a fraction between 0 and 1 that represents the average switching
activity on each clock cycle. Wattch estimates C based on the circuit and the transistor sizing
as described below. Vdd and f depend on the assumed process technology as defined in the
header file power.h. The user can choose among 0.10, 0.18, 0.25, 0.35, 0.40, and 0.80
micron technology, and Wattch will automatically resize the transistor accordingly.
Wattch calculates the power consumption based on only the capacitance of each stage,
rather than both R and C. Additionally, Wattch analyzes and sums the power consumption
of all paths, not only the critical path. In Wattch, certain critical transistors are automatically
sized based on the model parameters to achieve reasonable delays. 
Figure 2.5: Schematic of wordlines and bitlines in Wattch array structure [19]. 
 33
A cache in Wattch was implemented as an array structure. The power model of the cache
is based on the number of rows, columns, and the number of read/write ports. These
parameters affect the size and number of decoders, the number of wordlines, and the number
of bitlines. In addition, these parameters are used to estimate the length of the pre-decode
wires and the lengths of the array?s wordlines and bitlines. The wordline and bitline
capacitance are computed in a similar way. The wordline capacitance includes the
capacitance of the wordline driver, the gate capacitance of the cell access transistor
multiplied by the number of bitlines, and the capacitance of the wordline?s metal wire. The
bitline capacitance includes the diffusion capacitance of the pre-charge transistor, the
diffusion capacitance of the cell access transistor multiplied by the number of word lines,
and the metal capacitance of the bitline. The number of ports also affects the power
consumption due to additional transistor connections on wordlines, two additional bitlines,
and longer wires on both wordlines and bitlines. 
Wattch authors estimate the physical implementations for cache structures using the help
of the CACTI tools [12]. As described in the previous section, CACTI takes the cache size,
block size and associativity as inputs, and chooses the organization that gives the smallest
access time. 
Wattch considers three different options for clock gating to disable unused resources in a
multi-ported cache. 
1. All-or-nothing approach. The full modeled power will be consumed if any 
accesses occur in a given cycle, and zero power consumption otherwise.
2. Scaled linearly. If only a portion of a cache?s ports are accessed, the power is scaled 
linearly with the number of accessed port(s).
 34
3. Scaled linearly with 10 per cent. It is the same as the second option except that 
unused units dissipate 10% of their maximum power, rather than drawing zero 
power.
To interface with SimpleScalar, the Wattch power model tracks which units are accessed
on each cycle and how. The power model also varies the estimated power based on the
number of ports used and which clock-gating scheme is used.
2.4. Main Memory: DRAM
The next level down in the memory hierarchy is the main memory. Main memory
functions include servicing requests from the cache and interfacing with the I/O. Main
memory is usually made up of a set of DRAM chips organized in a way that the memory
requests can be sent in interleaving manners. It has been widely accepted that the computer
system performance is mostly limited by the performance of DRAM-based memory
systems. The reason is that the gap between the rate of DRAM memory system performance
improvement and the rate of processor performance improvement has been continuously
increasing during the past thirty years. This phenomenon is well-known as the memory gap.
There are two main reasons in this phenomenon. First reason is the slow improvement in the
interface between the processor and the DRAM due to off-chip location of DRAM. The
second reason is the decision of implementing enhancements in DRAM chips relies heavily
on the manufacturing costs. Therefore, only enhancements with significant performance
improvements for minimal manufacturing cost are considered for standard DRAM devices.
 35
This section discusses DRAM devices and the memory systems organizations. The content
of this section is summarized from a part of David Wang?s Ph.D. dissertation [17].
2.4.1.DRAM Memory Cell
Figure 2.6 shows the basic DRAM cell, which is implemented as a one transistor and
one capacitor (1T1C) memory cell. This memory cell, widely used in modern DRAM
devices, contains one data bit. The memory consists of one transistor as an access transistor
and one capacitor to hold the charge according to the data of one bit. Before a memory cell
can be read, the bitlines have to be precharged to V
dd
/2. The memory cell is read by
asserting the wordline to turn on the access transistor, and then the voltage representing the
data value is placed on the bitline via the access transistor. The bitline voltage changes are
only minimal, but can be detected by the sense amplifier. Then, the sense amplifier senses
the value of the data and amplifies the signal up or down. Finally, the charges are restored
into the capacitor and the access transistor is turned off by removing the voltage at the
word line
bit line
Figure 2.6: a DRAM memory cell--one transistor one capacity (1T1C). 
 36
wordline. In the write process, only two steps are required. First, the bitline is driven with
the new data value. Then, the row select is asserted on a wordline to turn on the access
transistor and the data bit is latched into the memory cell. Therefore, the read process is
essentially a read with a restoring write.
However, the charges stored in the memory cell capacitor leak through the access
transistor. As a result, data stored in DRAM cells must be periodically read-out and written
back to restore the charges representing the data value. Otherwise, the charges stored in the
capacitor will no longer represent the originally stored data bit. This process is called
refresh. The DRAM device is usually refreshed every 32 or 64 milliseconds to maintain the
usability of the data.
In a DRAM array, the capacitance of a storage capacitor is much smaller than the
capacitance of the bitline. Therefore, when the voltage representing the data value is placed
on the bitline via the access transistor, the voltage on the bitline is changed minimally. This
minimal voltage change on the bitline is difficult to measure in an absolute sense. Therefore,
a differential sense amplifier is added in DRAM devices to detect the minimal voltage
change by comparing the bitline voltage to a reference voltage.
2.4.2.Standard DRAM Device
Figure 2.7 shows a block diagram for a Fast-Page-Mode (FPM) DRAM device. The
description of the standard DRAM device in this section will be based on this diagram. All
DRAM devices consist of one or more arrays of DRAM cells, which are organized into a
number of rows and columns. A column represents the smallest unit of addressable memory
on that device, which can contain multiple bits of data. In this figure, the DRAM cell
 37
consists of 4096 rows, 1024 columns per row, and 16 bits of data per column. Modern
DRAM devices can have multiple arrays in each device. These arrays are referred to as
banks. A number of logic circuits are also included in a DRAM device to control the timing
and sequence of the device operation, such as the clock generator and the refresh controller.
To access a row in a data array, the memory controller places the row address on the
address bus and asserts the row address strobe (RAS) signal. The DRAM device buffers the
address on the address bus in the row address buffer and then forwards the address to the
row decoder. After that, the row address decoder asserts the specific row of the data arrays
according to the accepted address. The data bits in memory cells along the selected row are
Figure 2.7: 64 Mbit Fast Page Mode DRAM Device (4096 x 1024 x 16) [17]. 
 38
placed on their corresponding bitlines and then they are sensed and remain in the array of
sense amplifiers.
Generally, one or more column accesses follow a row access. After the row access is
completed, a number of bits remaining in the sense amplifiers are accessed according to the
column access command. Like the row access, the memory controller places a column
address on the address bus. Meanwhile, the memory controller also asserts the appropriate
column access strobe (CAS#) signals. The DRAM device then accepts the column address,
decodes it and selects one column in the sense amplifiers according to the decoded column
address. If the command is for a read, the data for that column is then placed onto the data
bus and sent to the memory controller. On the other hand, if the command is for a write
specified by the write enable (WE) signal, the data are overwritten with data from the data
bus.
A DRAM device with a particular capacity can be manufactured in different
configurations. For example, a 1 Gbit DRAM device can be configured into 8 banks x
16384 rows x 2048 columns x 4 bits/column, 8 banks x 16384 rows x 1024 columns x 8
bits/column or 8 banks x 8192 rows x 1024 columns x 16 bits/column. However, the larger
row size means that the device with more bits per row consumes significantly more current
per row activation than the configuration with less bits per row. The larger bitline also
translates into more time to complete the refresh process for the entire array. Therefore, the
different configurations cause differences in current consumption in DRAM devices. With
the differences in current consumption, the DRAM devices require different timing
parameters to limit peak power consumption of DRAM devices.
 39
Finally, a set of DRAM devices can be organized to process a request together as a rank.
A set of DRAM devices can be connected for a particular size of data bus. For example, 8
DRAM devices with 8-bit column can be connected together to create a rank for 64-bit wide
data bus to form a single rank of memory.
In SDRAM and DDRx SDRAM devices, a column read command transports the data
out from the DRAM devices in burst manner. The column read command moves a variable
number of columns, called a burst, as specified on the programmable mode register. The
burst mode is possible in these DRAM devices because each column of the device is
uniquely identifiable. Given a column address of a multiple-column burst, the SDRAM-
based device rearranges the data in the burst and places the data of the requested address
first. This capability is known as critical-word forwarding.
In DDRx SDRAM burst mode, multiple columns are moved concurrently from the sense
amplifiers to the read latch. Then, the data is pipelined through a multiplexor to the external
data bus. This process is called prefetching. With the burst mode, the operating data rate of
DDRx SDRAM devices can be improved significantly compared to SDRAM devices.
However, the disadvantage of the prefetch architecture is that short-burst accesses are no
longer available.
2.4.3.DRAM-Based Memory System Organization
This section discusses how to organize multiple DRAM devices to create a memory
system. First, we would like to clarify the terms using in the DRAM research community
and also in this dissertation. A channel is an interconnection that DRAM devices are
connected to and these devices operate in lockstep with respect to each other. Usually, one
 40
channel is controlled by one DRAM memory controller. However, one DRAM controller
can control more than one channel in concert to create a more efficient memory system. For
example, the multiple channels in FPM DRAM were invented to sustain throughput
required by high performance workstations and servers prior to SDRAM. The word bank is
currently used by DRAM device manufacturers to describe the number of independent
DRAM arrays within a DRAM device. Multiple banks in a DRAM device support more
parallel accesses to the data in the different banks simultaneously. Read requests can be
processed simultaneously if they access different banks. The word rank is now used to
denote a set of DRAM devices that operate in lockstep fashion to commands in a memory
system. In DRAM devices, a row is simply the group of storage cells that are activated in
parallel in response to a row activation command. In general, DRAM devices are connected
as ranks of DRAM devices operating in lockstep. This organization causes a specific row in
all DRAM devices in a rank to be activated concurrently with a single row activation
command. This means that a DRAM row actually spans multiple DRAM devices of a given
rank of memory. A column of data is the smallest independently addressable unit of memory
in a DRAM device.
Memory system organizations in many computer systems are typically non-uniform.
The system can contain different sizes and organizations of DRAM devices. The reason for
non-uniformity in a memory system is flexibility. Most computer systems are designed to
allow end users to arbitrarily upgrade the capacity of the memory system by inserting and
removing commodity memory modules. To support upgrades by the end user, DRAM
controllers have to be flexible and handle different configurations of DRAM devices and
modules that the end user could place into the computer system.
 41
The memory module was created to alleviate the cumbersome memory upgrade process.
Essentially, memory modules are made of small electronic boards that contain a number of
DRAM devices so that multiple DRAM devices can be inserted and removed from the
system board conveniently. Memory modules also provide a standard interface so that
different manufacturers can conform to produce compatible memory upgrades for different
computer systems. Memory modules have been developed progressively over decades to
provide flexibility and compatibility between different systems. Modern memory modules
also include an extra DRAM device to serve as an ECC check bit.
Different memory modules have different configurations and timing parameters. To
provide the memory controller with necessary information, a small flash memory device is
integrated onto the memory module. This small flash memory device is known as a Serial
Presence Detect (SPD) device. SPD provides the configuration parameters and timing
characteristics of the memory module to the memory controller at system initialization. With
SPD, the memory controller can obtain the memory module information required to
effectively access the DRAM devices on the module.
The 30 pin Single In-line Memory Module (SIMM) was first standardized in the late
1980?s. Then, the advent of 72 pin SIMMs made the 30-pin SIMMs obsolete. SIMMs are
single-inline, which mean both sides of the module?s contacts represent the same electrical
contacts. In the late 1990?s, 72 pin SIMMs were obsolete due to the arrival of dual in-line
memory modules (DIMMs). DIMMs are larger in dimension than SIMMs and provide a 64
or 72 bit wide data bus interface. Unlike a SIMM, a DIMM has electrically different
contacts on different sides of the DIMM.
 42
Registered memory modules have been introduced to reduce electrical loads of a
memory system with large numbers of DRAM devices. In a large memory system, the
electrical loads are segmented through the use of registers that (1) separate the loads of the
DRAM devices on the module from the system and (2) buffer the address and control
signals at the interface of the memory module. The load segmentation limits the number of
electrical loads and shortens the control signal paths of the memory system. However,
registered memory modules introduce longer delays to memory access. 
The topology and organization of the DRAM memory system are important because
memory system topology determines the signal path lengths and electrical loading
characteristics of the memory system. However, due to the sensitivity of memory systems to
the manufacturing costs, the memory system topology has remained essentially unchanged
since the Fast Page Mode DRAM (FPM) era. Synchronous DRAM (SDRAM) and Dual
Data Rate SDRAM (DDR) also employ this memory system topology. The later memory
systems also adopt the topology with the trend of fewer ranks.
Figure 2.8: The Classic Memory Topology [17]. 
 43
Figure 2.8 shows an example of the classic memory topology. The figure shows a
memory system of 16 DRAM devices are organized into four separate ranks of memory,
which are connected to a single DRAM controller. The bi-directional data bus is divided
into the size of column in a DRAM device and connected to one device in each rank. The
uni-directional address and command bus also connects to every DRAM device in the
system. In this topology, a command is sent via the address and command busses to all
DRAM devices in the memory system. Specified by the chip-select signal, one a selected
rank is activated to process a read command or receive the data for a write command.
Therefore, the rest of the DRAM devices in the system ignore the data and command being
sent by the memory controller.
2.4.4.DRAM Commands
The DRAM memory access protocol is difficult to analyze in detail and is considered to
be a complex task. The large number of combinations of commands in modern memory
systems causes the analysis of the DRAM access protocol to be complex. This section
provides an introduction to basic DRAM commands and their functions. A more detailed
analysis of the DRAM memory access protocol interacting with various DRAM commands
can be found in [17].
Figure 2.9 shows the data movement caused by different DRAM commands. The figure
is used throughout this section as a generic DRAM device to define the basic memory access
commands. The generic DRAM access protocol described by Wang [17] is based on a
resource usage model. The resource usage model holds the condition that two different
commands can be processed concurrently if they do not require the access to the same
 44
resource at the same moment. However, there are other parameters which must be satisfied,
such as timing parameters to limit the power dissipation of the DRAM systems. Figure 2.9
illustrates four interleaving operational phases for a DRAM command. 
1. the command is transported through the address and command busses and decoded 
by the DRAM device. 
2. data are moved within a bank. The data can be moved in two directions: either from 
the cells to the sense amplifiers for a read request, or from the sense amplifiers back 
into the DRAM arrays for a write command. 
3. the data is moved through the shared I/O gating, read latches and write drivers. 
4. the DRAM device transports the data between the host?s memory controller. The 
data is placed onto the data bus by the DRAM device in case of a read command or 
Figure 2.9: Command and data movement on generic SDRAM device. [17]. 
 45
by the memory controller in case of a write command. Since the data bus may be 
connected to multiple ranks of memory, multiple commands to different ranks can 
cause conflicts on the data bus.
To operate the DRAM memory systems, there are five generic DRAM commands to be
discussed. Each command associates with a number of timing parameters, which are usually
defined in the DRAM device data sheet to describe the specific command behaviors. The
five commands include row activation commands, column read commands, column write
commands, precharge commands, and refresh commands. A row activation command
moves data from the DRAM cells to the sense amplifiers. The data remain in the sense
amplifiers for the following column read/write access commands to access multiple columns
of data. A precharge command resets the array of sense amplifiers and the bitlines and
prepares the sense amplifiers for the next row access command. Finally, a refresh command
retains the electrical charges in the memory cells of a particular row.
Additionally, some modern DRAM devices also support commands involving complex
actions, for example, a compound column read and precharge command, posted-CAS
command in DDR2 SDRAM, and additional complex commands to manage specialized
hardware. However, we introduce only generic commands in this section as a background
for DRAM operations.
2.4.4.1. Row Activation Command
To access data from the DRAM arrays, the first step is to move the entire row of data to
the sense amplifiers. A row activation command moves data from the DRAM arrays to the
sense amplifiers. To access another row of data, the charges must be restored from the sense
 46
amplifiers. Therefore, the row activation command is associated with two timing
parameters: t
RCD
 and t
RAS
. The Row Column (Command) Delay or t
RCD
 is the time for the
row activation command to move data from the DRAM cell arrays to the sense amplifiers.
Then, a column read access command or a column write access commands can transport the
data from the sense amplifiers to the memory controller via the data bus.
The second timing parameters t
RAS 
deals with the charge restoration process to the
DRAM cells. Due to the data movement to the sense amplifier, a row activation command
discharges the DRAM cells of the accessed row. As a result, To prepare the sense amplifiers
for the subsequent access to a different row, the data charges must be restored from the sense
amplifiers back into the DRAM cells. The Row Access Strobe latency or t
RAS
 is defined as
the time that a row access command discharges and restores data from the row of DRAM
cells. After t
RAS
, the data restoration process is completed, the sense amplifiers are ready,
and the DRAM array can be precharged for another row access to the same bank.
2.4.4.2. Column Read Command 
The function of a column read command is to move particular columns of data from the
array of sense amplifiers to the memory controller via the data buses. The column read
command consists of four different but overlapping phases. 
1. The column address and command are transported through the address and 
command bus, and then decoded by the DRAM device.
2. The specific data columns are accessed at the sense amplifier array of the specific 
bank and moved to the I/O gating. 
3. The data are transported out to the data bus via the I/O gating.
 47
4. The data are transported on the data bus for the time duration of t
Burst
 or the data 
burst duration.
A column read command is associated with two timing parameters: t
CAS
 and t
Burst
. The
Column Access Strobe Latency (t
CAS
 or t
CL
) is the time for the DRAM device to place the
first chunk of the requested data onto the data bus after the moment that memory controller
sends the column read command. t
CAS
 includes step 1 through 3 until the first chunk of the
requested data has finished the movement from the sense amplifiers onto the data bus. In
modern memory systems, data are sent over the data bus in bursts, usually in terms of 2, 4 or
8 beats on the data bus. One beat in Dual Data Rate systems usually means a half clock
cycle. The t
Burst
 or the data burst duration is conventionally defined in terms of a unit of time
instead of a unit of clock cycles. 
2.4.4.3. Column Write Command
The function of a column write command is to move data from the memory controller to
sense amplifiers of a specific bank. The column write command is quite similar to the
column read command with different direction in data movement. Therefore, the same set of
operating phases are repeated here with reversed sequence. 
1. The column address and column write command are transported through the 
address and command bus. 
2. The memory controller places the data onto the data bus. 
3. The data are transported from the data bus through the I/O gating, 
4. The data arrive at the sense amplifiers of the appropriate bank. 
 48
A column write command is associated with only one timing parameter, t
CWD
. Column
write delay or t
CWD
 is the time the memory controller waits before placing the data onto the
data bus, after it issues the write command. t
CWD
 is defined differently in different memory
systems. In earlier DRAM and SDRAM, memory controllers place both the write data and
the command at the same time; as a result, t
CWD
 is equal to zero. In DDR SDRAM, write
data is delayed one full clock cycle, and in DDR2, the write delay is one cycle less than
t
CAS
. Another timing parameter to be introduced is the write recovery time. The write
recovery time or t
WR
 is defined as the time between the moment the data burst ends and the
moment the data complete their movement into the DRAM arrays.
2.4.4.4. Precharge Command
There are two steps in the process of accessing data on a DRAM device. First, data are
moved from the DRAM cells to the sense amplifiers by a row access command. Second, a
number of column access commands move the data in the sense amplifiers to/from the
DRAM devices from/to the DRAM controller. Before the data from a new row can be
accessed, a precharge command prepares the DRAM device. The precharge command resets
the array of sense amplifiers and the bitlines to the original state. The precharge command
also consists of two different phases.
1. The precharge command is transported to the DRAM device, 
2. The selected bank is precharged. 
The precharge command is associated with one timing parameter t
RP
. The row cycle
time t
RC 
is summation of two row-access related timing parameters, t
RP
 and t
RAS
. Usually,
 49
the row cycle time of a DRAM device is an indicator for the speed of the DRAM device to
access data, including, 
1. Moving data from the DRAM cell arrays into the sense amplifiers,
2. Restoring the data to the DRAM cells, and
3. Precharging the bitlines to the reference voltage level and making ready for another 
row access command. 
Therefore, the row cycle time restricts the data retrieval speed of the DRAM device
when accessed to different rows in the same DRAM bank.
2.4.4.5. Refresh Command
A refresh command is used to periodically retain the data value in the DRAM cells
because the electrical charges in the storage capacitor gradually dissipate through the access
transistor. Therefore, to retain the data value, the data stored in DRAM memory cells must
be periodically read out and written to the full value. This can be done by a refresh
command. The time interval between refresh commands must be shorter than the time
period in which data in storage cells deteriorate to indistinguishable values. The refresh
command also has disadvantages of consuming bank bandwidth and power. Therefore,
varieties of different refresh mechanisms are used by different systems to reduce controller
complexity and/or bandwidth impact.
A refresh command reads the row address from an internal register, and then the DRAM
device sends all banks the row address. After that, each bank refreshes that row
concurrently. The refresh command is associated with a timing parameter t
RFC
. The refresh
cycle time or t
RFC
 is at least equal to or longer than the row cycle time t
RC
. In modern
 50
DRAM memory systems, the memory controller typically issues a refresh command once
every 32 or 64 milliseconds for each row in a bank.
The sequence of DRAM commands can vary in different systems depending on the
policy of the memory controller. For example, it is more beneficial to remain an active row
of data at the sense amplifiers in case of applications with high locality in memory accesses.
The reason is the subsequent memory accesses can retrieve data from the same row directly
without accessing another row. This can save both latency and energy. On the other hand,
applications with low locality of accesses would favor memory systems that immediately
precharge the DRAM array and prepare the DRAM bank for another row access. The
memory systems in the former case that keep rows active at the sense-amplifiers are called
open-page memory systems, and the memory systems in the later case that precharge a bank
right after a column access are called close-page memory systems. 
2.4.5.Memory Controller
The system controller correctly and efficiently manages the flow of data among the
processors, I/O devices, and the memory system. The DRAM memory controller is located
inside the system controller. The function of the DRAM memory controller is a subset of the
system controller: to manage the flow of data to and from DRAM devices. The interface
protocol of a DRAM memory controller is characterized by the DRAM access protocol and
timing parameters.
Most DRAM devices are manufactured without any intelligence. The devices only
operate when the commands are sent to them. The data sheets are provided by DRAM
manufacturers to specify the timing constraints for individual DRAM commands. To
 51
maintain the correctness of DRAM operations, the DRAM controller must operate within
the timing parameters defined in the datasheet.
This section provides an overview for a number of important issues to the design and
implementation of modern DRAM memory controllers. Specifically, following items are
particularly important to the design and implementation of a DRAM memory controller:
? Row-buffer Management Policy
? Address Mapping Scheme
? Memory Transaction and DRAM Command Ordering Scheme
2.4.5.1. Row-buffer Management Policy
Row buffer management policies are the policies that manage the operation of sense
amplifiers. In modern DRAM devices, arrays of sense amplifiers act as temporary buffers
for a previously-accessed row of data. Modern memory controllers typically employ the
following two policies to manage the operations of sense amplifiers in DRAM devices: the
open-page policy and the close-page policy. Different row-buffer management policies also
exist. The row-buffer management policy impacts the selection of the address mapping
scheme, the memory command re-ordering mechanism and the transaction re-ordering
mechanism for DRAM memory controllers.
The open-page row-buffer management policy is aimed at favoring memory access
sequences directed at the same row of memory. The open-page row-buffer management
policy keeps sense amplifiers open, meaning the sense amplifiers are not immediately
precharged. So, each sense amplifier holds an entire row of data for subsequent access. This
policy assumes that different columns of the previously accessed row may be accessed again
 52
in the near future. If the subsequent memory read access is to access the same row as the
previous memory access, the read access could take the minimal latency of t
CAS
. The reason
is that only a column access command is needed to satisfy the access. However, if the access
is directed to a different row of the same bank, the memory controller would perform a
series of actions, including precharge the DRAM array, perform another row access, then
perform the column access.
Unlike the open-page policy, the close-page row-buffer management policy is aimed at
favoring random accesses, which tend to map to different rows of memory. It is adopted in
memory systems designed for large-scale multiprocessor systems or some specialty
embedded systems. Due to the combination of memory request sequences, the spatial
locality of the resulting memory access sequence is greatly reduced. Additionally, with
different timings due to the different command combinations, the resulting sequence of
DRAM commands in an open-page system is very difficult to schedule efficiently, as
compared to the same memory access sequence in a close-page memory system.
2.4.5.2. Address Mapping Scheme
The purpose of an address mapping scheme is to reduce bank conflicts and increase
parallelism in the memory system. In a DRAM memory system with an open-page row-
buffer management policy, a sequence of consecutive read requests to the same row of data
can be performed in pipelined fashion, while a similar sequence of read requests with close-
page row-buffer management policy causes longer latency. In a memory system that utilizes
the close-page row-buffer management policy, the latency of each access remains relatively
 53
the same. With the difference in the access preferences, optimal address mapping schemes
are different for open-page and close-page memory systems.
Open-page Baseline Address Mapping Scheme
In a system that utilizes the open-page row-buffer management policy, consecutive
cacheline addresses should be placed into different channels, then adjacent cachelines
should be mapped into the same row, same bank, and same rank. The baseline address
ordering is as follows: row, rank, bank, cachelines per row, channel, and cacheline offset,
respectively from most significant bit to least significant bit.
Close-page Baseline Address Mapping Scheme
The key assumption of the close-page row-buffer management policy is that there is
little spatial locality in the sequence of memory accesses. In close-page memory systems,
mapping in a similar manner as with open-page would result in a bank conflict, which
greatly under-utilizes available memory bandwidth. To avoid bank conflicts, adjacent lines
are mapped to different channels, then to different banks, then to different ranks. The
baseline ordering is as follows: row, cachelines per row, rank, bank, channel and cacheline
offset, respectively from most significant bit to least significant bit.
In addition to considering row-management policies and due to flexibility and
scalability, memory system organization parameters that can be varied are typically assigned
to the highest address range. For example, rank and channel, which can be changed by user
upgrades. Therefore, the lower-order address assignment can remain unchanged while the
memory modules are altered in the system. However, this mapping scheme allows an
application to utilize only a subset of the memory address space and would limit the
 54
availability of the memory to fewer ranks. An address mapping scheme with more
scalability would have less rank or channel parallelism to memory accesses.
2.4.5.3. Memory Transactions and DRAM Commands
A design engineer must consider the additional complexity of a high-performance
DRAM memory controller. The DRAM controller design must take into account the specific
DRAM memory system behaviors, application specific requirements, and the type and
number of processing elements in the system. Fortunately, some basic strategies have been
invented to aid in the design of a high performance DRAM memory controller. To name a
few, the strategies are: bank-centric organization, write caching, and seniors first. These are
common to many high-performance DRAM controllers. Also, specific adaptive arbitration
algorithms are unique in specific DRAM controllers.
Write Caching
Write Caching has been used in many levels of memory hierarchy. The basic idea of
write caching is that write requests are typically non-critical, but read requests may be
critical in terms of performance. Additionally, DRAM devices perform poorly in cases of
consecutive read and write requests. Therefore, caching write requests and allowing read
requests to proceed ahead are beneficial to performance. 
DRAM-Bank-Centric Request Queuing Organization
One approach that can benefit when multiple commands are processed in a DRAM
memory controller is multiple queues arranged in a per bank basis. In this approach, DRAM
commands that access the same bank are sent to the same queue. The per-bank queuing
approach allows a memory controller to efficiently schedule requests to the same bank,
 55
either to the same row or different rows of the same bank. Additionally, bank-centric
organization with a bank-rotation mechanism can process concurrent requests to different
banks, which results in greater utilization of the memory system.
Feedback Directed Scheduling
Typically, the transaction requests do not contain priority information that allows a
memory controller to schedule the transactions more effectively. With direct
communication between a processor and an integrated DRAM memory controller, the
DRAM memory controller can schedule DRAM commands based on the availability of
resources and the DRAM command access history. To achieve the high performance, these
integrated DRAM memory controllers have to be aware of state and access history of the
processor contexts.
 56
CHAPTER 3:   OVERVIEW OF DISKS
The disk drive is a highly complex electro-mechanical system developed over decades
of research and experimentation. To name a few, the disk drive system incorporates many
disciplines , including physics for magnetic recording and the read/write heads, material
science for various materials used in the disk platter and coating, mechanical engineering for
the actuator and the slider carrying the recording head, electrical engineering for the spindle
motor, its control and the servo mechanism of the actuator, electronics for the read/write
channel and the various control electronics and computer science for architecture of the
drive controller and its cache, firmware and algorithms that controls the operation of the disk
drive.
A magnetic disk has been considered a fundamental component in a computer system
since 1965 [39]. Magnetic hard disk drives will continue to be the dominant form of
secondary storage for the foreseeable future. These drives primarily serve as long-term, non-
volatile storage for files and as a level of the memory hierarchy below main memory. The
disk is included in the virtual memory system implemented in many popular operating
systems as a slower form of memory during program execution.
The performance gap between the processor and the disk drive is greater than the
processor and the DRAM gap. The processor performance has doubled approximately every
two years, and DRAM device data rates are increasing at a rate of 100% every three
years [17] with each new generation of DRAM devices. Unfortunately, the disk drive access
 57
time improves only 10-15% a year [78]. The solution to this ever-growing problem is to
investigate the I/O. 
A variety of disk optimization techniques including caching, write buffering,
prefetching, and parallel I/O have been invented. These techniques were introduced in
server-class disk drives over a decade ago. As technology becomes less expensive with time,
these techniques are increasingly applied to workstation disk drives as well. In addition,
with better technology, the characteristics of the physical disk are also improved, including
the RPM (rotational speed in rotations-per-minute), the seek time and the disk drive
interface. As the performance gap between the processor and disk-based storage continues
to widen, increasingly aggressive optimization of the storage system is needed. This requires
a profound understanding of the real potential of the various I/O optimization techniques
and how they work together. Therefore, we must study the effects of an entire system.
This chapter presents a high-level discussion of disk drive technology. It also includes an
explanation on how a disk drive works. This background will serve as the foundation for
better understanding of the other parts of the dissertation. This fundamental understanding
will help clarify a number of the design issues and trade-offs that can affect the performance
and power consumption of a disk drive and the disk-based storage subsystems discussed
later. The content of this chapter derives greatly from the Disk section in [102].
3.1. Classifications of Disk Drives
Disk Drives can be classified by a variety of methods. One way to classify disk drives is
by the drive?s form factor. Modern disk drives are usually manufactured in one of four form
factors, namely 3.5?, 2.5?, 1.8? and 1?. These numbers indicate the width of the sealed disk
 58
drive unit. However, form factors alone are not an effective manner to classify a disk drive.
The reason is that form factors do not indicate the underlying technologies inside the disk
drives. Disk drives with similar form factors can be equipped with very different
functionality, performance and reliability.
A more conventional way to classify a disk drive is according to the application
platforms, which traditionally are:
? The server class drives to be used in high-end or enterprise systems, 
? The desktop class drives to be used in personal computers and low-end workstations
? The mobile class drives to be used in laptop or notebook computers. 
The disk drive?s characteristics and the requirements of each class are specified by the
application environment in which the disk drive is being used. Server drives require high
reliability and performance. Desktop drives require low cost due to the highly price-
competitive personal computer market. Mobile drives require low power consumption.
Today, the boundaries for these classifications have become unclear. The reason is that some
features in a class start to drift to other classes. For example, reliability is required for all
disk drives and there has been an ever-growing trend that some higher-end systems are
beginning to use desktop drives in some applications to take advantage of their low cost.
Another disk drive classification is the type of interface the drive provides. Current
interfaces in modern disk drives are: Fiber Optic Channel (FC), parallel SCSI (Small
Computer System Interface), parallel ATA (Advanced Technology Attachment) and the
emerging serial ATA (SATA) and serial attached SCSI (SAS). Server class drives are
commonly available in either FC or SCSI interface. Desktop, mobile and consumer
electronics drives invariably come with an ATA interface. Since server class drives which
 59
use a SCSI interface are more than twice as expensive as desktop drives, people often
mistakenly think that SCSI interface is expensive. However, the cost of server class drives is
mostly due to expensive technologies that are implemented in server class drives to give
high reliability and performance, not the SCSI interface. Nevertheless, some high-end
storage systems are starting to use ATA desktop drives in certain applications to achieve a
lower system cost.
3.2. Areal Density Growth Trend
Areal density is measured in terms of the number of bits that can be recorded per square
inch. It is one of the most important parameters a disk drive. This parameter determines the
amount of data that can be stored on each platter for a given disk diameter. Areal density
specifies the total storage capacity of a disk drive given the number of platters it contains.
Even though there are many other contributing factors, ultimately this is the one single most
important parameter that governs the cost per megabyte of a disk drive. The rapid growth
rate of areal density over the past thirty years has driven the storage cost of disk drives down
to the level that makes it the technology of choice for online data storage. Recently, areal
density has reached the point where it has become economically feasible to miniaturize disk
drives. This technology improvement opens the consumer electronics opportunity for such
small disk drives. Areal density also has a profound influence on performance.
Areal density consists of two components, namely tpi and bpi. The recording density in
the radial direction of a disk is measured in terms of the number of tracks per inch, or tpi.
The recording density along a track is measured in terms of bits per inch, or bpi. However,
there are many factors that cause the disk drive to be unable to utilize the maximum areal
 60
density it can provide. For a rotating storage device spinning at a constant angular speed, the
highest bpi is at the innermost diameter, or ID, of the recording area, while the outer track
may utilize less bpi due to the limitation in data organization.
Some of the technology improvements that have enabled areal density growth include:
? thinner magnetic coating - improved magnetic properties
? better head design
? fabrication for smaller heads/sliders
? flying height - spacing between head and magnetic material, resulting in higher 
linear bit density or bpi
? accuracy of head positioning servo, enabling narrower track pitch or tpi
Hsu and Smith [78] reported that, for IBM 3.5-inch server class disks, the linear density
has been increasing by approximately 21% per year, while the track density has been going
up by around 24% per year since 1988. In the last few years, areal density has increased
especially sharply, so that with a least-squares estimate (no weighting), the compound
growth rate is as high as 62%. With only the areal density increasing, the average disk
response and service times are improving by about 9% per year. This performance
improvement is due to areal density increasing. The data is packed more closely together
and can be accessed with a smaller physical movement. 
On the other hand, Hitachi GST's areal density growth rate is reported to be 60% per
year since 1991, and the rate has further accelerated to an incredible 100% per year since
1997 [98]. This acceleration is the result of the introduction of MR read heads in 1991,
GMR read heads in 1997, and AFC media in 2001. Since 1997, track densities have been
increasing faster than linear densities, which is the principal factor for the continuing
 61
increase in areal density. The track density growth rate is reported at 50% per year while the
linear density compound growth rate is around 30%. However, to achieve the areal density
in the range of terabits per square inch in 2010, both tpi and bpi growth rate have to be
scaled to 25% and 14%, respectively, and a compound growth rate of 46%.
3.3. Performance Metrics
The two widely used measurements of disk drive performance include response time
and throughput. The response time is defined as the amount of time starting from the time a
request is sent to the disk drive system until the moment the disk drive completes the data
transfer. Requests to a disk drive system are usually referred to as I/Os, an abbreviation for
input/output. Response time measures the speed of a drive to service a single request. On the
other hand, throughput measures the disk drive?s ability to serve an amount of data or a
number of requests in a unit of time. Throughput is usually defined in terms of one of two
units: number of I/Os per second (IOPS) or an amount of data transferred per second
(MB/s). Both units are equivalent to each other as they can be converted to the other through
the unit of each I/O request. Response time and throughput are closely related and are
usually closely correlated. As a result, a drive with fast response time will generally also
have high throughput.
For a given disk drive system, the characteristics of a workload can directly affect the
performance of the disk drive system in terms of both response time and throughput. The
characteristics of a workload that can influence performance include:
? Block size ? a large block size takes longer time to transfer than a small block size.
? Access pattern ? how much sequentiality or randomness does the sequence of 
 62
accesses have?
? Footprint - the size of the accessed disk area can affect the performance. Accessing 
only a small area of the disk translates into smaller seek distances between I/Os.
? Command type - performance of a read request and a write request can be different 
due to the disk enhancements equipped in the system.
? Command queue depth - the size of the queue can affect the performance. With a 
deeper queue, the request scheduler has more options to improve the disk 
performance by intelligent scheduling.
? Command arrival rate - since both read requests and write requests are sent to the 
disk system in the form of bursts, the longer the bursts, the longer the request wait 
time.
One fundamental element that determines both response time and throughput of a disk
drive is the I/O completion time, which is defined as the time a disk drive requires to process
and complete a user request. The I/O completion time consists of four major components:
command overhead, seek time, rotational latency, and data transfer time,
3.3.1.Command overhead
Command overhead is defined as the time the disk drive?s controller and electronics take
to process an I/O request before the request is sent to the disk?s mechanical parts. The
command overhead at the disk drive?s controller includes the time the controller interprets
the command and allocates the necessary resources to service the request. This controller
overhead is the major portion in command overhead. Another portion of the command
overhead is spent at the end of the I/O request including sending a completion signal to the
 63
host and cleaning up unused resources. With better technology, command overhead has been
steadily decreasing over the years. The main reason is that the hard drive?s microcontroller
and function-specific hardware have become faster.
3.3.2.Seek time
Seek time is defined as the duration of the disk drive to move the read-write head from
its current track to the destination track to service the next request. Seek time consists of two
components: 
1. A travel time for the actuator to move from its current track to the destination track 
2. A settle time for centering the decelerating head over the destination track and 
maintaining the center position of the track until data access process starts. 
Since seek time deals with mechanical parts in the disk drive, it is one of the largest
components of an I/O access. As a result, the disk research community usually considers
seek time to be very significant. However, we will show later that, with single-user
environment, seek time is unimportant in our experiments due to the application?s
behavior.Despite the fact that today?s disk drives implement zone bit recording (ZBR),
which makes the data organization different on the disk from zone to zone, treating ZBR
disks as non-ZBR disks is in general an acceptably close approximation for seek time
studies. The historical rate of increase in seek time is 8% per year [78].
 64
3.3.3.Rotational latency
After seeking and settling the head, the disk rotates the platters to position the head at the
destination sector. Rotational latency is defined as the time the disk rotates to position the
head to the start of the destination sector. Disk manufacturers report the average rotational
latency for a disk as half the amount of time of one revolution. A more conventional
parameter to specify the disk rotational latency is by specifying rotational speed. The
rotational speed is usually reported in terms of RPM or revolutions per minute. Therefore,
the rotational speed is simply an inverse proportion to the rotation latency. The historical
rate of rotational speed increase is 9% per year [78]. Like seek time, the rotational latency is
one of the largest components of I/O because it is dealing with moving mechanical parts in
the disk, which are relatively slow as compared with electronic components.
3.3.4.Data transfer time
Data transfer time is defined as the time the data is transferred to/from a disk drive
from/to the host system. Data transfer time is proportional to transfer size and inversely
proportional to data rate. The average transfer size is specified by the operating system and
is determined by the application characteristics. The data rate of a disk drive can be
generally be categorized into two types: (1) the media data rate and (2) the interface data
rate. Media data rate or the internal datarate (IDR) is defined as the rate that data may be
transferred in and out of the magnetic recording media. Generally, the media data rate
depends mainly on bpi and RPM. The internal or maximum media data rate of Hitachi GST
hard drives [98] increased at about 40 per cent per year. Today's server class hard disk drives
have the internal data rates beyond 100 MBytes/s. The increase in internal data rate is mostly
 65
due to increasing bpi and disk RPM to 10,000-15,000 RPM. However, an increasing internal
data rate comes with the cost of increasing power consumption. To mitigate the effects of
this, disk drive designers have reduced the disk diameter.
On the other hand, the interface data rate is defined as the rate that data is transferred
between the disk drive and the host. The transfer is via an interface. The interface is defined
as the communication channel, which I/O requests and the data are sent from/to the host to
the disk drive. There are several most common standard interfaces for disk drives today. For
example, the parallel and serial versions of ATA in the personal computer realm, and parallel
and serial versions of SCSI and the serial Fibre Channel in the server realm. Though ATA
disks are usually used in personal computers, ATA disks are also deployed as RAID disk
systems in some server applications to lower the overall cost.
3.4. The Physical Layer
3.4.1.Principles of Rotating Storage Devices
All rotating storage devices, with different recording methods and media, are based on
common features and principles, including platter, read and write head concepts. Those
rotating storage devices have a number of platters which hold recording material on their
surfaces. The storage devices also have heads, which are transducers for detecting,
extracting, and converting the signal on the recorded media into electrical signals and vice
versa. The detailed functionality of the heads differs due to the different types of storage
mechanisms and media. A read head is used to detect and retrieve the recorded data.
Therefore, only the read head is required for read-only types of storage devices, such as a
 66
DVD-ROM. On the other hand, recordable devices, such as magnetic disks and DVD-RW,
require a write head in addition to record the data to the recording media. 
The location of a specific block of data on a disk can be specified by radial and the
angular coordinates. A disk drive with many platters also adds another dimension to the data
coordinates, which is the surface number or the head number. To access data, the head must
be positioned at the data location intended to be accessed. The mechanical parts on which
the head is mounted move the head to the destination radial position. This process of
positioning the head at a particular radial coordinate is called ?seek?. The seek process ends
when the head reaches the destination radial position and maintains position at the center of
the track. Then, the disk is rotated to bring the destination data location under the head. The
common approach is to rotate the disk with constant rotational speed rather than spin the
disk from stationary state when needed. The reason for this is rotating the disk from a
stationary state to a constant rotational speed takes a long time and much energy because the
process involves moving mechanical parts. Therefore, rotating the disk with constant speed
does not harm the performance as much as having the disk rotate starting from a stationary
state. Additionally, since a disk request accesses many sequential bits at a time, rotating the
disk at a constant speed will continuously bring many data bits under the head. Separate
electronics participate in rotating storage devices; specific electronics control the rotation of
the disks and the others control the servo mechanism of the head radial positioning. The
servo directly affects the seek time, which is an important factor in performance. 
Additionally, rotating storage devices have electronics to perform as the interface
between the storage device and the host system. As discussed previously, the storage devices
can have multiple platters, which are mounted on the same motor spindle to increase
 67
capacity. A disk drive with many platters also adds another dimension to the data
coordinates, which is the surface number or the head number. Each surface usually has one
dedicated head for it, so the positioning in this dimension is only a process of electronically
switching to the right head on the destination surface. 
3.4.2.Magnetic Recording
Magnetic recording is founded on materials that can be permanently magnetized,
magnetic fields, and the interaction between them. Permanently magnetizable materials are
called ferromagnetic materials and they are often used as the storage media for recording
since they can provide non-volatility of magnetization. To record data onto the magnetized
storage media, external magnetic fields are applied to induce magnetism in ferromagnetic
materials in the specified direction. To detect and retrieve the recorded data, the magnetic
fields of magnetized ferromagnetic material must be detected. 
Ferromagnetic materials can be classified according to their magnetic behavior exhibited
in hysteresis loops, into a hard magnetic material and a soft magnetic material. A hard or
permanent magnetic material has high magnetic coercivity and high remanence, which is
suitable for magnetic recording media. In contrast, a soft magnetic material is a material
with low magnetic coercivity and low remanence, which is suitable for magnetic recording
head.
The transitions in magnetic orientation between adjacent magnetized grains on the
magnetic recording media are used to represent binary data. By convention, the presence of
a magnetic field reversal represents a digital 1 and the absence of a field reversal represents
 68
a 0. This data representation has been used conventionally since the beginning of magnetic
recording.
3.4.2.1. Writing
Writing is the process of recording magnetic transition patterns onto a recording
medium. Theoretically, the process of writing requires applying sufficiently strong magnetic
field to the designated magnetized grains to induce saturation magnetization onto the media.
This process changes the transitions in magnetic orientation between adjacent magnetized
grains. In other words, a magnetic recording of digital data storage employs saturated
recording. Since the data is either a digital 1 or a 0, only the polarity of magnetization is
required to determine the digital value. Therefore, to obtain the strongest signal possible, the
recording medium is magnetized to saturation. As a result, saturation magnetization
maximizes the signal-to-noise ratio in the magnetic recording media.
The write head and the write channel electronics perform signal conversion that encode
the user?s data to the appropriate magnetic fields to be applied to the magnetic grains on the
recording media.
3.4.2.2. Reading
Reading is the process of detecting and retrieving the data located on the media by
determining the magnetic pattern recorded the in magnetic grains. The different transitions
in adjacent magnetic grains represent a 1 or a 0. Therefore, the magnetic pattern represents a
set of sequential data on the media. The magnetic pattern is detected by sensing the
transition in adjacent magnetic grains. 
 69
The read head is the transducer that detects magnetic field transitions. The read head
converts the magnetic field transitions to electrical signals that can be processed and
interpreted by the drive?s electronics. The details regarding the mechanism employed by the
read head detection and conversion process is varied with the type of read head.
3.4.3.Mechanical and Magnetic Components
3.4.3.1. Disks
The recording medium for hard disk drives is basically a very thin layer of magnetically
hard material on a rigid circular substrate. Some of the required characteristics of recording
media include:
? A thin substrate to use less space
? A lightweight substrate use less power while spinning
? High rigidity for low mechanical resonance and distortion under high rotational 
speeds
? Flat and smooth surface to allow the head fly very low without making contact 
? High coercivity (H
c
) for the stable magnetic recording 
? High remanence (M
r
) for good signal-to-noise ratio
? A square hysteresis loop for sharp transitions between compartments
Magnetic material is composed of numerous grains of magnetic domains, whose size
directly affects the material?s magnetic properties and the media transition noise. Meaning
that more grains are required at a grain transition boundary to reduce noise. This can be
 70
accomplished by reducing the grain size. However, decrease in grains size for higher TPI
causes the grains become magnetically unstable.
3.4.3.2. Substrates
Recently, glass has been used as substrate materials in disk drives. Though brittle, glass
has become widely used due to reduced disk diameters. Glass can be polished to produce a
surface finish with high level of smoothness. This property makes glass more attractive than
aluminum, even though the cost of glass is higher. Additionally, glass is more attractive
because glass is hard and has better durability against head-contact damage. Therefore, it is
a better solution for mobile applications. With higher tensile strength, glass can be
manufactured in thinner and lighter forms.
3.4.3.3. Magnetic Layer
Originally, particulate media was used as the magnetic layer. However, thin film media
is widely used currently, and the originally-used particulate media have become obsolete.
Thin film media have a layer of magnetic metal deposited directly onto the substrate and
bound there. Unlike particulate media, thin film media eliminate the need to use polymers to
bind the magnetic layer to the substrate. Therefore, magnetic material in thin film media is
not diluted by the nonmagnetic binder, i.e. polymers. This allows for a thinner layer of
magnetic material and results in shorter magnetic transitions between adjacent magnetic
grains. With narrower transitions, thin film media can provide higher areal density. Thin film
media can be produced by a sputtering method. In a sputtering method, a low pressure gas,
such as argon, is accelerated with a high voltage towards the surface of the target magnetic
 71
material. Then, the surface atoms of the target magnetic material are displaced by the
energized argon ions accelerating toward them. As a result, the magnetic material atoms are
ejected to bond with the substrate, resulting in the thin film of magnetic material on the
substrate.
3.4.3.4. Disk Structure
Besides the substrate and magnetic layer, a magnetic disk is composed of multiple layers
of a variety of materials. Starting from the substrate, there are layers of nickel-phosphorus,
chromium, magnetic media, wear-resistant overcoat, and lubricant. The first layer above the
substrate is nickel-phosphorus. Nickel-phosphorus provides a much harder surface
protecting the disk from damage and can be polished to a very fine surface finish. Above
that is a layer of chromium. Chromium provides a basic microstructure foundation for the
magnetic layer material to be deposited on. Next is the magnetic layer, for which a cobalt
alloy is normally used. A layer of hard-but-not-brittle overcoat is atop the magnetic layer to
protect it from wear and tear and other damages. Hard carbon is usually used as the overcoat
since it satisfies the requirements of being a chemically inert material while still having the
ability to bond well with the magnetic layer. Hard carbon is sputtered onto the disk to
produce the overcoat layer protecting the magnetic layer of the disk. Finally, a layer of
lubricant is placed on the top to prevent possible damages from the contact between the head
and the disk.
 72
3.4.3.5. Spindle Motor
Modern disk drives use compact and efficient DC motors, such as the three-phase, eight-
pole motor. The motor drives the spindle directly, whose stator is fixed to the bottom of the
disk drive case. A part of the outer sleeve of the motor, the rotor, establishes the spindle.
Disk platters are mounted on to the spindle. Speed of the spindle is electronically controlled
by a servo system.
3.4.3.6. Bearings
To maintain smooth and quiet disk drive operations, disk drives are equipped with
bearings to separate the rotating parts from the stationary parts. The function of bearings is
to support and separate the spindle hub from the stator shaft. Originally, disk drives were
equipped with metal ball bearings. Recently, fluid dynamic bearings (FDBs) have
increasingly been used as a replacement for ball bearings. FDBs produce quiet disk drives
by replacing the ball bearings with a thin layer of lubrication oil. The oil is high in viscosity,
and it resides in a specifically-manufactured container. FBD contains no ball bearings to
cause contact since it utilizes the liquid movement of a lubricant film. As a result, the FDB
spindle motors can produce a quieter and smoother solution, compared to ball bearings, due
to softer impacts between the parts, such as part contact, wobble, and shock. Recently, FDB
motor costs have been decreasing due to mass production and improvement in
manufacturing techniques. It will eventually cost less than a ball bearing motor due to
relative scarcity of parts required.
 73
3.4.3.7. Heads
Heads are the most important element of a disk drive. There are two types of heads in
disk drives, which are read and write. The write head generates a magnetic field to change
the magnetization direction on the magnetic media. On the other hand, the read head detects
the magnetic recording pattern and retrieves the data from the media.
Write Heads
A basic inductive write head consists of a ring core and a coil of wire wrapped around
the core. Ferrite is normally used as a magnetically soft material for a ring core. There is a
short gap in the core to expose magnetic flux to the media. The head moves very closely
above the magnetic recording media and the core gap is positioned just next to the media.
Similar to electromagnets, which work by applying current through the coil, the current
induces a magnetic field inside the core, whose direction depends on the direction of the
applied current. Therefore, the head is called the inductive head. At the core gap, the two
ends forming the gap establish two opposite magnetic poles. The magnetic flux moves
outside the core from this gap. The magnetic flux from the gap magnetizes the media, which
is a magnetically hard material. The result of magnetization is according to the material
hysteresis loop characteristics or the media and the amount of magnetic flux applied from
the core gap. The distance between the write head and the media has to be shorter than the
write bubble, which is defined as the space which the magnetic field of the write head is
strong enough for writing. After the write process, the resulting magnetization remains in
the magnetically hard media, due to its non-volatile nature. 
Digital magnetic recording uses saturation magnetization to record data. To change the
orientation of the magnetic field in the media, controlled current is applied via the head. The
 74
direction and magnitude of the magnetic flux leaking to the media is the result of different
directions and the magnitude of the applied current. Regardless of the previous orientation,
the magnitude of the exposed magnetic flux has to be sufficient to completely change the
orientation of the magnetic field of the grains next to the head to the desired orientation. For
instance, to change the orientation of the media grain to the opposite, the current is applied
to the head in the opposite direction of the grain?s magnetic field orientation, and the
magnitude of the magnetic flux from the core is sufficient to reverse the direction. As a
result, saturated magnetization in the opposite orientation is established in the magnetic
media at the designated grains. The process creates a transition in the media where
magnetization changes to the opposite orientation. To create immediate transition in the
process, the media must have a sharp hysteresis loop. This sharp transition in the media
translates into closer grains and higher in linear recording density (bpi). 
Another important factor to increase the areal density is the size of the head. With better
technology, a head dimension can be precisely manufactured. Lithography is used to define
the features of the head, and thin film process is used to construct all its components,
including the core, the gap and the copper windings. While the dimension of the head is
much more compact, the functionality of the head remains exactly the same as an inductive
head.
Read Heads
The read head detects and retrieves the magnetic transitions that are recorded in the
media in order to read the data located in magnetic recording. Unlike the inductive read
head, magnetoresistive (MR) heads sense the flux directly from the media, not the changes
in the flux. As a result, the MR sensor generates high peak differential voltage signals during
 75
the transition phase. The MR sensors are shielded at the front and back to prevent the MR
sensor from detecting unwanted magnetic fields from adjacent transitions. Therefore, it
detects the magnetic field from the transition right beneath it.
The the main reason for the rapid increase in areal density after the MR head was
introduced in 1991 is the MR head can generate a signal many times larger than an inductive
read head can do. An MR head has low inductance; therefore, MR head is applicable to high
frequency systems. It is also independent from the rotational speed of the disk, since it
detects magnetic flux rather than the rate of change of flux, which is how an inductive read
head operates. The last feature is beneficial to small diameter drives with low RPM.
In the early 1990?s, a sensor with giant magnetoresistance (GMR) was invented. GMR is
a composite sensor with multiple thin layers of different ferromagnetic and anti-
ferromagnetic materials. The GMR manufacturing process is a result of molecular beam
epitaxy process.
Read/Write Heads
The head or read/write head is referring to the transducers for both reading and writing
because the earlier disk drive models had only one inductive head used for both reading and
writing. However, to improve the performance, modern disk drives have separated the read
and write heads. The reason is that each type of head can be customized to suit its specific
purpose better.
Currently, the head is a composition of an inductive write transducer and the MR (or
GMR) read transducer placed on the same arm. The write head is placed behind the read
head, but both move in tandem. Since the read head is not required to be located at the exact
center of the track, a technique called ?write wide, read narrow? is applied to today?s disk
 76
heads. The technique is based on the narrower MR read sensor but wider write pole tip.
Therefore, the write head writes a wider track than the width of the read head. This
technique allows tracks to be packed together closer and results in increasing tpi.
3.4.4.Electronics
The physical components of a disk drive that directly relate to the total system
performance are the electronics that control its operation. Most of the disk drive electronics
are in the form of a set of IC chips on a small circuit board located outside the sealed HDA
chassis. Only the arm electronics module is located inside. 
3.4.4.1. Controller
The controller is composed of a variety of electronic components, including a processor,
ROM, memory controller, host interface, data formatter, ECC & CRC encoder/decoder. It is
consider the center intelligence of the disk drive. Its function is either to perform a disk drive
task itself, or control other components to accomplish the task. The major functions of the
controller include:
? Receive and schedule commands (I/O requests) from the user and report completion of
a command back to the user.
? Manage the built-in disk cache.
? Control the HDA operations, including seek and data access.
? Manage policies including error recovery, fault, and power.
? Start up and shut down the disk drive.
 77
3.4.4.2. Memory
The functions of memory in the disk drive include:
1. The controller uses a part of the memory as a scratch pad memory.
2. The interface uses a part of memory for speed-matching between the media data 
transfer rate and the data rate of the interface of the disk drive. 
3. A part of memory is used to perform caching for fast data access.
3.4.4.3. Recording Channel
The recording channel receives control signals from the controller and then generates
appropriate process for writing or reading the data from the media. It decides either read or
write head and circuitry to activate. For write data, the recording channel applies the
appropriate direction and magnitude of the voltage to the write head according to the control
signal from the controller. Additionally, the recording channel interprets the data retrieved
from the media by the read head. 
? Write Channel
The write channel is a set of electronics, which is a part of the recording channel. It
translates the user data from their digital format into the required magnitude and direction of
currents to be sent to the write head. Then, the write head manipulates the magnetic
transition in the media according to these currents.
? Read Channel
The read channel is a set of electronics, which is also another part of the recording
channel. It translates the magnetic signals retrieved by the read head to the digital data
 78
originally recorded. The function of read channel circuit also includes error recovery by a
modulation code decoder to recover the original user data and ECC/CRC information.
3.4.4.4. Motor Controls
Motor Controls have the function to control two disk drive motors: the spindle motor
and the actuator voice coil motor (VCM). These motors are different in both the form of
motion and the mechanical functions. The spindle motor spins the disk platters with a
constant rotational speed. On the other hand, the actuator VCM accurately moves the
actuator to position the head at the destination radial coordinate. Both motors can be
controlled by changing either the amplitude or direction of the current to the motor. 
The actuator servo control system requires special servo patterns, which are written on
the disks at manufacturing time. These servo patterns provide accurate positioning
information on the radial location of a surface. The actuator servo control system uses close-
loop feedback control. The servo patterns are detected by the read channel and then they are
forwarded to the decoding circuit. The decoding circuit decodes the signals into positioning
information, which is fed to the servo control logic. The servo control logic compares the
positioning information with the request?s destination track position, sent by the disk drive?s
controller. The difference between the positioning information from the servo patterns and
the request?s track position is calculated and is used to generate a corrective action signal fed
back to the VCM driver to adjust the position of the head. In today?s disk drives, the actuator
servo control includes a digital signal processor (DSP), an analog to digital converter
(ADC), and a digital to analog converter (DAC). A DSP is used to perform the main
functions of the control. An ADC is used for converting the analog servo signals to digital
 79
form, while a DAC is used for converting the digital correction signal into analog input for
the power amplifier of the VCM driver.
For the spindle motor, the EMF voltage from the motor coil, generated by the spinning
motor, is monitored to measure the motor?s rotational speed in today?s disk drives. Then, a
close-loop control adjusts the rotational speed accordingly to keep it spinning at a constant
rate.
3.5. The Data Layer
platter 
cylinder 
track 
sector 
Figure 3.1: Basic Data Organization of a disk drive. 
 80
3.5.1.Disk Blocks or Sectors
Today, all disk drives use a fixed-size block formatting. The blocks are called sectors.
Most disk drives, if not all, specify a sector size of 512 bytes for data. Each sector is
separated from adjacent sectors by physical gaps, which have no recording data. The
presence of a physical gap is to provide the read/write head with extra time and buffering
while accessing an individual sector. Figure 3.2 illustrates the basic fields in a sector. A
sector is composed of a preamble field, data address mark field, data field, ECC, CRC and a
flush pad field. The first field of a sector is a preamble field or the sync field. The preamble
field is approximately 10 bytes long. It defines the frequency and amplitude used to write the
sector. The read channel uses this information to adjust its phase locked loop (PLL) and its
automatic gain control (AGC) circuits. Located next to the preamble field is the data sync or
data address mark. The data address mark contains a few bytes which is used to separate the
preamble and the data. The third field is the data field. Currently, the size of data field is
usually set to 512 bytes. However, people in the disk drive community are attempting to
push the data field to 4KBytes for compatibility and performance reasons. For data recovery,
error correcting codes (ECC) are attached to a sector. Finally, a cyclic redundancy checksum
(CRC) is also included as a part of a sector to further ensure data integrity. The size of CRC
is product dependent.
 81
ECC and CRC are included to increase reliability. The difference between them is that
the CRC of a sector is calculated over the sector?s data while the ECC is computed over a set
of individual sector information that includes the sector?s data, the CRC, and the implied
logical block address (LBA) of the sector. The hard error rate in a sector with ECC is
reduced to 1 in 10
14
 for typical desktop drives and 1 in 10
15
 for typical server class drives.
The last field in a sector is padding. The padding contains a few bytes to facilitate the
process of flushing data through the read channel and other circuits. The padding field helps
maintain the clock while the data is being flushed.
As areal density increases, data becomes more sensitive to errors due to a lower signal-
to-noise ratio and the fact that the same size physical damage can affect more bits. More
redundancy bits per sector in ECC are required to maintain the data integrity. Therefore, the
disk drive industry is under pressure to increase the sector size.
Data
Preamble
Address mark
ECC CRC
flush
pad
Figure 3.2: Components of a sector. 
 82
3.5.2.Tracks
A platter of magnetic hard disk drives is composed of a group of concentric circles. This
characteristic has been defined as a standard feature for magnetic hard disk drives from the
start. Each circle is commonly known as a track. A location on each track is assigned as the
beginning of the track. Since a track is a circle the beginning of the track is also the end of
the track. A track contains many sectors, which are evenly spaced along each track and
numbered accordingly, starting with one. Each track is also numbered with the outermost
diameter (OD) being the first track (track 0) and the innermost diameter (ID) being the last
track.
3.5.3.Cylinders
A disk drive has multiple surfaces with multiple tracks on each surface. Tracks with the
same track number, one on each surface, form a cylinder. The disk drive generally reserves a
number of outermost cylinders for internal uses, such as disk physical address mapping
information. This information is not accessible by end-users. Those reserved cylinders are
sometimes called the negative cylinders since cylinder 0 is the first user data cylinder.
3.5.4.Address Mapping
To be accessed by the host system, an individual sector in a disk drive is uniquely
identified by an addressing scheme. The host system informs the disk drive which sector to
be accessed by specifying the host sector address. Inside the disk drive, the disk drive maps
the host sector address to disk physical address and then the correct destination sector is
 83
accessed. Therefore, there are two addressing schemes involved in disk drive address
mapping: internal addressing and external addressing.
3.5.5.Internal Addressing
There are two addressing schemes generally used internally in a disk drive to map a
physical address to a physical sector on a disk drive. First, a disk drive identifies a physical
sector by using a physical block address (PBA) or absolute block address (ABA). The PBA
or ABA is ranged between 0 and N-1 where N is the total number of sectors in the disk
drive. Another scheme commonly used in disk drives is CHS or cylinder-head-sector
addressing. CHS identifies a physical sector with 3 numbers, which include a cylinder
number, a head number or surface number, and a sector number on the track. 
3.5.6.External Addressing
The host commonly uses an addressing scheme called LBA or Logical Block Address to
identify a sector in the host aspect. Unlike CHS consisting of three numbers, LBA simply
specifies the sector with only one number. Even though the previously-introduced logical
addressing is limited by the number of sector, track, and head, LBA is not. Additionally,
LBA allows the ATA disk drive to recognize the address for a capacity over 8.4 GB.
3.5.7.Logical Address to Physical Location Mapping
When a request is sent from the host to the disk drive, the address of the request is
converted from logical address in terms of LBA to physical address in terms of PBA. Then,
 84
the request PBA is mapped to the physical CHS address to identify the destination head,
track, and sector. When the disk head reaches the end of the currently accessed track, it
moves to the next track. Two approaches are generally used to decide which track the head
should move to. One is to move to the next track on the same platter, and the other is to
move to the next track located on different platters but the same cylinder.
3.5.7.1. Cylinder Mode
Cylinder mode moves the head to the next track on different platters but on the same
cylinder. The benefit of this approach is it eliminates the need to reposition the actuator
causing the head to seek. However, as the tpi rapidly grows, the actuator repositioning
process to move to the next track on the same surface may not take as long as the process to
move to the next surface. Therefore, with today?s ever-increasing track density, it is likely
that the actuator can reposition to the next track on the same surface more easily than if it
switch to the next surface. Therefore, the other approach, Serpentine Format is more
commonly used than Cylinder Mode in modern disk drives.
3.5.7.2. Serpentine Format
As we discussed earlier, today?s disk drive have the ability to move the head to the next
track on the same surface more easily than to move the head to the next surface. This is due
to the rapid increase in tpi in the disk drive. Therefore, the next track on the same surface is
physically located closer than the next track on the next surface. Additionally, the exact
location of the destination sector is also unknown after switching the head to the next
surface. The head needs to read the servo information to locate the current position before
 85
the destination sector is located. To prevent the problems, serpentine format is used to move
the head to the next track on the same surface.
Compared with cylinder mode, serpentine formatting performs better in random disk
access streams. However, cylinder mode outperforms serpentine formatting in the disk
access streams with high locality. In those access streams, the average seek time in
serpentine formatting is increased. To assure the performance in both types of access
streams, a combination approach, called banded serpentine formatting, is introduced. The
approach limits the serpentine formatting within a group of tracks, called a band, and the
cylinder mode is used in the head movement between the bands. 
3.5.7.3. Skewing
If all tracks on the same surface have their beginning position at the same angular
position, when the head moves from the end of one track to the beginning of the next track,
the beginning of the next track would pass the head already. Therefore, the head has to wait
for a full rotation to access the beginning of the track. To prevent the performance loss due
to the full rotation wait for the beginning of the track, skewing is introducing. In skewing,
the beginning of each track is placed in different locations, depending on the head switching
time. Therefore, when the head switches to the next track, the beginning of the next track
comes after the head finishes switching.
3.5.8.Zoned Bit Recording
If all tracks on a disk surface have the same number of sectors, the surface doesn?t utilize
all the capacity it can provide. The reason is only the innermost track would utilize the
 86
highest bpi. On the other hand, the other tracks would contain the same amount of data as
the innermost track, even though they occupy more area on the surface. Therefore, most of
the disk surface area is under-utilized. To utilize the surface area as much as possible, Zoned
Bit Recording (ZBR) is introduced. ZBR divides tracks into groups, called zones. In each
zone, the tracks have the same number of sectors. The tracks in the outer zone have more
sectors than the tracks in the inner zone. Today?s drives tend to have 3 or more zones.
Therefore, the original data formatting can be considered as a special case of ZBR where the
number of zones is one. 
In ZBR, since the number of sectors per track is changed from zone to zone, the disk
drive has to deal with variable data rates.
3.5.8.1. Variable Data Rate
Variable Data Rate occurs when the disk applies ZBR, and it is also called CAV
(constant angular velocity) recording. With constant disk rotational speed, ZBR varies the
number of sectors per zone; therefore, the data rate of the disk varies from zone to zone. The
read/write electronics take care of variable data rate. This approach has advantages,
including: 
1. More data is stored at outer diameter (OD) tracks compared with the amount of 
data stored at inner diameter (ID) tracks. 
2. The performance of the OD tracks is better than the performance on the ID tracks 
because the data rate on the OD tracks is higher. Therefore, most accesses occur at 
the OD with higher data rate. As a result, the overall performance of the disk is 
improved.
 87
3.5.9.Servo
The function of the servo system in a disk drive is to control the movement of the
read/write head. It has to maintain accuracy in the head movement in both the movement
between the track and along the track. The servo system?s function includes:
1. Controlling the movement of the head actuator from the current track to the
destination track including the movement on the same surface and the movement to another
surface (switching head). These movements are collectively known as seek. One parameter
that affects the performance of the seek operation is the seek distance. The seek distance is
the distance in terms of number of tracks that the head has to move from the current track to
the destination track.
2. Maintaining the correct position of the read/write head along the track at the center of
the track. When the head is accessing the data on a track, the servo system continues making
corrective adjustments to maintain the position of the head on the track. This is to prevent
the head from drifting off the center of the track. This process is called track following. 
With ever-increasing tpi in modern disk drives, high accuracy in servo system is
required. The servo system in modern disk drives is generally implemented with closed-loop
control to accomplish the high accuracy in both of the previously mentioned functions. To
provide the accurate position information of the sector on a disk to the read head, the servo
information is written onto the disk surface directly. There are two approaches to place the
servo information, which are dedicated servo and embedded servo.
 88
3.5.9.1. Dedicated Servo
In the dedicated servo approach, the servo information is written on one of the surfaces
in a multiple-platter disk drive. The servo information is usually written onto the middle
surface in the disk stack at the manufacturing time. The head on that surface is only a read
head and is called a dedicated servo head. Since all the heads in a multiple-platter disk drive
are connected to the same actuator and move together, the dedicated servo head plays the
master role for all heads. However, since modern disk drives tend to have only a few platters
due to power consumption, dedicating one surface for servo information is considered too
costly. Additionally, temperature can cause different physical changes in arms and disks on
different platters. Therefore, the servo information on the dedicated surface may be
displaced with respect to other surfaces. This problem can be solved by periodically
calibrate the servo information with respect to other surfaces due to the temperature
changes. This calibration process causes significant performance degradation, and the
process is more complicated as the tpi increases. As a result, the dedicated servo has been
obsolete since the mid 1990?s.
3.5.9.2. Embedded Servo
Unlike the dedicated servo, the servo information in embedded servo is written with the
data on the surface. Therefore, there is no dedicated surface or dedicated head in the
embedded servo approach. The head on each surface performs both read/write data on the
surface and reads the servo information on the surface. The servo information in the
embedded servo is in the form of wedges on a disk surface. The wedge area is specifically
reserved for servo information, written at disk manufacturing time. The wedges are evenly
 89
spaced around the disk. The embedded servo also provides two types of servo information,
which are referred to as servo bursts and track id. The servo bursts are provided to prevent
the head drifting off the center of the track. On the other hand, the track id is the servo
information for seek operations. Since there are both a read head and a write head on every
surface, care should be taken not to allow the write head to overwrite the servo information
on the surface. Some problems should be mentioned in using embedded servo, including: 
? The servo wedges should be placed closely to prevent the head drifting off the center 
of the track.
? The head requires servo information to determine the current track number and the 
destination track number. When seeking, the head must read the servo information as 
soon as possible to determine the track number. Therefore, there should be enough 
servo wedges on the surface, so the head does not have to wait for a long time to read 
the servo information to determine the track number.
Both problems suggest that there should be as many servo wedges as possible on a
surface. However, the area containing user data is reduced when increasing the number of
servo wedges. Additionally, the benefit of having many servo wedges is dependent on the
disk access pattern; meaning, having many servo wedges improves the performance in case
of a random access pattern, but degrades the performance in case of sequential access
pattern. Modern disk drives typically have approximately 100 to 200 servos per track, and
the servo information takes up space of approximately 8% to 12% of the total disk capacity.
 90
3.5.9.3. Servo ID and Seek
In each servo sector, there are two servo coordinates: the radial coordinate and the
angular coordinate. However, to seek, the servo system only requires the radial information
to move the head to the destination track. The radial information is recorded in the form of
the cylinder number, and it is repeated for all servo sectors on the same cylinder. The
cylinder number is implemented as a Gray code, in which any two adjacent cylinder
numbers will differ by only one bit. This is to guarantee the cylinder number value to be one
of the two valid adjacent values rather than a totally different, unrelated value, which would
be harder to distinguish from an erroneous value
3.5.9.4. Servo Burst and Track Following
As discussed earlier, the servo bursts are provided and are used to prevent the head
drifting off from the center of the track. The process is conducted in two situations: at the
end of seek operation and while the head is moved along the track. At the end of seek
operation, the process is called settle and the time to settle is called settling time. On the
other hand, the process in the second situation is called track following. Both settling and
track following require servo bursts to maintain the head position at the center of the track.
There are four special magnetic patterns encircling each track on a surface at a servo
wedge. They are called A, B, C, and D bursts. The A burst and the B burst are placed
adjacent to a track but on different sides of the track. The A burst generates a signal called
VA and B burst generates a signal called VB. If the read/write head is at the center of the
track, VA and VB signal amplitude are equal. On the other hand, if the head is off the center
of the track, either VA or VB signal amplitude is higher than the other to indicate which
 91
direction the head is off the center and how far. Then the servo control can adjust the head
position accordingly to maintain the head at the center of the track. However, C and D bursts
are required for the case that the head is located between two adjacent track, which causes
both the VA and VB signal to go flat. Likewise, the signal generated from the C bursts is
called VC and one generated from D bursts is called VD. The C and D bursts are placed
immediately after the A and B bursts, but the C and D bursts? alignments are shifted from A
and B?s by the half of the servo burst width. Therefore, if the head is at the flat part of the
servo bursts, the amplitude of VC and VD are used to identify the location of the head. As a
result, the servo control can use these signals to adjust the head position accordingly to
maintain the head at the center of the track.
3.5.9.5. Components of a Servo
Each servo sector is composed of a preamble field, an address mark field, a servo index
field, the track id, and the four servo bursts. Servo sectors are also separated from user data
by a gap. The first field in a servo sector is the preamble. Like the preamble in a data sector,
the preamble of a servo sector is used to synchronize the read channel?s PLL with servo?s
clock. Next is the servo address mark or the servo sync mark. The address mark informs the
head that the servo information is next. The third field, the servo index field, provides the
coordinates of the servo. The fourth field, the track id field, provides the track number in the
form of a Gray code. Finally, the four servo bursts provide the signals to maintain the head at
the center of the track as explained in the previous section.
 92
3.5.9.6. ZBR and Embedded Servo
Embedded servo causes more difficulty to the sector placement on a disk with zoned bit
recording (ZBR). Without ZBR, the embedded servo wedges can simply be placed between
sectors split spaced evenly around the surface. However, ZBR causes the sectors with the
same sector number but on different tracks not to align on the same angular position.
Therefore, to place the embedded servo wedges on a disk surface, sectors at the wedges are
split. Due to splitting data sectors, two additional overheads are introduced for embedded
servo, which are extra fields and extra gaps. First, each split sector requires its own preamble
field and address mark field. Second, extra gaps are also required both before and after the
split part. Those gaps are required for the process of switching heads from write head to read
head when the head is entering the servo sector during write (writing the first split part of the
sector and then reading the adjacent servo sector). Also, the gaps are needed to switch from
read head to write head when the head finishes reading the servo sector and then continues
to write the sector. These extra gaps and fields can be significant depending on the number
of split data sectors.
Servo Bursts
Preamble
Address mark
servo index
Track ID
Figure 3.3: Components of a servo sector. 
 93
3.5.10.Sector ID and No-ID Formatting
The information on an individual data sector was originally stored at a field referred to
as a header or sector id. The header was placed just before the corresponding data sector. It
contained (1) the sector?s physical address (in CHS format), (2) a split flag whether the
sector is split by the servo sector, (3) defective flag, and (4) the location whose data is
moved to if the sector is defective. However, modern disk drives generally adopt the no-ID
format. The no-ID format or headerless format eliminates the header physically placed
along with the data. It utilizes the servo information to provide the sector?s physical address,
and stores other information at a no-ID table. The no-ID table is stored at the protected area
of the disk, and it is loaded into the disk controller?s memory at startup.
The no-ID format advantages include:
? Each track can contain more user data since the ID fields and their corresponding 
overheads are eliminated.
? tpi is increased because tracks can be placed closer. The ID-fields require wider 
tracks.
? Reliability is improved because there are no ID-fields which may get corrupted.
? Performance is improved as the total results because of the increase in tpi and track 
capacity.
3.5.11.Defect Management
Defects can occur due to multiple causes at the time of manufacturing or during daily
usages. Different schemes are used to relocate the data due to defects depending on when the
defects occur.
 94
Relocation Schemes utilize logical block address (LBA) and allocate a different sector to
that defective LBA. Mapping is done by the disk drive controller. The host does not have the
knowledge of the LBA to physical disk address mapping or the defected LBA. As a result,
the host believes that all of a disk drive?s LBAs are usable for data storage. There are
basically two common methods for re-allocating a new sector to replace one that is
defective:
3.5.11.1.Relocation Schemes
Sector Slipping
Sector Slipping makes use of the very next good sector; if the sector after a defective
sector is not defective, then the LBA of the defective sector is assigned to the following
sector. In general, sector slipping does not directly degrade the performance. The reason is
there is no interruption to the flow of sequential accessing. Only marginal extra time is
required to skip the defective sector and access the immediate subsequent sector
sequentially. Therefore, sector slipping is a preferable method for relocating defective
sectors in case of minimal data storing on the disk. Otherwise, sector slipping has a
disadvantage if data has already been stored on a disk. It must slip all data sectors after the
defective sector, which requires the disk to read and re-write all sectors to their new assigned
locations. 
Sector Sparing
Instead of slipping all data to the next good location, sector sparing allocates a number
of spare sectors in the disk drive for defective sectors. When a defect occurs, a defective
sector can be allocated to a spare sector. For better performance, it is preferable to scatter
 95
these spare sectors around the disk drive. Therefore, a defective sector can be relocated to
the closest spare sector. However, the performance might be degraded if the spare sector
location causes the drive to move the mechanical parts resulting in significant delay.
A number of schemes are also used in modern disk drives. They are mostly based on
these two basic schemes.
3.5.11.2.Types of Defects
There are two types of defects as mentioned earlier:
1. the defects that occur at the manufacturing time, called primary defects, and 
2. the defects that occur later after leaving the factory, during daily usages, which is 
called grown defects. 
These two types of defects are handled differently.
Primary Defects.
Before a disk drive leaves the factory, every sector of the disk drive has to be scanned for
defective sectors. If found, those defects are called primary defects. The process of scanning
sectors for defects is accomplished by reading and writing each sector in the disk drive.
Usually, since there is no data stored on the disk drive to be slipped, the defective sectors are
relocated by the sector slipping method. Finally, a list of defective sectors in the form of
ABAs (absolute block addresses) are generated and stored in the P-List or Primary List.
These ABAs in P-List are skipped during the disk drive maps LBA to ABA.
Grown Defects.
In contrast, grown defects or non-recoverable errors are defined as the defects that
develop after a disk drive has left the factory. Usually, the defective sectors are discovered
 96
during the disk drive?s daily usages. The grown defects, unlike primary defects, are handled
by the sector sparing method because user data has already been stored in the disk drive. A
list of grown defects is called the G-List. Each entry in the G-List is a tuple of the defective
sector address and the corresponding relocated sector address. 
3.6. File System Caching
In the operating system point of view, the most important factor in I/O performance is
not the speed of the disk, or how efficiently it is used, but whether it is used. To hide the I/O
latency, caching, which is widely used in many levels of memory hierarchy to hide the
latency of the lower memory level, is also applied in UNIX-based operating systems. This
type of caching under the control of the operating system is called file system caching.
Originally, the technique used to be called file caching and disk caching, depending on
whether the logical disk address or physical disk address is used. File caching uses logical
address, while disk caching uses physical address. However, researchers simply use the term
file system cache in general for such techniques, and usually use the term disk cache to refer
to a memory physically built into a disk drive.
In the operating system, file system caching uses main memory as a cache for disk data
to improve I/O performance. Since main memory is much faster than disks, file caches
significantly improve performance. Therefore, a file system cache is generally implemented
in every UNIX-based operating system. However, different systems have very different I/O
policies, and the performance of some I/O policies can differ by many orders of magnitude.
These policies can be even more important than the underlying hardware. 
 97
I/O performance is limited to the interaction between the disk and the operating system.
The hardware may determine the potential performance of the I/O, but the operating system
determines how much potential is delivered. In particular, the file cache is critical to I/O
performance for UNIX systems. File caching policy is important since file caches in high-
end computers with better hardware can perform comparably to file caches in workstations
with better file caching policies. Even though optimizations in memory systems can
improve disk read performance, the operating system policy on writes can improve the file
cache performance by many orders of magnitude.
Like processor caches, write back and write through are applied to file system caches on
writes as well. The operating systems community uses the term asynchronous writes to refer
to writes that allow the processor to continue after transferring data to a write buffer. This
approach improves the performance if writes occur infrequently. Like writer-buffering in
other levels of the memory, if writes are too frequent, then the processor may eventually stall
until the write buffer is flushed. This situation limits the system speed to the speed of the
I/O, which is the slowest level of the memory hierarchy. Note that a write buffer does not
directly reduce the number of writes to the next level. Writes can be merged or overwritten
to reduce the number of writes when they are waiting in the queue. However, a write buffer
only allows the processor to continue while I/O is in progress if the write buffer is not full.
The effectiveness of caches for writes also depends on the policy of flushing dirty data to
the disk, i.e. how often to flush and which data to be flushed. To protect against losing
information in case of failures, applications will occasionally flush dirty data out of the
cache in the main memory to the disk. Most UNIX operating systems have a policy of
 98
periodically writing dirty data to disk. By default, a safety window is typically set for all
applications to 30 seconds. We will see this behavior first hand in our experiments.
 99
CHAPTER 4:   RELATED WORK
The contribution made in this dissertation is in two parts. First, we created a complete-
system simulator, SYSim, to demonstrate the detailed interaction of a memory hierarchy in
both the performance and power domains. Secondly, to study the I/O behavior during the
I/O intensive phase of applications, we explored several disk enhancements and physical
disk technology improvements in both isolation and combination. We studied the systems in
terms of total system performance and the power/energy consumption. Therefore, this
chapter consists of two parts: one is for Complete-System Simulations, and the other is the
Disk Enhancements and Physical Technology Improvements.
4.1. Complete-System Simulations
Simulation is the most widely accepted approach in computer architecture, as
information is obtained in simulations that is impossible to obtain in a real system. Unlike an
experiment in a real system, the system in a simulation is not perturbed during an
experiment by an attempt to measure statistical information.
Simulation can occur at various levels of abstraction. Cain et al. [3] compares various
approaches of simulations as shown in Table 4.1. The Analytical models and CPI equations
are fast, but they lack detail, which costs them precision. The precision of Trace-driven
Simulation mainly depends on how the traces are collected. The software trace collection
schemes pollute the traces due to software overhead. The hardware schemes require
 100
expensive investment in proprietary hardware and are probably not feasible for multi
gigahertz processors. Trace collection also records only committed instructions, which do
not reflect the inaccuracies created by speculative instructions. Cain et al. conclude that the
complete-system (or full-system in their words) execution-driven simulation is the most
precise and accurate. 
Complete-system simulation means a system simulation that includes I/O, especially
disk and OS effects in the simulation. Gurumurthi et al. [1] pointed out that the features of a
low-power disk can also influence operating system routines such as the idle process
running on the processor core. Hence, a model which includes the disk helps to characterize
the processor power more accurately. During I/O operations, energy is consumed by the
Modeling 
Techniques
Inputs Benefits Drawbacks
Analytical models
Cache miss rates; I/O 
rates
Flexible, fast, 
convenient, provide 
intuition
Cannot model 
concurrency; lack of 
precision
CPI Equations
Core CPI, cache miss 
rates
Simple, intuitive, 
reasonably accurate
Cannot model 
concurrency; lack of 
precision
Trace-driven 
Simulation
Hardware traces; 
software traces
Detailed, precise
Trace collection 
challenges; lack of 
speculative effects; 
implementation 
complexity
Execution-driven 
Simulation
Programs, input sets
Detailed, precise, 
speculative paths
Implementation 
complexity; simulation 
time overhead; 
correctness 
requirement; lack of OS 
and system effects
Full-system, 
execution-driven 
simulation
Operating system, 
programs, input sets, 
disk images
Detailed, precise, 
accurate
Implementation 
complexity, simulation 
time overhead, 
correctness requirement
Table 4.1: Attributes of various performance modeling techniques [3].
 101
disk. Further, as the process requesting the I/O is blocked, the operating system schedules
the idle process to execute. Therefore, energy is also consumed in both the processor and the
memory subsystem. The cycles due to disk activity are accounted for as idle-cycles in the
execution profile, which can take up to 10% of the execution cycles and up to 7% of the
power in the processor and memory. Cain et al. [3] demonstrate that correct simulation of
I/O behavior can significantly affect simulator accuracy. They demonstrate that the majority
(50.9%) of the executed instructions in some benchmarks are the operating system?s. The
operating system causes the IPC to be different from the user-level simulation as much as
20%, which translates into over 100% difference in energy consumption. The authors in
both [2] and [3] agree that omitting operating system activities can introduce errors that can
exceed 100%. Chen et al. [2] reason that: 
1. The operating system code has distinctive behavior, different from user code. 
2. The state of the micro-architecture after the OS call is different from the state of the 
microarchitecture before the call was made.
3. The timing of activities scheduled in an OS call may affect the behavior of the 
microarchitecture.
Therefore, they emphasize that all architecture researchers should seriously consider
adoption of full-system simulation, despite the up-front cost of doing so.
Many simulators that can estimate the power/energy numbers are in the area of
embedded systems or specific environments [22, 23, 24, 25, 26, 27, 29, 30, 31], as the
power-consumption is obviously more important to such systems. Most of them have an
application to mobile systems for which the energy source is very limited or less likely to be
 102
replaced. However, none of them includes a disk model since a disk is not a standard
component for such systems.
The development of complete system simulators in general-purpose systems has been
directly motivated by the inability of user-level simulators to target complex workloads, i.e.
database or network workloads. Their benefits are diverse and significant; including
evaluation of hardware design, development of operating systems, and performance tuning
of workloads. There are also many complete-system simulators, e.g. SimOS [4], Simics [5],
g88 [6], gsim [7], Talisman [8], Pharmsim [3], and TFsim [9]. g88, gsim, and Talisman are
in-order functional simulators for the 88000 processor, which can simulate a modified
version of UNIX. SimOS is a dynamic full-system simulator that supports out-of-order
processor models for the MIPS and Alpha instruction sets. Pharmsim is a dynamic full-
system simulator based on SimOS and SimpleMP with an out-of-order processor model for
the PowerPC instruction set. Simics is a commercial simulator that supports system-level
simulation of five target architectures: Alpha, IA-32 (x86), PowerPC, SPARC, and x86-64.
Simics can boot unmodified operating systems and it can be extended for cache timing
simulations, but it only models simple (scalar, in-order) instruction execution. TFsim is the
timing-first simulation which its timing simulator executes each dynamic instruction ahead
of the functional simulator. It uses Simics as its functional component. However, no power
consumption is reported by these simulators.
A handful of publications are focused on architecture-level power estimation tools for
processors, i.e. Wattch [19], SimplePower [20], Architecture-level power estimation in
[32], and Architectural Power Evaluation [21]. These simulators focus only on CPU power
consumption; neither memory nor disk is included in their power estimation systems.
 103
Additionally, they are user-level simulators which do not include the operating system
effects.
Only a small number of publications actually implement complete-system simulators
with power estimation. To our knowledge, the execution-driven, power estimation,
complete-system simulators, which can run operating system on top and include the concept
of disk, are SimWattch [2], SoftWatt [1], and Mambo [10].
SimWattch is a complete-system simulator that estimates performance and power
consumption of an out-of-order issue superscalar microprocessors. It is based on Simics for
performance simulation and on Wattch for power estimation. SimWattch takes the
advantage of fast simulation in Simics by letting Simics fill an instruction trace at full speed
into a FIFO queue and letting the instructions be consumed by Wattch at a slower pace.
Wattch is employed as an architectural simulator that estimates CPU power consumption.
One of the important steps is to convert the Simics instructions (SPARC-V9 instructions) to
corresponding Wattch instructions (SimpleScalar?s PISA instructions). Even though
SimWattch can run an unmodified OS and it includes multiple I/O devices, the power
consumption reported includes only the processor.
SoftWatt is based on SimOS. SimOS has three CPU models, namely, Embra, Mipsy, and
MXS. Embra employs dynamic binary translation and provides a rough characterization of
the workload. Mipsy provides emulation of a MIPS R4000-like architecture. It consists of a
simple pipeline with blocking caches. MXS emulates a MIPS R10000-like superscalar
architecture. SoftWatt use MXS to obtained detailed information about only the processor
and disk, and use Mipsy simulator to run the operating system and obtain cache and memory
system profiles. After the simulation, the simulation log files are fed into the analytical
 104
power models to generate power values. Therefore, SoftWatt does not reflect the actual
interaction between memory and disk since the memory system statistics and the disk
statistics are the product derived from disjoint simulations. Also, there is a per-cycle power
information loss due to post-processing approach.
IBM?s Mambo is a complete-system simulator modeling PowerPC based systems. It
provides building blocks for creating simulators that range from purely functional to timing-
accurate. Functional versions support fast emulation of individual PowerPC instructions and
the devices necessary for executing operating systems. Timing-accurate versions include the
ability to account for device timing delays, and support the modeling of the PowerPC
processor microarchitecture. While it has been used widely in IBM, it is not an open-source
software available to the public.
All of these simulators dedicate more details to the processor side than the memory side.
For simplicity, some implement the memory as a constant time and constant energy per
access. Some implement the memory as banks, but are not very detailed. Some, using two
disjoint processor simulators produce disjoint levels of memory accesses, may cause
discrepancy in the memory. SYSim is proposed to fill this gap in the memory system
research community. It is intended to be an open-source, complete-system simulator that can
demonstrate the systemic behaviors of entire memory hierarchy. It also includes both
performance and energy models. We hope that SYSim will be a better option for the
research community.
 105
4.2. Magnetic Disk Drive Enhancements and Physical 
Improvements
The performance of magnetic disk storage systems has improved only 10-15% per year,
while the performance of processors has roughly doubled every two years. Over time, the
performance gap between those two components continues to grow. To narrow the
performance gap, more aggressive optimization of the storage system is required. The main
reason for the slow growth in disk drive performance is the slow mechanical parts of storage
devices. Since these slow mechanical parts can degrade the total system performance
significantly, the importance of I/O optimization techniques has been widely acknowledged.
As a result, many disk drive enhancements have been invented, including caching, write
buffering, prefetching, request scheduling, and parallel I/O. 
Many efforts have also been made to improve the underlying technology of the disk
drive?s physical characteristics. These physical characteristics, continuously improved over
the years, include the tracks or bits per inch, average seek time, and rotational speed. These
improvements are usually defined by using physical metrics, which is difficult to relate
directly to the total system performance of real workloads. Therefore, using physical
metrics, it is complicated to compare different physical improvements, since they are not
directly related to the total system performance. 
Unfortunately, the relative effectiveness of these techniques and technology
improvements is ambiguous since they have been investigated in isolation by different
researchers using different methodologies. Some use discrete event simulation and
analytical modeling. The others use trace-driven simulations. In some cases, the simulations
are based on traces of real workloads and in others, randomly generated synthetic
 106
workloads. For example, Zhu and Hu [75] evaluate a built-in disk cache using both real and
synthetic workloads, and report the results in term of response time. Smith [81] evaluates a
disk cache mechanism with real traces collected in real IBM mainframes on a disk cache
simulator. He reports the results in terms of miss rate ratio. Huh and Chang [77] evaluate
their RAID controller cache organization with a synthetic trace, and Varma and
Jacobson [94] and Solworth and Orji [95] evaluate destage algorithms and write cache,
respectively, with synthetic workloads. SPEC2000 is also used, for instance, to evaluate a
energy-aware, compiler-controlled disk prefetching in [91]. The most widely-used
benchmarks in disk drive research would be hplajw, cello, snake [99], and TPC [100]. These
are workloads which are used to evaluate many aspects of large-scale disk systems. Since
many of the techniques have not been evaluated with real workloads on the same basis, their
actual effect is not known. To effectively optimize the system performance, the actual
advantage of the disk enhancements and physical improvements must be analyzed both in
isolation and in combination especially in system-level points of view.
In this dissertation, we investigate how different techniques affect total system-level
performance and power/energy consumption by using a complete-system simulation. The
most similar work to ours is by Hsu and Smith [78]. They use a variety of workloads,
including server and personal computer workloads, to systematically analyze the actual
performance impact of various I/O optimization techniques by using trace-driven
simulation. However, they evaluated each technique in isolation and neither in terms of total
system performance nor in terms of power/energy consumption. Additionally, we focus on
the I/O intensive phase on a single processing environment, widely used in personal
computer systems. Therefore, we selectively studied the techniques aiming at improving the
 107
disk systems for future personal computer workloads. For instance, unlike large-scale
database systems with hundreds of RAID disks, we investigate the RAID system with only
small number of disks, i.e. 4 and 8 disks. Also, we investigate the disk cache located inside
individual disk drives, rather than the cache in RAID controllers or the cache in the file
system. The reason is that we would like to make an impartial comparison between the disk
systems equipped with selected disk enhancements and a single disk system, which is
widely used in personal computers. Therefore, it would be unreasonable to make
comparisons against sophisticated and complex techniques, which exist only to improve
large-scale server applications. 
The selected disk enhancement techniques and the physical technology improvements
we studied are described in detail as follows.
4.2.1.Disk Drive Enhancements
4.2.1.1. Disk Caching
Caching is a general technique for improving performance in many levels of computer
systems. It temporarily holds data that is likely to be utilized in faster memory. The faster
memory is called the cache. In this section, the data refers to disk blocks requested from the
storage system, and the faster memory refers to dynamic random access memory (DRAM)
built into the disk drive. We refer to this as disk cache, which is different from file system
cache and processor cache as described in Chapter 1. Disk cache is a general term
introduced in [81] as a buffer used to hold portions of the disk address space contents,
which can be placed at many levels along the data path. The disk cache as a small piece of
 108
memory specifically built into a disk drive was first referred to as disk buffer in [83].
Today, the term disk buffer and disk cache are used to refer to the same entity.
The disk cache?s specific function is to support I/O operations, which occur only
through I/O requests via the operating system. Unlike the processor?s cache, disk cache does
not connect to the processor directly. Therefore, it is not considered as a part of the memory
hierarchy directly, but it is added as a special feature in a disk drive for performance
optimization. When an I/O request is sent to the disk system, the request either finds the
requested data in the disk cache or is sent to the physical disk mechanism. Like in the
processor cache, the hit ratio is defined as the ratio of the number of requests satisfied by the
cache to the number of total requests, and the miss ratio is the ratio of the number of requests
sent to the physical disk mechanism to the number of total requests. The data can be brought
to the cache in two ways, which are (1) they are fetched by reference, referred to as caching,
or (2) they are anticipated to be referenced in the near future, referred to as prefetching. In
?Memory Systems? [102], Jacob et al. notes that disk caching is claimed to be a key factor
in a disk drive?s performance whose importance is comparable to other basic drive attributes
such as seek time, rotational speed, and data rate.
Originally, disk cache was used as a speed matching buffer between the disk drive and
the interface. The buffer is useful in two situations, one is because the disk drive and the
interface operate at different speeds, and the other is when the host or the interface is not
ready to receive the data. DRAM is usually used as this buffer memory. Since the data is
already stored in the disk cache due to buffering, the function of disk cache extends to
support caching and prefetching as well. When a request is sent to the disk system, the cache
checks whether it holds the requested data. The data may remain in the cache since they
 109
have been referred to by the previous requests. Instead of sending the request to the physical
disk, the request can be satisfied by the cache. Therefore, the need to move the mechanical
parts in the disk drive is eliminated and results in performance improvement. Therefore, the
performance improvement depends on the amount of data to be reused in the cache. Today?s
commercially available disk drives are generally equipped with a built-in cache as part of
the drive controller electronics. The cache size ranges from 512KB for micro-drives to
16MB for the largest server drives, and is still growing. With ever-growing disk cache size,
only a small fraction of the disk cache is used for speed matching buffer, while most of the
disk cache is occupied by the caching data.
In ?Memory Systems?[102], Jacob et al. pointed out that the effectiveness of a disk
cache depends on two aspects: the disk cache organization and the disk cache algorithm.
The first aspect, the disk cache organization, defines how the disk cache and its data are
structured, including how the disk cache is organized and allocated, and how data are stored
and identified in the disk cache memory. The second aspect of disk cache is the algorithms
used, which defines the policies of how the cache is being utilized. The algorithm includes
but is not limited to: 
? destage policy--determine which piece of old data to destage/replace (destage is a 
process to write data from the cache to the disk.), 
? prefetching policy--what data and how much data to prefetch 
? scheduling policy--which requests should be processed first. 
Jacob et al. also suggested that these two aspects are related but orthogonal in operation.
One aspect?s operation neither depends upon nor involves the other. However, some
particular algorithms may be more suitable to be implemented with a specific type of
 110
structure. Since, in this dissertation, we are focusing on the I/O intensive phase of the
execution consisting of mostly sequential requests served well with FCFS scheduling and
LRU replacement policy, the second aspect will not be discussed in this dissertation.
Zhu and Hu [75] have suggested that large disk built-in caches will not significantly
benefit the overall system performance because all modern operating systems already use
large file system caches to cache reads and writes. In our experiments, we investigated the
disk cache including the effects of file system caching. As suggested by Przybylski [101],
the reference stream missed by the L1 cache has low locality. Like the reference stream in
the processor caches, the reference stream to the disk system missed the file system cache.
As a result, the locality in the stream is low. To reflect a realistic stream to the disk system, it
is important to include the file system cache effect. Therefore, the reference stream
including file system caching effect in our experiment is more realistic than the stream in the
experiments for trace-driven simulations. This method is hard to implement, but extremely
valuable, so we ensure that the resulting reference stream represents realistic behavior of
real systems.
4.2.1.2. Prefetching
Prefetching is generally referred to the technique of acquiring the data before they are
actually used in the system. Like caching, prefetching is implemented in many levels in
memory hierarchy, including the disk systems. It involves two steps, which are predicting
which data are likely to be used in the near future and fetching them before they are actually
needed. The purpose of prefetching is to hide the stall time that the system has to wait if the
 111
data are fetched on demand. Hsu and Smith [78] pointed out that the overall effectiveness of
prefetching at improving performance depends on the following: 
1. the accuracy of the prediction
2. the amount of extra resources (memory use, disk and data-path busy time, etc.) that 
is consumed by the prefetching process
3. the timeliness of the prefetch, i.e., whether the prefetch is completed before the 
blocks are actually needed
There are several prediction schemes for prefetching on the host side to improve I/O
performance in server applications, i.e. Database Management Systems in multiple parallel
disk systems, or implemented in the operating systems. The host can prefetch in these
systems more intelligently than multi device controllers or disk caches can, since
applications can often better predict read requests. As examples given by Hsu and
Smith [78], the prediction of the host-side prefetching is usually based on past access
patterns [84, 85], system-generated plans [87, 88], user-disclosed hints [86], and even
guidance from speculative execution [89]. Depending on the implementation, this
information may be available to help with the prediction. However, only a few works
explore the prefetching in disk cache [90, 81]. The disk drive prefetch generation is usually
simply implemented as sequential prefetching. The reason for the disk drive to initiate
prefetching is that the drive is aware of its own status, so it can avoid prefetching, which can
be interfered from host requests. However, the drive does not know what data it stores, so it
requires additional hints from the host, i.e. which data it should prefetch for complex
prefetching schemes. Therefore, complex prefetching in the disk drive side is not a common
practice and only prefetching schemes based upon the principle of locality of access are
 112
implemented. Recently, since energy consumption has become major concern, a number of
publications are proposing Energy-Aware data prefetching that is orchestrated by the
compiler [91] or implemented with Flash Drives [92].
Since prefetching is based on speculation, the prefetched data may not be used.
However, prefetching comes with cost. The cost of prefetching includes the use of the disk
drive?s resources, i.e. the drive?s electronics and mechanical parts, to bring the prefetched
data into the cache. Therefore, prefetching can occupy the disk drive?s resources and prevent
the drive from doing other useful work. It may interfere with the user requests and may
cause a user request to be delayed and to observe an increase in request response time.
Therefore, a common strategy is to disable prefetch if there are I/O requests waiting to be
serviced in the queue, and preempt any ongoing prefetch when a new I/O request arrives if
the request requires the disk resources being used in the prefetch. If the new request can be
serviced without requiring the other disk resources used in prefetching, such as a request
hitting in the cache, the drive can continue prefetching. Besides the disk drive?s electronics
and mechanical parts, prefetched data also consume space in the disk cache. If the prediction
of prefetch is not accurate, prefetching may cause cache pollution just like in a processor
cache. Cache pollution occurs when prefetched data are not used in the near future, but they
cause other useful data to be thrown out from the cache since prefetched data contends for
this space. This may result in degradation of total performance. Therefore, care should be
taken when applying prefetch to the disk system.
Most workloads generate disk request stream with high sequentiality. Therefore, simply
sequential prefetching, especially on a cache miss, works well on all three criteria mentioned
above, which are prediction accuracy, cost, and timeliness. In addition, a request for a large
 113
chunk of sequential data can be served more effectively with sequential prefetching in many
storage systems. To improve prefetching performance further, small sequential requests are
merged into one large request, which can be served more efficiently by the storage system.
Therefore, it is a common practice for modern storage systems to be equipped with
sequential prefetching after a cache miss. We focus on such a prefetch in this dissertation
due to the sequential nature of the disk requests in the I/O intensive phase. The prefetched
data are managed as if they were fetched by reference in this dissertation. Unlike our
prefetch scheme, the prefetched data in other systems can be allocated in a separate buffer or
can be managed in the cache differently from data fetched by reference. The interested
reader is referred to [85] for an evaluation of such alternatives.
4.2.1.3. Write Buffering
The I/O subsystem is becoming a bottle-neck in the computer system due to the rapid
growth in the processor speed and technology. Increasing the memory size for the file
system cache and increasing built-in disk cache size will improve the caches? effectiveness
to satisfy disk read requests. As a result, disk traffic will consist of mostly write traffic.
There are many proposed solutions to this problem, including Log-Structured File
Systems [93] in the operating system side and write-buffering in the disk drive side [95]. 
The term write buffering refers to the technique of temporarily holding written data in
fast memory before the data are written into the physical storage permanently. Right after
the data are accepted by the buffer, a write operation can be reported as completed. Unlike
read caching, write commands to the write buffer do not require data in the cache. Therefore,
the write commands are processed like a cache hit if there is available space in the cache
 114
memory. The write command latency is composed of only the drive controller overhead and
data transfer time. No mechanical time is involved. However, an amount of the dirty data is
?destaged?, written out to the disk media to free up space when the cache is mostly
occupied, depending on the write-buffering policy. To prevent conflicts with user requests,
destaging should be performed while the drive is idle. This avoids the interference with user
requests which causes noticeable delay. In reality, the interference may be inevitable
because the drive may be under a high usage with only minimal idle time. The situation is
most likely to occur because disk writes usually come in bursts, so they easily fill up the
buffer. In this case, destaging must take place while the drive is busy. Therefore, destaging
adds more load to the drive in this case, which causes longer delays for user requests. As a
result, write buffering may be only a technique to delay the disk writes instead of hiding
them completely. 
Even though write buffering may only delay the disk writes, delaying those writes may
be beneficial. The write requests in the write buffers can be scheduled, merged, or
overwritten for better performance, i.e. fewer requests must be sent to the physical disk.
Most systems implement a write buffer by allocating a part of DRAM memory under the
control of the disk controller for it. Therefore, in server systems, write buffering is usually
disabled for reliability reasons. To prevent data loss of data, the write buffer can be
implemented with some form of nonvolatile storage (NVS), such as non-volatile RAM
(NVRAM) [95, 79, 77], a disk cache disk (DCD) [76], NAND Flash memory [82], or
MEMS-based storage [96]. In some environments, (e.g., UNIX file system, PC disks),
periodically flushing the buffer contents to the physical disk, i.e. every 30 seconds, is
considered sufficient. 
 115
In summary, there are three key benefits when utilizing write buffering. First, write
buffering is used to hide the latency of writes by delaying the writes to until an appropriate
time, i.e. without interfering with user requests. Second, by merging and overwriting the
write data in the buffer, the write buffering can improve the performance by reducing the
number of writes to the physical disk. Finally, by scheduling the writes to the physical disk,
the write buffering helps the physical disk to perform more efficiently. 
While the write buffering technique performance can be affected by several destage
parameters, such as high water mark, low water mark, and the size of write cache, in this
dissertation we exclude those effects. We conducted a limit study of write buffering by
assuming that the destage algorithm is perfect, so write buffering can completely hide the
write latency. More details about destage algorithms can be found in [94] and [79].
4.2.1.4. Parallel I/O
Another widely used technique to improve I/O performance is to use parallel I/O.
Parallel I/O simultaneously distributes multiple small requests to be serviced by several
disks. A large request can also be serviced by multiple disks at the same time. Therefore,
parallel I/O can improve both response time for individual requests and improve the
throughput in for multiple requests. The two most common approaches are to distribute data
among multiple disks are organizing the disks in a volume manner and organizing the disks
into a stripe manner. The first approach, a volume manner, fills data into one disk until it is
filled then moves to the next disk. On the other hand, the second approach, a stripe manner,
divides data into small units called stripe units, and distributes the stripe units across the
disks in a round-robin manner. Therefore, the volume manner can be considered as a stripe
 116
manner with an entire disk is one stripe unit. Even though the stripe manner has an
advantage over the volume manner for parallelism, striping data across multiple disks
introduces low reliability to the disk system. As a result, redundant information is added to
the striped disk system to improve reliability. This striped disk systems with redundant
information are collectively called Redundant Arrays of Inexpensive Disks is RAID [80]. 
The following is a quick summary of the most commonly used RAID levels defined in
[80]:
? RAID 0: Striped Set 
RAID 0 stripes data evenly across multiple disks without parity information for
redundancy. RAID 0 was not included in the originally defined RAID levels because its
reliability is not enhanced with redundancy. It was invented to improve parallelism in the
disk systems for performance gains and to create a large logical disk space with multiple
small disk drives.
? RAID 1: Mirrored Disks
A traditional approach for improving reliability of magnetic disks is mirroring. As
suggested with the name, mirrored disks duplicate all data to the mirrored disks. Therefore, a
write actually is two writes to the data disk and the mirrored disk. Mirroring disks is
considered the most expensive approach to improving reliability since all disks are
duplicated. An optimized version of mirrored disks doubles the number of controllers for
fault tolerance, so it allows reads occur in parallel to improve performance.
? RAID 2: Hamming Code for ECC
RAID 2 imitates the DRAM bit-interleaving behaviors by bitwise striping the data
across multiple disks with redundancy. The redundancy is accomplished by additional check
 117
disks to detect and correct a single error. Multiple disks are required to detect and correct a
single error. Multiple disks are used to identify the erroneous disk. Therefore, RAID 2
suffers from low usable space in the disk system because multiple disks are required for
check disks. For a group size of 10 data disks, we need 4 check disks in total.
? RAID 3: Single Check Disk Per Group
RAID 3 stripes data in the unit of byte with one dedicated parity disk. RAID 3 replaces
multiple check disks in RAID 2 with a single parity disk. The principle is the disk controllers
can detect that a disk failed, so multiple check disks in RAID 2 unnecessarily duplicates the
function. If a disk has failed, the data on the failed disk can be reconstructed by the data on
the remaining disks and the parity information. If the disk is the parity disk, the parity
information can always be recalculated by the original data, and be stored easily in the
replacement disk. This mechanism results in lowest reliability cost in RAID 3. So, the last
two levels, RAID 4 and RAID 5, consider only improving the performance of small
accesses, but the cost of reliability remains the same as RAID 3.
? RAID 4: Independent Reads/Writes
RAID 4 stripes data in block-sized granularity across multiple disks with one dedicated
parity disk. The only difference between RAID 3 and RAID 4 is RAID 4 stripes per block,
rather than per byte. Consequently, the performance of RAID 4 is improved over RAID 3
because all reads/writes can now be serviced independently. The reason is a write uses 2
disks to perform 4 accesses?2 reads and 2 writes?while a small read involves only one
read on one disk.
? RAID 5: No Single Check Disk
 118
While level 4 RAID achieved high performance for reads due to parallelism, writes are
still limited to one per group since every write must read and write the check disk. The
RAID 5 distributes the data and checks information across all the disks?including the
check disks. Therefore, RAID 5 supports multiple individual writes per group. These
changes make RAID 5 satisfy the best performance for both reads and writes. RAID 5 can
perform small read-modify-writes close to the speed per disk of RAID 1 while it performs
large transfers per disk and retains the high ratio of usable storage capacity of RAID 3 and
RAID 4. Spreading the data and parity across all disks even improves the performance of
small reads, since there is one more disk per group that can perform read concurrently.
In this dissertation, we implemented only RAID 5 since it is the most widely used RAID
level. It also fits well with the concept of disk performance enhancements which we would
like to compare against other techniques. 
Chen et al. [97] studied the striping effects in a RAID 5 disk array. As mentioned earlier,
a striping unit is defined as the maximum amount of logically contiguous data that is stored
on a single disk. A large striping unit makes a file span only on only a few disks. A small
striping unit spans a file across more disks. They found that the optimal striping unit for the
write-intensive workloads to be four times smaller than in the case of read-intensive
workloads. The reason is the overhead of maintaining parity causes full-stripe writes (writes
that span the entire error-correction group) to be more efficient than read-modify writes of
reconstruct writes. The optimal striping unit for reads in RAID 5 varies inversely to the
number of disks, but the optimal striping unit for writes varies with the number of disks. In
conclusion, they derived general design rules for striping data in RAID 5 systems to be one
half of average positioning time times disk transfer rate.
 119
However, the application behavior characterizations (read/write intensive) vary with the
system memory size. That is, a read-intensive application can become a write-intensive one
by decreasing the memory and vice versa. We chose a fixed striping unit of 16KB as it
performed well in both read and write intensive applications reported in [97].
4.2.2.Disk Drive Physical Improvements
The storage of a disk drive consists of a set of multiple rotating platters, on whose
surfaces data is recorded. Each surface has its own head to perform read/write operations,
but all heads are attached to a single set of mechanical arms moving as one. Therefore, there
are multiple dimensions to the performance of a physical disk, for instance, what is the rate
at which the platters rotate (RPM), how fast does the arm move (seek time), and how closely
packed is the data (areal density). All dimensions affect the overall performance differently
in terms of response time and throughput. Additionally, the effectiveness of a disk depends
on the disk access order. As a result, only providing the numbers in the terms of those
physical features is not an obvious indicator to reflect the effect of the disk technology
improvement to real-world performance. In this dissertation, we also evaluate the
improvement in underlying physical technology in terms of the actual system
performance/power of real workloads. Hsu and Smith [78] stated that it is complicated to
isolate and quantify the performance impact of the disk technology improvement in the
different dimensions. Their reason is that disk technology improvements can affect the
performance metrics, i.e. access time, in many dimensions. They gave the examples that the
often-quoted 10% yearly improvement in the access time of disks is actually a result of a
multi-dimensional combination including higher rotational speed, faster seek time due to
 120
improvement in the disk arm mechanism, and smaller diameter disks with higher track
density, which also reduce seek distance. In practice, the areal density reduces the seek time
because the head is moved less to reach the destination track. Also, the areal density
improves the internal data rate since, with the same rotational speed, the head can process
more data when the data are packed more closely together. Higher areal density also
improves the storage capacity per surface, and results in fewer disks for the data mapping
mechanism. Observe that for various workloads in [78], the average response time and
service time are projected to improve by approximately 15% per year because it takes into
account the dramatic improvement in areal density and assumes that the workload and the
number of disks used remain constant. In fact, our metric, the total system performance
metric (CPI), has never been studied against the disk drive physical improvements. In this
dissertation, we break down the improvements in disk technology into three major basic
effects:
? Seek time reduction due to actuator improvement.
? Increase in rotational speed.
? Interface data rate
4.2.2.1. Seek Time
Seek time is the time to move the read-write head from its current track to the destination
track to service the next request. Seek time is one of the largest components of access time
because it is dealing with moving mechanical parts. Therefore, seek time plays a significant
role of a disk access time. With the recent rapid increase in areal density, the mass
production of smaller and lighter drive disk platters is the major cause in seek time
 121
improvement. Higher areal density translates into less distance for the head to seek, and the
smaller disk drive platters with smaller and lighter actuators and arms can be easily moved
in less time. Seek time is composed of two components, which are the travel time and the
settle time. The settle time is the time the head requires to maintain to correct position on the
track after arriving at the destination track. It includes the identification and confirmation of
the correct destination track and the head is ready for data transfer. With rapid growth in tpi
(tracks per inch), the travel time component in seek time decreases drastically, so settle time
grows its significance. Typical average seek time for today?s server-class disk drives is
about 3.5 milliseconds while for desktop drives is about 8 milliseconds. Mobile drives? seek
time is typically slower in order to reduce power consumption. Historically, an 8%
improvement in the average seek time translates roughly into only 3% improvement in the
average response time [78].
4.2.2.2. Rotational Speed
After the head has reached the destination track and settled at the center of the track, the
disk rotates to bring the destination sector to the head. Rotational latency is defined as the
time the disk rotates to bring the destination sector to the position under the head. Since
magnetic disk drive rotational speed is constant, the average rotational latency on the
datasheet is simply calculated as one-half the time it takes the disk to finish one complete
revolution. The rotational speed is inversely proportional to the rotational latency. The
rotational speed is another great component of access time because it deals with moving
mechanical parts. The rotational speed or RPM (revolution-per-minute) improves in discrete
steps. The rotational speed has historically increased at 9% per year, which corresponds to
 122
about a 5% improvement in average response time [78]. As portrayed by Jacob et al. [102],
higher RPM is first introduced in the high-end server drives, and takes approximately 10
years to travel down to mobile drives. The new speed takes a few years to be adopted by the
majority of server drives and then another few more years to be commonly used among
desktop drives. Finally, the RPM speed takes another few more years to be introduced in
mobile drives. Jacob et al. gives an example of the 7200-RPM disk drive, which was first
introduced in server drives back in 1994. It was appeared in desktop drives in late 1990s,
and the first 7200 RPM mobile drive was not available until 2003 - almost ten years after the
first 7200 RPM drive was introduced in server systems. Today?s high-end server drives run
at 15K RPM, with 10K RPM being the most common, while desktop drives are mostly 7200
RPM [78].
4.2.2.3. Interface Data Rate
Interface data rate is the rate that the data can be transferred between the disk drive and
the host over the interface. The most recent disk drive interfaces, data rates, and the bus
latency per sector are shown in Table 4.2 below. All these interfaces are substantially faster
than the disk internal data rate. In our experiments, we varied the interface transfer time per
sector from 0.64 microseconds to 1 millisecond.
 123
Most interface data transfers can be overlapped with the internal data transfer, excluding
the very last block case. The reason is that every sector of data must be in the drive buffer
entirely for error checking and possible error correcting code (ECC) correction before it can
be sent to the host. For writes, all of the interface data transfer from the host can be
overlapped with the seek process, disk rotation, and the disk drive internal data transfer if
necessary.
Interface Type
Data Rate 
(MB/s)
Bus latency 
(us) per 512B-
sector
ATA-7 133 3.85
SCSI Ultra 320 320 1.6
SATA and SAS 300 1.7
future SATA and SAS 600 0.85
FC 200 2.56
future FC 400 1.28
Table 4.2: Latest Disk Interfaces and Their Data Rate
 124
CHAPTER 5:   METHODOLOGY
SYSim is a model of an entire memory hierarchy that includes both performance models
and energy models for cache, DRAM, and disk. The SYSim project is incorporated with
several simulators for each component of the system. Figure 5.1 shows SYSim architecture
and its components.
Bochs [11], a Pentium emulator, is used as the CPU model to generate the memory
accesses and I/O interrupts. The cache model comes from Wattch [19]. The authors of
Wattch integrate Cacti [12] to obtain the cache configuration with the best timing behavior
for the cache.
OS CACHES:
Wattch
+ Cacti
DRAM:
DRAMsim
+ Micron
power model
req
data
DISK: DiskSim
with DRPM 
power model
I/O req
req
data
CPU
BOCHS
DMA
Figure 5.1: SYSim architecture. including CPU with OS from Bochs [11], caches with power model from Wattch [19] and Cacti [1
DRAMsim from University of Maryland [17] with power model from Micron [18], Disk with power model from DRPM [15] and DiskSim [14], a
all interfaces between them
 125
After a miss from the last level of the caches, the SYSim accesses the DRAM. The
DRAM simulator from the University of Maryland [17] is integrated into the system to
provide the timing behavior and also the power consumption. The simulator is implemented
in a very detailed way since it has the concept of channels, ranks, and banks. As part of this
dissertation, a power model [18] was incorporated into the DRAM simulator in order to
generate the instantaneous power consumption.
Bochs has the basic model of a Disk, but neither timing nor power statistics is
considered. The Disk model in Bochs takes the responsibility only to read and write data
from/to the disk image. Therefore, we integrate a modified DiskSim [14] simulator as used
in the DRPM paper [15] into the system. The DRPM version of DiskSim is used for only
timing and power consumption statistics collection. We obtained a disk image from Bochs
website which has Redhat Linux 6.0 installed. Finally, a set of programs from
SPECINT2000 benchmark [13] are complied and installed in an OS-ready disk image.
 126
5.1. The Processor Simulator: Bochs
Bochs is a highly portable open-source PC emulator written in Object-Oriented C++,
which runs on most popular platforms. It includes emulation of the Intel x86 CPU, common
I/O devices, and a customized BIOS. The typical use of Bochs is to provide complete x86
PC emulation, including the x86 processor, hardware devices, and memory. This allows
running OS's and software within the emulator on a workstation. Currently, Bochs can be
compiled to emulate a 386, 486, Pentium, Pentium Pro or AMD64 CPU, including optional
MMX, SSE, SSE2 and 3DNow! instructions. In addition, Bochs is able to run most
Operating Systems inside the emulation including Linux, Windows95, DOS, and Windows
NT 4. 
For the SYSim project, we compiled Bochs for Pentium since it is the most current
released version of Bochs reported to be stable. We also obtained a disk image of Redhat
Linux version 6.0 from the Bochs website. As the objective of the project is to construct a
simulator exclusively for memory system, the CPU simulator is viewed as a black box that
generates memory access requests to the L1 cache and I/O interrupts to the disk. It can be
considered as a trace generator that can reflect the changes in memory organizations and the
memory latency in the hierarchy, but it is more valuable than a trace generator because the
requests respond to the timing in the memory hierarchy, and it reflects the effects of the
unmodified operating system.
 127
5.2. The Cache Simulator: Wattch
Wattch [19] is an architecture-level microprocessor power estimation tool. The power
models are integrated into the SimpleScalar architectural simulator [28]. The cache model in
Wattch is implemented as an array structure. The power model of the cache is based on the
number of rows, columns, and the number of read/write ports. These parameters affect the
size and number of decoders, the number of wordlines, and the number of bitlines. In
addition, these parameters are used to estimate the length of the pre-decode wires as well as
the lengths of the array?s wordlines and bitlines. The wordline and bitline capacitance are
computed in the same way. The wordline capacitance includes the capacitance of the
wordline driver, the gate capacitance of the cell access transistor multiplied with the number
of bitlines, and the capacitance of the wordline?s metal wire. The bitline capacitance
includes the diffusion capacitance of the pre-charge transistor, the diffusion capacitance of
the cell access transistor multiplied by the number of word lines, and the metal capacitance
of the bitline. The number of ports also affects the power consumption due to additional
transistor connection on wordlines, two additional bitlines, and longer wire on both
wordlines and bitlines. 
Wattch authors estimate the physical implementations for cache structures using the
help of the Cacti tools [12]. Cacti takes the cache size, block size and associativity as inputs,
and chooses the organization that gives the smallest access time. Cacti models each
component of the cache in transistor level considering the technology dependence
parameters. Cacti authors compare the results with Hspice model.
Wattch consider three different options for clock gating to disable unused resources in
multi-ported cache. 
 128
1. All-or-nothing approach. The full modeled power will be consumed if any 
accesses occur in a given cycle, and zero power consumption otherwise.
2. Scaled linearly. If only a portion of a cache?s ports are accessed, the power is scaled 
linearly with the number of accessed port(s).
3. Scaled linearly with 10 per cent. It is the same as the second option except that 
unused units dissipate 10% of their maximum power, rather than drawing zero 
power.
Since the amount of clock gating in current processors falls somewhere between these
styles, SYSim calculates all three clock gating styles on the fly. Then, it output all three
power consumption numbers corresponding to the clock gating styles into the output file.
So, the user has the freedom to choose which clock gating style he prefers.
 129
5.3. The DRAM Simulator and Its Power Model
The DRAM simulator from University of Maryland [17] is integrated into SYSim. The
DRAM simulator is considered an extremely detailed DRAM simulator, and it is extremely
configurable. Every parameter can be set, for examples, type of the DRAM, the DRAM
configuration (i.e., number of channels, ranks, banks, rows, and columns), the operating
frequency, refresh policy, address mapping policy, close/open page policy, all timing
parameters, etc. It includes the bus interface unit, the memory controller unit, the DRAM
DIMMs, and DRAM devices. It has the concept of individual state of each channel, rank,
and bank. Each memory access request is transformed to a transaction consisting of the
combination of row activating, column read/write, and precharge commands. In order to
generate such commands, the DRAM simulator considers the current state of each bank. It
also takes the timing specification from the manufacturer?s datasheet into account, and
generates the timing diagram for command bus and data bus. It can simulate a wide variety
of the DRAM types, i.e. SDRAM, DDR SDRAM, DDR2 SDRAM, DRDRAM, and fully-
buffered DIMMS. 
The DRAM simulator has been carefully validated against real hardware and three
different detailed DRAM simulators, using in published DRAM studies. The accuracy
demonstrated exceeds that of any other simulators.
To generate the power consumption, the power model by Micron [18] has been added to
the DRAM simulator. The modified DRAM simulator can generate the power number and
its related statistics to an output file on every epoch. Currently, the DRAM simulator is
incorporated with a power model for DDR and DDR2 SDRAM. Basically, to calculate the
power is to calculate the average power in one activation-to-activation cycle. That is, we
 130
calculated the power in each DRAM state and then multiplied it with the fraction of time the
device spends in each state with respect to one activation-to-activation cycle. For simplicity,
we consider the power model for DDR SDRAM first, and then make some extensions to
cover the DDR2 case. The power consumption in DDR SDRAM is calculated as follows:
Parameter/Condition Symbol Units
OPERATING CURRENT: One bank; Active Precharge; t
RC
 = t
RC
 
MIN; t
CK 
= t
CK
 MIN
I
DD
0mA
PRECHARGE POWER-DOWN STANDBY CURRENT: All banks 
idle; Power-down mode; t
CK 
= t
CK
 MIN; CKE = LOW
I
DD
2P mA
IDLE STANDBY CURRENT: CS_ = HIGH; All banks idle; t
CK 
= t
CK
 
MIN; CKE = HIGH
I
DD
2F mA
ACTIVE POWER-DOWN STANDBY CURRENT: One bank; 
Power-down mode; t
CK 
= t
CK
 MIN; CKE = LOW
I
DD
3P mA
ACTIVE STANDBY CURRENT: CS_ = HIGH; One bank; t
CK 
= t
CK
 
MIN; CKE = HIGH
I
DD
3N mA
OPERATING CURRENT: Burst = 2; READs; Continuous burst; 
One bank active t
CK 
= t
CK
 MIN; I
OUT
 = 0mA
I
DD
4R mA
OPERATING CURRENT: Burst = 2; WRITEs; Continuous burst; 
One bank active t
CK 
= t
CK
 MIN
I
DD
4W mA
AUTO REFRESH CURRENT; tRC = 15.625ms I
DD
5mA
Table 5.1: The definitions of symbols in the DRAM datasheet
a
a. Data Sheet Assumptions
IDD is dependent on output loading and cycle rates. Specified values are obtained with 
minimum cycle time at CL = 2 for -75Z, -8 and CL = 2.5 for -75 with the outputs 
open.
CKE must be active (HIGH) during the entire time a REFRESH command is executed. 
That is, from the time the AUTO REFRESH command is registered, CKE must be 
active at each rising clock edge, until t
REF
 later.
0?CT
A
70?C??
V
DD
Q()V
DD
? 2.5V 0.2V+
?
=
 131
There are parameters extracted from a DDR SDRAM data sheet involved in the
calculation. Table 5.1 shows the definition of the I
DD
 values from a data sheet. In order to
calculate the power, two states are defined. When data is held in any of the sense amplifiers,
the DRAM is said to be in the ?active state?. And after all banks of the DDR SDRAM have
been restored to the memory array, it is said to be in the ?precharge state?. Additionally,
CKE, the device clock enable signal, is considered. In order to send commands, read, or
write data to the DDR SDRAM, CKE must be HIGH. If CKE is LOW, the DDR SDRAM
clock and input buffers are turned off, and the device is in the power-down mode. 
From the definition of active/precharge states and CKE above, a DRAM device can be
in four states: 
1. Active Standby Power: 
2. Active Power-down Power: 
3. Precharge Standby Power: 
4. Precharge Power-down Power: 
where 
? IDD values are defined in the data sheet and VDD is the maximum voltage supply of 
the device.
? BNKpre is the fraction of time the DRAM device is in precharge state (all banks of 
the DRAM are in precharge state) compared with the actual activation-to-activation 
cycle time.
? CKEloPRE is the fraction of time the DRAM stays in precharge state and CKE is 
low compared with the time it stays in precharge state.
? CKEloACT is the fraction of time the DRAM stays in active state and CKE is low 
p ACTstby()IDD3NVDD? 1 BNKpre?()1 CKEloACT?()??=
p ACTpdn()IDD3PVDD 1 BNKpre?()CKEloACT???=
p PREstby()IDD2FVDD? BNKpre 1 CKEloPRE?()??=
p PREpdn()IDD2PVDD? BNKpre CKEloPRE??=
 132
compared with the time it stays in active state.
Figure 5.2 shows the four states of a DRAM device during an activation-to-activation
period with a read burst [18]. We assume that the CKE signal becomes low as soon as there
is no activity in the DRAM device.
In addition, when the DRAM device is in Active Standby state, commands can be sent to
the device. The activities corresponding to the commands cause an increase in the current
sent to the DRAM device. For example, during the Active Standby state in figure 5.2, the
current increases due to activation command, and then it drops to IDD3N. During the read
process, the current is also pulled up. Finally, the DQ termination increases the current
Figure 5.2: Read Current with I/O Power Included [18].  The four states of DRAM device are shown, Active Standby state, Active
Power-Down state, Precharge Standby state, and Precharge Power-Down state. We assume other banks in the DRAM device are
precharged. The DRAM device is in Active state when data is stored in any of the sense amplifiers, after an acitvation command. The
device is in Precharge state if all banks are in precharged. The device is in Power-Down mode if CKE signal is low, and it is in Standby
mode, otherwise.
 133
during data transfer out from the device. Therefore, we have 4 more states in the Active
Standby state.
1. Activate Power: 
2. Write Power: 
3. Read Power: 
4. Termination Power: 
where 
? tRC is the shortest activation-to-activation cycle time as specified in the data sheet.
? tACT is the actual activation-to-activation cycle time in the real system.
? WRpercent is the fraction of time the data, to be written, stays on the data pins 
compared with the actual activation-to-activation cycle time.
? RDpercent is the fraction of time the read data stays on the data pins compared with 
the actual activation-to-activation cycle time.
? p(perDQ) is the power of each DQ. It depends on the termination scheme. In this 
case, we use p(perDQ) = 6.88mW for DDR SDRAM.
? numDQ and numDQS are the number of DQ and DQS pins in the device, 
respectively.
And, Refresh Power: 
Notice that IDD3N is deducted out from the calculation since we already include it in
the p(ACTstdby). Also, in the current version of the DRAM simulator, we simulate a refresh
command as a row activate command with a precharge command. 
pACT()IDD0 IDD3N?()
tRC
tACT
-------------? VDD?=
p WR()IDD4WIDD3N?()WRpercent VDD??=
pRD()IDD4RIDD3N?()RDpercent VDD??=
pDQ()pperDQ()numDQ numDQS+()RDpercent??=
p REF()IDD5 IDD2P?()VDD?=
 134
Then we scale the voltage and frequency to the ones we actually operate on. As a result,
we obtain:
Finally, sum everything up for the total power:
P PREpdn()p PREpdn()
useVDD
2
maxVDD
2
--------------------------?=
PACTpdn()p ACTpdn()
useVDD
2
maxVDD
2
--------------------------?=
P PREstby()pPREstby()
usefreq
specfreq
----------------------?
useVDD
2
maxVDD
2
--------------------------?=
PACTstby()pACTstby()
usefreq
specfreq
----------------------?
useVDD
2
maxVDD
2
--------------------------?=
PACT()pACT()
useVDD
2
maxVDD
2
--------------------------?=
PWR()pWR()
usefreq
specfreq
----------------------
useVDD
2
maxVDD
2
--------------------------??=
PRD()pRD()
usefreq
specfreq
----------------------
useVDD
2
maxVDD
2
--------------------------??=
PDQ()pDQ()=
usefreq
specfreq
----------------------?
PREF()pREF()
useVDD
2
maxVDD
2
--------------------------?=
PTOT()PPREpdn()PPREstby()PACTpdn()PACTstby()PACT()
PWR()PRD()PDQ()PREF()
+++++
+
=
 135
In case of DDR2 SDRAM, most of the calculations remain the same except p(ACT),
p(REF), and the I/O and termination power. For DDR2 SDRAM, p(ACT) before the
voltage/frequency scaling is:
Then we scale it the same as in the DDR SDRAM case.
The refresh power p(REF) is:
In the power model of DDR2 SDRAM, the simulator supports two cases, 1) one-rank
case, and 2) multiple rank case but at most four rank. For the one rank case, the termination
powers are:
WriteTermination Power: 
Read Termination Power: 
Read Termination Power and Write Termination Power to other ranks are zero:
where p(dqW) = 8.2 mW and p(dqR) = 1.1 mW.
In the case of multiple ranks, the read termination power and write termination power
are the same with p(dqW) = 0 and p(dqR) = 1.5 mW. However, the DRAM needs to
terminate from the other ranks. The termination powers from other ranks are:
where 
p ACT()IDD0
IDD3NtRASIDD2N tRC tRAS?()?+?
tRC
---------------------------------------------------------------------------------------------------------? VDD?=
p REF()IDD5 IDD3N?()VDD
tRFCmin
tREFI
------------------------??=
ptermW()p dqW()numDQ numDQS 1++()? WRpercent?=
pDQ()pdqR()numDQ numDQS+()? RDpercent?=
ptermRoth()ptermWoth()0==
ptermRoth()pdqRDoth()numDQ numDQS+()? termRDsch?=
ptermWoth()p dqWRoth()numDQ numDQS 1++? termWRsch?=
 136
? p(dqRDoth) is the termination power when terminating a read from another DRAM, 
and is equal to 13.1 mW.
? p(dqWRoth) is the termination power when terminating write data to another 
DRAM, and is equal to 14.6 mW. 
? termRDsch is the fraction of time that read terminated from another DRAM.
? termWRsch is the fraction of time that write terminated to another DRAM.
Finally, we sum it all to obtain the total power of the DDR2 SDRAM:
During the simulation, SYSim collects the statistics information in each epoch. At the
end of the epoch, SYSim calculates the total power of each DRAM chip, and multiplies with
the number of chips in a rank to generate the per-rank power. In an oracle fashion, SYSim
switches the device to the power-down mode as soon as possible. During the time that the
DRAM simulator is not called, SYSim also accounts this time as a power-down mode. The
device is switched to either precharged or active power-down mode, depending on the state
of the banks in the device.
PTOT()P PREpdn()PPREstby()PACTpdn()PACTstby()PACT()PWR()PRD()PDQ()
PREF()ptermW ptermWoth ptermRot h()
++++++++
++ +
=
 137
5.4. The Disk Simulator: DiskSim
DiskSim [14] is an efficient, accurate, highly-configurable storage system simulator. It
is written in C and requires no special system software. It includes modules for many
secondary storage components of interest, including device drivers, buses, controllers,
adapters and disk drives. Some of the component modules are highly detailed (e.g., the disk
module), and the individual components can be configured and interconnected in a variety
of ways. DiskSim can be driven by externally-provided I/O request traces or internally-
generated synthetic workloads. DiskSim has been used in a variety of published studies to
understand modern storage subsystem performance, to understand how storage performance
relates to overall system performance, and to evaluate new storage subsystem architectures.
The disk module in DiskSim, which is extremely detailed, has been carefully validated
against five different disk drives from three different manufacturers. The accuracy
demonstrated exceeds that of any other disk simulators.
The DiskSim model that we used is taken from the DRPM paper [15]. The model
includes power models to record the energy consumption of the disks when performing
operations like data transfers, seeks, or when just idling as described as TPM model in
DRPM paper. The model collects the latency of each state that the disk stays, and multiplies
the latency with the power consumption number in Figure 5.3 to generate the total energy
consumption. We modified the reported energy at the end of the simulation to report power
in every epoch on-the-fly. 
 138
However, DiskSim models the performance behavior of disk systems, but does not
actually save or restore data for each request. Therefore, we incorporate the DiskSim model
on top of the Disk model of Bochs. We modified Bochs? Disk model to convert the I/O
requests from the CPU to DiskSim requests and to place the requests in the DiskSim
interrupt queue. After that, DiskSim is called to simulate the event and returns the latency.
Then, after Bochs updates the timing, the Bochs? Disk model reads (or writes) the data from
the disk image and return the control to the CPU. 
IDLE
ACTIVE
STANDBY
22.3W
4.15W
39W
SEEK
39W
SPINUP
26Secs.
34.8W
SPINDOWN
15Secs. 4.15W
Figure 5.3: TPM Power Modes [15]. 
 139
5.5. The Benchmarks: SPEC2000
The SPEC CPU2000 benchmarks are intended to exercise the CPU, the memory
hierarchy, and the compilers. Since we intended to study the behavior of the entire memory
hierarchy, a set of benchmarks from SPEC CPU 2000 suite were used in our experiments.
Seven benchmarks from SPEC2000 integer suite were selected, which are bzip2, gzip, gcc,
mcf, parser, twolf, and vortex. And, a selection of two benchmarks from floating-point suite,
ammp and mgrid, are also used in the experiments. They were compiled by gcc with static
libraries on a Linux host system. Then, the binary files of the benchmarks and their input
files are installed on a Redhat Linux disk image. Inputs for all benchmarks are reference
inputs.
SPEC CPU2000 is the next-generation industry-standardized CPU-intensive benchmark
suite. SPEC designed CPU2000 to provide a comparative measure of compute intensive
performance across the widest practical range of hardware. The implementation resulted in
source code benchmarks developed from real user applications. These benchmarks measure
the performance of the processor, memory and compiler on the tested system. The data
collected show that SPEC met its goals for memory footprint: most benchmarks are larger
than common cache sizes, many are larger than 100MB, and none are larger than 200MB.
Further details of the selected benchmarks can be found in the appendix of this dissertation. 
 140
5.6. Interactions
One of the main contributions of SYSim is to ensure the correct interactions between the
components in the system at the epoch level. These interactions include:
? the interaction between the processor and L1 cache
? the interaction between multiple levels of the caches
? the interaction between the last level of the caches and DRAM
? the interaction between processor and disk via I/O requests
? the interaction between the disk and DRAM via DMA.
Figure 5.4 shows the interaction among the processor, caches, and memory for a Load
instruction. For the interaction of the processor, L1 cache, multilevel of the caches, we
implement them as modeled in SimpleScalar/Wattch. When the processor fetches an
instruction, or executes a load or a store command, a memory request is generated. The
page-frame portion of the requesting address is translated by the operating system via TLB,
while the index portion from the page offset is sent to the L1 instruction cache. As we
viewed the processor and the operating system as a black box that generates memory access
requests and I/O interrupts, we assume that the TLB translates the page-frame address,
except on a page fault. A page fault will be taken care of by the operating system. With the
translated physical page-frame address and the page offset, the L1 cache decodes the
translated physical address into set, tag, and offset. Then, the set is chosen, and the tag
portion of the cache is accessed, and compared. If it is a hit, then the proper bytes of the
 141
block are furnished to the processor using the lower bits of the page offset, and the
instruction stream access is done.
If it is a miss from L1 cache, the L2 cache is accessed. Like in L1 cache, the physical
address is decoded to tag, set, and offset portion, and the L2 cache accesses the block. If it is
a hit, the proper bytes (the size of L1 cache block) of the block are sent to the L1 cache.
Then, the L1 cache chooses a block to be replaced, depending on the replacement policy,
writes the replaced block to the appropriate block in L2 cache, and writes the missed block
from L2 cache to the chosen block in L1 cache. 
Fetch Decode
WB
MemExec
virtual to physical 
address translation
(DTLB access) [A
1
]
[A
2
] L1 D-Cache
access. If miss
then proceed to
[A
3
] L2 Cache
access. If miss
then send to BIU
Bus Interface Unit (BIU)
obtains data from main
memory [A
4
 + B]
[B
1
] BIU arbitrates [B
2
] request
sent to system 
controller
[B
8
] system 
controller returns 
data to CPU
Stages of instruction execution
Proceeding through
the memory hierarchy
in a modern processor
[B
3
]physical addr.
 
to memory addr.
translation. 
[B
4
] memory 
L1
cache
L2
cache
DTLB
Processor Core
BIU (Bus Interface Unit)
DRAM System
for ownership of
address bus ** 
[B
5
] memory
addr. Setup request
scheduling** (RAS/CAS)
[A
1
]
[B
8
]
[A
4
]
[A
2
]
[A
3
]
** Steps not required for some processor/system controllers. protocol specific
[B
4
]
[B
3
]
[B
2
]
[B
1
]
I/O to memory traffic memory request 
scheduling
physical to 
memory addr
mapping
[B
7
]
[B
5
]
read
data
buffer
memory controller
processor
DRAM 
core
[B
6
]
[B
6
, B
7
] DRAM dev.
obtains data and 
returns to controller 
Part A: Searching
on-chip for data
Part B: Going
off-chip for data
(CPU clocking 
domain)
(DRAM clocking
domain)
Figure 5.4: : Abstract Illustration of a Load Instruction in a Processor-Memory System [17]. 
 142
On the other hand, if a miss occurs in L2 cache, the request is sent either to L3 cache, if
exists, or to the DRAM memory system. If the request must go to DRAM memory, the
request will be put in the Bus Interface Unit (BIU) request queue inside the DRAM
simulator. After the appropriate BIU entry has been selected, the status of the BIU entry is
marked as SCHEDULED, then a memory transaction is created in the memory transaction
queue. The transaction is broken into a series of appropriate row activation, column
read/write, and precharge commands depending on the status of the accessing bank. After
the commands are issued, the DRAM returns the most critical data (the size of memory bus)
with respect to the DRAM timing specification, and the rest of the data is sent to the
replaced block of the last-level cache until filled. Figure 5.4 shows how a processor and
memory system, excluding disk effects, interact in a load instruction.
For the actual implementation in SYSim, the integrated cache and DRAM models do not
contain any data; the data are obtained from the memory array in Bochs simulator. The
integrated cache and DRAM only return the latency for each request and update the status of
the components. The value of the returned latency includes only the time spent until the first
chunk (critical word) of data returns, excluding the time returning the rest of the data.
On the other side of the operating system, when a page fault occurs, the operating
system issues a command to transfer the new page from the disk to memory. As a disk
request often involves block transfers, direct memory access (DMA) hardware is added to
many computer systems to allow transfers of numbers of words without intervention by the
CPU. DMA is a specialized processor that transfers data between memory and an I/O device
while the CPU goes on with other tasks. Thus, it is external to the CPU and must act as a
master on the bus. The CPU first sets up the DMA registers, which contain source and
 143
destination memory addresses and number of bytes to be transferred. Once the DMA
transfer is complete, the controller interrupts the CPU. There may be multiple DMA devices
in a computer system; for example, DMA is frequently part of the controller for an I/O
device.
A newer protocol for the ATA/IDE interface is Ultra DMA. The key technological
advance introduced to IDE/ATA in Ultra DMA was double transition clocking. Before Ultra
DMA, one transfer of data occurred on each clock cycle, triggered by the rising edge of the
interface clock (or strobe). With Ultra DMA, data is transferred on both the rising and falling
edges of the clock. Double transition clocking, along with some other minor changes made
to the signaling technique to improve efficiency, allowed the data throughput of the interface
to be doubled for any given clock speed.
The actual implementation of DMA and Ultra DMA relies on the timing in Bochs.
Bochs already implements DMA and the interaction with memory, but no timing is updated.
By inspection from Bochs, a disk request causes the disk to read a sector from the disk to the
disk buffer. Then, the data is sent to the memory 2 bytes at a time as the IDE/ATA interface
is two bytes (16 bits) wide. We need to consider the transfer configuration, i.e. the frequency
of the bus, what is the type of the interface, etc., to calculate the data transferring latency.
For example, if we wish to transfer a 512-byte sector with DMA, the latency is equal to 
, or
 with Ultra DMA.
latency
512B
2BperClock BusFrequency?
----------------------------------------------------------------------------=
latency
512B
4BperClock BusFrequency?
----------------------------------------------------------------------------=
 144
After DMA completes transferring the data to the memory controller, the controller
generates a transaction corresponding to the data from DMA. Then, the transaction is
scheduled to the DRAM system when appropriate. Finally, the memory pages,
corresponding to the data existing in any caches, are invalidated as the preparation for the
data to be loaded to the caches later on.
In SYSim, we run all applications in single-user mode to make accurate calculation of
execution time. Otherwise, the kernel would swap to other processes on a read() system call.
Therefore, disk delay shows up as stall time. On the other hand, a write() system call returns
to user code as soon as the data is transferred into kernel space. Instead of writing through to
the disk system directly and waiting for the write to finish, the operating system buffers the
writes in the memory and returns the control to the user application immediately. The
operating system issues a long burst of buffered writes to the disk system periodically at
later time. As a result, the disk read requests behave like blocking requests, and disk writes
behave like non-blocking requests. Additionally, the way we implemented the interface is
very straightforward as DiskSim allows us to specify the bus latency for transferring one
sector. Therefore, in the experiment, we only varied the bus latency to see the differences
between the different types of interfaces.
 145
5.7. Parameter and Benchmark Selections
All the parameters using in the experiments are shown in the table below. We have
chosen the parameters to suit the benchmarks which is SPEC2K, but still reflects the modern
computer systems as the same parameters are still used in recent publications. The cache
structure consists of a level-1 instruction cache, a level-1 data cache, and a level-2 unified
cache. We use 0.10 micron-technology for the cache as defined in Wattch. The memory type
is DDR SDRAM with one channel of an eight-byte wide bus. The DRAM parameters are set
according to the datasheet of DDR SDRAM 128Mb chip from micron website. Table 5.2
shows the base configuration for the CPU, caches, DRAM, and Disk in our experiments. If
not specified, the parameters are set according to the table.
For the benchmarks, since we focus on the entire memory hierarchy affected by the
changed in parameter settings, we are using a subset of SPEC2K benchmarks. However,
after a preliminary experiment showing in the next chapter, all benchmarks can be
characterized into 2 categories; first is the benchmarks that show memory page swapping
behavior due to not enough memory. This type of benchmark has both disk reads and writes.
Second is the benchmarks that have no page swapping; therefore, only a series of
consecutive disk reads are exhibited. As a result, we choose only bzip2 and ammp with
different main memory sizes to represent both categories of behaviors. Since we focus on
the I/O intensive phase of the execution, we run only the first 500 million instructions,
which is disk-intensive.
 146
CPU Parameters
CPU speed 2GHz
L1-Icache 64kB; 64B linesize; 4-way with LRU repl.; lat = 1
L1-Dcache 64kB; 64B linesize; 4-way with LRU repl.; lat = 1
L2-cache 512kB; 128B linesize; 4-way with LRU repl.; lat = 6
cache technology sizing 0.10 micron
Memory Parameters
memory type DDR SDRAM
memory data rate 400 MHz
memory channel count 1
memory channel width 8 bytes
memory rank count 1
memory bank count 4
memory row count 4096 (for 128MB)
memory column 1024 (for 128MB)
DRAM chip density 128 Mb (for 128MB)
DRAM chip VDD 2.6 V
DRAM chip IDD0 155 mA
DRAM chip IDD2P 5 mA
DRAM chip IDD2F 55 mA
DRAM chip IDD3P 45 mA
DRAM chip IDD3N 60 mA
DRAM chip IDD4R 190 mA
DRAM chip IDD4W 195 mA
DRAM chip IDD5 11 mA
Table 5.2: Base Configuration for CPU, caches, and memory
 147
The disk parameter files are taken from DRPM paper. We choose 5400 RPM (or 5k) to
represent yesterday?s disk, 12000 RPM (or 12k) to represent today?s disk, and 20000 RPM
(or 20k) to represent tomorrow?s disk. However, for the power consumption, after several
disk drives currently available in the market have been surveyed, the power consumption for
idle mode and active mode of 22Watts and 39Watts as specified in DRPM paper are no
longer reasonable. Figure 5.5 below shows the plot of the RPM versus the idle power and
the active power of 47 commercially available disk drives. The figure shows the idle power
values are in the range of 5-16 Watts and the active power values are in the range of 7-23
Watts, while the RPM is between 7,200RPM and 15,000 RPM. No disk drive has the power
consumption over 25 Watts. The reason might be the changes in disk drive technology
which are increasingly moved toward low power venue in the past few years making the
power consumption of a disk drive in real life diverge from the one described a few years
back by DRPM paper. Therefore, a selection of more reasonable idle and active power
consumption has to be chosen.
DRAM chip?s power per DQ 6.88 mW
Disk Drive Parameters
Disk parameters 5400, 12000, 20000 with 4MB of disk cache
RAID stripe size 16KB
Disk sector size 512 bytes
Table 5.2: Base Configuration for CPU, caches, and memory
 148
In this dissertation, since the disk speed that we study is either obsolete (5400 RPM) or
not commercially available (12,000 RPM and 20,000RPM), we employ the following
technique to project the power consumption. Son and Kandemir [91] suggested curve fitting
method to estimate the idle and active power of a disk drive for a particular RPM. They
collected several pairs of RPM and power consumption from commercially available disk
drives, and projected them on a linear curve fitting. From the power consumption of the
multi-speed disk drive shown in their paper, we use their linear curve fitting to project the
idle power and active power for 5400, 12000, and 20000 RPM. The equation used in the
curve fitting is below:
projected value
projected value
x
x
Figure 5.5: Idle and Active Power of 47 Commercially Available Disk drives. The figure on the left is for Disk Idle Power and RPM,
and the figure on the right is for Disk Active Power. The data point marked as DRPM is the Idle and Active Power from DRPM paper.
Obviously, the idle and active power from the DRPM paper are too high from the power numbers of the commercially available disks. The
figure also shows our projected values used in the dissertation.
Pidle 0.51 RPM()1000??()2.5+=
Pactive 0.73 RPM()1000?? 2.5+=
 149
As a result, we use the following power consumption for 5400, 12000, and 20000-RPM
disk drives in our experiment as show in the table 5.3:
For the RAID5 settings, we consider the configurations as shown in the figure 5.7 below.
We conducted the experiments with 2 configurations of 4-disk RAID system which are (1) 2
controllers with 2 disks each or ?2c x 4ds?, and (2) 4 disks connected to only one controller
or ?4ds?. For the 8-disk RAID system, we have 3 configurations, which are (1) 2 controllers
with 4 disks each or ?2c x 4ds?, (2) 4 controllers with 2 disks each or ?4c x 2ds?, and (3) one
controller with 8 disks or ?8ds?.
RPM Active Power (W) Idle Power (W)
5,400 6.442 5.254
12,000 11.26 8.62
20,000 17.1 12.7
Table 5.3: Disk Active and Idle Power Values
Figure 5.6: Son and Kandemir?s Disk Power Projection for IBM Ultrastar 36Z15. The figure shows linear relationship between the
power and the RPM as used in their experiments. This relationships reflect better representative values for the active and idle power for
a currently available disk drive.
 150
Driver
CTLR1
CTRL2 CTLR3
bus1
bus2
bus3 bus4
bus5
bus6
Disk1 Disk2 Disk3 Disk4
Driver
CTLR1
bus7 bus8 bus9 bus10
CTLR4
CTLR5
CTLR6 CTLR7
bus1
bus2
bus3 bus4 bus5 bus6
CTLR2 CTLR3 CTLR4 CTLR5
Disk1 Disk2 Disk3 Disk4
(a) 2c x 2ds (b) 4ds
Driver
CTLR1
CTRL2 CTLR3
bus1
bus2
b3
D1 D2 D3 D4
Driver
CTLR1
bus1
bus2
D5 D6 D7 D8
C4 C5 C6 C7 C8 C9 C10 C11
D1 D2 D3 D4 D5 D6 D7 D8
C6 C7 C8 C9 C10 C11 C12 C13
CTLR2 CTLR3 CTLR4 CTLR5
Driver
CTLR1
bus1
bus2
D1 D2 D3 D4 D5 D6 D7 D8
C2 C3 C4 C5 C6 C7 C8 C9
(a) 2c x 4ds (b) 4c x 2ds
(c) 8ds
Figure 5.7: (1) RAID5 Configuration for an 4-disk system. The figure shows (a) 2 controllers with 2 disks each (2c x 2ds), and (b) one
controller with 4 disks connected (4ds). Note that each disk also has its own controller.
Figure 5.7: (2) RAID5 Configuration for an 8-disk system. The figure shows (a) 2 controllers with 4 disks each (2c x 4ds), (b) 4
controllers with 2 disks each (4c x 2ds), and (c) one controller with 8 disks (8ds). Note that each disk also has its own controller. Due to the
limitation in space, a bus is reduced to a 2-head arrow, and the name of the bus is omitted.
 151
5.8. SYSim and Real Systems Comparison
We compared the execution time breakdown results obtained from SYSim to the results
obtained from a set of real systems. We ran SPEC?s gzip on three different machines. The
real system configurations are set to be comparable to SYSim configurations. Due to the
limitation of the availability and compatibility in hardware, we compare SYSim with a set of
available machines with comparable configuration, but not with the exact configuration. The
first system has a 750MHz CPU with 96MB of the system memory and runs Fedora Core 3.
The execution time breakdown for the first system is shown in Table 5.4. 
The second system is the same system as the first system, but the system memory is set
to 128MB. The second system is comparable to a SYSim system configured with a 2GHz
CPU and 128MB of the system memory. Even though the second system has a CPU of
750MHz, the CPU is an out-of-order core, which is approximately comparable to a 2 GHz
in-order core in SYSim. Therefore, SYSim execution time statistics are also shown at the
end of Table 5.5 for comparison. 
Run # User (s) Kernel (s) I/O stall (s) Total (s)
1 (cold cache) 93.11 15.06 600.83 709
2 (warm cache) 92.7 16.3 397.00 506
3 (warm cache) 92.8 14.3 425.90 533
4 (warm cache) 93.3 14.3 460.40 568
5 (warm cache) 93.6 14.3 441.10 549
Table 5.4: Execution Time Breakdown for System #1: 750MHz CPU with 96MB 
memory
 152
The first and the second real systems are the same system with only 32MB different in
the memory size. However, the total execution time of both systems are as different by the
factor of 3 and the I/O stall times are as different by the factor of 5. This result shows that the
I/O effect exists in the real system. In Table 5.5, the execution time breakdown in both real
system and SYSim are comparable.
The third system has a 2.4GHz CPU with 1GB of the system memory also running
Fedora Core 3. The third system is to be compared with a SYSim system configured with a
2GHz CPU with 512MB of the system memory. Though the SYSim system has less in both
processor frequency and memory size, our experiment results in the next chapter show that
any systems running SPEC?s gzip with any size of memory larger than 160MB will not
cause any differences in the total system performance. Therefore, the memory size of 1GB
or 512MB will perform similarly in this case. The execution time statistics in both actual
Run # User (s) Kernel (s) I/O stall (s) Total (s)
1 (cold cache) 90.4 6.4 164.20 261
2 (warm cache) 90.1 6 126.90 223
3 (warm cache) 89.8 5.7 129.50 225
4 (warm cache) 90.5 5.5 121.00 217
5 (warm cache) 90.3 6.1 168.60 265
SYSim System Run
User and 
Kernel (s)
I/O stall (s) Total (s)
run# 1 27.8 135.2 162.8
Table 5.5: Execution Time Breakdown for System #2: 750MHz CPU with 128MB of 
memory comparing with a SYSim system with 2GHz CPU with 128MB of memory
 153
system and SYSim system are shown in Table 5.6. Notice, the execution time breakdown in
both real system and SYSim system are also very similar in this case.
Run # User (s) Kernel (s) I/O stall (s) Total (s)
1 (cold cache) 20 0.19 27.8 48
2 (warm cache) 20 0.19 19.8 40
3 (warm cache) 20 0.19 17.8 38
4 (warm cache) 20.1 0.20 18.7 39
5 (warm cache) 20 0.19 21.0 41.2
SYSim System Run
User and 
Kernel (s)
I/O stall (s) Total (s)
run #1 27.8 33.1 60.9
Table 5.6: Execution Time Breakdown for System #3: 2.4GHz CPU with 1GB of 
memory comparing with a SYSim system with 2GHz CPU with 512MB of memory
 154
5.9. Sample Output
Figure 8-11 show the graphs which are generated from the sample output of SYSim. The
system configuration is as described in Table 5.2, with 128MB of memory, a single 12k-
RPM disk drive with disk cache. The system ran gzip to completion. Figure 5.8 shows the
Sample Output of Cache Accesses and Total system CPI. The figure shows 4 graphs, which
are (1) instruction cache accesses per 10 milliseconds, (2) data cache accesses per 10 ms (3)
level-2 unified cache accesses per 10 ms, and (4) the total system CPI per 10 ms along with
the accumulated system CPI. In the last graph, the duration with no data point means that
there is no instruction executed.
Figure 5.9 shows the Sample Output of Cache miss rate and Disk Accesses. Figure 5.9
shows 4 graphs, which are (1) the miss rate of the instruction cache, (2) the miss rate of the
data cache, (3) the miss rate of level-2 unified cache, and (4) the disk accesses per 10ms.
Figure 5.10 shows the Sample Output of Cache power and Disk Power Dissipation. The
figure shows 4 graphs, which are (1) the instruction cache power, (2) the data cache power,
(3) the level-2 unified cache power, and (4) the disk power per 10 ms. All power dissipation
values are in Watts.
Finally, Figure 5.11 shows the Sample Output of DRAM and Disk Accesses and Power
Dissipation. The figure shows 4 graphs, which are (1) DRAM Power, (2) DRAM Accesses
per 10 ms, (3) Disk Power, and (4) Disk Accesses per 10ms. The duration having no data
point means that there are no accesses. All graphs share the same x-axis which is the
execution in milliseconds, and each data point is the collection of average value over 10
milliseconds.
 155
0
5e+061e+07
1.
5e+0
7
2e+07
Icache Access
Cache Accesses (per 10 ms) and System CPI
gzip; memory: 128MB; run to completion
0
5e+061e+07
1.
5e+0
7
2e+07
Dcache Access
0
5000
0
1e+05
1.
5e+0
5
2e+05
2.
5e+0
5
3e+05
3.
5e+0
5
L2 cache Access
20
40
60
80
100
120
140
160
180
200
time(s)
1
10
100
1000
1000
0
CPI
CPI@10ms cu
m. CPI
Figure 5.8: Sample Output of Cache Accesses and Total system CPI. The figure shows 4 graphs, which are (1) instruction cache
accesses per 10 milliseconds, (2) data cache accesses per 10 ms (3) level-2 unified cache accesses per 10 ms, and (4) the total system CPI
per 10 ms along with the accumulated system CPI. The duration having no data point mean there is no instruction executed.
 156
0
0.20.40.60.8
1
I cache miss  rat e
Ca
che
 Miss Rate
0
0.20.40.60.8
1
Dcach e miss rate
0
0.20.40.60.8
1
L2ca che miss rate
20
40
60
80
100
120
140
160
180
2
00
time
(s)
0
80
160240320400
D i sk  Ac c e s s ( pe r  10 m s )
Figure 5.9: Sample Output of Cache miss rate and Disk Accesses. The figure shows 4 graphs, which are (1) the miss rate of the
instruction cache, (2) the miss rate of the data cache, (3) the miss rate of level-2 unified cache, and (4) the disk accesses per 10ms.
 157
0
1
2345
Ic ach e Po wer  ( W )
Cache and Disk Power (per 10 ms)
0
1
2345
Dc ach e Po wer  ( W )
0.4
0.420.440.460.48
0.5
L2c ach e Po wer  ( W )
20
40
60
80
100
120
140
160
180
200
time(s)
05
1015
D i sk  P o we r (W )
Figure 5.10: Sample Output of Cache power and Disk Power Dissipation. The figure shows 4 graphs, which are (1) the instruction
cache power, (2) the data cache power, (3) the level-2 unified cache power, and (4) the disk power per 10 ms. All power dissipation are in
Watts.
 158
012345
D R AM  P o wer (W )
DRAM & Disk Power/Accesses
10
0
10
1
10
2
10
3
10
4
10
5
DRAM  Access e s per 10ms
05
1015
Disk Po wer(W)
20
40
60
80
100
120
1
40
160
180
200
time(s)
10
0
10
1
10
2
10
3
10
4
Disk Access(per 10ms)
Figure 5.11: Sample Output of DRAM and Disk Accesses and Power. The figure shows 4 graphs, which are (1) DRAM Power, (2)
DRAM Accesses, (3) Disk Power, and (4) Disk Accesses per 10ms. The duration having no data point mean there is no accesses.
 159
CHAPTER 6:   EXPERIMENTAL RESULTS
This chapter discusses the results from the experiments that utilized SYSim to
investigate the system-level behaviors during the I/O intensive phase of an execution. Most
applications spend a significant amount of the time, if not most, in the I/O intensive phase
due to the I/O activities. During the I/O intensive phase, the other components in the
memory hierarchy cause only very little activities. We conducted the experiments during the
I/O-intensive phase, which tends to be within the first 500 million instructions of the
execution. This chapter presents the impact of the variations in system memory size settings
and disk design space having on total system performance and power. The experimental
results are shown in terms of both total system performance and power/energy consumption.
6.1. I/O intensive phase
As we discussed in the introduction, an application tends to spend a significant amount
of time during the I/O intensive phase. Again, Figure 6.1 shows the interaction of memory
hierarchy components during the entire execution of gzip on our complete-system
simulator--SYSim, while in a single user environment. The system configuration used in
this example is a 2-GHz Pentium processor, 128MB of main memory, and a 12k-RPM disk
drive with built-in disk cache. The other system configuration is as described in Table 5.2.
Figure 6.1 includes graphs displaying comparisons between cache accesses and system CPI,
all cache power, and DRAM and disk access/power. The system CPI is shown in both 10ms-
 160
I/O intensive phase
computation
phase
0
5e+06
1e+07
1.5e+07
2e+07
Icache A
cce
s
s
Cache Accesses (per 10 ms) and System CPI
gzip; memory: 128MB; run to completion
0
5e+06
1e+07
1.5e+07
2e+07
D
c
a
c
he A
ccess
0
50000
1e+05
1.5e+05
2e+05
2.5e+05
3e+05
3.5e+05
L2cach
e A
ccess
   20    40    60    80  100    120    140    160    180   200
time(s)
1
10
100
1000
10000
CP
I
CPI@10ms
cum. CPI
0
1
2
3
4
5
Icache Pow
e
r 
(W)
Cache Power (per 10 ms)
0
1
2
3
4
5
D
cache Pow
e
r
 (W)
0.4
0.42
0.44
0.46
0.48
0.5
L2cach
e Po
w
e
r 
(W)
20 40 60 80 100 120 140 160 180 200
time(s)
0
5
10
15
Di
s
k
 P
o
w
e
r (
W
)
0
1
2
3
4
5
DR
AM P
o
w
e
r
 
(W
)
DRAM & Disk Accesses/Power(per 10ms)
10
0
10
1
10
2
10
3
10
4
10
5
DR
AM A
c
c
e
s
s
e
s
0
5
10
15
Dis
k
 P
o
w
e
r(W)
   20    40    60    80   100    120    140    160    180   200
time(s)
10
0
10
1
10
2
10
3
10
4
Disk Access(per 10
m
s
)
Figure 6.1: The System CPI. The figure shows the System CPI over the entire run of gzip. The system configuration is a 2-GHz processor
with 128MB of memory and a 12k-RPM disk. The CPI graph shows 2 CPI values: one is the instant CPI for every 10ms, another is the
accumulated average CPI. The duration having no data point means no instructions are executed due to the I/O latency. The course of
execution when the accumulated CPI is over 100 is the I/O intensive phase, and the course of execution when the CPI is below 100 is the
computation phase.
 161
epoch average and total accumulated average. Accesses and power of level-1 instruction
cache, level-1 data cache, and level-2 unified cache are all illustrated. All graphs use the
same x-axis, which represents the execution time in seconds. The x-axis does not start at
zero since the system boot time is excluded. Figure 6.1 is the same as Figure 5.8 to 5.11, the
sample output in Chapter 5. 
The figure demonstrates different phases of execution: the I/O intensive phase and the
computation phase. During the I/O intensive phase, the operating system reads the program
and required data from the disk and allocates the memory pages for them. The caches are
mostly idle, and the DRAM is sporadically written into. On the other hand, the disk is
actively accessed. From the figure, the I/O intensive phase is from the start of the execution
until the 140th second. Since the application was run in a single user mode, the disk access
delay causes stall time in the execution. One notices long periods of disk activity, when the
disk power is at its maximum followed by periods of disk bursts, i.e. from the 10th second to
the 50th second and the 70th to the 110th second. These long disk activity periods are the
result of write bursts caused due to write buffering performed by the file system
management. Since perfect disk-side write buffering mechanism is not implemented in this
system, the long write bursts have to be processed immediately to prevent the data
discrepancy from any failures. The latency of this long period depends heavily on the
memory page swapping algorithm used by the operating system. More page swapping
means more write data to be buffered in the main memory due to pages swapped out, and
longer write bursts to be scheduled to the disk. In this configuration, due to the large
memory footprint of the application (as much as 180MB [13]), the operating system swaps
out numerous memory pages. The write data buffered in the file system are periodically sent
 162
to the disk. If these write bursts are scheduled before the disk reads in a single user mode,
the write bursts would prolong the execution time since the reads have to wait for the write
bursts to be completed.
The second phase is the computation phase. During the computation phase, the caches
and DRAM are accessed regularly while the processor read instructions and data from the
caches and executes them. In this execution phase, the disk is rarely accessed since most of
the required code and data are already loaded in the memory. In other applications, there
might be disk accesses due to a larger memory footprint. The figure also illustrates a number
of disk accesses during the computation phase, but these accesses have no performance
impact. This is because these accesses are periodic disk write bursts, which are the results
from write buffering under the file system. These write bursts would not lower system
performance unless there are reads scheduled after the bursts.
The last phase of execution, is often an I/O output phase. After the computation, an
application would output the results to I/O, which can be the computer screen or a file.
However, since SPEC's version of gzip performs only reads from I/O for input, but no file
I/O for output, the figure does not demonstrate this I/O output phase.
As Figure 6.1 shows, the CPI value can vary dramatically, i.e. by many orders of
magnitude, due to the I/O activities during long I/O intensive phase. CPI finally reduces to a
single-digit number during the computation phase as observed in previous studies. If the CPI
is calculated from the entire execution, the average CPI would be just under 10. However, as
many researchers only concentrate on the computation phase, they claimed the final CPI
number to be around 1. This misconception is mainly caused by excluding the I/O intensive
phase. The final results would be an inaccurate average CPI and incorrect execution time
 163
estimation. Therefore, the I/O intensive phase is truly important to the entire execution of an
application. 
Let?s have a look at what if we increase the memory size to the point where is no paging
in the system. Figure 6.2 shows the results of the same system with 512MB running gzip.
The I/O intensive phase is much shorter, but it remains a significant portion of the entire
execution time. Though the I/O intensive phase is much shorter, write bursts remain, i.e. at
the 25th and the 40th second. Therefore, even without the memory paging in the system, the
disk request stream is composed of both read and write requests. As a result, the problem
remains even in a system equipped with disk prefetching, so simple prefetching data from
the disk is not a solution.
Figures 6.3 to 6.11 show the executions during the I/O intensive phases of all nine
benchmarks used in the experiments: ammp, bzip2, gcc, gzip, mcf, mgrid, parser, twolf, and
vortex. Again, all graphs are the results of the configurations of 128MB of memory with a
single 12k-RPM disk equipped with disk cache. All results are shown for the first 500
million instructions. Obviously, all benchmarks demonstrate disk-intensive behavior during
the execution: the disk is actively accessed and prolongs execution time due to the long
latency. On the other hand, the number of cache and DRAM accesses are minimal,
compared with the number of cache and DRAM accesses during the computation phase.
Though some applications actively access the cache and DRAM, the numbers of accesses
are not as high as during the computation phase, for example, ammp and parser. For mcf,
system initialization ends after 10 seconds. During initialization only the disk is accessed.
Following the I/O intensive phase, all memory components are actively accessed. Caches
and DRAM accesses are scattered periodically because of long latency of the I/O between
 164
0
5e+06
1e+07
1.5e+07
2e+07
Icache A
ccess
Cache Accesses (per 10 ms) and System CPI
gzip; memory: 512MB
0
5e+06
1e+07
1.5e+07
2e+07
D
cache A
ccess
0
50000
1e+05
1.5e+05
2e+05
2.5e+05
3e+05
3.5e+05
L
2
cache A
c
cess
10 15 20 25 30 35 40 45 50 55
time(s)
1
10
100
1000
10000
CPI
CPI @10ms
cum. CPI
0
2
4
6
8
10
Icache Pow
e
r 
(W)
Cache Power (per 10 ms)
0
2
4
6
8
10
D
cache Pow
e
r 
(W)
0.4
0.42
0.44
0.46
0.48
0.5
L2ca
che P
o
w
e
r
 
(W)
10 15 20 25 30 35 40 45 50 55
time(s)
0
5
10
15
D
i
sk P
o
wer 
(W)
0
1
2
3
4
DR
AM P
o
we
r
 
(W
)
DRAM and Disk Accesses/Power
gzip; memory 512MB
10
0
10
1
10
2
10
3
10
4
10
5
DR
AM A
c
c
e
s
s
e
s
 p
e
r 
1
0
ms
0
5
10
15
Dis
k
 P
o
w
e
r(W)
10 15 20 25 30 35 40 45 50 55
time(s)
1
10
100
1000
10000
1e+05
Disk Access(per 10
m
s
)
Figure 6.2: The interaction in memory hierarchy in a system with 512MB of memory. The figure shows the interaction between all
components in the memory hierarchy including level-1 instruction cache, level-1 data cache, level-2 unified cache, DRAM, and a disk drive.
Notice that initialization time reduces from 140 seconds in Figure 6.1 to 40 seconds in this figure.
 165
0
5e+06
1e+07
1.5e+07
2e+07
Icache
 Access
Cache Accesses (per 10 ms) and System CPI
ammp; memory: 128MB; first 500M instructions
0
5e+06
1e+07
1.5e+07
2e+07
Dca
c
he Access
1
10
100
1000
10000
1e+05
1e+06
L2cache Access
8101214
time(s)
1
10
100
1000
10000
CPI
CPI@10ms
cum. CPI
0
2
4
6
8
10
Icache Power (W)
Cache Power (per 10 ms)
0
2
4
6
8
10
Dcache Power (W)
0.4
0.42
0.44
0.46
0.48
0.5
L2cache Power (W)
8 101214
time(s)
0
5
10
15
Disk Power (W)
0
1
2
3
4
DRAM Power (W)
DRAM & Disk Power/Accesses
1
10
100
1000
10000
1e+05
1e+06
DRAM Accesses per 10ms
0
5
10
15
Disk Power(W)
8101214
time(s)
1
10
100
1000
10000
Disk Access(per 10ms)
Figure 6.3: I/O intensive phase of ammp. 
 166
0
5e+06
1e+07
1.5e+07
2e+07
Icache
 Access
Cache Accesses (per 10 ms) and System CPI
bzip2; memory: 128MB; first 500M instructions
0
5e+06
1e+07
1.5e+07
2e+07
Dca
c
he Access
0
50000
1e+05
1.5e+05
2e+05
2.5e+05
3e+05
3.5e+05
L2cache Access
20 40 60 80 100 120 140
time(s)
1
10
100
1000
10000
CPI
CPI@10ms
cum. CPI
0
2
4
6
8
10
Icache Power (
W
)
Cache Power (per 10 ms)
0
2
4
6
8
10
D
cache Power (W)
0.4
0.42
0.44
0.46
0.48
0.5
L2cache Power (W)
20 40 60 80 100 120 140
time(s)
0
5
10
15
Disk Power (W)
0
1
2
3
4
5
DRAM Power (W)
DRAM & Disk Power/Accesses
1
10
100
1000
10000
1e+05
DRAM Accesse
s per 10ms
0
5
10
15
Disk Power(W)
20 40 60 80 100 120 140
time(s)
1
10
100
1000
10000
Disk A
ccess(per 10ms)
Figure 6.4: I/O intensive phase of bzip2. 
 167
0
5e+06
1e+07
1.5e+07
2e+07
Icache
 Access
Cache Accesses (per 10 ms) and System CPI
gcc; memory: 128MB; first 500M instructions
0
5e+06
1e+07
1.5e+07
2e+07
Dca
c
he Access
0
50000
1e+05
1.5e+05
2e+05
2.5e+05
3e+05
3.5e+05
L2cache Access
8 101214
time(s)
1
10
100
1000
10000
CPI
CPI@10ms
cum. CPI
0
2
4
6
8
10
Icache Power (W)
Cache Power (per 10 ms)
0
2
4
6
8
10
Dcache Power
 (W)
0
2
4
6
8
10
L2cache
 Power (W)
8 101214
time(s)
0
5
10
15
Disk Power
 (W)
0
1
2
3
4
DRAM Power (W)
DRAM & Disk Power/Accesses
1
10
100
1000
10000
1e+05
DRAM Accesses per 10ms
0
5
10
15
Disk Power(W)
8 101214
time(s)
1
10
100
1000
10000
1e+05
Disk Access(per 10ms)
Figure 6.5: I/O intensive phase of gcc. 
 168
0
5e+06
1e+07
1.5e+07
2e+07
Icache
 Access
Cache Accesses (per 10 ms) and System CPI
gzip; memory: 128MB; first 500M instructions
0
5e+06
1e+07
1.5e+07
2e+07
Dca
c
he Access
0
50000
1e+05
1.5e+05
2e+05
2.5e+05
3e+05
3.5e+05
L2cache Access
20 40 60 80 100 120
time(s)
1
10
100
1000
10000
CPI
CPI@10ms
cum. CPI
0
2
4
6
8
10
Icache
 Power (W)
Cache Power (per 10 ms)
0
2
4
6
8
10
Dcache Power (W)
0.4
0.42
0.44
0.46
0.48
0.5
L2cache Power (W)
20 40 60 80 100 120
time(s)
0
5
10
15
Disk Power (W)
0
1
2
3
4
5
DRAM Power (W)
DRAM & Disk Power/Accesses
1
10
100
1000
10000
1e+05
1e+06
DRAM Accesse
s per 10ms
0
5
10
15
Disk Power(W)
20 40 60 80 100 120
time(s)
1
10
100
1000
10000
1e+05
Disk Access(per 10ms)
Figure 6.6: I/O intensive phase of gzip. 
 169
0
5e+06
1e+07
1.5e+07
2e+07
Icache Access
Cache Accesses (per 10 ms) and System CPI
mcf; memory: 128MB; first 500M instructions
0
5e+06
1e+07
1.5e+07
2e+07
Dcache Access
0
50000
1e+05
1.5e+05
2e+05
2.5e+05
3e+05
3.5e+05
L2cache Acce
ss
810121416 18
time(s)
1
10
100
1000
10000
CPI
CPI@10ms
cum. CPI
0
2
4
6
8
10
Icache Power (
W
)
Cache Power (per 10 ms)
0
2
4
6
8
10
D
cache Power (W)
0.4
0.42
0.44
0.46
0.48
0.5
L2cache Power (W)
810121416 18000
time(s)
0
5
10
15
Disk Power (W)
0
1
2
3
4
DRA
M
 Power (W)
DRAM & Disk Power/Accesses
1
10
100
1000
10000
1e+05
DRAM Accesses per 10ms
0
5
10
15
Disk Power(W)
810121416 18
time(s)
1
10
100
1000
10000
1e+05
Disk Access(
per 10ms)
Figure 6.7: I/O intensive phase of mcf. 
 170
0
5e+06
1e+07
1.5e+07
2e+07
Icache
 Access
Cache Accesses (per 10 ms) and System CPI
mgrid; memory: 128MB; first 500M instructions
0
5e+06
1e+07
1.5e+07
2e+07
Dca
c
he Access
0
50000
1e+05
1.5e+05
2e+05
2.5e+05
3e+05
3.5e+05
L2cache Access
78910
time(s)
1
10
100
1000
10000
CPI
CPI@10ms
cum. CPI
0
2
4
6
8
10
Icache Power (
W
)
Cache Power (per 10 ms)
0
2
4
6
8
10
D
cache Power (W)
0.4
0.42
0.44
0.46
0.48
0.5
L2cache Power (W)
78910
time(s)
0
5
10
15
Disk Power (W)
0
1
2
3
4
DRA
M
 Power (W)
DRAM & Disk Power/Accesses
1
10
100
1000
10000
1e+05
DRAM Accesses per 10ms
0
5
10
15
Disk Power(W)
78910
time(s)
1
10
100
1000
10000
1e+05
Disk Access(
per 10ms)
Figure 6.8: I/O intensive phase of mgrid. 
 171
0
5e+06
1e+07
1.5e+07
2e+07
Icache
 Access
Cache Accesses (per 10 ms) and System CPI
parser; memory: 128MB; first 500M instructions
0
5e+06
1e+07
1.5e+07
2e+07
Dca
c
he Access
0
50000
1e+05
1.5e+05
2e+05
2.5e+05
3e+05
3.5e+05
L2cache Access
78910
time(s)
1
10
100
1000
10000
CPI
CPI@10ms
cum. CPI
0
2
4
6
8
10
Icache Power (
W
)
Cache Power (per 10 ms)
0
2
4
6
8
10
D
cache Power (W)
0.4
0.42
0.44
0.46
0.48
0.5
L2cache Power (W)
78910
time(s)
0
5
10
15
Disk Power (W)
0
1
2
3
4
DRAM Power (W)
DRAM & Disk Power/Accesses
1
10
100
1000
10000
1e+05
DRAM Ac
cesses per 10ms
0
5
10
15
Disk Power(W)
78910
time(s)
1
10
100
1000
10000
1e+05
Disk Access(per 10ms)
Figure 6.9: I/O intensive phase of parser. 
 172
0
5e+06
1e+07
1.5e+07
2e+07
Icache
 Access
Cache Accesses (per 10 ms) and System CPI
twolf; memory:128MB; first 500M instructions
0
5e+06
1e+07
1.5e+07
2e+07
Dca
c
he Access
0
50000
1e+05
1.5e+05
2e+05
2.5e+05
3e+05
3.5e+05
L2cache Access
78910112
time(s)
1
10
100
1000
10000
CPI
CPI@10ms
cum. CPI
0
2
4
6
8
10
Icache Power (W)
Cache Power (per 10 ms)
0
2
4
6
8
10
Dcache Power 
(W)
0.4
0.42
0.44
0.46
0.48
0.5
L2cache
 P
o
wer (W)
78910112
time(s)
0
5
10
15
Disk Power
 (W)
0
1
2
3
4
DRA
M
 Power (W)
DRAM & Disk Power/Accesses
1
10
100
1000
10000
1e+05
DRAM Accesses per 10ms
0
5
10
15
20
25
30
35
40
Disk Power(W)
78910112
time(s)
1
10
100
1000
10000
1e+05
Disk Access(
per 10ms)
Figure 6.10: I/O intensive phase of twolf. 
 173
0
5e+06
1e+07
1.5e+07
2e+07
Icache
 Access
Cache Accesses (per 10 ms) and System CPI
vortex; memory: 128MB; first 500M instructions
0
5e+06
1e+07
1.5e+07
2e+07
Dca
c
he Access
0
50000
1e+05
1.5e+05
2e+05
2.5e+05
3e+05
3.5e+05
L2cache Access
78910112131415 16 17 18
time(s)
1
10
100
1000
10000
CPI CPI@10ms
cum. CPI
0
2
4
6
8
10
Icache Power (
W
)
Cache Power (per 10 ms)
0
2
4
6
8
10
D
cache Power (W)
0.4
0.42
0.44
0.46
0.48
0.5
L2cache Power (W)
78910112131415 16 17 18
time(s)
0
5
10
15
Disk Power (W)
0
1
2
3
4
DRA
M
 Power (W)
DRAM & Disk Power/Accesses
1
10
100
1000
10000
1e+05
DRAM Accesses per 10ms
0
5
10
15
Disk Power(W)
78910112131415 16 17 18
time(s)
1
10
100
1000
10000
1e+05
Disk Access(
per 10ms)
Figure 6.11: I/O intensive phase of vortex. 
 174
the accesses. Therefore, the disk latency has a significant impact on the total execution time
for all applications.
The disk dissipates near the maximum power during the I/O intensive phase, while other
memory components (level-1 caches, level-2 cache, and DRAM) dissipate marginal power.
Even in the applications with regular accesses, other memory components still dissipate
little power due to the long latency of the I/O spreading the accesses apart. The maximum
instantaneous power dissipated for level-1 caches is approximately 4 Watts, but the average
power for the entire phase is approximately 0.4 - 1.2 Watt. Despite its large size, the level-2
cache dissipates very little power (0.4 Watt) due to the clock gating style used in the cache.
In level-2 cache, since it is accessed less frequently than the level-1 caches, only the
accessed bank of level-2 cache is active during the access, and the rest are inactive. This
may cause a performance penalty, but the mechanism saves significant power. The DRAM
dissipates only roughly 4 Watts maximum at any instant and approximately 0.2 - 1 Watt on
average. Since the DRAM has low activity during the I/O intensive phase, the DRAM
power and energy mainly depends on the DRAM configuration: the power in the DRAM
system is proportional to the number of the DRAM chips. In our experiment, the DRAM is
set to only one rank with 8 chips, and its capacity is varied by changes in internal DRAM
chip configuration. With no variation in the number of chips, the power dissipation of the
DRAM would not be drastically affected.
 175
6.2. Memory Size and I/O Behaviors
It has been widely accepted that the system memory capacity has a great impact to
overall system performance. Different benchmarks require different memory, and different
phases of the execution have different memory footprint. For our experiments, we executed
nine different SPEC2000 benchmarks on different sizes of the main memory, and observed
the total system performance in term of CPI and the number of disk requests generated. The
results are shown in the figure 6.12. The minimum and the maximum size of the memory for
each benchmark executed are as labeled. The minimum size of memory for each benchmark
is the minimum size of memory that the system can run without a ?not enough memory?
error. The maximum size of the memory for each benchmark is the size of memory that does
not have any different results than the system with smaller memory. Note, the y-axes in both
graphs are in log scale.
We observed that all SPEC2000 benchmarks used can be characterized into 2 categories;
first is the benchmarks that show memory page swapping behavior due to insufficient
memory to hold the entire memory footprint in the I/O intensive phase, i.e. ammp, bzip2,
gzip, and mgrid. This type of benchmark has both disk reads and writes. The numbers of
disk reads and writes depend heavily on the size of the memory. The second category is the
benchmarks that can fit into a small memory system, so the systems have no page swapping.
Therefore, only series of sequential disk reads are exhibited. For example, gcc, mcf, parser,
and vortex are categorized into this type of benchmark. This type of benchmark is not
sensitive to the size of the memory as long as the system can provide enough memory to run
without crashing. One would notice that in the latter type of applications, for example gcc,
 176
Figure 6.12: Memory Size Exploration. Changing in the system memory capacity has exponentially impact on the overall system
performance (CPI). However, if the size of the memory is big enough to hold the memory footprint of the benchmark, the memory size
has no effects. The figure shows not only the changes in CPI but also the changes in the number of the disk requests over 9 spec 2000
benchmarks. We run amp, gcc, parser, twolf, and vortex over the memory size of (16MB, 32MB, 64MB, 96MB, and 128MB), bzip2 and
gzip over (96MB,112MB,128MB, 144MB, 160MB, 172MB, 192MB), mcf over (80MB, 96MB, 128MB), and mgrid over (32MB, 64MB,
96MB, 128MB). The smallest size of the memory for each benchmark is the smallest size of memory that the system can run without
?not enough memory? error.
ammp
bzip2
gcc
gzip mcf mgrid
parser
twolf vortex
10
100
1000
CP
I
 
Memory Size Exploration
16
M
1
28M
96
M
19
2M
16
M
1
28M
96
M
19
2M
80
M
12
8M
32
M
1
28M 16
M
1
28M
16
M
12
8M
16
M
12
8M
ammp
bzip2
gcc
gzip mcf mgrid
parser
twolf vortex
10
2
10
3
10
4
10
5
n
u
m
b
e
r
 of dis
k
 
re
qu
es
ts
 
Memory Size Exploration
 177
the memory size of 16MB can hold both the operating system and the application.
Therefore, the operating system uses less than 16MB. As a result, in the applications
demonstrating paging behavior, the systems exhibit paging even with the memory size
larger than 100MB. This paging behavior is all due to the memory footprint of the
application. Due to the paging behavior in those systems, in the subsequent experiments, we
choose only bzip2 and ammp with different memory size to represent both categories of
behaviors. 
Figure 6.13 to 6.15 show the disk reads and writes over time in the first category of
applications for ammp, mgrid, gzip, and bzip2. Each graph in the figure represents the data
on the system with different DRAM capacity: less DRAM capacity on top and more at the
bottom. There are both disk reads and writes in the request streams. The less DRAM
capacity provided in the system, the more requests to disk. The majority of the increased
disk requests are disk writes. On the other hand, Figure 6.16 shows the disk reads and writes
over time in the second category applications for parser and gcc. Obviously, the applications
in the first category exhibit increasing number of writes while the DRAM capacity decreases
because the operating system swaps many pages out to prepare for new pages reading in.
The applications in the second category applications do not write to disk, despite the size of
the DRAM.
 178
Figure 6.13: ammp and mgrid Disk Activities. 
 179
Figure 6.14: gzip Disk Activities. 
 180
Figure 6.15: bzip2 Disk Activities. 
 181
5 50 500
0
1e+06
2e+06
LBN
Disk Read
Disk Requests for parser
memory: 16MB, 32MB, 64MB, 96MB
5 50 500
0
1e+06
2e+06
LBN
5 50 500
0
1e+06
2e+06
LBN
5 50 500
0
1e+06
2e+06
LBN
5 50 500
0
1e+06
2e+06
LBN
Disk Read
Disk Requests for gcc
memory: 16MB, 32MB, 64MB, 96MB
5 50 500
0
1e+06
2e+06
LBN
5 50 500
0
1e+06
2e+06
LBN
5 50 500
time(s)
0
1e+06
2e+06
LBN
Figure 6.16: parser and gcc Disk Activities. 
 182
6.3. Power/Energy Consumption of the Disk due to 
Different Memory Size
In the previous section, we learned that the memory size has a substantial impact on the
overall system performance. However, not only is overall performance important, but the
power dissipation and energy consumption are also main concerns nowadays. We setup an
experiment for the power dissipation and energy consumption. We varied the size of the
memory and analyzed the power dissipation and energy consumption of the DRAM and the
disk. The experiment focuses on only single disk systems. The disk configuration is a 12k-
RPM disk with 4MB disk cache. Figure 6.17 shows the power dissipation and the energy
consumption of such systems. Compared to the disk, DRAM dissipate a small amount of
power. Indispensably, the system requires enough DRAM capacity to hold the application
footprint, so the disk is accessed less and dissipates much less power. 
On the other hand, the energy consumption is more important than the power dissipation
in our experiments. For small size of the memory, the disk power dissipation remains
relatively constant compared with the system with large memory, while the disk energy
consumption is rapidly increasing. The reason is, though the power dissipation is limited by
the maximum, the execution time is prolonged due to more pages swapping to the disk. The
prolonged execution time causes the energy consumption to increase rapidly. Additionally,
DRAM energy consumption becomes significant compared to the disk energy. The reason is
the DRAM capacity is large enough to contain all needed memory pages, so the number of
requests to the disk is reduced. Therefore, the disk consumes its minimum energy while the
DRAM system is busiest. The ratio of the energy consumption of the disk over the energy of
 183
the DRAM reduces from 100:1 to 10:1 when enough DRAM capacity is added in the
system. However, the ratio may reduce even more if more DRAM is added excessively and
costs only more energy without any performance benefits.
Next, we conduct an experiment to characterize the power dissipation and the energy
consumption of different RPM disks. Figure 6.18 shows the DRAM & Disk Power
Dissipation and Energy Consumption with different RPM disks. The top graph shows the
power dissipation of the DRAM and the disk in a single disk system. The bottom graph
shows the energy consumption of the DRAM and the disk. We varied the memory size and
study its impact on energy consumption/power dissipation. All results are show for bzip2. 
The DRAM power and energy for each memory size are always the same for all disk
RPMs. Interestingly, the power dissipation of the disk are varied with the disk RPM. On the
other hand, the energy consumption is more intriguing. Lower RPM disk does not always
mean lower energy. Unlike the power dissipation, the lowest RPM actually consumes the
most energy when the memory size is small, and consumes the least energy when the
memory size is large enough to hold the benchmark data. The reason is when the memory
size is small, the disk with lower RPM spend more time to execute the same set of
instructions since there are more disk requests to the disk due to memory page swapping.
The disks with 12k-RPM and 20k-RPM consume relatively the same energy, but the CPI of
the latter is much better than the former. All disks consume relatively the same energy at
144MB, but the different system performances vary as much as a factor of 2.5.
 184
96 112 128 144 160 176 192
DRAM capacity(MB)
0
2
4
6
8
10
12
Power
 (W
)
Disk Power
DRAM Power
DRAM & Disk Power and Energy Consumption
benchmark: bzip2, DISK:12k RPM with cache
6
4
8
5
.
7
9
3
2
3
9
.
3
4
1
3
6
6
.
7
0
8
4
4
.
8
2
3
1
6
.
9
8
3
1
7
.
0
3
3
1
6
.
7
2
0
1000
2000
3000
4000
5000
6000
7000
Energ
y
 (J)
Disk Energy
DRAM Energy
1
1
5
7
6
.
2
3
1
.
9
2
8
.
4
2
8
.
7
2
8
.
7
2
8
.
6
6
4
6
7
.
9
3
2
2
4
.
6
1
3
5
3
.
6
8
3
4
.
0
3
1
2
.
5
3
1
2
.
5
3
1
2
.
2
96 112 128 144 160 176 192
DRAM capacity(MB)
10
100
1000
En
ergy
 (J
)
DISK
DRAM
DRAM & Disk Energy Consumption
benchmark: bzip2, DISK:12k RPM with cache
Figure 6.17: Power Dissipation and Energy Consumption of DRAM and a Disk. The configuration of the disk system is one 12k-RPM
disk drive with 4MB of disk cache. We varied the memory size, but not the other memory parameters. The top graph shows both the power
dissipation (dash line with left y-axis) and the energy consumption (solid line with right y-axis). The bottom graph shows only the energy
consumption in log scale.
 185
96 112 128 144 160 176 192
DRAM capacity(MB)
0
5
10
15
20
Power
 (W
)
20k-RPM DISK
12k-RPM DISK
5k-RPM DISK
DRAM
DRAM & Disk Power Consumption
benchmark: bzip2, single disk with cache
10
100
1000
10000
CP
I
5k-RPM CPI
12k-RPM CPI
20-k RPM CPI
96 112 128 144 160 176 192
Memory(MB)
0
2000
4000
6000
8000
10000
Energy (J)
20k-RPM Disk
12k-RPM Disk
5k-RPM Disk
DRAM Energy
DRAM & Disk Energy Consumption and CPI
config: bzip2; a single disk with cache
Figure 6.18: DRAM & Disk Power Dissipation and Energy Consumption. The top graph shows the power dissipation of the DRAM
and the Disk in a single system. The bottom graph shows the energy consumption of the DRAM and the Disk and the total system CPI. We
varied the memory size and ran bzip2. The DRAM power and energy for each memory size are always the same for all disk RPMs.
Interestingly, the power dissipation of the disk are varied with the RPM while the lowest RPM means lowest power. On the other hand, the
energy consumption is more intriguing. The lowest RPM actually consumes the most energy when the memory size is small, and consumes
the less energy when the memory size is big enough to hold the benchmark memory footprint. All disks consume relatively the same energy
at 144MB with different total system CPI.
 186
Lastly, Figure 6.19 shows the trade-offs graph between the Energy Consumption and the
system CPI. Each line represents a configuration with different RPM disk. There are 7 data
points on each line, representing the size of memory varied from 96MB to 192MB,
corresponding to the memory size in the previous figure. On each line, the top-right data
point is for the memory size of 96MB. The inset graph shows the data points of the
configurations with the memory size of 144, 160, 176, and 192MB. As displayed, all lines
are moving toward the origin as the memory is increasing. Among different RPMs, the
higher RPM disks move toward the system optimal with higher rate. Hence performance is
related to DRAM capacity more strongly than disk RPM. However, the lines stop at a
certain CPI, even though the memory is increasing. This suggests that, with large system
0 2000 4000 6000 8000
CPI
0
2000
4000
6000
8000
Total Energy (J)
20k RPM
12k RPM
5k RPM
Energy Consumption v.s. CPI Trade-offs
memory size: 96MB-192MB; bzip2; a single disk with cache
0 200 400 600 800 1000
CPI
0
200
400
600
800
1000
Total Energy(J)
Pareto Optimal Points
Figure 6.19: Energy Consumption v.s. CPI Trade-offs. The graph show the trade-offs plot combining total system energy consumption
and the total system CPI. Each line represents a configuration with different RPM disk. There are 7 data points on each line, representing the
size of memory from 96MB to192MB with the top-right data point is for 96MB. Largest DRAM capacity is to the left; obviously, more DRAM
capacity translates into better in both energy and performance. The inset graph shows the pareto optimal points.
 187
memory, higher disk RPM only increases system energy without performance impact. Using
a fast RPM disk is a bad design point in this case. As a result, the slowest RPM disk is
optimal.
6.4. Effects of Disk Physical Technology Improvement 
and Enhancements
6.4.1.Rotational Speed (RPM) 
We conducted an experiment to explore the effects of the RPM on the disk on the overall
system performance. For each benchmark, we set the disk system to single disk system and
varied the RPM and the disk cache. The disk cache setting is either 4MB with prefetching or
no cache at all. The memory size is set to a capacity of 128MB. Note, only bzip2 and gzip
have memory page swapping at this stated memory size. Figure 6.20 shows the CPI due to
the RPM and disk cache variation. The CPI is also in log scale. 
First, we consider the benchmarks without page swapping, which have only read
requests to the disk systems. Without disk caching and prefetching, the disk RPM benefit is
very obvious; the faster, the better. However, when the disk RPM is fast enough, i.e. from
12k RPM to 20k RPM, improvements from a faster disk taper out. At this point, the latency
from the other parts of the disk overshadow the benefit from higher RPM. With disk caching
and prefetching, there is no significant difference in CPI for the benchmarks without page
swapping because the disk caching and prefetching can hide the latency of the disk perfectly. 
 188
Second, in the benchmarks with disk reads and disk writes, as exhibited in bzip2 and
gzip, the disk RPM does matter in both the systems with and without disk cache. The
locality of the accesses decrease since the reads and writes access different areas of the disk.
Even though either the reads or writes have high locality, the disk caching and prefetching
cannot hide all their latency. The reason is the disk maintains the concepts of non-volatile
storage, so reads issued after any writes have to wait until the writes are processed.
Finally, to be compared with our base case with a 12k-RPM single disk in Figure 6.4,
Figure 6.21 and Figure 6.22 show the interaction of the memory hierarchy components in
the system with a 5k-RPM single disk and with a 20k-RPM single disk, respectively. The
system with 12k-RPM spends approximately 140 seconds while the 5k-RPM system spends
ammp
bzip2
gcc
gzip mcf mgrid
parser
twolf vortex
10
0
10
1
10
2
10
3
CPI 
Disk RPM and Cache Exploration
(5k,no$),(5k,w$),(12k,no$),(12k,w$),(20k,n0$),(20k,w$)
Figure 6.20: CPI due to the disk RPM and disk cache. The memory size is 128MB, and the disk system is a single disk system. We
vary the RPM and the existence of the disk cache. The disk cache configuration is either 4MB disk cache with prefetching or no cache at all.
For each benchmark, there are 6 bars which represent the CPI of (1) 5k-RPM disk with no cache,(2) 5k-RPM disk with cache, (3) 12k-RPM
disk with no cache, (4) 12k-RPM disk with cache, (5) 20k-RPM disk with no cache, and (6) 20k-RPM disk with cache.Note that at the
memory size of 128MB, only bzip2 and gzip exhibit the memory page swapping.
 189
0
5e+06
1e+07
1.5e+07
2e+07
Icache A
c
c
e
s
s
Cache Accesses (per 10 ms) and System CPI
bzip2; memory: 128MB; 5k-RPM single disk
0
5e+06
1e+07
1.5e+07
2e+07
D
c
a
c
he A
ccess
0
50000
1e+05
1.5e+05
2e+05
2.5e+05
3e+05
3.5e+05
L2cach
e A
ccess
50 100 150 200 250
time(s)
1
10
100
1000
10000
CP
I
CPI@10ms
cum. CPI
0
2
4
6
8
10
Icache P
o
w
e
r
 
(W)
Cache Power (per 10 ms)
0
2
4
6
8
10
D
cac
he P
o
w
e
r 
(W)
0.4
0.42
0.44
0.46
0.48
0.5
L
2
cache Pow
e
r (
W
)
50000 1e+05 1.5e+05 2e+05 2.5e+05
time(s)
0
5
10
Dis
k
 P
o
w
e
r 
(W)
0
1
2
3
4
5
DR
AM P
o
we
r 
(W
)
DRAM & Disk Power/Accesses
1
10
100
1000
10000
1e+05
DR
A
M
 Accesses 
per 10ms
0
5
10
Disk P
o
w
e
r(W)
50 100 150 200 250
time(s)
1
10
100
1000
10000
1e+05
Disk Access(per 1
0
ms
)
Figure 6.21: The interaction in the memory hierarchy for a system with a 5k-RPM disk drive. 
 190
0
5e+06
1e+07
1.5e+07
2e+07
Icach
e A
ccess
Cache Accesses (per 10 ms) and System CPI
bzip2; memory: 128MB; 12k-RPM single disk
0
5e+06
1e+07
1.5e+07
2e+07
D
cache A
ccess
0
50000
1e+05
1.5e+05
2e+05
2.5e+05
3e+05
3.5e+05
L
2
cache A
c
c
e
s
s
20 40 60 80 100
time(s)
1
10
100
1000
10000
CPI
CPI@10ms
cum. CPI
0
2
4
6
8
10
Icache Pow
e
r (
W
)
Cache Power (per 10 ms)
0
2
4
6
8
10
D
cache Pow
e
r 
(W)
0.4
0.42
0.44
0.46
0.48
0.5
L2
cache Pow
e
r 
(W)
20 40 60 80 100
time(s)
0
10
20
D
i
sk P
o
wer 
(W)
0
1
2
3
4
5
DR
AM
 P
o
we
r
 
(W
)
DRAM & Disk Power/Accesses
1
10
100
1000
10000
1e+05
1e+06
DR
AM Ac
c
e
sse
s 
p
e
r 1
0
m
s
0
5
10
15
20
Di
s
k
 P
o
w
e
r
(
W
)
20 40 60 80 100
time(s)
1
10
100
1000
10000
1e+05
D
i
sk A
c
cess
(
per 10m
s)
Figure 6.22: The interaction in the memory hierarchy for a system with a 20k-RPM disk drive. 
 191
over 250 seconds on the same set of instructions. Using a 20k-RPM disk improves the
execution to 110 seconds. The improvement in total execution time is mainly from the
benefits over the write bursts. The write bursts can benefits from higher RPM disk directly
because the faster disk would result in fast response for writes. As a result, the write bursts
processed faster results in shorter write burst processing period. However, the benefit over
the reads is not obvious. The reason is the disk subsystem is already equipped with disk
cache, which absorbs the reads when the cache hits without disk mechanical parts involved.
 192
6.4.2.Prefetching 
As shown is Figure 6.20, disk caching and prefetching can gain significant performance,
especially with both the benchmarks with only reads (no page swapping) and with
reads/writes (with page swapping). Furthermore, disk caching and prefetching can
completely hide the rotational latency in the benchmarks with only disk reads. In this
section, we conducted the experiment to identify the importance of disk caching and
prefetching separately.
In the experiment, the system configuration is set to 112MB of memory running bzip2.
Figure 6.23 shows the CPI and disk average response time for the experiment. The three
bars in each group represent (1) a single disk system, (2) a 4-disk RAID5 system, and (3) an
8-disk RAID5 system. The upper graph shows the CPI for each configuration, and the lower
graph shows the average response time of the disk requests. Note, the CPI axis is in linear
scale, but the disk average response time axis is in log scale. The height of each bar in the
average response time graph is the absolute value, i.e. the value of the response time for each
type is the exact value where the bar ends.
In the previous section, Figure 6.20 shows that, in the disk read-dominated benchmarks,
disk prefetching is more important than increasing the disk RPM. That is, rotational latency
and bandwidth can be overcome by simple prefetching in an application with only disk
reads. From Figure 6.23, disk caching has only marginal effects to both the CPI and the disk
average response time. However, disk caching with prefetching has significant benefits: up
to the factor of 4 for the case of 5400 RPM with 8 RAID disks and the factor of 2 on average
 193
5k RPM 12k RPM 20k RPM
0
1000
2000
3000
4000
5000
6000
CP
I 
no cache no prefetch
w/ cache no prefetch
w/ cache & prefetch
The Effects of Disk Prefetching
bzip2 112MB (1ds/4ds/8ds)
5k no$ pf&$ 12k no$ pf&$ 20k no$ pf&$no pf no pf no pf
10
1
10
2
10
3
10
4
10
5
DI
SK av
erage
 res
p
tim
e
(ms
)
 
avrg Write resp time
avrg resp time
avrg Read resp time
The Effects of Disk Prefetching
bzip2 112MB (1d/4ds/8ds)
1 disk
4 disks
8 disks
Figure 6.23: The Effects of Disk Prefetching. The experiment tries to identify the effects of prefetching and caching in the disk cache.
The configuration is 112MB of memory running bzip2. The 3 bars in each group represent single disk system, 4-disk RAID5 system, and 8-
disk RAID5 system. The figure above shows the CPI of each configuration, and the figure below shows the average response time of the disk
requests. Note that the CPI axis is in linear scale, but the disk average response time axis is in log scale. The height of the each bar in the
average response time graph is the absolute value.
 194
over all configurations. Therefore, from this point on, we will only study disk cache with
both caching and prefetching mechanism implemented. We refer to the Disk Cache that
caches and prefetches as "Disk Cache" as referred to by the disk drive manufacturer.
As we focus on the RAID disk system in the next section, we will discuss another
interesting behavior also demonstrated in Figure 6.23. The behavior is how the RAID disk
system tends to have longer response time for disk writes due to parity calculations. This
behavior will be discussed later. Despite longer response time for writes causing longer
overall average response time, the overall performance is significantly improved.
To compare the behavior of the entire memory hierarchy in the system with disk
prefetching, Figure 6.24 shows the interaction of the entire memory hierarchy in a system
without disk cache, and Figure 6.25 shows the interaction of the components in a system
with disk cache but no prefetching. Figure 6.26 illustrates the interaction of the memory
hierarchy with disk caching and prefetching mechanism enabled. All system configurations
have 112MB of memory running bzip2 with a single disk drive. All three figures
demonstrate the power and accesses of caches, DRAM, disk, and the total system CPI. Both
the systems without disk cache and with disk cache but no prefetching perform relatively the
same as they complete the 500 million instructions in 500 seconds. With both disk caching
and prefetching, the system can reduce the task execution time to 300 seconds. The
contribution of disk caching and prefetching is mainly from its read absorbing behavior. The
read bursts can be processed faster by the disk cache than by the mechanical parts in the
disk, and this behavior results in shorter process time for read bursts. However, write bursts
still take as much time as a system without disk cache, because they overwhelm the size of
the disk cache.
 195
0
5e+06
1e+07
1.5e+07
2e+07
Icach
e A
ccess
Cache Accesses (per 10 ms) and System CPI
bzip2; memory: 112MB; 12k-RPM single disk no disk cache
0
5e+06
1e+07
1.5e+07
2e+07
D
cache A
ccess
0
50000
1e+05
1.5e+05
2e+05
2.5e+05
3e+05
3.5e+05
L
2
cache A
c
c
e
s
s
100 200 300 400 500
time(s)
1
10
100
1000
10000
CPI
CPI@10ms
cum. CPI
0
2
4
6
8
10
Icache Pow
e
r 
(W)
Cache Power (per 10 ms)
0
2
4
6
8
10
Dc
a
c
h
e
 P
o
we
r 
(
W
)
0.4
0.42
0.44
0.46
0.48
0.5
L2cache Pow
e
r (
W
)
100 200 300 400 500
time(s)
0
5
10
15
Disk P
o
w
e
r 
(W)
0
1
2
3
4
5
DR
AM P
o
we
r 
(W
)
DRAM & Disk Power/Accesses
1
10
100
1000
10000
1e+05
1e+06
D
R
A
M
 A
ccess
e
s
 per 10ms
0
5
10
15
Dis
k
 P
o
w
e
r(W)
100 200 300 400 500
time(s)
1
10
100
1000
10000
1e+05
Disk Access(per 10
m
s
)
Figure 6.24: The interaction of the memory components in a system without disk cache. 
 196
0
5e+06
1e+07
1.5e+07
2e+07
Icache A
c
c
e
s
s
Cache Accesses (per 10 ms) and System CPI
bzip2; memory: 112MB; 12k-RPM w/$ no_prefetch
0
5e+06
1e+07
1.5e+07
2e+07
D
c
a
c
he A
ccess
0
50000
1e+05
1.5e+05
2e+05
2.5e+05
3e+05
3.5e+05
L2cach
e A
ccess
100 200 300 400 500
time(s)
1
10
100
1000
10000
CP
I
CPI@10ms
cum. CPI
0
2
4
6
8
10
Icache Pow
e
r 
(W)
Cache Power (per 10 ms)
0
2
4
6
8
10
D
cache Pow
e
r 
(W)
0.4
0.42
0.44
0.46
0.48
0.5
L
2
cac
he P
o
w
e
r 
(W)
100 200 300 400 500
time(s)
0
5
10
15
Di
s
k
 P
o
w
e
r (
W
)
0
1
2
3
4
5
DRA
M
 P
o
we
r 
(W
)
DRAM & Disk Power/Accesses
1
10
100
1000
10000
1e+05
1e+06
D
R
AM Accesses p
e
r
 
10ms
0
5
10
15
Di
s
k
 P
o
w
e
r
(
W
)
100 200 300 400 500
time(s)
1
10
100
1000
10000
1e+05
D
i
sk A
ccess(per 10ms)
Figure 6.25: The interaction of the memory components in a system with disk cache but no prefetching. 
 197
0
5e+06
1e+07
1.5e+07
2e+07
Icache A
c
cess
Cache Accesses (per 10 ms) and System CPI
bzip2; memory: 112MB; 12k-RPM 1 disk
0
5e+06
1e+07
1.5e+07
2e+07
D
cache A
ccess
0
50000
1e+05
1.5e+05
2e+05
2.5e+05
3e+05
3.5e+05
L2cach
e A
ccess
50 100 150 200 250 300
time(s)
1
10
100
1000
10000
CP
I
CPI@10ms
cum. CPI
0
2
4
6
8
10
Icache
 Po
w
e
r 
(
W
)
Cache Power (per 10 ms)
0
2
4
6
8
10
D
cache Pow
e
r 
(W
)
0.4
0.42
0.44
0.46
0.48
0.5
L
2
cache Pow
e
r 
(W)
50 100 150 200 250 300
time(s)
0
5
10
15
Di
sk
 P
o
we
r 
(W
)
0
1
2
3
4
5
DR
AM P
o
we
r
 
(W
)
DRAM & Disk Power/Accesses
1
10
100
1000
10000
1e+05
DR
AM Ac
c
e
sse
s 
p
e
r 1
0
m
s
0
5
10
15
20
25
30
35
40
Di
sk
 P
o
we
r(W
)
50 100 150 200 250 300
time(s)
1
10
100
1000
10000
1e+05
D
i
sk A
ccess
(
per 10ms)
Figure 6.26: The interaction of the memory components in a system with disk caching and prefetching. 
 198
6.4.3.Parallel I/O: RAID5
In this section, we will discuss the RAID disk system and the RAID organization
focusing only on RAID 5, which is the most popular RAID organization. We conducted the
experiment with a single disk system, a 4-disk RAID system with 2 different organizations,
and an 8-disk RAID system with 3 different organizations. 
Figure 6.27 shows the CPI and the disk average response time for all RAID5
configurations described in the previous chapter. The system configuration used in this
experiment is set to 32MB of memory running ammp with various RAID configurations
made up of 12k-RPM disks with disk cache enabled. As illustrated in Figure 5.7, the label
?xc x yds? refers to the RAID5 configuration with x controllers, each of which is connected
with y disks. For comparison, the figure also shows the CPI and average response time of the
system with 5400 RPM and 20k RPM. The first group was configured with a single 5400
RPM disk with and without disk cache. The last 2 groups were configured with 20k RPM
disk(s). If the configuration is not labeled with ?no$?, each disk in each configuration has a
4MB disk cache. Like before, the graph above shows the CPI, and the graph below shows
the disk average response time.
Let?s consider the 12k-RPM disk system with the same number of disks. No matter how
the disks are organized, the CPI and the average response time remains the same across the
same number of disks. To explore the effects of increasing parallelism by increasing the
number of disks, we move from 1 disk to 4 disks and from 4 disks to 8 disks. However,
unlike Figure 6.23, increasing the number of disks from 1 to 4 seems not to have any
 199
5k 
n
o
$
5k
 
w
$
1
2k
 
n
o$
12
k
 
w
$
1
2k
 
2
c
x2
ds
1
2
k
 
4
ds
1
2k
 
2c
x
4
d
s
1
2k
 
4
c
x
2
ds
12
k 
8
d
s
20
k
 
n
o
$
20
k 
w
$
2
0k
 
w
$
 
8
ds
0
100
200
300
400
500
600
700
800
CPI 
32MB ammp Disk config Exploration
5
k
 n
o
$
5
k
 w
$
1
2
k
 n
o
$
1
2
k
 w
$
1
2
k
 2
g
x
2
d
s
1
2
k
 4
d
s
1
2
k
 2
g
x
4
d
s
1
2
k
 4
g
x
2
d
s
1
2
k
 8
d
s
2
0
k
 n
o
$
2
0
k
 w
$
2
0
k
 w
$
 8
d
s
10
1
10
2
10
3
10
4
10
5
DIS
K
 a
v
era
ge 
res
p
tim
e
(ms) 
avrg Write resp time
avrg resp time
avrg Read resp time
32MB ammp Disk config Exploration
12k RPM
xc x yds: x controllers
each connected with y disks
1disk
4disks
8disks
12k-RPM
1disk
1disk
1disk
1disk
1disk
4disk
8disk 8disk
8disks
Figure 6.27: RAID5 configuration. The figure shows the effects of RAID5 configuration on the CPI and the disk average response time.
The system configuration is 32MB of memory running ammp with 12k-RPM disk with disk cache. The figure also shows the results for
different RPM disk for comparison. 
 200
obvious benefit compared to increasing from 4 to 8 in 12k-RPM disk systems. The possible
explanation is that the complexity of scheduling and parity calculation overshadows the
benefits of having multiple disks in the case of a 4-disk system in this application. As we
move to 20k-RPM disk system, the benefits of having RAID system is less obvious even
with 8 disks. Additionally, the performance of any 20k-RPM system is comparable to an 8-
disk RAID system with 12k-RPM disks. Therefore, RAID in a high RPM disk system may
have only marginal benefits over a system with a single high RPM disk. Care should be
taken to choose the number of RAID disks in a uniprocessor system.
For the average response time, even though the write response time in a RAID system is
much higher than the write response time in a single disk system, this trend does not
translate directly into the overall performance. The write response time in a RAID system is
higher due to parity calculations, especially the benchmarks with small writes. The cost of
writing in a RAID system is significant [80]. If the cost of a write is reduced/eliminated, the
overall system performance will be improved.
Figure 6.28 shows the CPI and the average response time of the disk systems with a
different number of RAID disks and RPMs. The system configuration is set to 112MB of
memory running bzip2. The 3 bars in each group represent a single disk system, a 4-disk
system and an 8-disk system. Note, the CPI is in linear scale and the average response time
is in log scale. Again, as the RPMs increase, the benefit of the RAID diminishes. As
mentioned before, the benefits of having 4 RAID disks versus having only a single disk is
not very obvious in a fast disk system. In a slow disk system (i.e., 5400 RPM), RAID has
more tangible benefits over a non-RAID system. Nevertheless, the combination of using
RAID, disk cache, and fast disks can improve the overall performance up to a factor of 10.
 201
5k no$ 5k w$ 12k no$ 12k w$ 20k no$ 20k w$
0
1000
2000
3000
4000
5000
6000
CPI 
Disk RPM/$/RAID Exploration
bzip2 112MB (1ds/4ds/8ds)
5k no$ 5k w$ 12k no$ 12k w$ 20k no$ 20k w$
10
1
10
2
10
3
10
4
10
5
DI
SK ave
r
age
 re
sp
tim
e
(
m
s) 
avrg Write resp time
avrg resp time
avrg Read resp time
Disk RPM/$/RAID Exploration
bzip2 112MB (1d/4ds/8ds)
1 disk
4 disks
8 disks
Figure 6.28: Disk RAID5 Configuration with different RPMs. The figure shows the CPI and the average response time of the disk
systems. The system configuration is 112MB of memory running bzip2. The 3 bars in each group represent a single disk system, a 4-disk
system and an 8-disk system.Note that the CPI is in linear scale and the average response time is in log scale.
 202
Figure 6.29 shows the effects of RAID disk systems with different configurations on the
total system CPI of the benchmarks with only disk reads. The benchmarks are ammp and
gcc with 128MB of memory, and all disks in the system are equipped with a 4MB disk cache
each. The RAID disk system has only minimal benefits over a single disk system since the
disk requests are sequential. The reason is the caching and prefetching mechanism in the
disk can hide most of the latency.
The interaction between the components in the memory hierarchy in the RAID disk
systems are shown in Figure 6.30 and Figure 6.31. These two figures are to be compared
against the system with a single 12k-RPM disk in Figure 6.26. Figure 6.30 shows the
interaction results of a system with 4-disk RAID system, and Figure 6.31 shows the
interaction results of a system with 8-disk RAID system over time. All system
configurations are set to 112MB of memory running bzip2. Moving from a single disk
128M ammp 128M gcc
0
20
40
60
80
100
CP
I
 
Disk RAID5 Exploration: no writes
1ds,2gx2ds, 4ds,2gx4ds, 4gx2ds,8ds
Figure 6.29: RAID5 with no writes. The figure show the effects of different RAID5 systems with benchmarks with only disk reads. The
benchmarks are ammp and gcc with 128MB of memory. Obviously, RAID disk system has only minimal benefits over a single disk system.
 203
0
5e+06
1e+07
1.5e+07
2e+07
Icache A
c
c
e
s
s
Cache Accesses (per 10 ms) and System CPI
bzip2; memory: 112MB; 12k-RPM 4 disks
0
5e+06
1e+07
1.5e+07
2e+07
D
c
a
c
he A
ccess
0
50000
1e+05
1.5e+05
2e+05
2.5e+05
3e+05
3.5e+05
L2cach
e A
ccess
50 100 150 200 250
time(s)
1
10
100
1000
10000
CP
I
CPI@10ms
cum. CPI
0
2
4
6
8
10
Icache Pow
e
r 
(W)
Cache Power (per 10 ms)
0
2
4
6
8
10
D
cache Pow
e
r 
(W)
0.4
0.42
0.44
0.46
0.48
0.5
L
2
cac
he P
o
w
e
r 
(W)
50 100 150 200 250
time(s)
0
10
20
30
40
50
Di
s
k
 P
o
w
e
r (
W
)
0
1
2
3
4
DRA
M
 P
o
we
r 
(W
)
DRAM & Disk Power/Accesses
1
10
100
1000
10000
1e+05
D
R
A
M
 Accesses per 
10ms
0
10
20
30
40
50
Di
s
k
 P
o
w
e
r(
W
)
50 100 150 200 250
time(s)
1
10
100
1000
10000
1e+05
D
i
sk A
ccess(per 10ms)
Figure 6.30: The interaction between the memory components in the hierarchy of a system with 4-disk RAID system. 
 204
0
5e+06
1e+07
1.5e+07
2e+07
Icache A
c
c
e
s
s
Cache Accesses (per 10 ms) and System CPI
bzip2; memory: 128MB; 12k-RPM 8disks
0
5e+06
1e+07
1.5e+07
2e+07
D
c
a
c
he A
ccess
0
50000
1e+05
1.5e+05
2e+05
2.5e+05
3e+05
3.5e+05
L2cach
e A
ccess
20 40 60 80 100 120 140 160 180
time(s)
1
10
100
1000
10000
CP
I
CPI@10ms
cum. CPI
0
2
4
6
8
10
Icache Pow
e
r 
(W)
Cache Power (per 10 ms)
0
2
4
6
8
10
D
cache Pow
e
r 
(W)
0.4
0.42
0.44
0.46
0.48
0.5
L
2
cac
he P
o
w
e
r 
(W)
20 40 60 80 100 120 140 160 180
time(s)
50
60
70
80
90
100
Di
s
k
 P
o
w
e
r (
W
)
0
1
2
3
4
5
DRA
M
 P
o
we
r 
(W
)
DRAM & Disk Power/Accesses
1
10
100
1000
10000
1e+05
1e+06
D
R
A
M
 Accesses per 
10ms
50
60
70
80
90
100
Di
s
k
 P
o
w
e
r(
W
)
20 40 60 80 100 120 140 160 180
time(s)
1
10
100
1000
10000
1e+05
D
i
sk A
ccess(per 10ms)
Figure 6.31: The interaction between the memory components in the hierarchy of a system with 8-disk RAID system. 
 205
system to a 4 disk system, the execution time improves from 325 seconds to 275 seconds
approximately. Even better, an 8-disk system can improve the execution time to only
roughly 190 seconds, approximately. Both read and write burst process times are improved
since the bursts can be serviced faster. Despite the fact that the average response time for a
single write is higher, the total performance including the benefit from parallelism in write
bursts is also improved. However, the power consumption is also proportionally increased
by the number of disks.
6.4.4.Size of the Disk Cache
We conducted a set of experiments to identify the effects of the disk cache size over the
overall system performance. The size of the disk cache relates to the cost of the disk drive
directly since the cost per GB for DRAM is currently about 50 times higher than for disk
storage [78]. A disk cache is composed of a set of multiple segments. Each segment can
vary in size in the unit of sector (512 bytes in general). Compared with processor cache, a
segment can be seen as a cache line, and the segment size is therefore the same as a cache
line size. As a result, to vary the disk cache size, we have to vary the number of segments
and the segment sizes themselves.
Figure 6.32 shows the performance impacts of Segment Size Variation. The figure
shows the impacts of different segment sizes with the same number of segments in the disk
cache. In this case, there is only one segment. The system configuration is set to 128MB of
memory with a 12k-RPM disk. There are 7 bars for each benchmark, which are (1) no cache,
(2) 1 segment of 2 sectors each, (3) 1 segment of 4 sectors each, (4) 1 segment of 8 sectors
each, (5) 1 segment of 32 sectors each, (6) 1 segment of 128 sectors each, and (7) 1 segment
 206
ammp
bzip2
gcc
gzip mcf mgrid
parser
twolf
vortex
0
100
200
300
400
500
600
700
800
900
1000
1100
CPI 
Disk Cache Configuration: segment size
(#segments x segmentsize): no$,1x2,1x4,1x8,1x32,1x128,1x256
ammp
bzip2
gcc
gzip mcf mgrid
parser
twolf vortex
10
100
1000
10000
D
I
S
K
 av
era
g
e
 
re
sp
tim
e
(
m
s) 
Disk Cache Configuration: segments size
(#segments x segmentsize): no$, 1x2,1x4,1x8,1x32,1x128,1x256
Figure 6.32: The Effects of Disk Cache Size by varying the Segment Size. The figure shows the effects of different segment size with
the same number of segments in the disk cache. The system configuration is 128MB of memory with a 12k-RPM disk. There are 7bars for
each benchmark, which are (1) no cache, (2) 1segment of 2 sectors each, (3) 1segment of 4 sectors each, (4) 1segment of 8 sectors each,
(5) 1segment of 32 sectors each, (6) 1segment of 128 sectors each, and (7) 1segment of 256 sectors each. Note that the CPI graph is in
linear scale, and the average response time graph is in log scale.
 207
of 256 sectors each. The top graph shows the CPI and the bottom graph shows the average
response time of the disk requests.
Figure 6.33 entitled the Effects of Disk Cache Size by Varying the Number of Segments.
The figure shows the effects of different number of segments with the same segment size in
the disk cache, which is 512 sectors. The system configuration is set to 128MB of memory
with a 12k-RPM disk. There are 5 bars for each benchmark, which are (1) no cache, (2) 1
segment of 512 sectors each, (3) 2 segments of 512 sectors each, (4) 16 segments of 512
sectors each, and (5) 24 segments of 512 sectors each. Note that the CPI graph is in linear
scale, and the average response time graph is in log scale. 
The effect of the disk cache size is limited to the presence of the cache with a particular
size. Meaning, increasing the size of the disk cache, either by segment size or by the number
of segments, will not result in a better performance if the disk cache is already large enough
as described in [75] and [78]. They simply concluded that, with a reasonably sized file
system buffer cache controlled by the operating system, there is very little performance
benefit of using a big built-in disk cache. Our study agrees with those observations, but from
the system-level point of view. Another interesting behavior to point out is, for the
benchmark with disk reads and writes, i.e. bzip2 and gzip, the average response times are
always the same while the overall CPIs improve due to the presence of the disk cache. This
behavior will be explained in the next section.
Figure 6.34 shows the trade-offs between Memory Sizes and Disk Cache Sizes. The top
graph shows the trade-offs between the memory sizes and the disk cache sizes under the
assumption that the total in megabytes of memory and disk cache remain the same. We also
varied the number of disks in a RAID5 disk systems. In this experiment, the total in
 208
ammp
bzip2
gcc
gzip mcf mgrid
parser
twolf vortex
0
100
200
300
400
500
600
700
800
900
1000
1100
CPI 
Disk Cache Configuration: number of segments
(#segments x segmentsize): no$,1x512, 2x512, 16x512, 24x512
ammp
bzip2
gcc
gzip mcf mgrid
parser
twolf vortex
10
100
1000
10000
D
I
SK a
v
erag
e resp
time
(m
s) 
Disk Cache Configuration: number of segments
(#segments x segmentsize): no$, 1x512, 2x512, 16x512, 24x512
Figure 6.33: The Effects of Disk Cache Size by varying the Number of Segments. The figure shows the effects of different number
of segments with the same segment size in the disk cache. The system configuration is 128MB of memory with a 12k-RPM disk. There are
5 bars for each benchmark, which are (1) no cache, (2) 1segment of 512 sectors each, (3) 2 segments of 512 sectors each, (4) 16
segment of 512 sectors each, and (5) 24 segment of 512 sectors each. Note that the CPI graph is in linear scale, and the average
response time graph is in log scale.
 209
1ds 2gx2ds 4ds 2gx4ds 4gx2ds 8ds
10
2
10
3
10
4
CPI 
5,400 RPM
12,000 RPM
20,000 RPM
Constant ($+MEM) Capacity
ammp (Memory + Disk Cache = 32MB)
1ds 2gx2ds 4ds 2gx4ds 4gx2ds 8ds
0
100
200
300
400
500
600
700
800
900
1000
CPI 
5,400 RPM
12,000 RPM
20,000 RPM
Disk RPM/($+MEM)/RAID Exploration
ammp (Memory = 32MB, Disk Cache 0/4/8/256MB)
(
32,
0)
(
2
8,
4)
(2
4,8
)
(2
0,
12)
(1
6,
16)
(8
,
2
4
)
Disk cache size
Memory size
(3
2,
0)
(32,
4)
(32,
8)
(32,
256
)
Disk cache size
Memory size
Figure 6.34: The Trade--offs between Memory Sizes and Disk Cache Sizes. The graph above shows the trade-off between the memory
size and the disk cache size under the assumption that the total MB of the memory and the disk cache remains the same. In this case, the
total MB is 32MB on an ammp execution. The graph below shows the effects of disk cache size over a RAID5 disk system. The system also
has 32MB of memory running ammp.
 210
megabytes used on an ammp execution is 32MB. From left to right in each RAID
configuration, each bar represents the CPI of (1) 32MB of memory with no disk cache, (2)
28MB of memory with 4MB of disk cache in total, (3) 24MB of memory with 8MB of disk
cache in total, (4) 20MB of memory with 12MB of disk cache in total, (5) 16MB of memory
with 16MB of disk cache in total, and (6) 8MB of memory with 24MB of disk cache in total.
Each bar has 3 bars that are overlapping, which are for 5k-RPM, 12k-RPM, and 20k-RPM
disk system. Note, the CPI is in log scale. 
The bottom graph shows the effects of disk cache size on RAID5 disk systems. The
system is also set to have 32MB of memory constantly and is running ammp while we
varied the size of the disk cache on each disk drive. Each bar in each RAID configuration
represents (1) 32MB of memory with no disk cache, (2) 32MB of memory with 4MB of disk
cache for each disk, (3) 32MB of memory with 8MB of disk cache for each disk, and (4)
32MB of memory with 256MB of disk cache for each disk.
Unlike in Figure 6.33, the experimental results in Figure 6.34 suggest that increasing
disk cache size will result in better performance most easily seen in the top graph. Increasing
disk cache size is not a step function in this case. The reason is the application in Figure 6.33
has only minor write traffic to the disk, but the write traffic in Figure 6.34 is increasing with
the reduced memory size. Therefore, increasing disk cache is beneficial for write traffic.
However, the effect remains true until the system reaches the memory limit where the
DRAM can no longer contain significant portion of the memory footprint. At that point, the
CPI increases rapidly. Also, the bottom graph suggests that, at a particular memory size,
only a relatively small amount of disk cache can improve the performance greatly.
 211
Increasing the disk cache further would not positively affect the performance. These facts
are also true with multiple disks in the RAID disk systems.
6.4.5.Disk Cache Organization
In the previous section, we conclude that the size of the disk cache, either by segment
size or by the number of segments, does not have significant effects to the system
performance. Only a small disk cache size is enough for the disk caching and prefetching.
This section explores the choice of cache organization to see if that has any more of an effect
than cache size.
To answer this question, we also conducted an experiment to identify the effects of the
disk cache organizations. From Figure 6.32 and Figure 6.33, we learned that only a small
disk cache, i.e. 1 segment with 4 sectors, can improve the performance. Therefore, we did an
experiment to see the effects of the cache organizations around the 4-sector case. Figure 6.35
shows the effects of disk cache organization around this case. The figure shows both the CPI
(above graph) and the average disk response time (below graph). The 7 bars on each
benchmark in the graph represent the disk system with (1) no cache, (2) disk cache with 1
segment with 2 sectors each, (3) disk cache with 2 segments with 1 sector each, (4) disk
cache with 1 segment with 4 sectors each, (5) disk cache with 2 segments with 2 sectors
each, (6) disk cache with 4 segments with 1 sector each, (7) disk cache with 2 segments with
4 sectors each.
From the experimental results, we can conclude that only the 1 segment with 4 sectors
and the 2 segments with 4 sectors cases improved the performance. Both of them gained the
same improvements over other cases. For other cases, no significant system performance
 212
ammp
bzip2
gcc
gzip mcf mgrid
parser
twolf vortex
0
100
200
300
400
500
600
700
800
900
1000
1100
CPI 
Disk Cache Configuration
(#segment x segmentsize): no $, 1x2,2x1,1x4,2x2,4x1,2x4 
ammp
bzip2
gcc
gzip mcf mgrid
parser
twolf vortex
10
100
1000
10000
DISK
 avera
g
e resptime
(ms) 
Disk Cache Configuration
(#segment x segmentsize): no$,1x2,2x1,1x4,2x2,4x1,2x4
Figure 6.35: Disk Cache Organization. The figure shows both the CPI (above graph) and the average disk response time (below graph).
The 7 bars on each group in the graph represent (1)no cache, (2) 1 segment with 2 sectors each, (3) 2 segments with 1 sectors each, (4) 1
segment with 4 sectors each, (5) 2 segments with 2 sectors each, (6) 4 segment with 1 sectors each, (7) 2 segments with 4 sectors each.
 213
effects were exhibited. Therefore, the disk cache with the same size but different
configurations, i.e. 1 segment with 4 sectors each, 2 segments with 2 sectors each, and 4
segment with 1 sectors each, perform differently. The configurations exhibiting benefits
included the configurations with the segment sizes of 4 sectors. In conclusion, the only
cache organization parameter that matters is the size of the segment.
6.4.6.Bus Transmission Latency
As suggested in [35], the bus transmission latency in the DRAM system has a
significant effect on the overall system performance. We conducted an experiment to
identify the effects of the bus transmission latency on the disk system. Figure 6.36 shows the
R=0, 1disk R=0, 4disks R=0, 8disks W=0; 1disk W=0, 4disks W=0, 8disks
0
50
100
150
200
250
300
350
400
CPI 
ammp 32MB-DRAM 
12k_RPM Disk 
w/o disk cache
Bus Latency Exploration
1-disk bus lat =1000us, 512us, 10.24us, 1.28us, 2.56us,1.28us,and 0.64us
4/8-disk bus lat = 1000us, 512us,10.24us,5.12us,2.56us,1.28us
Figure 6.36: Bus Latency Exploration. The graph shows the effects of the bus latency variation to the total system CPI. The system
configuration is 32MB of memory running ammp with a 12k-RPM RAID disk system without disk cache. The groups marked as ?R=0?
represent the read bus latency (data from the disk) taking no time but the write bus latency (data to the disk) taking varied latency, and the
groups marked as ?W=0? represent the write bus latency to the disk take no time but the read bus latency taking varied latency. The groups
marked as ?1 disk? has a single disk system,?4 disks? has a 4-disk RAID system, and ?8 disks? has an 8-disk RAID system. The bus latency
varies from 1 millisecond to 0.64 microseconds for a single disk system, and from 1 milliseconds to 1.28 microseconds for a 4-disk and 8-disk
system.
 214
Bus Latency Exploration. The graph shows the effects of the bus latency variation in regards
to the total system CPI. The system is configured with 32MB of memory running ammp
with a 12k-RPM RAID disk system without disk cache. The groups marked as ?R=0?
represent the read bus latency (data from the disk) taking no time but the write bus latency
(data to the disk) taking varied time lengths to transfer one sector of data. The groups
marked as ?W=0? represent the write bus latency to the disk taking no time but the read bus
latency taking varied time lengths to transfer one sector of data. The groups marked as ?1
disk? equipped with a single disk system,?4 disks? equip a 4-disk RAID system, and ?8
disks? equipped with an 8-disk RAID system. The bus latency varies from 1 millisecond to
0.64 microseconds for a single disk system, and from 1 millisecond to 1.28 microseconds
for a 4-disk system and an 8-disk system. These bus latency values are set according to the
range of the latency in the latest disk interface latencies showing in Table 4.2.
Even though the bus latency has a significant effect on the overall performance in the
case of the DRAM systems, there are no significant effects of the bus latency variation on
the disk system in regards to the total system CPI. The reason is the bus latency in the
DRAM system is comparable to the DRAM latency. However, the bus latency in the disk
system, which is in the unit of microseconds, is insignificant compared to the disk latency,
which is in the unit of milliseconds.
6.4.7.Perfect Write-Buffering
The I/O subsystem is becoming a bottle-neck in the computer system due to the rapid
growth in the processor speed and technology. Disk cache has been used to fill this gap, but
the benefit is limited to the read operations as the write I/Os are usually committed to disk to
 215
maintain consistency and to allow for crash recovery. Therefore, disk writes basically are
clogging the disk system. There are many publications [76, 77, 79, 82] suggesting to use the
write-buffering techniques. Such techniques are aimed at hiding the disk write latency by
writing to a buffer instead of writing to the disk immediately. To maintain the non-volatile
concept of the disk, the buffer is required to withstand possible failures that could happen
before the data are written to the disk. Using non-volatile RAM (NVRAM), a disk cache
disk (DCD), or a NAND Flash memory are possible options for these techniques. All write-
buffering publications measure the techniques against the disk subsystem metric such as the
disk request response time and the disk system throughput. We conducted an experiment to
exhibit the limit of such techniques to the overall system performance, by modeling a perfect
write buffering, which can completely hide the write latency. The system was configured
with 112MB of memory running bzip2 with a choice of 1, 4, or 8-disk RAID system. We
also varied the existence of the disk cache to isolate the effects.
Figure 6.37 shows the limited effects of write-buffering, assuming that all writes to the
disk system are buffered perfectly to eliminate the need to write to the disk immediately. The
top graph shows the total system CPI, and the bottom graph shows the disk average response
time. The CPI graph compares the CPI of the perfect write-buffering techniques with the
system with normal disk reads and writes. For the disk average response time graph, the
total average response time and the read response time are similar since the write latency to
the disk system is hidden perfectly.
Our study shows that using some techniques to buffer the writes can improve the
performance greatly, up to a factor of 10 in the case of a single 5k-RPM disk with disk
caching. Additionally, buffering write requests decreases the needs for many disks to exist
 216
in a RAID system since both 4-disk and 8-disk systems perform similarly. Note, this may
not be true in multiprocessor environment.
5k no$ 5k w$ 12k no$ 12k w$ 20k no$ 20k w$
0
1000
2000
3000
4000
5000
6000
CP
I 
normal R&W
normal R WB
The Limit of Write-Buffering Technique
bzip2 112MB (1ds/4ds/8ds)
5k no$ 5k w$ 12k no$ 12k w$ 20k no$ 20k w$
0
20
40
60
80
100
120
140
160
180
200
DIS
K
 aver
age
 
r
e
spt
i
me
(ms) 
avrg Write resp time
avrg resp time
avrg Read resp time
The Limit of Write-Buffering Technique
bzip2 112MB (1d/4ds/8ds)
Figure 6.37: The Limit of Write-Buffering Technique. The figure shows the limited effects of write-buffering technique. The above graph
show the CPI of both a normal systems and a write-buffering system, and the graph shows the disk average response time of only the write-
buffering system. The CPI graph compares the limit with the system with normal disk reads and writes. For the disk average response time
graph, the total average response time and the read response time are the same since the writes to the disk system are eliminated.
 217
The graph also shows that if we can perfectly buffer write requests to the disk along with
obtaining a small amount of cache to the disk, the performance will significantly be
improved up to an order of magnitude. Furthermore, all the CPIs of any RPM disks remain
the same no matter how fast the disks are. In conclusion, write-buffering technique can
improve the performance immensely without the cost of multiple fast and expensive disks.
To be compared with our base case in Figure 6.26, Figure 6.38 shows the interaction of
the system with a single disk with perfect write buffering. Figure 6.39 shows the interaction
of the same configuration with a 4-disk RAID system, and Figure 6.40 shows the interaction
of the configuration with 8-disk RAID system. All system configurations have 112MB of
memory running bzip2. The write buffering eliminates the needs to write to disks.
Therefore, reads can be performed immediately. This improves the execution time from 325
seconds, in the case of the system with a single disk with disk cache, to 100 seconds, in the
case of the single-disk system with disk cache and write buffering. The rest of the
components in the hierarchy remain dissipating low power.
 218
0
5e+06
1e+07
1.5e+07
2e+07
Icach
e A
ccess
Cache Accesses (per 10 ms) and System CPI
bzip2; memory: 128MB; 1 disk with WB
0
5e+06
1e+07
1.5e+07
2e+07
D
cache A
ccess
0
50000
1e+05
1.5e+05
2e+05
2.5e+05
3e+05
3.5e+05
L
2
cache A
c
c
e
s
s
20 40 60 80
time(s)
1
10
100
1000
10000
CPI
CPI@10ms
cum. CPI
0
2
4
6
8
10
I
cac
he P
o
w
e
r 
(W)
Cache Power (per 10 ms)
0
2
4
6
8
10
D
cache
 Po
w
e
r 
(
W
)
0.4
0.42
0.44
0.46
0.48
0.5
L
2
cache Pow
e
r 
(W)
20 40 60 80
time(s)
0
5
10
15
Di
sk
 P
o
we
r (
W
)
0
1
2
3
4
DRA
M
 P
o
we
r 
(W
)
DRAM & Disk Power/Accesses
1
10
100
1000
10000
1e+05
D
R
A
M
 Accesses per 
10ms
0
5
10
15
Di
s
k
 P
o
w
e
r(
W
)
20 40 60 80
time(s)
1
10
100
1000
10000
1e+05
D
i
sk A
ccess(per 10ms)
Figure 6.38: The interaction of the memory components in the hierarchy in a single disk system with perfect write buffering. 
 219
0
5e+06
1e+07
1.5e+07
2e+07
Icach
e A
ccess
Cache Accesses (per 10 ms) and System CPI
bzip2; memory: 112MB; 12k-RPM with WB
0
5e+06
1e+07
1.5e+07
2e+07
D
cache A
ccess
0
50000
1e+05
1.5e+05
2e+05
2.5e+05
3e+05
3.5e+05
L
2
cache A
c
c
e
s
s
20 40 60
time(s)
1
10
100
1000
10000
CPI
CPI@10ms
cum. CPI
0
2
4
6
8
10
Icache Pow
e
r 
(W)
Cache Power (per 10 ms)
0
2
4
6
8
10
Dc
a
c
h
e
 P
o
we
r 
(
W
)
0.4
0.42
0.44
0.46
0.48
0.5
L2cache Pow
e
r (
W
)
20 40 60
time(s)
0
10
20
30
40
50
Disk P
o
w
e
r 
(W)
0
1
2
3
4
5
DR
A
M
 P
o
we
r
 
(W
)
DRAM & Disk Power/Accesses
1
10
100
1000
10000
1e+05
DR
AM Ac
c
e
sse
s 
p
e
r 1
0
m
s
0
10
20
30
40
50
Di
s
k
 P
o
w
e
r
(
W
)
20 40 60
time(s)
1
10
100
1000
10000
1e+05
D
i
sk A
c
cess
(
per 10m
s)
Figure 6.39: The interaction of the memory components in the hierarchy in a system with 4-disk RAID disk subsystem along with 
perfect write buffering. 
 220
0
5e+06
1e+07
1.5e+07
2e+07
Icach
e A
ccess
Cache Accesses (per 10 ms) and System CPI
bzip2; memory: 112MB; 12k-RPM 8disks
0
5e+06
1e+07
1.5e+07
2e+07
D
cache A
ccess
0
50000
1e+05
1.5e+05
2e+05
2.5e+05
3e+05
3.5e+05
L
2
cache A
c
c
e
s
s
20 40 60
time(s)
1
10
100
1000
10000
CPI
CPI@10ms
cum. CPI
0
2
4
6
8
10
I
cac
he P
o
w
e
r 
(W)
Cache Power (per 10 ms)
0
2
4
6
8
10
D
cache
 Po
w
e
r 
(
W
)
0.4
0.42
0.44
0.46
0.48
0.5
L
2
cache Pow
e
r 
(W)
20 40 60
time(s)
50
60
70
80
90
100
Di
sk
 P
o
we
r (
W
)
0
1
2
3
4
5
DRA
M
 P
o
we
r 
(W
)
DRAM & Disk Power/Accesses
1
10
100
1000
10000
1e+05
D
R
A
M
 Accesses per 
10ms
60
70
80
90
100
Di
s
k
 P
o
w
e
r(
W
)
20 40 60
time(s)
1
10
100
1000
10000
1e+05
D
i
sk A
ccess(per 10ms)
Figure 6.40: The interaction of the memory components in the hierarchy in a system with 8-disk RAID disk subsystem along with 
perfect write buffering. 
 221
6.5. Total CPI v.s. Disk Response Time 
Most publications in the disk research community use the average disk response time
and/or throughput as the metrics to measure the performance of the system. Especially, in
the case of a single-user environment, the user pays attention to only a single process
response. Therefore, average disk response time is the metric in this case. 
However, as we have noticed this behavior in the previous section, the total CPI does not
track the total average disk response time. We conducted an experiment to identify the
relationship between the total CPI and the disk response time. We ran several benchmarks
on a system with 128MB of memory utilizing a single disk. We varied the RPM of the disk
and the existence of the disk cache. Figure 6.41 shows the CPI and Disk Average Response
Time of the systems. The 6 bars in each group represent (1) a 5k-RPM disk without disk
cache,(2) a 5k-RPM disk with disk cache,(3) a 12k-RPM disk without disk cache,(4) a 12k-
RPM disk with disk cache,(5) a 20k-RPM disk without disk cache, and (6) a 20k-RPM disk
with disk cache. The top graph shows the total CPI of the system, and the bottom graph
shows the average disk response time for reads, writes, and overall for both.
During the I/O intensive phase which consists of both disk reads and writes, the average
CPI tracks only average read response time, not the overall average R/W response time. This
is true even for the benchmarks with read and write activities or even write-intensive
benchmarks as portrayed with bzip2 and gzip. Therefore, the total disk average response
time may not be an accurate metric to evaluate a disk technique as a representative to the
total performance of a system. A better representative to the total system performance
should be the read response time.
 222
ammp
bzip2
gcc
gzip mcf mgrid
parser
twolf vortex
10
0
10
1
10
2
10
3
CP
I 
Disk RPM and Cache Exploration
(5k,no$),(5k,w$),(12k,no$),(12k,w$),(20k,n0$),(20k,w$)
ammp
bzip2
gcc
gzip mcf mgrid
parser
twolf vortex
10
1
10
2
10
3
10
4
10
5
D
I
SK
 av
erag
e resp
time
(m
s) 
avrg Write resp time
avrg resp time
avrg Read resp time
Disk RPM and Cache Exploration
(5k,no$),(5k,w$),(12k,no$),(12k,w$),(20k,no$),(20k,w$)
Figure 6.41: CPI v.s. Disk Average Response Time. The 6 bars in each group represent (1) a 5k-RPM disk without disk cache,(2) a 5k-
RPM disk with disk cache,(3) a 12k-RPM disk without disk cache,(4) a 12k-RPM disk with disk cache,(5) a 20k-RPM disk without disk cache,
and (6) a 20k-RPM disk with disk cache. The above graph shows the total CPI of the system, and the below graph shows the average disk
response time for reads, writes, and total.
 223
6.6. The CPI Breakdown
Figure 6.42 shows the System CPI Breakdown. The figure shows the breakdown CPI for
2 benchmarks, twolf and bzip2; both experimental system configurations utilized 128MB of
memory with different disk systems. The top graph is for twolf, which does not have disk
write requests, and the bottom graph is for bzip2, which has both reads and writes. The
graphs show the CPI breakdown portions for (1) the processor, caches, and DRAM, (2) the
controller computation which includes queuing, scheduling, and parity calculation, and (3)
the disk mechanism, which include seek, rotation, and transfer. We experimented with 6
RAID5 configurations. Each RAID configuration contained 9 bars, divided into 3 groups.
The first group uses 5400 RPM disks, the second group uses 12k RPM disks, and the last
group uses 20k RPM disks. Each group consisted of 3 bars, which represented normal seek
time, half seek time, and zero seek time. ?Half seek time? means the seek times were
computed then scaled down by half. ?Zero seek time? means the seek times for all accesses
are assumed to be zero. Note that the graphs have different y-axis.
From the twolf CPI, the CPI remains the same for all configurations due to only disk
reads exhibited in the benchmark. RAID systems improve performance with a small margin
due to less queuing and less scheduling time. The disk systems rarely use the disk
mechanism since most of the requests hit the disk cache due to their sequential nature. In this
case, the CPI portion of the processor, caches, and DRAM is significant.
In the case of bzip2, there are both disk read and write requests. Interestingly, the 4-disk
systems have a bigger portion of controller computation CPI--this is due to the complexity
of the scheduling and the parity calculations. However, due to the parallelism of multiple
 224
1ds 2gx2ds 4ds 2gx4ds 4gx2ds 8ds
0
20
40
60
80
100
CPI 
Disk seek+rot+xfer
Disk Controller comp.
uP+$+DRAM
System CPI Breakdown
twolf 128MB
1ds 2gx2ds 4ds 2gx4ds 4gx2ds 8ds
0
200
400
600
800
1000
1200
CPI 
Disk seek+rot+xfer
Disk Controller comp.
uP+$+DRAM
System CPI breakdown
bzip2 128MB
seek
 
time 
=
 
1
x
see
k t
ime 
=
 
.5x
see
k t
ime 
=
 
0
5
k
-
RP
M
1
2k
-R
P
M
2
0k
-R
P
M
Figure 6.42: System CPI Breakdown. The figure shows the breakdown CPI for 2 benchmarks, twolf and bzip2, both with 128MB of
memory. The graph above is for twolf which does not have disk write requests, and the graph below is for bzip2, which has both reads and
writes. The graphs show the CPI breakdown portions for (1) the processor, caches, and DRAM, (2) the queuing and scheduling, and (3) the
disk mechanism, which are seek, rotation, and transfer. We run the experiment on 6 RAID5 configurations. Each RAID configuration has 9
bars, divided into 3 groups. The first group uses 5400 RPM disks, the second group uses 12k RPM disks, and the last group uses 20k RPM
disks. Each group consists of 3 bars, which are for normal seek time, half seek time, and zero seek time.
 225
disks, the overall performance of the 4-disk system is better than the single disk system with
the same configuration. The 8-disk systems amortize the complexity of the disk controller
computation well enough to reduce the queuing and scheduling CPI. In both benchmarks,
varying the seek time has no effect on the CPI due to the largely sequential nature of the
requests. This seems not in agreement to the claim that seek time is very significant to the
performance. The reason is seek time would be important in the access streams with little
sequentiality as in multiprocessor systems, not in uniprocessor systems in our experiments.
In conclusion, CPI portions spent in the processor, caches, and DRAM represent only a
secondary effect in comparison with the Disk parameter effects. The reason that the DRAM
and Cache CPI are not as significant compared to the Disk CPI portion is that the access
time due to the DRAM and cache is insignificant compared to the Disk access time.
Additionally, most DRAM and Cache enhancements will affect at most only less than 2X
their CPI portions. On the other hand, the Disk Parameter settings can change the total
system performance over an order of magnitude.
6.7. Power/Energy Consumption 
This section discusses the power dissipation and the energy consumption of the system
as functions of memory size and disk enhancements mentioned in previous sections. First,
the energy and power consumption of the system with different memory capacities are
illustrated in Figure 6.43. Figure 6.43 corresponds to the performance graphs in Figure 6.12.
The top graph shows the total power dissipated in the memory system, including caches,
DRAM, and disk. The middle graph shows the energy consumption, and the bottom graph
 226
ammp
bzip2
gcc
gzip mcf mgrid
parser
twolf vortex
0
5
10
15
T
o
ta
l Po
w
e
r (W
) 
Memory Size v.s. Power Exploration
ammp bzip2 gcc gzip mcf mgrid parser twolf vortex
1
10
100
1000
10000
E
ner
gy (
J
)
Memory Size v.s. Energy Consumption
10 100 1000
CPI
10
100
1000
Energy
 (J)
ammp
bzip2
gcc
gzip
mcf
mgrid
parser
twolf
vortex
Energy Consumption v.s. CPI trade-offs
12k-RPM single disk
16MB
1
28MB
96
M
B
19
2MB
16MB 128MB
96
M
B
19
2MB
80
M
B
128
MB
32
MB
128
MB
16
M
B
12
8MB
12
8MB 128MB
16
MB 16MB
Figure 6.43: Power and Energy Consumption in the system with different memory size. This figure is corresponding to the
performance graphs in Figure 6.11. The top graph shows the total power dissipated in the memory system, including caches, DRAM, and
disk. The middle graph shows the energy consumption, and the bottom graph shows the pareto optimal of the CPI and energy. The power
and energy of nine SPEC benchmarks are reported.
 227
shows the Pareto plot of the CPI and energy. The power and energy of nine SPEC
benchmarks are reported. While the total power dissipation in the systems remains at
maximum for all memory sizes, the total energy consumed can differ by two orders of
magnitude. The total power for the benchmarks demonstrating memory page swapping
should be lower when the DRAM capacity increases. However, since the disk power
dominates the total system power, and the difference between the active power (11.26W)
and idle power (8.62W) of the disk is marginal, the power of the systems with lower
memory capacity is not much lower than the systems with large memory. Additionally,
despite the large memory capacity, the disk is still the key component as it is accessed
actively during the I/O intensive phase. On the other hand, the energy consumption in those
systems are extremely different. Compared with Figure 6.12, the energy consumption tracks
closely with the CPI and the number of disk requests, and can reach as high as two orders of
magnitude of the energy in a system with large memory. Interestingly, by plotting the CPI
and the energy trade-offs, all benchmarks end up with having the same relationship between
the CPI and the energy consumption. This reflects the realistic behavior of our simulator
because the very same hardware-based systems reported the same total performance would
consume the same amount of energy, no matter what type of applications the system is
running. The reason is the overall performance and the total system energy consumption
would account for all activities in all components in the system. 
The power and energy of the systems with different disk RPMs and disk cache is
illustrated in Figure 6.44. The graph shows the power and energy corresponding to the
experiment results in Figure 6.20. The system memory is 128MB running nine different
benchmarks. The top graph shows the power dissipation, and the middle graph shows the
 228
ammp
bzip2
gcc
gzip mcf mgrid
parser
twolf vortex
0
10
20
Powe
r (W)
Disk RPM and Cache Exploration
(5k,no$),(5k,w$),(12k,no$),(12k,w$),(20k,n0$),(20k,w$)
ammp
bzip2
gcc
gzip mcf mgrid
parser
twolf vortex
10
0
10
1
10
2
10
3
En
er
gy
 (J
)
Disk RPM and Cache Exploration
(12k,no$),(12k,w$),(54k,no$),(54k,w$),(20k,n0$),(20k,w$)
10 1 00 1 00 0 10 00 0
CPI
10
10 0
1000
10 00 0
En
e
r
g
y
 (
J
)
ammp
bzip2
gcc
gzip
mcf
mgrid
parser
tw ol f
vo rtex
Disk RPM and Cache Trade-offs
 RPM: 5k, 12k, 20k        disk cache: dash line =no$,  solid line =w$
5K
5K
5K
5K
Figure 6.44: Power and Energy Consumption for the system with different RPM and the presence of disk cache. The top graph
shows the power dissipation, and the middle graph shows the energy consumption. The bottom graph shows the CPI and total energy
trade-offs of the systems with different disk RPM and the presence of disk cache. For the bottom graph, The graph shows the CPI and
energy for different benchmark on the systems with varied RPM and the existence of the disk cache. The lines connect the data points with
the same disk cache configuration with different RPMs (5k, 12k, 20k): The dash lines represent no-cache configuration, and the solid lines
represent with-cache configuration.
 229
energy consumption. The bottom graph shows the CPI and total energy trade-offs of the
systems with different disk RPMs and the presence of disk cache. In the bottom graph, For
the bottom graph, The graph shows the CPI and energy for different benchmark on the
systems with varied RPM and the existence of the disk cache. The lines connect the data
points with the same disk cache configuration with different RPMs (5k, 12k, 20k): The dash
lines represent no-cache configuration, and the solid lines represent with-cache
configuration. Again, the power dissipation remains the same among the systems with the
same features, i.e. the same disk rotational speed in this case. The power also increases with
the disk rotational speed because the higher RPM disk dissipates more power. Unlike the
systems with the same disk RPM, the energy consumption does not track the system CPI.
The reason is the power varies in different rotational speed; therefore, the system with the
same CPI but equipped with different RPM disks consumes different energy. Interestingly,
the systems without disk cache prefer 12k-RPM disk over other rotational speed disk. The
systems implementing disk cache prefer lower RPM for the benchmarks with only read
requests since the requests are mostly serviced by the disk cache, so the disk mechanical
parts are mostly idle in these benchmarks. Additionally, the disks with disk cache consume
the same amount of energy when the request stream is a mix of reads and writes, such as
bzip2 and gzip because slow disks compensate slowness with lower power. Like Figure
6.43, the CPI and energy relationships for all systems lie on the same projected band with
different slopes due to different disk RPMs. Moreover, now we actually have an interesting
Pareto plot: more than one optimal points are exhibited in case of the benchmarks with both
reads and writes. For those benchmarks with both reads and writes, regardless of disk cache,
both 12k and 20k RPM are optimal points. 
 230
Figure 6.45 shows the power and energy consumption of the systems as a function of
5k RPM 12k RPM 20k RPM
0
20
40
60
80
100
120
140
Power (W)
no cache no prefetch
w/ cache no prefetch
w/ cache & prefetch
The Effects of Disk Prefetching
bzip2 112MB (1ds/4ds/8ds)
5k RPM 12k RPM 20k RPM
0
20000
40000
En
er
gy
 (J)
no cache no prefetch
w/ cache no prefetch
w/ cache & prefetch
The Effects of Disk Prefetching
bzip2 112MB (1ds/4ds/8ds)
1 disk
4 disks
8 disks
Figure 6.45: Power and Energy Consumption of the system with Disk Caching/Prefetching. This figure shows the power and energy
consumption corresponding to Figure 6.22. The memory is 112MB running bzip2. The number of RAID disks, the disk RPM, the presence of
disk cache and prefetching were varied.
 231
disk caching and prefetching. This figure shows the power and energy consumption
corresponding to Figure 6.23. The system configuration is set to 112MB of memory running
bzip2. The other experimental system configuration settings such as the number of RAID
disks, the disk RPM, the presence of disk cache and prefetching were set to various
increments. In contrast to Figure 6.23 where more RAID disks gain better performance, the
power and energy is proportionally increased with the number of disks. The same pattern
repeats here again where the power dissipation in the systems with similar features remains
the same across all different disk caching/prefetching organizations. On the other hand, the
energy numbers of those systems with different disk caching/prefetching organizations are
different. The system with better performance in Figure 6.23, i.e. one with both disk caching
and prefetching, consumes less energy. Considering only energy, the systems prefer 12k-
RPM disk system over other RPMs. The reason is the 12k-RPM disk systems with lower
active and idle power perform as comparable as the 20k-RPM disk systems, so the 20k-
RPM consumes more energy. The 5k-RPM disk systems perform much slower despite
lower power, so the final total energy is higher. 
The results for the systems with perfect write buffering are shown in Figure 6.46. The
figure shows the power and energy corresponding to Figure 6.37. Like the previous case, the
power dissipation in the systems with the same disk-system configurations remain the same
across different disk caching and write buffering choices. The energy consumption is
different depending on the system performance. Despite the RPM, the systems with both
disk caching and write buffering prefer a lower RPM disk system because of its lower
power. This is true because disk caching and write buffering eliminate the need to wait for
the disk?s mechanical parts; thus the system no longer requires fast RPM disks to improve
 232
1d/4ds/8ds 5k RPM 12k RPM 20k RPM
0
20000
40000
En
erg
y
 (J)
no cache no prefetch
no cache no prefetch & WB
w/ cache & prefetch
w/ cache & prefetch & WB
The Energy of Disk Caching/Write Buffering
bzip2 112MB (1ds/4ds/8ds)
5k RPM 12k RPM 20k RPM
0
20
40
60
80
100
120
140
Powe
r (W)
no cache no prefetch
no cache no prefetch & WB
w/ cache & prefetch
w/ Cache & prefetch & WB
Power of Disk Caching/Write Buffering
bzip2 112MB (1ds/4ds/8ds)
1 disk
4 disks
8 disks
Figure 6.46: Power and Energy Consumption for Caching and Perfect Write Buffering. The figure shows the power and energy
corresponding to Figure 6.36. The memory is 112MB running bzip2. The number of RAID disks, the disk RPM, the presence of disk
caching/prefetching and write buffering were varied.
 233
performance. Therefore, slower but lower powered disks are more energy efficient at this
point. 
The energy and power results for the system with constant megabytes of the sum of
DRAM and disk cache capacity are also included in Figure 6.47. The energy and power
reported corresponds to the top graph in Figure 6.34. The graph shows the energy and power
trade-offs between the memory size and the disk cache size under the assumption that the
total MB of the memory and the disk cache remains the same. In this case, the total MB is
32MB on an ammp execution. The bars represent the system power, and the lines represent
the system energy. The total power remains the same across all systems with the same
number of disks and disk RPM. On the other hand, the energy consumption tracks the CPI
shown in Figure 6.34, while the last two data points in each RAID configuration, which are
(16,16) and (8,24), increase rapidly because the memory is not large enough for the
1ds 2gx2ds 4ds 2gx4ds 4gx2ds 8ds
0
50
100
150
200
Power (W)
20k power
12k power
54k power
Disk RPM/($+MEM)/RAID Exploration
ammp (Memory + Disk Cache = 32MB)
1
10
100
1000
10000
Energy (J)
5k energy
12k energy
20k energy
(32,
0)
(
2
8,
4)
(20,
12)
(
16,
16)
(8
,
2
4
)
D
i
sk
 ca
ch
e si
z
e
Memory size
(
2
4,
8)
Figure 6.47: Power and Energy Consumption for the system with constant sum of memory size and disk cache. 
 234
application?s footprint. All disk RPMs consume relatively the same amount of energy in this
case, typically within 50% of each other.
The energy and power for the systems with different disk cache size is shown in Figure
6.48. The system memory is 32MB running ammp. The disk cache size varies from no
cache, 4MB, 8MB, and 256MB of disk cache. The figure is corresponding to the bottom
graph of Figure 6.34. The power remains the same across the systems with similar number
of RAID disks. The energy consumption tracks the system CPI in the configurations with
the same number of RAID disks and RPM. However, the systems with faster RPM disks
consume more energy as well as being superior in performance.
Finally, we conducted an experiment to investigate the trade-off between the power
consumption and the performance of several disk technology improvements and
enhancements. Figure 6.49 shows the trade-offs Chart of the Power Dissipation/Energy
1ds 2gx2ds 4ds 2gx4ds 4gx2ds 8ds
0
50
100
150
Power(W)
5k power
12k power
20k power
Disk RPM/($+MEM)/RAID Exploration
ammp; Memory:32MB; Disk cache: 0, 4, 8, 256MB
1
10
100
1000
10000
Energ
y
(J)
5k energy
12k energy
20k energy
(
3
2,
0)
(
32,
4)
(32,
8)
(3
2,2
56)
Figure 6.48: Power and Energy Consumption for the system with different size of disk cache. 
 235
Consumption versus CPI. The top graph is the Power Dissipation versus the CPI, and the
bottom graph is the Energy Consumption versus the CPI. The system configuration is
128MB running bzip2. We varied the number of disks to one, four, and eight RAID disks
with 5k, 12k, and 20k RPM. We also varied the existence of the disk cache. The dash line is
0 500 1000 1500 2000 2500 3000
CPI
0
50
100
150
200
Total P
o
wer Diss
ipati
on(W)
5k no$
12k no$
20k no$
5k w$
12k w$
20k w$
WB 5k no$
WB 12k no$
WB 20k no$
WB 5k w$
WB 12k w$
WB 20k w$
Power Dissipation v.s. CPI trade-offs
bzip2 128MB
0 500 1000 1500 2000 2500 3000
CPI
0
5000
10000
15000
20000
D
i
sk
 Energ
y
 Co
ns
umpt
ion
s
(J)
5k no$
12k no$
20k no$
5k w$
12k w$
20k w$
WB 5k no$
WB 12k no$
WB 20k no$
WB 5k w$
WB 12k w$
WB 20k w$
Energy Consumption v.s. CPI trade-offs
bzip2 128MB
Figure 6.49: Power Dissipation/Energy Consumption v.s. CPI trade-offs. The above graph is Power Dissipation v.s. CPI, and the lower
graph is the Energy Consumption v.s. CPI trade-offs. The system configuration is 128MB running bzip2. We varied the number of disks to
one (the lowest data point on the line), four (the middle data point), and eight RAID disks (the highest data point) with 5400, 12k, and 20k
RPM. We also varied the existence of the disk cache. The dash line is for the disk system with write-elimination technique. Therefore, the
dash line is the limit of energy/power saving.
 236
for the disk system with perfect write-buffering technique marked as ?WB?. Therefore, the
dash line can be considered as the limit of energy/power saving of the write-buffering
technique.
For the power dissipation, except for 5k RPM with no disk cache, all other data points
are clustered in the region or CPI 500-1000. We can conclude that the techniques, such as
increasing RPM and disk caching and prefetching, can improve the performance only to a
factor of 2. The write-buffering technique can also improve the performance by a factor of 2,
and the combination of write-buffering and caching/prefetching can improve the
performance greatly without the requirement of multiple fast disks. However, the power
dissipation remains the same among the systems with the same number of disks. 
On the other hand, the energy consumption graph gives us more of an insight.
Obviously, unlike the power dissipation, the systems with the same number of disks do not
consume the same amount of energy. For example, the system with 8 20k-RPM RAID disks
without disk cache consumes more energy and performs worse than the same configuration
with disk caching. The system with 5k-RPM disks consumes more energy than other
different RPM-disk systems containing more disks. Nevertheless, the write-buffering
technique on a slow disk system in conjunction with disk cache produces the optimal effect
in this case. 
To sum up, systems with N RAID disks do not directly improve the performance by a
factor of N, while they typically consume N times more energy and power. On the other
hand, increasing the RPM of an already fast disk system will not gain any obvious benefits,
and only increases the energy consumption, which varies with the number of disks. Using a
low RPM disk does not save energy in most case. The disk enhancements, i.e. disk
 237
caching/prefetching and write-buffering, can improve the performance by a factor of 2 while
reducing the energy approximately the same rate. Care should be taken into the disk
enhancements rather than attempting to increase only the disk bandwidth parameters, such
as the number of RAID disks and the RPM.
6.8. The System Bandwidth
To sum up, the figure 6.50 shows how the total System Bandwidth of configurations on
different disk systems compares to the total system performance. The figure shows the CPI
versus the system bandwidth, which is calculated by multiplying the number of disks, the
rotation speed, the number of sectors per cylinder (1024), and the sector size (512 bytes).
We varied the disk RPM, the existence of the disk cache, prefetching, and write-buffering
technique. We also varied the number of disks in the RAID5 disk system. Each line connects
systems with the same RPM disks; therefore, there are 3 data points on each line, which
represent 1-disk, 4-disk, and 8-disk system, respectively from left to right. The top graph
shows only the configuration with caching and prefetching, which are already explicitly
implemented in today?s disk drives. The bottom graph shows the same graph along with the
perfect write-caching configurations represented as dotted lines. We ran bzip2 on all system
configurations with 112MB of memory.
Interestingly, the total performance of the system with the same system bandwidth can
vary over an order of magnitude, depending on which enhancements have been applied. In
some cases, the disk system with comparable bandwidth and employing the same techniques
can have the total performance as different as a factor of 2 due to different configurations.
 238
For example, the case of the 8 5k-disk system without disk cache and the 4 12k-disk system
without disk cache exhibit this behavior. On the other hand, with different enhancements,
0 500 1000 1500 2000
Bandwidth (MB/S): #Disks x (RPM/60) x Sectors_per_cyl x Sector_size
0
1000
2000
3000
4000
5000
6000
CP
I
5k no$
12k no$
20k no$
5k w$ no_pf
12k w$ no_pf
20k w$ nopf
5k w$ & pf
12k w$ & pf
20k w$ & pf
The System Bandwidth
bzip2; 112MB; 1ds/4ds/8ds; no$/w$/w$&pf
0 500 1000 1500 2000
Bandwidth (MB/S): #Disks x (RPM/60) x Sectors_per_cyl x Sector_size
0
1000
2000
3000
4000
5000
6000
CPI
5k no$
12k no$
20k no$
5k w$ no_pf
12k w$ no_pf
20k w$ nopf
5k w$ & pf
12k w$ & pf
20k w$ & pf
5k no$ WB
12k no$ WB
20k no$ WB
5k w$ & pf & WB
12k w$ & pf & WB
20k w$ & pf & WB
The System Bandwidth
bzip2; 112MB; 1ds/4ds/8ds; no$/w$/w$&pf; write caching (WB): no$ /w$
Figure 6.50: The System Bandwidth. The figure shows the CPI versus the system bandwidth, which is calculated by multiplying the
number of disks, the round- per-second, the number of sectors per cylinder, and the sector size. We varied the disk RPM, the existence of
the disk cache, prefetching, and write-buffering technique. We also varied the number of disk in the RAID5 disk system. Each line connects
the system with the same RPM disks; therefore, there are 3 data points on each line, which represent 1-disk, 4-disk, and 8-disk system,
respectively from left to right. The top graph shows only the configuration already implemented in today?s disk drives. The bottom graph
shows the configuration with perfect write caching in dotted lines.
 239
carelessly choosing a configuration only to increase the bandwidth may cause the system to
perform worse. For example, choosing a system with 8 20k-RPM disks with no cache rather
than a system with 4 12k-RPM disks with cache will not benefit the system as suggested by
the system bandwidth.
Another trend also demonstrated in the graph is the trend that increasing only the system
bandwidth does not directly translate into improvement in total system performance. When
the bandwidth is low, increasing the bandwidth will improve the performance significantly,
except the cases where employing all disk caching/prefetching and write-buffering
techniques. We noticed that as we continue increasing the system bandwidth without
applying further enhancement, the CPI exhibits relatively no improvement. As a result, new
enhancements for disk systems are required to improve the total system performance.
6.9. Configuration Comparison
In this dissertation, we tried to answer this question: 
What is the best solution, in terms of both total system performance and power/energy
consumption, for a single processing system whose I/O intensive phase is exposed to occupy
a significant portion of the entire execution time?
And, from our experiments with SYSim, we conclusively proposed two answers:
? increasing the memory size, and/or
? using a single disk system with disk cache and significant attention paid to write 
buffering.
 240
The interactions between all components in the entire memory hierarchy and the system
CPI of both solutions are shown in the last figures. To compare with the interaction in Figure
1.1 which is also shown here again for comparison, Figure 6.52 shows the interaction in the
case of increasing the memory size to 512 MB, and Figure 6.53 shows the interaction in the
case of 128MB of memory with perfect disk write buffering. Both systems run gzip with a
12k-RPM disk drive with a small disk cache (4MB). The figure shows the interaction
between all components in the memory hierarchy including the level-1 instruction cache, the
level-1 data cache, the level-2 unified cache, DRAM, and a disk drive. Notice that the
initialization time reduced from 140 seconds in Figure 1.1 to 40 seconds in Figure 6.52 and
to 48 seconds in Figure 6.53. The first solution would solve the problem under the condition
where the memory is always big enough to hold the application memory footprint. However,
the energy consumption of the first solution may increase significantly if using next
generation DRAM, i.e. an FBDIMM system. In contrast, the second solution would be less
sensitive to the application characteristics due to the I/O latency hiding nature of the
approach. However, one would suggest using a RAID disk system to improve parallelism in
the disk system. As shown in the RAID studies, using RAID in single user mode does not
improve the performance as much as its costs because the energy and power consumption of
the RAID system is proportional to the number of disks, and RAID performance does not
directly scale with the number of the disks. Figure 6.54 shows the interaction in the system
with an 8-disk RAID system equipped with cache and write buffer. The execution time of
the RAID system improves only 5 seconds--less than 7% improvement over a single disk
with write buffer while the user has to pay the cost of 8 disks. As a result, RAID is not
recommended to a single process environment during the I/O intensive phase.
 241
I/O intensive phase
computation
phase
0
5e+06
1e+07
1.5e+07
2e+07
Icache A
cce
s
s
Cache Accesses (per 10 ms) and System CPI
gzip; memory: 128MB; run to completion
0
5e+06
1e+07
1.5e+07
2e+07
D
c
a
c
he A
ccess
0
50000
1e+05
1.5e+05
2e+05
2.5e+05
3e+05
3.5e+05
L2cach
e A
ccess
   20    40    60    80  100    120    140    160    180   200
time(s)
1
10
100
1000
10000
CP
I
CPI@10ms
cum. CPI
0
1
2
3
4
5
Icache Pow
e
r 
(W)
Cache and Disk Power (per 10 ms)
0
1
2
3
4
5
D
cache Pow
e
r
 (W)
0.4
0.42
0.44
0.46
0.48
0.5
L2cach
e Po
w
e
r 
(W)
20 40 60 80 100 120 140 160 180 200
time(s)
0
5
10
15
Di
s
k
 P
o
w
e
r (
W
)
0
1
2
3
4
5
DR
AM P
o
w
e
r
 
(W
)
DRAM & Disk Accesses/Power(per 10ms)
10
0
10
1
10
2
10
3
10
4
10
5
DR
AM A
c
c
e
s
s
e
s
0
5
10
15
Dis
k
 P
o
w
e
r(W)
   20    40    60    80   100    120    140    160    180   200
time(s)
10
0
10
1
10
2
10
3
10
4
Disk Access(per 10
m
s
)
Figure 6.51: The interaction in memory hierarchy in our base configuration with 128MB of memory. The figure shows the System
CPI over the entire run of gzip. The system configuration is a 2-GHz processor with 128MB of memory and a 12k-RPM disk. The CPI graph
shows 2 CPI values: one is the instant CPI for every 10ms, another is the accumulated average CPI. The duration having no data point
means no instructions are executed due to the I/O latency. The course of execution when the accumulated CPI is over 100 is the I/O
intensive phase, and the course of execution when the CPI is below 100 is the computation phase.
 242
0
5e+06
1e+07
1.5e+07
2e+07
Icache A
ccess
Cache Accesses (per 10 ms) and System CPI
gzip; memory: 512MB
0
5e+06
1e+07
1.5e+07
2e+07
D
cache A
ccess
0
50000
1e+05
1.5e+05
2e+05
2.5e+05
3e+05
3.5e+05
L
2
cache A
c
cess
10 15 20 25 30 35 40 45 50 55
time(s)
1
10
100
1000
10000
CPI
CPI @10ms
cum. CPI
0
2
4
6
8
10
Icache Pow
e
r 
(W)
Cache Power (per 10 ms)
0
2
4
6
8
10
D
cache Pow
e
r 
(W)
0.4
0.42
0.44
0.46
0.48
0.5
L2ca
che P
o
w
e
r
 
(W)
10 15 20 25 30 35 40 45 50 55
time(s)
0
5
10
15
D
i
sk P
o
wer 
(W)
0
1
2
3
4
DR
AM P
o
we
r
 
(W
)
DRAM and Disk Accesses/Power
gzip; memory 512MB
10
0
10
1
10
2
10
3
10
4
10
5
DR
AM A
c
c
e
s
s
e
s
 p
e
r 
1
0
ms
0
5
10
15
Dis
k
 P
o
w
e
r(W)
10 15 20 25 30 35 40 45 50 55
time(s)
1
10
100
1000
10000
1e+05
Disk Access(per 10
m
s
)
Figure 6.52: The interaction in memory hierarchy in a system with 512MB of memory. The figure shows the interaction between all
components in the memory hierarchy including level-1 instruction cache, level-1 data cache, level-2 unified cache, DRAM, and a disk drive.
Notice that initialization time reduces from 140 seconds in Figure 1.1 to 40 seconds in this figure.
 243
0
5e+06
1e+07
1.5e+07
2e+07
Icache A
c
cess
Cache Accesses (per 10 ms) and System CPI
gzip; memory: 128MB with perfect write buffering
0
5e+06
1e+07
1.5e+07
2e+07
D
cache A
ccess
0
1e+05
2e+05
3e+05
L2cac
he A
ccess
10 20 30 40 50 60 70
time(s)
1
10
100
1000
10000
CP
I
CPI@10ms
cum. CPI
0
2
4
6
8
10
Icache Pow
e
r 
(W)
Cache Power (per 10 ms)
0
2
4
6
8
10
D
c
ache P
o
w
e
r
 
(W)
0.4
0.42
0.44
0.46
0.48
0.5
L2cach
e Po
w
e
r 
(
W
)
10 15 20 25 30 35 40 45 50 55 60 65 70 75
time(s)
0
5
10
15
Disk P
o
w
e
r 
(W)
0
1
2
3
4
5
DR
AM P
o
we
r 
(W
)
DRAM and Disk Accesses/Power
gzip; memory: 128 MB; with perfect write buffering
10
0
10
1
10
2
10
3
10
4
10
5
D
R
AM Accesses p
e
r
 
10ms
0
5
10
15
Di
s
k
 P
o
w
e
r
(
W
)
10 20 30 40 50 60 70
time(s)
10
0
10
1
10
2
10
3
10
4
D
i
sk A
c
cess
(
per 10m
s)
Figure 6.53: The interaction in memory hierarchy in a system with 128MB of memory and a disk drive with perfect write buffering. 
The figure shows the interaction between all components in the memory hierarchy including level-1 instruction cache, level-1 data cache,
level-2 unified cache, DRAM, and a disk drive. Notice that initialization time reduces from 140 seconds in Figure 1.1 to 48 seconds in this
figure.
 244
0
5e+06
1e+07
1.5e+07
2e+07
Icache A
ccess
Cache Accesses (per 10 ms) and System CPI
gzip; memory: 128MB; WB & 8ds
0
5e+06
1e+07
1.5e+07
2e+07
D
cache A
c
cess
0
50000
1e+05
1.5e+05
2e+05
2.5e+05
3e+05
3.5e+05
L
2
cache A
cce
s
s
10 15 20 25 30 35 40 45 50 55 60 65
time(s)
10
0
10
1
10
2
10
3
10
4
CP
I
0
2
4
6
8
10
I
c
ache P
o
w
e
r
 
(W)
Cache Power (per 10 ms)
0
2
4
6
8
10
D
cach
e P
o
w
e
r 
(W)
0.4
0.42
0.44
0.46
0.48
0.5
L
2
cache Pow
e
r 
(W)
10 15 20 25 30 35 40 45 50 55 60 65
time(s)
60
65
70
75
80
Dis
k
 P
o
w
e
r 
(
W
)
0
1
2
3
4
P
o
w
e
r (W)
DRAM Access/Power
10
0
10
1
10
2
10
3
10
4
10
5
A
ccesses per 
1
0
m
s
60
65
70
75
80
Di
s
k
 P
o
w
e
r
(
W
)
10 15 20 25 30 35 40 45 50 55 60 65
time(s)
0
20
40
60
80
100
D
i
sk A
cce
s
s
(per 10ms)
Figure 6.54: The interaction in memory hierarchy in a system with the same configuration with RAID disk system. The figure
shows the interaction between all components in the memory hierarchy including level-1 instruction cache, level-1 data cache, level-2
unified cache, DRAM, and a disk drive. Notice that initialization time reduces from 140 seconds in Figure 1.1 to less than 40 seconds in this
figure
 245
CHAPTER 7:   CONCLUSIONS
Most studies focus on the computation phase during which the most repeated
instructions are executed. The argument for focusing on the computation phase is to make
the most repeated case fast. However, we followed a different path from those studies. The
course of the entire execution consists of I/O intensive phase as well as the main
computation phase. We have shown that a program spends significant amount of the time in
the I/O intensive phase due to the I/O latency, especially in the single processing
environment mostly found in personal computers. Therefore, the I/O has been exposed as a
significant component with respect to total execution time.
To obtain a system with more balanced phases, we require more understanding in the
effects of I/O configurations to the entire system. Therefore, we are forced to extensively
investigate the I/O effects to the full-system scale. The system total execution time can be
improved to an order of magnitude by the previously mentioned enhancements in disk
systems, i.e. using disk caching/prefetching and write-buffering techniques. 
Memory performance and power are now the key challenges in system design. With
respect to the processor, memory accesses become slower and consume more power with
increasing memory size. Most of the total power consumption of the systems is dominated
by the entire memory hierarchy. Hence, memory power and access time significantly affect
total power and performance for computations with large storage requirements, and memory
becomes the main bottleneck
 246
Disks in general have been widely used as secondary, non-volatile storage and as a low-
level memory in virtual memory hierarchy. It is accepted as an indispensable part of the
general-purpose computer system. So far, no studies demonstrate the complete picture of the
virtual memory hierarchy including disk. One of the reasons is that there are no proper tools
available in the public domain for such studies. 
Therefore, we created SYSim, an open-source complete-system simulator aiming at
complete memory hierarchy studies. SYSim focuses on demonstrating the detailed
interactions in the entire memory hierarchy. SYSim has the ability to instantaneously collect
the statistic information in both performance and power consumption. 
With SYSim, we extensively conducted the complete-system experiments. We explored
disk drive design space, including several disk drive enhancements and technology
improvements, during the I/O intensive phase. The experimental results are reported in
terms of total system performance (CPI) and power/energy consumption for many SPEC
2000 benchmark applications. We captured unquestionably fascinating behaviors as
follows:
? The disk research community uses average response time as the metric to measure 
the disk system performance, which includes both disk read and write response time. 
However, we found that, during the I/O intensive phase, the average CPI tracks only 
average disk read response time and not overall average disk response time. This 
behavior stays true with the disk request stream consisting of any ratio of reads and 
writes. Therefore, average read response time should be a better representative for 
measuring the disk system performance and relating it to the entire system 
performance.
 247
? The effect of the size of the disk cache is limited to the presence of the cache with a 
particular size. Meaning, with constant DRAM capacity, increasing the size of the 
disk cache will not result in better performance if the disk cache is already large 
enough. The disk cache organization does not have any impact to the performance. 
Only one segment of disk cache is sufficient for our case. This behavior is in 
agreement with the disk-level simulation results in [78]. However, increasing disk 
cache size benefits increasing write traffic.
? In disk read-dominating applications, Disk Prefetching is more important than 
increasing the disk RPM. That is, rotational latency and bandwidth can be overcome 
by simple prefetching mechanism. In such applications, the request stream is often a 
stream of sequential reads. Therefore, requests mostly hit the prefetched data in the 
disk cache. By hitting the cache, there is no need to access the physical disk and 
move the disk mechanical parts, which are the major reason for long I/O latency.
? In applications with both disk reads and writes, the disk RPM matters. The reason is 
the disk maintains the concepts of non-volatile storage, so when the write comes in, 
it is required to write to the disk immediately if there is no sufficient space in the 
cache. Therefore, if a long write burst is scheduled before a read, even if the read is a 
cache hit, the read has to wait for the write burst to complete before it hits the data in 
the cache. The waiting time can be very long since the write burst has to move the 
disk mechanical parts. As a result, the RPM affects the read response time, which 
represents the overall system performance. The experiment shows that using some 
techniques to eliminate the writes may improve the performance significantly in this 
case.
 248
? The cost of writing in a RAID system is significant as the RAID system usually 
suffers from small writes [80]. The reason is that a disk write in a RAID system 
requires parity calculation, so a disk write in RAID system takes longer than a write 
in a single disk system. If the cost of a write is reduced, such as by implementing 
write buffer mechanism, the overall system performance will be improved by 
potentially an order of magnitude.
? Individual DRAM chips dissipate little power, but a system must have a substantial 
amount of DRAM to reduce disk traffic and thus prevent the disk from dissipating 
significant power. Since the total system performance is related to DRAM capacity 
more strongly than disk RPM, and an active disk dissipates more power than 
individual active DRAMs, it is wise to increase the DRAM capacity rather than 
increase the disk RPM. However, when there is enough amount of DRAM in the 
system, the total DRAM power can be significant and can approach that of the disk 
system. In this case, higher disk RPM also increases system energy without 
performance benefit, so using high RPM disk with sufficiently large DRAM 
capacity is a bad design point.
? The energy consumption has more significance than the power dissipation. While 
the power stays constant in most systems with similar features, the energy consumed 
can change significantly with different disk parameters. This is because the I/O 
latency, resulted from different disk parameters, substantially prolongs the program 
execution time. The difference in energy in different systems can be as much as a 
factor of 10.
? In systems with high RPM disks, techniques aiming at increasing the system 
 249
bandwidth alone, such as increasing the number of RAID disks and RPM, fail to 
improve the total system performance directly. In some cases, the systems with 
higher bandwidth perform worse than the systems with lower bandwidth. To 
significantly improve total system performance further, the disk enhancement 
techniques are required in the systems with fast disks.
 250
APPENDIX:   SPEC CPU2000
SPEC CPU2000 [13] is the industry-standardized CPU-intensive benchmark suite.
SPEC designed CPU2000 to provide a comparative measure of computation-intensive
performance across the most extensive practical range of hardware. The benchmark suite is
comprised of source code benchmarks developed from real user applications. The
benchmarks in the suite have a goal to measure the performance of the processor, memory
and compiler on the tested system.
The SPEC CPU2000 benchmarks are intended to exercise the CPU, the memory
hierarchy, and the compilers. The data collected show that SPEC CPU2000 met its goals for
memory footprint. Meaning, most benchmarks are larger than common cache sizes, many
are larger than 100MB, and none are larger than 200MB.
This section provides details about a set of SPEC CPU2000 suite used in the
experiments in this dissertation. A selection of seven benchmarks from integer suite and a
selection of two benchmarks from floating-point suite are explained.
 251
A.1. CFP2000 (Floating Point Suite of SPEC CPU2000):
A.1.1.188.ammp
188.ammp is classified as a Computational Chemistry program. It models large systems
of molecules usually associated with Biology. The benchmark runs molecular dynamics on
a protein-inhibitor complex, which is embedded in water. The energy is approximately
calculated by a classical potential or "force field". The protein in the complex is HIV
protease complexed with the inhibitor indinavir. There are 9582 atoms in the water and the
protein, making the benchmark a representative of a typical large simulation. This
188.ammp benchmark is a derivation from published work on an understanding of drug
resistance in HIV. The problem traces how the atoms move from an initial coorinates and
initial velocities. The output is the energy of the final configuration of atoms. It is written in
C.
A.1.2.172.mgrid
172.mgrid is a Multi-grid Solver program, which is a 3D Potential Field program. The
172.mgrid benchmark demonstrates the capabilities of a simple multigrid solver in
computing a three dimensional potential field. The benchmark is adapted by SPEC from the
NAS Parallel Benchmarks with modifications for portability and a different workload as
follows
1. It solves only a constant coefficient equation, and only on a uniform cubical grid.
2. It solves only a single equation, representing a scalar field rather than a vector field.
 252
The output includes echoing some of the inputs and the smoothed approximate inverse.
The main part of the output is from the smoothed approximate inverse. However, only a
small portion of the smoothed output is printed. This output is only enough to assure that all
work is being done and to check intermediate results for accuracy. Additionally, the L2
norm and Inf norms are used as a checksum of the output. The benchmark is written in
Fortran 77.
 253
A.2.  CINT2000 (Integer Component of SPEC CPU2000)
A.2.1.164.gzip
164.gzip (GNU zip) is a popular data compression program written by Jean-Loup Gailly
for the GNU project. The 164.gzip benchmark uses Lempel-Ziv coding (LZ77) as its
compression algorithm. However, SPEC's version of gzip performs only reading I/O for
input, but no file I/O for output. All compression and decompression are computed entirely
in memory. This is to differentiate the work done in the CPU from the work done in the
memory subsystem. Reference workload of 164.gzip includes five components: a large
TIFF image, a web server log, a program binary, random data, and a source tar file. The
random data is selected to test gzip's worst-case behavior. Beside the random data, the rest
of the workload components were selected as a realistic representative set of general inputs
that gzip might encounter regularly. Every input set is compressed and decompressed at
several different blocking factors or compression levels. Then, the end result of the process
is compared against the original data after each step. The output files are generated to
include a brief outline of the benchmark activities during execution. Output sizes for each
compression and decompression are included to facilitate validation. To validate, the results
of decompression are compared against the input data to ensure that they match. The
benchmark is written in C.
A.2.2.176.gcc
176.gcc is a C Language optimizing compiler. The 176.gcc benchmark is based on gcc
Version 2.7.2.2 from GNU. It generates code for a Motorola 88100 processor. The
 254
benchmark executes as a compiler with multiple optimization flags enabled. Unlike GNU
gcc, 176.gcc has its inlining heuristics altered slightly. Therefore, more code can be inlined
than it would be typical on a Unix system in 1997. The reason is the expectation that this
feature would be more typical for compiler usage in 2002. The change was done so that
176.gcc would spend more time analyzing it's source code inputs, and use more memory.
Despite of this effect, 176.gcc would have done less analysis, and required more input
workloads to achieve the run times specification for SPECint2000. There are five input
workloads included in 176.gcc. All of them are preprocessed C code (.i files). First,
integrate.i and expr.i come from the source files of gcc itself. 166.i is produced by
concatenating the Fortran source files of a SPECint2000 candidate benchmark, then using
the f2c translator to produce C code, and then pre-processing. 200.i is produced with the
same method from a previous version of the SPECfp2000 benchmark Finally, 200.sixtrack
and scilab.i are produced with the same method from a version of the Scilab program. All
output files are 88100 assembly code files. The code of 176.gcc is in C.
 The known portability issues for the 172.gcc benchmark are as follows:
1. The code requires the knowledge of the platform endian of the host it runs on. The 
default endian for 176.gcc is set to little endian. To run correctly on a big- endian 
machine, the flag HOST_WORDS_BIG_ENDIAN must be defined when the 
benchmark is compiled (eg -DHOST_WORDS_BIG_ENDIAN).
2. Some of the optimizations 176.gcc performs require platform-dependent calculation 
of floating point constants. These requirements form an insignificant amount of 
computation time, depending on IEEE floating point format to produce a correct 
result.
 255
3. 176.gcc is not an ANSI C program. It uses GNU extensions.
4. 176.gcc is inherently a 32-bit program. SPEC has successfully ported 176.gcc to 
many 64-bit UNIX implementations. However, use of high optimization levels with 
a 64 bit system in conjunction with inlining of procedures from different source files 
may reveal some 64-bit portability issues with 176.gcc.
5. SPEC has changed176.gcc slightly in order to build properly with newer versions of 
GCC. If you're using an old gcc (~2.6 or older) to build 176.gcc, you should define 
the __OLDANDBUGGY__GNUC__ flag.
A.2.3.181.mcf
181.mcf is a Combinatorial optimization / Single-depot vehicle scheduling. It is a
benchmark derived from a program used for single-depot vehicle scheduling in public mass
transportation. The benchmark is written in C, and The benchmark version uses almost
entirely integer arithmetic. The program is designed to solve single-depot vehicle scheduling
(sub-)problems occurring in the planning process of public transportation companies. It take
into account one single depot and a homogeneous vehicle fleet. It is based on a line plan and
service frequencies, so-called timetabled trips with fixed departure/arrival locations and
times derived. Each of this timetabled trip has to be served by exactly one vehicle. The links
between these trips are called dead-head trips. Additionally, there are pull-out and pull-in
trips for leaving and entering the depot, respectively. Cost coefficients are provided for all
dead-head, pull-out, and pull-in trips. The purpose is to schedule all timetabled trips such
that the number of necessary vehicles is minimized and, secondarily, the operational costs
among all minimal fleet solutions are also minimized. For simplification, the benchmark
 256
assumes that each pull-out and pull-in trip is defined implicitly with a duration of 15 minutes
and a cost coefficient of 15. For the considered single-depot case, the problem can be
formulated as a large-scale minimum-cost flow problem that can be solved with a network
simplex algorithm accelerated with a column generation. The main calculation of the
benchmark 181.mcf is the network simplex code "MCF Version 1.2 -- A network simplex
implementation", which is embedded in the column generation process. The network
simplex algorithm is a specialized version of the prominent simplex algorithm for network
flow problems. The linear algebra of the general algorithm is replaced by simple network
operations, such as finding cycles or modifying spanning trees that can be performed very
rapidly. The main work of our network simplex implementation is pointer and integer
arithmetic. 
 The input file includes the followings:
? the number of timetabled and dead-head trips,
? its starting and ending time for each timetabled trip,
? its starting and ending timetabled trip and its cost for each dead-head trip.
Worst case execution time is pseudo-polynomial in terms of the number of timetabled
and dead-head trips and in the amount of the maximal cost coefficient. However, the
expected execution time is in the order of a low-order polynomial. The benchmark memory
footprint is approximately 100 and 190 megabyte for a 32 and a 64 bit architecture,
respectively. The benchmark generates two output files, inp.out and mcf.out. The inp.out
output file consists of log information and a checksum while the mcf.out output file contains
check output values describing an optimal schedule computed by the program.
 257
A.2.4.197.parser
197.parser is a Word processing program. The Link Grammar Parser is a syntactic
parser of English, based on link grammar, an original theory of English syntax. The program
assigns a syntactic structure to a given sentence. The syntactic structure consists of set of
labeled links connecting pairs of words.
The parser includes a dictionary of about 60000 word forms. The dictionary covers a
wide variety of syntactic constructions, including many rare and idiomatic ones. The parser
is robust. It has the capability to skip over portions of the sentence that it cannot understand,
and assign structure to the rest of the sentence. It can handle unknown vocabulary, and
intelligently guess from context about the syntactic categories of unknown words. The input
is a sequence of proposed sentences, one per line. It is sensitive to punctuation and case. The
output is an analysis of each input sentence. The analysis output consists of a set of links
capturing the grammatical structure of the sentence, a labelling of each word with an
appropriate part of speech tag, and a judgement of the grammaticality of the input sentence.
Words in square brackets are determined superfluous by the parser. The parser is written in
ANSI C.
A.2.5.255.vortex
255.vortex is a Database program. The benchmark 255.vortex is a subset of a full object
oriented database program called VORTEx. VORTEx stands for "Virtual Object Runtime
EXpository". It is a single-user object-oriented database transaction benchmark, exercising a
system kernel coded in integer C. The VORTEx benchmark is a derivative of a full
OODBMS that has been customized to conform to SPEC CINT2000 guidelines.
 258
Transactions operated on the database are translated through a schema. The function of a
schema is to provide the necessary information to generate the mapping of the internally
stored data block to a model viewable in the context of the application. The benchmark
schema is pre-configured to manipulate three different database, including mailing list, parts
list, and geometric data. Both little-endian and big-endian binaries for the schema are
provided in the benchmark.
The 255.vortex benchmark builds and manipulates three separate, but inter-related
databases based on the schema. The size of the database is scalable, but has been restricted
to about 200 Mbytes for CINT2000 guidelines. However, this version of VORTEx
benchmark has been modified to prevent committing transactions to memory in order to
remove input-output activity from the benchmark.
The workload of VORTEx has been modeled to reflect general object-oriented database
benchmarks with modifications to vary the mix of transactions.
The 255.vortex benchmark executes three different times with different sequences of
transactions. Each time a different combination of database insert, delete and lookup
transactions is used to simulate different database usage patterns. 
The benchmark 255.vortex use three different workloads, simulating different dataset
sizes and access patterns. Each run, one for each workload, produces one output file. Each
output file (vortex1.out, vortex2.out, and vortex3.out) is a log of all transactions occurring
during the execution of the benchmark. These transactions include creating entries in the
database, deleting entries, and entry lookups. 255.vortex is written in C.
 259
A.2.6.256.bzip2
256.bzip2 is a compression program. The 256.bzip2 benchmark is a derivative of Julian
Seward's bzip2 version 0.1. Like SPEC version of gzip, the only difference between bzip2
0.1 and 256.bzip2 is that SPEC's version of bzip2 performs no file I/O rather than reading
the input. All compression and decompression occurs entirely in memory to help isolate the
work done to only the CPU and memory subsystem. The output files consist of a brief
outline of what the benchmark is doing during its execution. Output sizes for each
compression and decompression are printed to facilitate validation. To validate the
execution, the results of decompression are compared against the input data to ensure that
they match. The 256.bzip2 benchmark is written in ANSI C.
A.2.7.300.twolf
TimberWolfSC is a placement and global routing application package. The
TimberWolfSC package is used to create the lithography artwork needed for the production
of microchips. Especially, it determines the placement and global connections for standard
cells, which constitute the microchip. The standard cell is usually a group of transistors. The
placement problem is a permutation. Meaning, an exploration of the state space would take
an execution time proportional to the factorial of the input size. For example, To solve a
problems with 70 cells, a brute force algorithm would take the execution time of the factorial
of 70, which is an unacceptable amount of time even on the world's fastest computer.
Instead, the TimberWolfSC program implements simulated annealing as a heuristic to find
relatively optimal solutions for the row-based standard cell design style. In this
implementation, transistors are grouped together to form standard cells. These standard cells
 260
are placed in rows each or which share power and ground connections by abutment. The
simulated annealing algorithm has found the relative optimals to a large group of placement
problems. After the placement step, the global router interconnects the microchip design. It
is implemented with a constructive algorithm followed by iterative improvement.
The basic simulated annealing algorithm has been widely used in many applications
since its first introduction in 1983. The SPEC suite version is the most numerically intensive
version. Recently, the newer versions have reduced runtimes by intelligent reductions in the
search space. However, the solution search strategy and cost functions remain the same to
later versions.
SPEC has customized the version of TimberWolfSC so that it would capture the flavor
of many implementations of simulated annealing. The submitted version spends most of its
time in the inner loop calculations. With this behavior, this version often creates cache
misses due to traversing memory. In fact, the execution of small jobs on this version is
similar to later simulated annealing versions executing on large jobs. The reason is to insure
the applicability of the benchmark in the future versions of the program running large
instances. The submitted version should be extremely computer-intensive, but realistic for
future problems. 
Three test problems are provided for the SPEC 300.twolf benchmark. The first problem
is a small synchronous circuit which is being placed and routed as a subchip. The second test
circuit is the MCNC primary one benchmark circuit. It is one of the most frequently
executed benchmark circuits. The third test case is a structured circuit found in the MCNC
benchmark suite. In all test problems, the TimberWolf program is required to determine the
position of the standard cells and determine the interconnection of the netlist. Additionally,
 261
the global router must add extra cells, called feedthrus, to complete the route if not enough
space is present between two adjacent standard cells. The input files are composed of the
block description file, the netlist file, the net weighting file, and the parameter file. The
block description file describes the number and position of the rows, where standard cells
are to be placed. A valid placement is defined as the placement all of the cells are placed
within the specified rows without any overlap between cells. The netlist file describes the
standard cells and the connection network between cells. At this moment, the physical
location of these connections has not been determined.
Two output files are created for each test circuit: the placement file and the global
routing file. The benchmark is written in C.
 262
References
 [1]S. Gurumurthi, A. Sivasubramaniam, M.J. Irwin, N. Vijaykrishnan, M. Kandemir, T. Li,
L.K. John, ?Using Complete Machine Simulation for Software Power Estimation: The
SoftWatt Approach,? In Proceedings of the International Symposium on High
Performance Computer Architecture (HPCA-8), Cambridge, MA, pages 141-150,
February, 2002.
 [2]Jianwei Chen, Michel Dubois, and P. Stenstrom, ?SimWattch: An Approach to Integrate
Complete-System with User-Level Performance/Power Simulators,? IEEE International
Symposium on Performance Analysis of Systems and Software (ISPASS 2003), March
2003.
 [3]H. W. Cain, K. M. Lepak, B. A. Schwartz, and M. H. Lipasti, ?Precise and Accurate
Processor Simulation,? In Proceedings of the Fifth Workshop on Computer Architecture
Evaluation Using Commercial Workloads, pages 13?22, Feb. 2002.
 [4]M. Rosenblum, S. A. Herrod, E. Witchel, and A. Gupta, ?Complete Computer System
Simulation: The SimOS Approach,? IEEE Parallel and Distributed Technology: Systems
and Applications, 3(4):34?43, 1995.
 [5]P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F.
Larsson, A. Moestedt, and B. Werner, ?Simics: A Full System Simulation Platform,?
IEEE Computer, 35(2):50?58, Feb. 2002.
 [6]R. C. Bedichek, ?Some Efficient Architecture Simulation Techniques,? Winter 1990
USENIX Conference, pages 53?63, Jan. 1990.
 [7]P. S. Magnusson, ?A Design For Efficient Simulation of a Multiprocessor,? First
International Workshop on Modeling, Analysis, and Simulation of Computer and
Telecommunication Systems (MASCOTS), pages 69?78, Jan. 1993.
 [8]R. C. Bedichek, ?Talisman: Fast and accurate multicomputer simulation,? In Proceedings
of the 1995 ACM Sigmetrics Conference on Measurement and Modeling of Computer
Systems, pages 14?24, May 1995.
 [9]Carl J. Mauer, Mark D. Hill and David A. Wood, ?Full System Timing-First Simulation,?
In Proceedings of the 2002 ACM Sigmetrics Conference on Measurement and Modeling
of Computer Systems June, 2002
 [10]P. Bohrer, M. Elnozahy, A. Gheith, C. Lefurgy, T. Nakra, J. Peterson, R. Rajamony, R.
Rockhold, H. Shafi, R. Simpson, E. Speight, K. Sudeep, E. Van Hensbergen, and Lixin
Zhang, ?Mambo -- A Full System Simulator for the PowerPC Architecture,? ACM
SIGMETRICS Performance Evaluation Review, Volume 31, Number 4, March 2004.
 [11]The Bochs IA-32 Emulator Project. http://bochs.sourceforge.net
 263
 [12]S. Wilton and N. Jouppi, ?An Enhanced Access and Cycle Time Model for On-chip
Caches,? In WRL Research Report 93/5, DEC Western Research Laboratory, 1994.
 [13]Systems Performance Evaluation Cooperative. SPEC Benchmarks.
http://www.spec.org.
 [14]G. Ganger, B.Worthington, and Y. Patt, ?The DiskSim Simulation Environment Version
2.0 Reference Manual,? http://www.ece.cmu.edu/ ganger/disksim/.
 [15]S. Gurumurthi, A. Sivasubramaniam, M. Kandemir, H. Franke, ?DRPM: Dynamic
Speed Control for Power Management in Server Class Disks,? In the Proceedings of the
International Symposium on Computer Architecture (ISCA), pages 169-179, June, 2003.
 [16]IBM Hard Disk Drive - Ultrastar 36ZX. http://www.storage.ibm.com/ hdd/ultra/
ul36zx.htm.
 [17]David T. Wang, ?Modern DRAM Memory systems: Performance Analysis and
Scheduling Algorithm,? Ph.D. Dissertation, Electrical and Computer Engineering,
University of Maryland at College Park, 2005.
 [18]Jeff Janzen, The Micron System-Power Calculator. 
http://www.micron.com/products/dram/syscalc.html
 [19]D. Brooks, V. Tiwari, and M. Martonosi, ?Wattch: A framework for architectural-level
power analysis and optimizations,? In Proceedings of the 27th Annual International
Symposium on Computer Architecture, June 2000.
 [20]W. Ye, N. Vijaykrishnan, M. Kandemir, and M. Irwin, ?The Design and Use of
SimplePower: A cycle-accurate energy estimation tool,? In Proceedings of the Design
Automation Conference (DAC), June 2000.
 [21]G. Cai and C.H. Lim, ?Architectural Level Power/Performance Optimization and
Dynamic Power Estimation,? in Proceedings of Cool Chips Tutorial, in conjunction with
MICRO32, Nov. 1999, pp. 90-113.
 [22]K. Baynes, C. Collins, E. Fiterman, B. Ganesh, P. Kohout, C. Smit, T. Zhang, and B.
Jacob, ?The performance and energy consumption of embedded real-time operating
systems,? IEEE Transactions on Computers, vol. 52, no. 11, pp. 1454-1469. November
2003.
 [23]T. L. Cignetti, K. Komarov, and C. S. Ellis, ?Energy Estimation Tools for the Palm,? In
Proceedings of ACM MSWiM 2000: Modeling, Analysis and Simulation of Wireless and
Modile Systems, August 2000.
 [24]M. Lajolo, A. Raghunathan, S. Dey, L. Lavagno and A. Sangiovanni-Vincentelli,
?Efficient Power Estimation Techniques for HW/SW Systems,? In Proceedings of the
 264
IEEE VOLTA'99 International Workshop on Low Power Design , pp. 191-199, Como,
Italy, March 4-5, 1999. 
 [25]J.R. Lorc, ?A complete picture of the energy consumption of a portable computer,?
Master's thesis, University of California, Berkeley, December 1995.
 [26]Robert P. Dick , Ganesh Lakshminarayana , Anand Raghunathan , Niraj K. Jha, ?Power
analysis of embedded operating systems,? Proceedings of the 37th conference on Design
automation, p.312-315, June 05-09, 2000, Los Angeles, California, United States 
 [27]Tajana ?imunic , Luca Benini , Giovanni De Micheli, ?Cycle-accurate simulation of
energy consumption in embedded systems,? Proceedings of the 36th ACM/IEEE
conference on Design automation, p.867-872, June 21-25, 1999, New Orleans,
Louisiana, United States 
 [28]D. Burger and T. M. Austin. ?The SimpleScalar Tool Set, Version 2.0,? Computer
Architecture News, pages 13-25, June 1997.
 [29]Mirko Loghi, Massimo Poncino, Luca Benini, ?Cycle-accurate power analysis for
multiprocessor systems-on-a-chip,? Proceedings of the 14th ACM Great Lakes
symposium on VLSI,  April 2004.
 [30]Giovanni Beltrame, Gianluca Palermo, Donatella Sciuto, Cristina Silvano, ?Plug-in of
power models in the StepNP exploration platform: analysis of power/performance trade-
offs,? Proceedings of the 2004 international conference on Compilers, architecture, and
synthesis for embedded systems, September 2004 
 [31]Gilberto Contreras, Margaret Martonosi, Jinzhan Peng, Roy Ju and Guei-Yuan Lueh.
?XTREM: A Power Simulator for the Intel XScale Core,? The 2004 Conference on
Languages, Compilers, and Tools for Embedded Systems (LCTES'04), June 2004. 
 [32]R. Y. Chen, M. J. Irwin, and R. S. Bajwa, ?Architecture-level power estimation and
design experiments,? ACM Transactions on Design Automation of Electronic Systems,
2001.
 [33]N. An, S. Gurumurthi, A. Sivasubramaniam, N. Vijaykrishnan, M. Kandemir, M. J.
Irwin, ?Energy-Performance Trade-Offs for Spatial Access Methods on Memory
Resident Data,? In The VLDB Journal,11(3):179-197, November, 2002. 
 [34]Tajana Simunic, Luca Benini, Giovanni De Micheli, ?Energy-efficient design of battery-
powered embedded systems,? International Symposium on Low Power Electronics and
Design, Stanford University, 212-17, August, 1999.
 265
 [35]Vinodh Cuppu and Bruce Jacob, ?Concurrency, latency, or system overhead: Which has
the largest impact on uniprocessor DRAM-system performance?,? In Proc. 28th
International Symposium on Computer Architecture (ISCA 2001), pp. 62-71, Goteborg
Sweden, June 2001. 
 [36]Luca Benini, Giovanni de Micheli, ?System-level power optimization: techniques and
tools,? ACM Transactions on Design Automation of Electronic Systems (TODAES),
Volume 5  Issue 2 , April 2000.
 [37]D. Lidsky and J. Rabaey, ?Low-power design of memory intensive functions,?
Proceedings of the IEEE Symposium on Low Power Electronics (Sept.), IEEE Computer
Society Press, Los Alamitos, CA, 16-17, 1994.
 [38]F. Catthoor, S. Wuytack, E. De Greef, F. Balasa, L. Nachtergaele, and A.
Vandecappelle, ?Custom Memory Management Methodology: Exploration of Memory
Organization for Embedded Multimedia System Design,? Kluwer Academic, Dordrecht,
Netherlands, 1998a.
 [39]J. L. Hennessy and D. A. Patterson, ?Computer Architecture: A Quantitative Approach,?
Morgan Kaufmann, Second edition, 1996, pp. 487.
 [40]M. Kandemir , N. Vijaykrishnan , M. J. Irwin , W. Ye, ?Influence of compiler
optimizations on system power,? Proceedings of the 37th conference on Design
automation, p.304-307, June 05-09, 2000, Los Angeles, California.
 [41]F. Catthoor, S. Wuytack, E. De Greef, F. Balasa, L. Nachtergaele, and A.
Vandecappelle, ?Custom Memory Management Methodology: Exploration of Memory
Organization for Embedded Multimedia System Design,? Kluwer Academic, Dordrecht,
Netherlands, 1998.
 [42]F. Catthoor, S. Wuytack, E. De Greef, F. Franssen, L. Nachtergaele, and H. De Man,
?System-level transformations for low-power data transfer and storage,? In Low-Power
CMOS Design, R. Chandrakasan and R. Brodersen, Eds. IEEE Press, Piscataway, NJ,
1998.
 [43]C.-l. Su and  A. M. Despain, ?Cache design trade-offs for power and performance
optimization: a case study,? In Proceedings of the 1995 International Symposium on Low
Power Design (ISLPD-95, Dana Point, CA, Apr. 23?26), M. Pedram, R. Brodersen, and
K. Keutzer, Eds. ACM Press, New York, NY, pp. 63?68, 1995.
 [44]M. B. Kamble and K. Ghose, ?Analytical energy dissipation models for low-power
caches,? In Proceedings of the 1997 International Symposium on Low Power Electronics
and Design (ISLPED ?97, Monterey, CA, Aug. 18?20), B. Barton, M. Pedram, A.
Chandrakasan, and S. Kiaei, Eds. ACM Press, New York, NY, pp. 143?148, 1997.
 266
 [45]W. Shiue and C. Chakrabarti,   ?Memory exploration for low power, embedded
systems,? In Proceedings of the Conference on Design Automation (June), pp. 140?145,
1999.
 [46]Luca Benini , Alberto Macii , Enrico Macii , Massimo Poncino, ?Synthesis of
application-specific memories for power optimization in embedded systems,?
Proceedings of the 37th conference on Design automation, p.300-303, June 05-09, 2000,
Los Angeles, California, United States.
 [47]Peter Grun , Nikil Dutt , Alex Nicolau, ?APEX: access pattern based memory
architecture exploration,? Proceedings of the 14th international symposium on Systems
synthesis, September 30-October 03, 2001, Montreal, P.Q., Canada.
 [48]A. H. Farrahi,  G. E. T?llez, and  M. Sarrafzadeh, ?Memory segmentation to exploit
sleep mode operation,? In Proceedings of the 32nd ACM/IEEE Conference on Design
Automation (DAC ?95, San Francisco, CA, June 12?16), B. T. Preas, Ed. ACM Press,
New York, NY, pp. 36?41, 1995.
 [49]A. H. Farrahi and  M. Sarrafzadeh, ?System partitioning to maximize sleep time,? In
Proceedings of the 1995 IEEE/ACM International Conference on Computer-Aided
Design (ICCAD-95, San Jose, CA, Nov. 5?9), R. Rudell, Ed. IEEE Computer Society
Press, Los Alamitos, CA, 452?455, 1995.
 [50]Luca Benini , Alberto Macii , Massimo Poncino, ?A recursive algorithm for low-power
memory partitioning,? Proceedings of the 2000 international symposium on Low power
electronics and design, p.78-83, July 25-27, 2000, Rapallo, Italy.
 [51]Hsien-Hsin S. Lee , Gary S. Tyson, ?Region-based caching: an energy-delay efficient
memory architecture for embedded processors,? In Proceedings of the international
conference on Compilers, architectures, and synthesis for embedded systems, p.120-127,
November 17-19, 2000, San Jose, California, United States.
 [52]J. Kin, M. Gupta, and W.h. Mangione-smith,  ?The filter cache: an energy efficient
memory structure,? In Proceedings of the 30th Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO 30, Research Triangle Park, NC, Dec. 1?3),
M. Smotherman and T. Conte, Eds. IEEE Computer Society Press, Los Alamitos, CA,
pp. 184?193, 1997.
 [53]P. Grun , N. Dutt , A. Nicolau, ?Access pattern based local memory customization for
low power embedded systems,? Proceedings of the conference on Design, automation
and test in Europe, p.778-784, March 2001, Munich, Germany.
 [54]Jayaprakash Pisharath , Alok Choudhary, ?An integrated approach to reducing power
dissipation in memory hierarchies,? Proceedings of the international conference on
Compilers, architecture, and synthesis for embedded systems, October 08-11, 2002,
Greenoble, France.
 267
 [55]Afzal Malik , Bill Moyer , Roger Zhou, ?Embedded cache architecture with
programmable write buffer support for power and performance flexibility,? Proceedings
of the international conference on Compilers, architecture, and synthesis for embedded
systems, October 08-11, 2002, Greenoble, France.
 [56]Chuanjun Zhang , Frank Vahid , Jun Yang , Walid Najjar, ?A way-halting cache for low-
energy high-performance systems,? Proceedings of the 2004 international symposium on
Low power electronics and design, August 09-11, 2004, Newport Beach, California,
USA.
 [57]Rui Min , Wen-Ben Jone , Yiming Hu, ?Location cache: a low-power L2 cache system,?
Proceedings of the 2004 international symposium on Low power electronics and design,
August 09-11, 2004, Newport Beach, California, USA.
 [58]S. Liao, S. Devadas and K. Keutzer, ?Code density optimization for embedded DSP
processors using data compression techniques,? In IEEE Trans. Comput.-Aided Des.
Integr. Circuits Syst. 17, 7 (July), pp. 601?608, 1998.
 [59]H. Lekatsas and W. Wolf, ?Code compression for embedded systems,? In Proceedings
of the 35th Annual Conference on Design Automation (DAC ?98, San Francisco, CA,
June 15?19), B. R. Chawla, R. E. Bryant, and J. M. Rabaey, Eds, ACM Press, New York,
NY, pp. 516?521, 1998.
 [60]Luca Benini , Alberto Macii , Enrico Macii , Massimo Poncino, ?Selective instruction
compression for memory energy reduction in embedded systems,? Proceedings of the
1999 international symposium on Low power electronics and design, p.206-211, August
16-17, 1999, San Diego, California, United States.
 [61]S. Segars, K. Clarke, and L. Goudge, ?Embedded control problems, thumb and the
ARM7TDMI,? IEEE Micro 15, 5 (Dec.), pp. 22?30, 1995.
 [62]Wen-Tsong Shiue , Chaitali Chakrabarti, ?Memory Design and Exploration for Low
Power, Embedded Systems,? Journal of VLSI Signal Processing Systems, v.29 n.3,
p.167-178, November 2001.
 [63]S. Wuytack, F. Catthoor, and H. De Man, ?Transforming set data types to power optimal
data structures,? IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 15, 6 (June), pp.
619?629, 1997.
 [64]J.L. Da Silva, F. Catthoor, D. Verkest, and H. De Man, ?Power exploration for dynamic
data types through virtual memory management refinement,? In Proceedings of the 1998
International Symposium on Low Power Electronics and Design (ISLPED ?98,
Monterey, CA, Aug. 10?12), A. Chandrakasan and S. Kiaei, Eds. ACM Press, New
York,NY, pp 311?316, 1998.
 268
 [65]C. Gebotys, ?Low energy memory and register allocation using network flow,? In
Proceedings of the 34th Conference on Design Automation ( DAC ?97, Anaheim, CA,
June), pp. 435?440, 1997.
 [66]Alvin R. Lebeck , Xiaobo Fan , Heng Zeng , Carla Ellis, ?Power aware page allocation,?
ACM SIGOPS Operating Systems Review, v.34 n.5, p.105-116, Dec. 2000.
 [67]Xiaobo Fan , Carla Ellis , Alvin Lebeck, ?Memory controller policies for DRAM power
management,? Proceedings of the 2001 international symposium on Low power
electronics and design, p.129-134, August 2001, Huntington Beach, California, United
States.
 [68]Nam Sung Kim , Kriszti?n Flautner , David Blaauw , Trevor Mudge, ?Drowsy
instruction caches: leakage power reduction using dynamic voltage scaling and cache
sub-bank prediction,? In Proceedings of the 35th annual ACM/IEEE international
symposium on Microarchitecture, November 18-22, 2002, Istanbul, Turkey 
 [69]V. Delaluz , M. Kandemir , N. Vijaykrishnan , M. J. Irwin, ?Energy-oriented compiler
optimizations for partitioned memory architectures,? Proceedings of the international
conference on Compilers, architectures, and synthesis for embedded systems, p.138-147,
November 17-19, 2000, San Jose, California, United States.
 [70]Mahmut Kandemir , Ugur Sezer , Victor Delaluz, ?Improving memory energy using
access pattern classification,? In Proceedings of the 2001 IEEE/ACM international
conference on Computer-aided design, November 04-08, 2001, San Jose, California.
 [71]P. Panda and N. Dutt, ?Low-power memory mapping through reducing address bus
activity,? IEEE Trans. Very Large Scale Integr. Syst. 7, 3 (Sept.), 309?320, 1999.
 [72]Preeti R. Panda , Nikil D. Dutt, ?Reducing Address Bus Transitions for Low Power
Memory Mapping,? Proceedings of the 1996 European conference on Design and Test,
p.63, March 11-14, 1996.
 [73]Wei-Chung Cheng , Massoud Pedram, ?Low power techniques for address encoding and
memory allocation,? Proceedings of the 2001 conference on Asia South Pacific design
automation, p.245-250, January 2001, Yokohama, Japan.
 [74]Naehyuck Chang , Kwanho Kim , Jinsung Cho, ?Bus encoding for low-power high-
performance memory systems,? Proceedings of the 37th conference on Design
automation, p.800-805, June 05-09, 2000, Los Angeles, California, United States.
 [75]Yingwu Zhu, Yiming Hu, ?Can Large Disk Built-in Caches Really Improve System
Performance??, in Proceedings of the ACM SIGMETRICS 2002 (extended abstract),
Marina Del Rey, California, June 15-19, 2002. pp. 284-285. 
 269
 [76]Yiming Hu , Qing Yang, "DCD?disk caching disk: a new approach for boosting I/O
performance", Proceedings of the 23rd annual international symposium on Computer
architecture, p.169-178, May 22-24, 1996, Philadelphia, Pennsylvania, United States 
 [77]Jung-ho Huh, Tae-mu Chang, "Hierarchical disk cache management in RAID 5
controller", Journal of Computing Sciences in Colleges archive, Volume 19 ,  Issue 2
(December 2003), P.47 - 59, 2003 
 [78]W. Hsu , A. J. Smith, "The performance impact of I/O optimizations and disk
improvements", IBM Journal of Research and Development, v.48 n.2, p.255-289, March
2004
 [79]P. Biswas, K. K. Ramakrishnan, and D. Towsley. ?Trace driven analysis of write
caching policies for disks?, ACM SIGMETRICS Conference on Measurement and
Modeling of Computer Systems, pages 13 23, 1993.
 [80]D. Patterson, G. Gibson, and R. Katz, "A Case for Redundant Arrays of Inexpensive
Disks (RAID)," Proc. Int'l Conf. Management of Data, ACM, 1989, pp. 109-116.
 [81]A. J. Smith, ?Disk Cache: Miss Ratio Analysis and Design Considerations.?
Proceedings of the 5th annual Symposium on Computer Architecture, Apr. 1985, 242-
248.
 [82]"SAMSUNG Teams with Microsoft to Develop First Hybrid HDD with NAND Flash
Memory.",
http://www.samsung.com/Products/HardDiskDrive/news/HardDiskDrive_20050425_0
000117556.htm, Apr 25, 2005.
 [83]A. J. Smith, ?On the effectiveness of buffered and multiple arm disks?, In Proceedings
of the 5th Annual Symposium on Computer Architecture (April 03 - 05, 1978). ISCA '78.
ACM Press, New York, NY, 242-248.
 [84]A. J. Smith, ?Sequentiality and Prefetching and Database Systems?, ACM Trans.
Database Syst. 3, No. 3, 223-247 (September 1978).
 [85]W. W. Hsu, A. J. Smith, and H.C. Young, ?I/O Reference Behavior of Production
Database Workloads and the TPC Benchmarks--An Analysis at th eLogical Level?,
ACM Trans. Database Syst. 26, No. 1, 96-143 (March 2001).
 [86]R. H. Patterson, G. A. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka, "Informed
Prefetching and Caching," Proceedings of the ACM Symposium on Operating Systems
Principles (SOSP), Copper Mountain, CO, December 1995, pp. 79 -95.
 [87]L. Haas, W. Chang, G. Lohman, M. McPherson, P. Wilms, G. Lapis, B. Lindsay, H.
Pirahesh, M. Carey, and E. Shekita, "Starburst Mid-Flight: As the Dust Clears", IEEE
Trans. Knowledge & Data Eng. 2, No. 1, 143-160 (March 1990).
 270
 [88]J. Z. Teng and R. A. Gumaer, "Managing IBM Database 2 Buffers to Maximize
Performance", IBM Syst. J. 23, No. 2, 211-218 (1984).
 [89]F. Chang and G. A. Gibson, "Automatic I/O Hint Generation Through Speculative
Execution", Proceedings of the USENIX Symposium on Operating Systems Design and
Implementation (OSDI), New Orleans, LA, February 1999, pp. 1-14.
 [90]V. Soloviev, "Pretching in segmented disk cache for multi-disk systems", In Proceedings
of the Fourth Workshop on I/O in Parallel and Distributed Systems: Part of the Federated
Computing Research Conference (Philadelphia, Pennsylvania, United States, May 27 -
27, 1996). IOPADS '96. ACM Press, New York, NY, 69-82.
 [91]S. W. Son  and M. Kandemir, "Energy-aware data prefetching for multi-speed disks", In
Proceedings of the 3rd Conference on Computing Frontiers (Ischia, Italy, May 03 - 05,
2006). CF '06. ACM Press, New York, NY, 105-114.
 [92]F. Chen, S. Jiang, and X. Zhang, "SmartSaver: turning flash drive into a disk energy
saver for mobile computers", In Proceedings of the 2006 international Symposium on
Low Power Electronics and Design (Tegernsee, Bavaria, Germany, October 04 - 06,
2006). ISLPED '06. ACM Press, New York, NY, 412-417.
 [93]M. Rosenblum and J. K. Ousterhout, "The design and implementation of a log-structured
file system", ACM Trans. Comput. Syst. 10, 1 (Feb. 1992), 26-52.
 [94]A. Varma  and Q. Jacobson, "Destage Algorithms for Disk Arrays with Nonvolatile
Caches", IEEE Trans. Comput. 47, 2 (Feb. 1998), 228-235.
 [95]J. A. Solworth and C. U. Orji, "Write-only disk caches", In Proceedings of the 1990
ACM SIGMOD international Conference on Management of Data (Atlantic City, New
Jersey, United States, May 23 - 26, 1990). SIGMOD '90. ACM Press, New York, NY,
123-132.
 [96]B. Hong, F. Wang, S. A. Brandt, D. D. Long, and T. J. Schwarz, "Using MEMS-based
storage in computer systems---MEMS storage architectures", Trans. Storage 2, 1 (Feb.
2006), 1-21.
 [97]P. M. Chen  and E. K. Lee, "Striping in a RAID level 5 disk array", In Proceedings of
the 1995 ACM SIGMETRICS Joint international Conference on Measurement and
Modeling of Computer Systems (Ottawa, Ontario, Canada, May 15 - 19, 1995). B. D.
Gaither, Ed. SIGMETRICS '95/PERFORMANCE '95. ACM Press, New York, NY, 136-
145.
 [98]Hitachi Global Storage Technologies--HDD Technology Overview Charts,
http://www.hitachigst.com/hdd/technolo/overview/storagetechchart.html
 271
 [99]C. Ruemmler and J Wilkes, ?UNIX disk access patterns?, Proceedings of Winter 1993
USENIX (San Diego, CA, 25--29 January 1993), pages 405--20, January 1993.
 [100]TCP: Transaction Processing Performance Council, http://www.tpc.org/default.asp.
 [101]Steven A. Przybylski, ?Cache and Memory Hierarchy Design, A performancedirected
approach?, Morgan Kaufmann Publishers, Inc, 1990.
 [102]Bruce Jacob, Spencer Ng, David Wang, Aamer Jaleel, and Samuel Rodriguez,
?Memory Systems: Cache, DRAM, Disk ?A Holistic Approach to Design?, Morgan
Kaufmann Publishers, Inc., to be published in 2007.