ABSTRACT
Title of Thesis:  TERPS: THE EMBEDDED RELIABLE PROCESSING SYSTEM
Amol Vishwas Gole, Master of Science, 2003
Thesis directed by: Professor Bruce L. Jacob
Department of Electrical and Computer Engineering
Electromagnetic Interference (EMI) can have an adverse effect on commercial 
electronics. As feature sizes of integrated circuits become smaller, their susceptibility to 
EMI increases. In light of this, integrated circuits will face substantial problems in the 
future either from electromagnetic disturbances or intentionally generated EMI from a 
malicious source. 
The Embedded Reliable Processing System (TERPS) is a fault tolerant system 
architecture which can significantly reduce the threat of EMI in computer systems. TERPS 
employs a checkpoint and rollback recovery mechanism tied with a multi-phase commit 
protocol and 3D IC technology. This enables it to recover from substantial EMI without 
having to shutdown or reboot. In the face of such EMI, only a loss in performance dictated 
by the strength and duration of the interference and the frequency of checkpointing will be 
seen. 
Various conditions in which chips can fail under the influence of EMI are described. 
The checkpoint and rollback recovery mechanism and the resulting TERPS architecture is 
 
stipulated. A thorough evaluation of the design correctness is provided. The technique is 
implemented in Verilog HDL using a 16-bit, 5-stage pipelined processor to show proof of 
concept. The performance overhead is calculated for different checkpointing intervals and 
is shown to be very reasonable (5-6% for checkpointing every 128 CPU cycles).
  
TERPS: THE EMBEDDED RELIABLE PROCESSING SYSTEM
by
Amol Vishwas Gole
Thesis submitted to the Faculty of the Graduate School of the 
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Master of Science
2003
Advisory Committee:
Professor Bruce L. Jacob, Chair/Advisor
Professor Virgil D. Gligor
Professor Manoj Franklin
 
? Copyright by 
Amol Vishwas Gole
2003
   
DEDICATION
To my beloved parents, Lata and Vishwas
and
and all my family and friends
ii
   
ACKNOWLEDGEMENTS
I am grateful to my advisor, Dr. Bruce Jacob, for his direction and encouragement for 
the past two wonderful years here at the University of Maryland. He has not only been a 
great academic advisor to me, but has afforded invaluable guidance and insight to life and 
its many little quirks. I am sure the knowledge I have gained as being a research assistant 
and student under Dr. Jacob?s supervision will help me throughout my professional career. 
This work would not have been possible without him. Further, I would like to express my 
gratitude towards Dr. Franklin and Dr. Gligor for agreeing to be on my committee. 
I would also like to thank Dr. Declaris and Dr. Gansman for their support and the 
experience I gained as a teaching assistant under them. I would like to especially thank Dr. 
Declaris for always believing in me and his precious guidance. I would also like to thank 
the University, the teachers, and the staff for making my Masters Degree a reality. 
Working with Cagdas, Sam, and Xia has been a great experience and I would like to 
particularly thank them for helping me with this work. I am grateful to Sada and IyerB for 
lending their time and ideas when it really counted. I would like to thank my roommates 
and friends, Mukul, Spawgi, Anibha, Potti, Chandesaab, Arindam, Hyma, and Priya for all 
the support, laughter, dabbas, and putting up with the ?Gole Factor? these last few years. I 
would like to especially thank Spawgi, Potti, and Priya for being very understanding and 
caring during these last few months. I couldn?t have done it without you guys and I hope 
our friendship lasts forever.
Finally I am indebted to my beloved parents, Lata and Vishwas, my sister and brother-
in-law, Tina and Vikram, and all my relatives for believing in me and supporting me 
throughout especially during the difficult times. 
iii
   
 TABLE OF CONTENTS
List of Figures........................................................................................................... vi
List of Tables ..........................................................................................................viii
Chapter 1 Introduction ....................................................................................... 1
1.1 Effect of EMI on Integrated Circuits..................................................... 1
1.2 TERPS Architecture ...............................................................................4
Chapter 2 Related Work ..................................................................................... 8
Chapter 3 TERPS Architecture ........................................................................ 18
3.1 Checkpointing.......................................................................................18
3.2 Rollback Recovery................................................................................24
Chapter 4 Correctness of Design ......................................................................26
4.1 Resuming to a Consistent State ............................................................27
4.1.1 System State and Rollback.......................................................27
4.1.2 Precise Checkpointing..............................................................28
4.1.3 Multi-phase Commit ................................................................29
4.2 Re-execution of instructions.................................................................34
Chapter 5 Implementation ................................................................................ 37
5.1 Basic Processor Architecture................................................................37
5.2 Implementation .....................................................................................39
5.2.1 Logical Verification ..................................................................43
5.3 Safe Storage Implementation ...............................................................46
iv
 
Chapter 6 Results ..............................................................................................51
6.1 Performance Analysis ...........................................................................51
6.1.1 Performance with the Memory Controller on-chip .................54
Chapter 7 Conclusions and Future Work ......................................................... 56
References ................................................................................................................59
v
   
LIST OF FIGURES
Figure 1.1: Reduction in feature size over the years ............................................... 3
Figure 1.2: Order of concern for our system-level approach .................................. 5
Figure 3.1: TERPS Architecture .............................................................................19
Figure 3.2: Long latency EMI detection can cause failure of TERPS checkpoint 
rollback mechanism ............................................................................ 21
Figure 3.3: Checkpoint rollback mechanism with two safe storage banks  .......... 22
Figure 3.4: Checkpointing and rollback recovery using the checkpoint latch, 
write buffers and safe storage ............................................................  23
Figure 3.5: Rollback recovery details .................................................................... 25
Figure 4.1: RF writes do not change permanent state ........................................... 30
Figure 4.2: Store instructions that commit early may change permanent state .....31
Figure 4.3: Multi-phase commit............................................................................. 32
Figure 4.4: Importance of saving store data in the safe storage............................. 33
Figure 5.1: Detailed block diagram of the TERPS Processor Architecture .......... 40
Figure 5.2: Cadence NC Verilog ............................................................................ 43
Figure 5.3: The Design Browser ...........................................................................  44
Figure 5.4: The Waveform view ...........................................................................  45
Figure 5.5: 3 possible SRAM memory cell implementations ..............................  47
Figure 6.1: Performance Overhead due to checkpointing ..................................... 53
vi
 
Figure 6.2: Performance overhead due to checkpointing with the memory 
controller on-chip ...............................................................................  55
Figure 7.1: Photomicrographs of chips fabricated via MOSIS ............................. 57
vii
 
viii
 
LIST OF TABLES
 
Table 5.1: Instruction Set Architecture ................................................................. 38
Table 5.2: Features of different SRAM topologies .............................................. 48
Table 6.1: Write buffer size ...................................................................................52
     
Chapter 1
Introduction
Electromagnetic Interference (EMI) broadly refers to any type of interference that can 
potentially disrupt, degrade or otherwise interfere with the functioning of electronic 
systems. Current high performance ICs like microprocessors are fabricated with very small 
feature size, are clocked at frequencies well into the GHz range, and operate at reduced 
voltage levels. Though these characteristics have improved the capabilities and 
performance of chips, they have increased the susceptibility of high-performance chips to 
EMI. Hence, there is a growing concern over the electromagnetic compatibility of ICs in 
hostile EMI environments, especially those created by intentionally generated EMI from a 
malicious source. The Embedded Reliable Processing System (TERPS) is a system 
architecture-based approach which uses a checkpoint rollback recovery protocol to 
improve the reliability of microprocessor systems under such extreme operating 
conditions. 
1.1Effect of EMI on Integrated Circuits
Typical sources of EMI or radio frequency interference (RFI) are overhead high 
voltage lines, lightning events, radar devices, powerful radio transmitters, wireless network 
1
devices, and GSM (Global Systems for Mobile communication) bursts. Until recently, 
intentionally generated EMI was a lesser concern: In August 1999, the International Union 
 
of Radio Science addressed the subject of criminal EMI and EM terrorism which is 
defined as ?the intentional malicious generation of electromagnetic energy to induce noise 
or high-level disturbances into electrical or electronic systems with the intention to disrupt, 
confuse, or damage these systems for criminal or terrorist reasons? [1]. In general, 
electronic systems are designed more for reduced emissions than for RFI tolerance and 
hence they can easily fall prey to intentional EMI. Reports of medical equipment inside 
ambulances shutting down at field strengths of 20 V/m due to unintentional interference 
are known [1], thus the threat of intentional interference with field strengths of 100 to 
200V/m, which can be produced by off-the-shelf equipment from Radio Shack [1], is quite 
severe. Moreover, experts claim a suitcase-sized threat is widely available over the internet 
[1]. This is introduces serious risks for military equipment, safety-related automotive 
systems, and medical equipment because they are greatly reliant on embedded systems, 
which are easily susceptible to EMI. As a result, the industry and the research community 
are both paying attention to designing systems which not only have low emission 
characteristics but also low susceptibility to EMI. Such electromagnetic pollution imposes 
new challenges in the design of integrated circuits. 
The feature size of ICs has been reducing rapidly over the years (fig. 1.1) in accordance 
with Moore?s Law. The electrical charge involved in transistor switching decreases with 
the decrease in IC feature size. Correspondingly the energy required to disturb the 
switching process reduces, making it easier to disturb the circuit with increasingly lower 
EMI signal levels. As the switching speeds of microprocessors increase and supply 
voltages scale down resulting in smaller noise margins, the margin of error caused due to 
disturbances such as those induced by EMI, drastically reduces putting stress on better 
2
 
signal integrity. Moreover, parasitic effects inside integrated circuits have dramatically 
increased making signal integrity a prominent issue [2].
Electronic systems can couple EMI through cables, PCB traces, bonding 
interconnects, and even internal metal chip signals like power, ground, and data lines that 
behave as receiving antennas [3]. EMI that is coupled by the system can induce currents 
(mA) which cause various disturbances. Signal rectification due to interference is caused 
by the inherent nonlinear behavior of electronic devices. This is said to be the primary 
upset mechanism for integrated circuits under RFI [4]. In addition to signal rectification, 
inter-modulation, cross-modulation and other disturbances are immediate effects of 
interference [2]. When interpreted as a system signal or superimposed on one, these 
disturbances, if powerful enough, can cause malfunctioning or spurious state changes on 
logic devices. 
Reduction in Feature Size
0.01
0.1
1
10
1970 1975 1980 1985 1990 1995 2000 2005 2010
Year
Lithography
Figure 1.1: Reduction in feature size over the years. There has been a rapid reduction 
in IC feature size over the last few decades in accordance with Moore?s Law. As the 
feature size reduces, the susceptibility of ICs to EMI increases.
Source: INTEL and ITRS
(?
m)
3
     
The power levels and frequency range for which circuits are more susceptible to 
intentional EMI have been studied recently. Previous studies observed changes on the I-V 
characteristics of diodes, BJTs, and MOSFETs under RFI [5]. Susceptibility levels of a 
microcontroller and a DSP chip have been measured for RF interference up to 400 MHz, 
and data corruption was observed on the communication path between the microcontroller 
and RAM memory [6]. The same study showed that 20dBm RF interference at 350 MHz is 
enough to trigger the reset pin of a voltage regulator. Another study investigated the effects 
of RF interference on the input ports of a 0.7?m CMOS with frequencies in the 20MHz-
1GHz range with power levels up to 15 dBm [3]. They observed dynamic failures in the 
form of variations in input pad propagation delay and static failures when pad output 
signals were misinterpreted as they strayed out of the high or low voltage levels. Thus, 
even less powerful RFI can cause propagation and crosstalk-induced delays on wires and 
can deteriorate signal integrity. 
Though electronic equipment can be protected to a certain degree by using shielding, 
filters on PCBs, and filtered connectors, an uncompromising necessity to design robust ICs 
exists as these measures are often expensive due to post production costs and infeasible for 
volume applications [5] as the equipment has to be designed specifically for different 
working environments. 
1.2TERPS Architecture
This thesis introduces a fault tolerant system architecture, called TERPS, that can 
significantly reduce the threat of intentional EMI. In contrast to chip level approaches (e.g. 
radiation hardening) or circuit level approaches (e.g. self-checking logic), we investigate a 
system-level approach where multi-phase commit protocols are used in conjunction with a 
4
 
safe storage chip, which holds backups of system state and is more EMI resistant than the 
CPU and memory controller chips. The resulting system significantly reduces the 
susceptibility of its processing components to EMI induced transient faults. 
The protection offered to a system?s processing components by the TERPS mechanism 
is discussed. Fig. 1.2 outlines the major components of a computer system along with a 
safe storage memory, a part of the TERPS mechanism. The CPU is connected directly to 
the safe storage via an ECC-protected dedicated bus, which handles the checkpoint 
rollback traffic. The memory controller arbitrates the communication between the CPU 
and the DRAM system. The CPU, memory controller and safe storage constitute the 
Figure 1.2: Order of concern for our system-level approach. The protection scheme 
we propose in this study will primary cover the processing components of a general 
computing system. The CPU and memory system are more susceptible to EMI effects 
as compared to I/O. Therefore our main concern in this study is protecting the 
processing elements as shown in the figure. Future work will be directed towards I/O 
transactions. 
Memory 
Controller
DRAM
Array
DRAM
Array
DRAM
Array
DRAM DRAM DRAM DRAM
CPU
Safe Storage
I/O
Contoller
1
st
 Order EMI effect
2
nd
 Order EMI effect
Current Scope of 
immunity offered 
by TERPS
5
 
sphere of protection and, with the DRAMs themselves, represent the area of highest risk 
for EMI effects. To increase the reliability of the memory system, the DRAM may be ECC 
protected.The processing system is also connected to the I/O system, which represents an 
area of slightly reduced risk for EMI effects. A transient fault is more likely to disturb an 
in-process computation or memory request than an in-process I/O request because I/O 
requests are far less frequent than computations and memory transactions. In addition, the 
processor and memory controller operate at much higher speeds and with tighter timing 
margins than the I/O system. Hence, we will consider the effects of EMI on processors and 
memory systems to be of the first order while those on I/O of the second order. Future 
work will be directed towards incorporating the I/O system into the sphere of protection as 
well. 
Resistance to intentional EMI within TERPS stems from a hardware-based checkpoint 
and rollback recovery mechanism that works in conjunction with RF detection methods 
[7][8]. This mechanism allows the CPU to rollback to a previously known valid state and 
thus protects against a virtually unlimited number of faults anywhere in the area under 
consideration. Many embedded systems can tolerate only a limited number of 
simultaneous faults and either fail silently or require reboot if more faults occur. This can 
have devastating effects if such faults occur during critical conditions, for example if the 
guidance system fails while directing a missile or if a pace-maker is affected by a wireless 
device. The TERPS architecture allows recovery from such faults without having to reboot 
or shutdown and without any human assistance. 
To rollback to a safe state, TERPS maintains snapshots, or checkpoints, of the required 
system state at predetermined intervals. The processor state is saved into the safe storage 
6
     
memory chip, which is designed to be more resistant to EMI than the CPU and memory 
controller by employing circuit, device, and process-level techniques that trade off circuit 
performance for noise tolerance. Memory instructions are handled by a series of write 
buffers that provide a multi-phase commit protocol to the DRAM system. On EMI 
detection, the system is rolled back to a correct state using the safe storage. 
As even the minimum required state to rollback is large, it would take many cycles to 
do a checkpoint or recovery due to constraints of chip-to-chip bandwidth. To significantly 
reduce the impact of checkpointing on performance, high bandwidth solutions like 3D IC 
packaging [9], optical interconnect [10], or RAMBUS Yellowstone [11] technologies can 
be used. For our physical prototype, we will be implementing 3D IC technology. 
Through the application of TERPS, the processing system has improved from one with 
many points of failure in the presence of unintentional/intentional EMI to one in which 
only the safe storage itself and the CPU-resident control logic that handles the checkpoint 
and rollback mechanisms may pose problems.
In this thesis a detailed description of the TERPS mechanism is provided, a proof of its 
correctness, implementation considerations are specified, a minimal performance overhead 
due to checkpointing is shown, and our physical prototype system, built on 0.5 ?m and 
0.25 ?m processes via MOSIS, is described.
7
     
Chapter 2
Related Work
Reliability has always been an important component in the design of high performance 
processors. A characteristic of a highly reliable system is a low failure rate. A failure 
occurs when the behavior of a system deviates from that which is specified for it [12]. 
Hardware component failures, communication faults, timing problems, human error, etc. 
are just a few of the types of faults that occur in systems [12]. Smaller feature sizes, 
reduced voltage levels, higher processing speed, and increasingly complex designs have 
enhanced the functionality of digital systems but have also made them prone to hardware 
related faults during execution. A study by Randell et al. [12] provides an insight to 
reliability issues including types of faults, fault tolerance techniques, and examples of fault 
tolerant systems. They classify hardware component failures by duration (fault is 
permanent or transient), extent (effect is localized or distributed), and value (creates fixed 
or varying erroroneous results). 
Peercy and Banerjee discuss fault detection in detail [13]. Fault detection requires 
redundancy in either space, time, information, or algorithm. Space redundancy is usually 
some form of n-modular redundancy (nMR) or complementary logic. As chip area is 
8
expensive, an alternative is time redundancy where the same circuit is used for the same 
functionality at two different times. The major drawbacks are that permanent faults cannot 
 
be detected and the throughput is lowered. Watchdog timers are also used to guarantee that 
a processor is making forward progress. Information redundancy is in the form of 
concurrent error detecting codes like parity check codes, Berger code, and M-out-of-N 
code. Algorithm-based fault tolerance introduces some information or time redundancy 
into an aspect of the function being performed by the VLSI circuitry. Manoj Franklin?s 
study [14] investigates ways to implement redundancy techniques for superscalar 
processors. Under utilized resources available on the system are used to incorporate 
hardware, information, or time redundancy to detect errors in the functional units. REESE 
[15], which is a method of soft error detection in microprocessors, detects transient faults 
using time redundancy and adds a small number of extra functional units to keep the 
execution overhead low. A special form of space and time redundancy is observed in the 
DIVA architecture [16][17]. The core processor is appended by a small and simple checker 
processor which is functionally the same only less powerful. If any results from the core 
processor are incorrect due to a fault of some kind, the checker will be able to detect and 
fix the errant result. It then flushes the core processor state and restarts it after the errant 
instruction. This is an elegant solution for solving a whole range of faults while also 
reducing burden of verification. However, it cannot be applied for EMI induced faults as 
these may persist everywhere in the chip i.e. the checker processor, the RF and the clock 
network may go bad leaving no valid state to restart from. 
In general, a system can be designed to be fault tolerant by some form of redundancy 
and error recovery algorithms. Once an error is detected, fault tolerant techniques use some 
form of forward or backward error recovery [12]. Forward error recovery is dependant on 
9
 
having identified the fault, or at least all its consequences. Such schemes attempt to make 
use of the erroneous system state to make further progress (e.g. Error Correcting Codes). 
Backward error recovery techniques require establishing recovery points during which 
the state is saved (checkpointing) in a safe location and can be later reinstated (rollback-
recovery). Checkpointing and rollback-recovery has always been a commonly applied 
fault tolerance approach in the development of highly reliable processing systems. 
Depending on how much time is allowable for recovery procedures and how much loss of 
work is acceptable, checkpointing and rollback-recovery is implemented in software or 
hardware. 
One of the most significant and earliest systems which adopted checkpointing and 
rollback recovery were intended for space applications in which a high degree of fault 
tolerance was essential. The Jet Propulsion Laboratory Self Testing and Repairing (JPL-
STAR) computer [18] was a general purpose fault tolerant computer developed for a 
spacecraft guidance, control and data acquisition system which would be used on long 
unmanned space missions. Upon error detection by redundant units, error recovery is 
initiated by backward error recovery in software. The programs established recovery 
points and decided on the state that needs to checkpointed. File systems, database systems, 
and distributed systems also rely on checkpointing and rollback to establish fault tolerance. 
Koo and Toueg [19] disclose a distributed algorithm to create consistent checkpoints, as 
well as a rollback-recovery algorithm for distributed systems. They identify the ?domino 
effect? and ?livelocks? problems related with checkpoint creation and rollback-recovery in 
distributed systems and then show how their algorithm solves these problems by tolerating 
failures during their execution and forcing a minimal number of processes to rollback after 
10
 
a failure. Chandy and Ramamoorthy [20] discuss optimum checkpointing strategies in 
order to have shorter recovery times but still not affect performance significantly. The 
rollback points are tailor-made for a particular program according to their algorithm. 
Upadhyaya and Saluja [21] later modified Chandy and Ramamoorthy?s algorithm to insert 
rollback points in programs with multiple retries and also added a watchdog processor for 
error detection. The watchdog processor is implemented in place of a software error 
detection solution to ensure low error latency. K. Shin et al. [22] developed models to 
evaluate the behavior of checkpointing of real-time tasks. Using these models they 
determined optimal intercheckpoint intervals and an optimal number of checkpoints for a 
task by minimizing the mean task completion subject to a specified confidence in 
execution results. For their realistic model, which includes imperfect coverages of both the 
on-line detection mechanism and the acceceptance test, they observed that if a task 
requires a high probability of correct execution results, checkpointing must be done more 
frequently towards the end of the task, since the task has to pass all the acceptance tests 
near the end of the task. 
In this research, we focus on environments where error rates are high and real-time 
constraints prohibit significant delays for recovery. These constraints motivated us to use a 
hardware-assisted backward error recovery scheme - instruction retry - for TERPS. 
Instruction retry is used for rapid recovery from transient faults and is seen in many 
systems including the IBM 4341 processor [23], C.fast [24], the IBM ES/9000 Model 900 
[25], and in the UCLA Mirror Processor [26]. In single instruction retry, the state of the 
processor is checkpointed at each instruction boundary, and upon error detection, the state 
is rolled back to the previous instruction state. However this requires immediate error 
11
     
detection. In the IBM 4341 processor [23], an instruction is retried by restoring state 
information that is continuously saved and removed by hardware. If the instruction is to be 
aborted, the ?machine check interrupt process? is provided with a damage report. Tsao et 
al. introduce C.fast, a VLSI fault tolerant processor [24] in which shadow registers that 
contain state of the previous instruction are attached to every state register on the chip. 
When an error is detected during the execution of an instruction, the processor is able to 
retry the same instruction immediately. 
However, concurrent error detection required for single instruction retry, demands 
checkers and isolation circuits in communication paths between different modules of the 
system. These systems can incur significant performance penalties due to the delays in 
checking. To erase this performance loss, error checking can be done in parallel. The side 
effect is that the error signal is delayed and recovery becomes more complicated. Multiple 
instruction retry - rolling back multiple instructions - is called for in response to a delayed 
error signal. Multiple instruction retry schemes can either employ full checkpointing or 
incremental checkpointing [27]. In full checkpointing, which is employed by TERPS, 
snapshots of the system state are established at regular or predetermined intervals, and the 
system can roll back to this saved state on error detection. In contrast, incremental 
checkpointing preserves system state alterations in a sliding window like manner; error 
detection initiates recovery by undoing the system state changes one instruction at a time, 
back to an instruction previous to the one in which the error occurred. The Model 900 [25] 
uses a form of incremental checkpointing by postponing the remapping of physical register 
until the error detection latency has been exceeded for the data contained in the physical 
register. Checkpoints of the system state are made at variable intervals. Though the 
12
 
processor has an out-of-order model, in-order completion is maintained by storing the 
results of instruction that finished out-of-order in temporary registers. If one of the 
processors fails due to some fault, its processing state is rolled back to a consistent error 
free state by purging the pipeline and temporary registers. Micro rollback is another 
interesting incremental checkpointing based multiple instruction retry concept which was 
introduced by Tamir et. al. [28][29]. Micro rollback is the process of backing up a system 
several cycles in response to a delayed error signal. In micro rollback each module must 
save the state required to properly recover. In the UCLA Mirror Processor (MP) [26] 
system two mirror processor chips operate in lock-step, comparing external signals and a 
signature of internal signals every clock cycle. On error detection, both processors either 
recover using micro rollback or, in certain cases, erroneous state is corrected by copying a 
value from the fault-free processor to the faulty processor. The MP was designed to 
recover from single transient faults (with support for some multiple faults also) which are 
detected by having 2 processors, i.e. 2-modular redundancy. The MP works to recover as 
soon as an error is detected to prevent the spread of erroneous information throughout the 
system, i.e. error confinement. TERPS does a system-level recovery and prevents errors 
from spreading throughout the system as the state is never completely committed until it is 
safe to do so. Unlike the MP, TERPS does not take checkpoints at every clock cycle and 
does not recover to exactly the clock cycle before the error. But TERPS is similar to the 
MP in that it also uses write buffers to support the rollback mechanism when encountering 
store instructions. 
The aforementioned hardware-based instruction retry schemes employ some form of 
data redundancy to eliminate rollback data hazards leading to hardware overhead. 
13
     
Compiler-based multiple instruction retry techniques [30] have been developed to reduce 
hardware costs by alleviating anti-dependencies by data flow transformations that result 
from multiple instruction rollback. However, compiler-assisted instruction retry [27][31], 
which utilizes a read buffer to eliminate one kind of rollback data hazard and compiler 
techniques to eliminate the remaining hazards, shows better performance as compared to 
the compiler-only instruction rollback scheme by exploiting the unique characteristics of 
different hazard types. 
Instruction retry has the disadvantage that changes have to made in the processor 
design. Bowen and Pradhan introduced a scheme that supports checkpointing and rollback 
recovery at a higher level; checkpoint and rollback was embedded directly into the 
translational lookaside buffer (TLB) [32]. In this scheme, a backup copy of a memory page 
is made just before it is modified. This requires large checkpointing intervals to minimize 
the overhead due to page manipulations and modification of the TLB. Cache-Aided 
Rollback Error Recovery (CARER)[33] is a cache-based checkpointing proposal wherein 
the replacement policy of the regular cache is modified such that it prevents the 
replacement of dirty data thereby keeping a checkpoint state in memory. When either the 
deletion of some of the dirty blocks becomes unavoidable, an external interrupt occurs, or 
an I/O instruction is executed, a checkpoint is established by saving the processor state in 
internal back up registers and marking all the dirty blocks as unchangeable. When an error 
is detected, the processor recovers by restoring its saved state and all cache blocks, while 
the unchangeable ones are marked invalid. TERPS also employes a similar approach 
where the write buffers act as cache and hold the store instruction data to prevent them 
from being committed to memory. However TERPS does not use the modified 
14
 
replacement policy used by CARER to save state as it stores the checkpointed state in an 
external safe storage memory. An excellent performance study on cache-based recovery 
schemes is presented by Janssens and Fuchs [34]. They stipulate that though the average 
overhead of cache-based recovery schemes is quite minimal, the performance is not 
predicable as compared to a system without recovery capability due to the lack of control 
and variability of the checkpoint frequency of different programs and caches; checkpoint 
frequency will vary according to the I/O behavior and program?s interaction with the 
memory. TERPS has a constant checkpoint frequency and it is shown that the performance 
impact is predicable across different programs. This is crucial for real-time systems where 
a predictable recovery behavior would assist a scheduler to schedule programs to meet 
their deadlines even in the presence of a fault.
Support for checkpointing and rollback recovery in shared memory multiprocessor 
environments have also been proposed [35][36][37]. Wu et al. [35] present a cache-based 
checkpointing and recovery algorithm to maintain a consistent checkpoint state. The use of 
checkpoint identifiers and recovery stacks along with private caches was shown to reduce 
performance degradation due to increased write-backs. In the ReVive scheme [36], 
complex checkpoint and rollback functions are performed in software, while hardware 
operations are limited to the directory controllers of the machine to reduce costs. During a 
global checkpoint, the caches are flushed to memory and a two-phase commit protocol is 
performed. Therefore the main memory contains the checkpoint state. Changes to the 
checkpoint state in the memory are logged by the home directory controller and are used to 
restore the memory state upon error detection. ReVive performs recovery from a wide 
range of failures without any hardware modification to the processors or caches. SafetyNet 
15
 
[37] is a fault tolerant solution which maintains multiple, globally consistent checkpoints 
of a shared memory multiprocessor and minimizes performance overhead by pipelining 
checkpoint validation with subsequent parallel execution. The current uni-processor 
TERPS form can be extended to a multi-processor environment utilizing architectures 
similar to SafetyNet [37] as it also can sustain long latency error detection mechanisms.
Checkpointing and rollback was proposed by Hwu and Patt for branch mis-prediction 
and exception handling in out-of-order processors [38]. They proposed cost-effective 
algorithms for performing checkpoint repair which incur very little overhead in time. 
Smith and Pleszkun introduced novel structures for implementing precise exceptions in 
pipelined processors [39]. When an exception occurs, the process state must be saved such 
that it reflects the sequential architectural model. Primarily, the saved state must reflect the 
following conditions: (i) All instructions preceding the instruction indicated by the saved 
program counter have been executed and have modified the process state. (ii) All 
instructions following the instruction indicated by the saved program counter are 
unexecuted and have not modified process state. (iii) The saved program counter points to 
the interrupted instruction. One can recognize that the concepts of precise exception 
handling in pipelined processors can be used to ensure that during checkpointing, a precise 
state is saved. 
TERPS has been developed borrowing the checkpointing and rollback concepts 
applied in the software and hardware of many fault tolerant systems and the conditions for 
precise exceptions in pipelined machines for providing a precise rollback state. Fault 
tolerant architectures that have been proposed previously have mainly concentrated on 
protecting systems from single error transient faults while TERPS has been designed 
16
 
keeping in mind that EMI induced faults may occur everywhere in the system. This 
disparity is the main reason behind the differences in contemporary fault tolerant designs 
and TERPS. 
17
       
Chapter 3
TERPS Architecture 
As stated earlier, there is a growing concern over the electromagnetic compatibility of 
ICs in hostile EMI environments, especially those created by intentionally generated EMI 
from a malicious source. EMI can couple through various parts of a system and, if 
powerful enough, can cause misinterpretation of data, clock edges and even the power and 
ground references. This can result in failures in many sections of the chip at the same time. 
Related works have aimed at solving single error or a limited number of faults and hence 
are not directly applicable as a solution to this problem. TERPS is a system architecture-
based fault tolerance approach that addresses the issues related with EMI induced faults 
with little performance overhead. It allows recovery from such faults without having to 
reboot or shutdown and without any human or even software assistance. A description of 
how the architecture efficiently implements the hardware-based checkpoint rollback 
recovery mechanism is provided in detail in this chapter. 
3.1Checkpointing
The minimal process state required to return to any point of execution varies from 
processor to processor, but in general it comprises of the program counter, the register file, 
18
and a window of memory transactions. For precise checkpointing the saved process state 
 
must be consistent with the sequential architectural model. The issues dealt with here are 
similar to those by Smith and Pleszkun [39].
First a system overview of the various elements of the TERPS architecture are 
highlighted in fig 3.1. In addition to the CPU chip and memory system, a special safe 
storage chip is augmented to the basic system architecture. The CPU is connected directly 
to the safe storage via a dedicated bus to handle the checkpoint rollback traffic. This bus 
may be ECC-protected to protect against single error transient faults. The memory 
controller arbitrates the communication between the CPU and the DRAM system. The 
CPU, memory controller and safe storage constitute the sphere of protection offered by 
TERPS currently and, with the DRAMs themselves, represent the area of highest risk for 
EMI effects. To implement the mechanism, the processing system has a checkpoint latch 
Figure 3.1: TERPS Architecture. 
Processor 
Core
SS A
SS B
Checkpoint Latch
Write 
Buffers
Memory 
Controller
DRAM
Array
DRAM
Array
DRAM
Array
DRAM DRAM DRAMDRAM
Safe Storage Chip
CPU_CLK
SS_CLK
Buffer
WB0-2
ss_clk
CPU_clk
CPU chip
state latched 
on CPU side
state latched 
on SS side
chkpt_rollback 
bus
19
     
and a series of write buffers. This processing system is also connected to the I/O system 
(fig. 1.2), which as explained earlier represents an area of slightly reduced risk for EMI 
effects. Future work will be directed towards incorporating the I/O system into the sphere 
of protection as well. 
In order for the safe storage to be less susceptible to EMI, it applies circuit, device, and 
process-level techniques that trade off circuit performance for noise tolerance and hence 
operates at a frequency much lower than the CPU. The safe storage clock (ss_clk) is 
stepped down from the CPU clock (CPU_clk) and is given a duty cycle designed to 
maximize setup and hold times available to the safe storage. For the purposes of the 
discussion let us take the time period of the ss_clk as N times longer than the CPU_clk, i.e. 
T
ss_clk
 = N * T
CPU_clk
. Due to this speed mismatch and differences in process technology, 
the process of checkpointing is not a straightforward one. The safe storage must latch a 
value from the CPU at a clock speed dictated by its technology?s characteristics, else its 
setup and hold times might be violated if, for example, the data is held valid on the bus for 
a time equal to the period of CPU_clk and that time is less than the setup and hold times 
required by a safe storage. Hence when a precise checkpoint is taken at the CPU side, the 
process state is first stored in a checkpoint latch. If no fault is detected, the safe storage will 
read the state from the checkpoint latch at every positive edge of the ss_clk. It is important 
to note that EMI detection will not be concurrent and will probably take a few CPU clock 
cycles. This leads to problems when a fault happens just before the safe storage reads the 
state from the checkpoint latch, and the fault is detected only after this action. The saved 
state in the safe storage may be polluted and the system would not be able to recover from 
that state. This problem is depicted in fig 3.2. 
20
 
Therefore, when recovery is necessary, we have to rollback to an older valid 
checkpoint. To satisfy this condition, the safe storage has two banks and will store the 
checkpointed state in either safe storage A or safe storage B in an alternate fashion 
allowing it to maintain the last two checkpoints. This modified checkpoint rollback 
mechanism can be visualized in Fig. 3.3. This design is compatible with an EMI detection 
circuit which can report the fault within at most N CPU clock cycles. Hence checkpointing 
is done every N CPU cycles. 
Store instructions must be prevented from writing their data to permanent storage 
before it is known whether the store data is error-free or not. By delaying the stores from 
committing, load instructions that are re-executed after a recovery will not read the wrong 
data. A multi-phase commit protocol has been employed to delay the store data by 
Figure 3.2: Long latency EMI detection can cause failure of TERPS checkpoint 
rollback mechanism.  The TERPS checkpoint rollback recovery mechanism can be 
explained using the safe storage clock (ss_clk) as a reference. At point R the state is 
checkpointed at the CPU and written to the safe storage at point Q as shown. Then at 
point S a new checkpoint is made by the CPU and this state is stored in the safe storage 
at point T overwriting the last checkpointed state R. If a fault actually occurs just before 
point T and was detected only afterwards due to the long latency EMI detection, the state 
saved in the safe storage may be corrupt which is indicated by S*. When the system 
initiates recovery at point C, it will reinstate the bad state S* into the system and 
recovery will correspondingly fail. To operate correctly, the system should be able to 
rollback to state R.
Rollback 
ss_clk
R CQS T
:Fault Detected
x
Safe 
Storage
R S* S*
Fault Actually Occurs
21
 
directing it through a series of three write buffers and the memory controller before they 
are actually written to memory. As we are rolling back to the older checkpoint, we need 3 
write buffers to ensure that no write instruction is committed to permanent state until it is 
safe to do so. Justification for using three write buffers is provided in the following chapter 
where correctness of the design is addressed. The interaction of the checkpoint latch, write 
buffers, and safe storage is shown in fig 3.4. During every checkpoint interval on the CPU 
side, stores write to the first write buffer, WB0. On average about 30% of all instructions 
are memory transactions and about one-third of those are stores [40]. Therefore the size of 
the write buffer can be roughly decided by the frequency of checkpointing, e.g. if a 
checkpoint is made every 128 CPU cycles, then the write buffer size can be set around 12-
entries. In the worst case, if the write buffer becomes full, the pipeline is stalled until the 
next checkpoint. During a checkpoint the contents of WB0, i.e. the store instructions that 
were executed in this last checkpoint interval, are written to the checkpoint latch. Also the 
Figure 3.3: Checkpoint rollback mechanism with two safe storage banks. At point R 
the state is checkpointed at the CPU and written to the safe storage A (SS A) at point Q. 
Then at point S a new checkpoint is made by the CPU and this state is stored in the safe 
storage B (SS B) at point T. The CPU initiates another checkpoint at point U. Note that 
since a fault is detected sometime before point C, the checkpoint made at U is not latched 
in the SS A. At this point the safe storage contains checkpoints made at points R and S in 
SS A and SS B respectively. As the fault detected before point C may have occurred in 
the interval Q and T due to delays in the detection circuit, the system recovers to point R 
and not S. 
Rollback 
ss_clk
R CQS T U
:Fault Detected
R
x
R
S
R
S
x
x
SS A
SS B
22
 
data in WB2 is sent to the memory controller atomically, which may take a few cycles 
depending on the checkpointing frequency and width of the frontside bus. On completing 
Figure 3.4: Checkpointing and rollback recovery using the checkpoint latch, write 
buffers and safe storage. For explanation purposes, the state of the 3 write buffers at 
different checkpoint intervals is indicated by I, J, K, etc. At point R, the processor is 
stalled, and a checkpoint is taken over a time interval ?t
c
 during which the PC, RF and 
WB0 (K) are written to the checkpoint latch, WB2 (I) is sent to the memory controller, 
and then the write buffers are prepared for the next checkpoint interval as shown. At this 
point checkpointing is done and normal execution resumes. At the next positive edge of 
the ss_clk, i.e. at point Q, the safe storage reads the state checkpointed at R from the 
checkpoint latch. By this point the memory controller has finished updating the DRAM 
too. This checkpointing operation is repeated until a fault is detected. In the figure a fault 
is detected between times T and C. On detection, the system goes into recovery mode. 
Rollback is accomplished by loading the state from the safe storage back to the CPU. 
Now the system goes back to normal mode of operation.
Rollback Window
:Fault Detected
J
K
WB0
WB1
WB2
L
K
K
L
M
L
M
I
N
J
ss_clk
t
?t
c
t
s
{
L
K
e
x
ecution after 
rollback
Detection 
window
RC
Checkpoint 
interval
QS T U
?t
c
: {PC,RF,K} @ checkpoint latch; I @ MC; WB2<=WB1, WB1<=WB0
t
s
/Q: {PC,RF,K} @ SSA; I @ DRAM
S: {PC,RF,L} @ checkpoint latch; J @ MC
T: {PC,RF,L} @ SSB; J @ DRAM
U: {PC,RF,M} @ checkpoint latch; K @ MC
C: {PC,RF,M} not latched; K @ DRAM; 
On Rollback: {PC,RF,K} @ SSA to {PC,RF,WB0} @ CPU; WB1,WB2 flushed
R: Processor stalled and checkpoint is initiated
R
x
R
S
R
S
x
x
SS A
SS B
23
     
this transaction, WB2 will be overwritten by WB1 and WB1 by WB0. WB0 is ready for 
the store data that will follow. The memory controller begins writing stores to the DRAM 
and normal execution resumes. 
Note that in the TERPS architecture an instruction is declared committed when its 
result is out of the safe storage and hence into permanent state. An instruction, whose 
results are reflected in the older checkpoint saved, is ready for committal only after it is 
sure that the newer checkpoint that was saved is ensured to be valid and the system will be 
able to rollback to it in case of a fault. This defines a rollback window, which is the 
minimum lifetime of an instruction, i.e. any instruction checkpointed at a given rollback 
point can not be committed before it is out of the rollback window. For example, in fig.3.4, 
instructions checkpointed at rollback point R can be committed to permanent state only 
after commit point C if there is no fault detected. 
3.2Rollback Recovery
When the EMI detection circuit indicates a fault, the pipeline is stalled until the rising 
edge of safe storage clock to prevent the system from executing instructions that may be 
faulty. At this point the system goes into recovery mode and the pipeline is immediately 
flushed to remove the corrupted state. The safe storage is prevented from reading the 
checkpoint latch, as it would during normal operation, so it does not save the state that 
could have been polluted. Instead, after sufficient bus turn-around-time, the safe storage 
output buffers are enabled to provide the valid state to the CPU. As the safe storage is 
running at a much slower clock, the CPU will wait until the safe storage is able to drive its 
output buffers. Once ready, the CPU latches the data from the safe storage and normal 
operation is resumed. A detailed timing diagram of the rollback recovery procedure is 
24
 
shown in fig. 3.5. It can be seen that the rollback penalty is four checkpoint intervals. Note 
that no matter where within a certain detection window a fault is detected, the system will 
always recover to the same rollback point corresponding to that detection window. 
Figure 3.5: Rollback recovery details. A timing diagram of the recovery procedure is 
highlighted with the aid of the Recovery mode signal, safe storage select signal, and 
checkpoint rollback bus. When a fault is detected, the system goes into Recovery mode at 
the next rising edge of the ss_clk. After bus turn around time, the safe storage puts the 
rollback state onto the checkpoint bus and it is read by the checkpoint latch. Again the 
bus is turned around and normal operation is resumed utilizing the saved state. At the 
beginning of every checkpoint interval the safe storage select (ss_sel) line, which selects 
which bank of the safe storage to write/read from is toggled. But when the system is in 
Recovery mode, it is not toggled as it is already pointing to the bank with the older state 
from which the processor will read instead of write during recovery. After recovery, 
ss_sel is toggled as usual hence causing the system to overwrite the newer saved state 
(S). This is convenient as newer state (S) may be corrupted due to he issues discussed 
previously with regard to delay in detection.
RCSU
:Fault Detected
R
x
R
S
R
S
x
x
SS A
SS B
ss_clk
RPQ
Recovery 
mode
ss_sel
chkpt_rollback 
bus
R S U* R R P
R
R
P
R
Rollback Penalty
TATTAT
TAT: Turn Around Time
Detection window
25
   
Chapter 4
Correctness of Design
The principles of rolling back are similar to those of handling a branch misprediction 
or an exception in an out-of-order pipelined processor where some instructions have to be 
removed and execution is restarted from another point. In the case of checkpoint and 
rollback recovery, when a fault is detected, some instructions are removed and execution 
restarts from a point the system had passed through in the past. In both cases, the system 
should give the appearance that there was no break in the flow of execution i.e. rolling back 
should be transparent. Thus, the basic objective is to add some form of support to recover 
to a precisely correct system state while creating the impression that nothing went wrong. 
Also, for any checkpoint rollback mechanism, care should be taken to ensure that re-
executing instructions, and hence writing and reading results twice, does not affect the 
correctness of computation. 
Therefore, for any checkpoint rollback recovery mechanism to function properly, it is 
necessary and sufficient to satisfy the following conditions: 
1. The system resumes execution to a consistent valid state after rollback recovery. 
2. Re-execution of instructions does not affect correctness of computation. 
26
   
These conditions are sufficient because they ensure that the system will continue 
execution in a transparent manner. This chapter is dedicated to describe how TERPS 
attempts to satisfy them providing various examples and counter examples 
4.1Resuming to a Consistent State
4.1.1System State and Rollback
In order to resume execution to a consistent and valid state, a consistent and valid state 
must be saved during a checkpoint in the first place.
The entire state of a processing system is so large that it is difficult to quantify. It 
consists of the pipeline registers, control data, memory, etc. But there is a subset of this 
state which is sufficient to restart execution from, and it is important to identify this state to 
do efficient and valid checkpointing. Though this state will vary from architecture to 
architecture, for the purposes of discussion, a general idea of necessary state is given. The 
basic operation of a processing system is to fetch an instruction and execute it based on 
what kind of an instruction it is. Therefore it is absolutely necessary to save the address of 
the instruction you may want to restart from so it can be fetched again. This is stored in the 
program counter or PC. Instructions are generally of 3 types: ALU, memory and I/O. As I/
O semantics are complicated, I/O instruction issues are not discussed at this point. ALU 
instructions read operands or write results to the register file (RF). Memory instructions 
read from or write to the memory/RF. Thus, in general, the state required to be saved 
during a checkpoint, in order to restart from an intermediate point, should consist of the 
PC, RF, and memory. This state is also called the rollback state.
The memory is usually quite large and it would be difficult to checkpoint the entire 
memory. But store instructions are not that frequent, and the state changes made by stores 
27
     
within a checkpoint interval can be saved. This assumes that the memory is relatively fault 
tolerant. In TERPS, support for saving changes to the memory system within a checkpoint 
interval is provided in the form of a write buffer (WB0) as explained previously. 
4.1.2Precise Checkpointing
After identifying the information that needs to be saved during a checkpoint, the 
checkpointing mechanism must save the information such that it forms a consistent state. 
In a sequential (un-pipelined) machine, instructions are processed one-by-one, one 
finishing before the next starts. For any architecture, the rollback state must be precise, i.e., 
the rollback state should reflect the sequential architectural model. This is similar to 
establishing precise interrupts in pipelined processors [39]. If the rollback state is 
imprecise, it may leave the system in an irrecoverable state.
For precise checkpointing the following conditions should be satisfied:
1. The state changes by all instructions preceding the instruction indicated by the 
checkpointed PC are reflected in the rollback state. 
2. The state changes by all instructions following and including the instruction indicated 
by the checkpointed PC are not reflected in the rollback state.
It is trivial to satisfy these requirements for a sequential architecture. Fulfilling these 
conditions for an in-order pipelined processor is also quite straight forward. The 
checkpoint mechanism should stall the pipeline and then checkpoint by saving the PC of 
the next-to-complete instruction, the Register File (RF), and the writes to the memory 
system in that checkpoint interval (WB0). Checkpointing the PC of the next-to-complete 
instruction ensures that the instructions preceding it would have already completed and 
28
   
their results would be reflected in the rollback state. Store instructions may write to the 
memory system before they reach the next-to-complete stage in the pipe depending on the 
design. This would lead to an inconsistent rollback state. Stalling the pipeline prevents 
these memory writes from changing the state before establishing a checkpoint and hence 
satisfying condition 2. For an out-of-order pipelined processor, the techniques 
implemented by Sohi and Vajapeyam [41] to establish a precise interrupt can be used to 
determine a precise checkpoint. 
4.1.3Multi-phase Commit
Even though the rollback state is precise, it can not be guaranteed that the system will 
rollback to a valid state. In TERPS, checkpointing is a 2-step process. First the rollback 
state is saved in the checkpoint latch at the CPU and then it is read into the safe storage. As 
explained in the previous chapter, the rollback state may be corrupted due to delays in EMI 
detection. To prevent rolling back to a corrupted state, TERPS maintains the older rollback 
state in the safe storage too, which is known to be error free. This state is used to rollback 
to a valid state. The instructions in this older state should not be committed to permanent 
unrecoverable state until it is known that the newer rollback state saved is error free. If this 
condition is not supported then the system is vulnerable to recovering to an invalid state. 
TERPS is outfitted with a dual-bank safe storage to preserve the last two rollback states. 
When EMI is not incident, the recent checkpointed state overwrites the bank containing 
the older checkpoint when it is read into the safe storage. It is safe to overwrite the older 
rollback state as the other rollback state is known to be good at this point if a fault did not 
occur. 
29
 
Instructions that read from the RF after a recovery see a valid RF state because during 
a checkpoint the entire RF is saved. On recovery, the entire RF is overwritten by the 
rollback state undoing all the writes of instructions that wrote to it after the checkpoint. 
Thus, the state of the RF after recovery is precise. This is shown in fig. 4.1. 
However, a dual-banked safe storage is not sufficient for memory instructions because 
unlike the RF, the entire memory is not saved in the rollback state during a checkpoint as 
explained previously. Only the store data for the checkpoint interval before the rollback 
point is recovered from the safe storage. Hence, load instructions that are re-executed after 
a recovery may not see a consistent state of the memory if store instructions executing after 
the rollback point are committed to permanent state. An example is illustrated in fig. 4.2 
where, in an instruction sequence between two checkpoints, a load instruction reads from 
Figure 4.1: RF writes do not change permanent state. In the instruction stream on the 
left, instructions above are fetched before the instructions below. The RF in different 
checkpoint intervals is represented by W,X,Y, and Z. The arrow indicates that the result 
of instruction j is written to RF X. Instruction j writes to register R3 after instruction i 
reads from R3. But after recovery, the entire register file is loaded from safe storage bank 
A and does not reflect the change made by instruction j stored in safe storage bank B.
1st checkpoint
....
(i) sw addrA, r3
....
(j) add r3, r1, r2
....
2nd checkpoint
....
3rd checkpoint
....
Fault Detected
Initiate Recovery
1st chkpt 2nd chkpt 3rd chkpt
Rollback
:Fault Detected
W @ SSA X @ SSB Y NOT LATCHED
SSA reloaded
V @ SSB
ss_clk
YXW
Z
RF
 checkpoint 
interval
(rollback pt.)
30
 
an address location that a store instruction succeeding it writes to. If the stores are 
committed to permanent state too early, the load instruction may read the wrong data. 
Architectural support to delay such stores from writing to permanent unrecoverable state 
before it is safe to do so is called for. 
In response to these requirements, TERPS employs a multi-phase commit protocol, 
supported by three write buffers and the dual-bank safe storage, to ensure that no 
instruction is permitted to commit to permanent unrecoverable state (i.e. the DRAM 
system) until it is safe to do so. From fig. 4.2, it is clear that store data must be delayed to 
memory so that on recovery, the state will be precise. To delay stores from writing their 
data to permanent state, some temporary write buffers should be inserted between the CPU 
and the memory system. Following the same example given in fig. 4.2, fig. 4.3 (a) 
describes the TERPS mechanism equipped with two write buffers instead of one. For the 
Figure 4.2: Store instructions that commit early may change permanent state.  The 
figure illustrates a scenario with one write buffer where writes are committed to the 
memory system just after they are checkpointed. The load instructions i and j will 
incorrectly read the store data from instructions k and l after recovery as they were 
written to memory. 
....
1st checkpoint
....
(i) lw r3, addrA
(j) lw r5, addrB
....
(k) sw addrA, r6
(l) sw addrB, r3
(m) add r3, r1, r2
....
2nd checkpoint
....
3rd checkpoint
....
Fault Detected
Initiate Recovery
P Q R S
WB0
1st chkpt 2nd chkpt 3rd chkpt
Rollback
:Fault Detected
P @ SSA
P@ DRAM
Q @ SSB
Q @ DRAM
R NOT LATCHED
R @ DRAM
SSA reloaded
O @ SSB
O@ DRAM
ss_clk
31
 
interval highlighted, the stores k and l write to addresses A and B after the loads i and j 
have read from the same addresses. These stores write to WB0, named Q. After the third 
checkpoint, a fault is detected and recovery is initiated. But at this point the instructions k 
and l in Q have already been committed to the DRAM system. Hence when the loads i and 
j are re-executed after recovery, they will incorrectly read the store data of the instructions 
k and l. Thus, two write buffers do not delay the commitment of the store data adequately. 
Figure 4.3: Multi-phase commit.  This figure demonstrates how multi-phase commit is 
implemented to ensure all instructions following the checkpoint have not modified the 
process state before the commit point of their current checkpoint interval. In the 
instruction stream on the left, instructions above are fetched before the instructions 
below. (a) shows that two write buffers are insufficient whereas three write buffers, as 
shown in (b), are adequate. 
....
1st checkpoint
....
(i) lw r3, addrA
(j) lw r5, addrB
....
(k) sw addrA, r6
(l) sw addrB, r3
(m) add r3, r1, r2
....
2nd checkpoint
....
3rd checkpoint
....
Fault Detected
Initiate Recovery
P
O P
Q
Q
R
O
R
P
S
WB0
WB1
WB0
WB1
WB2
Q
P
P
Q
R
Q
R
N
S
O
1st chkpt 2nd chkpt 3rd chkpt
Rollback
:Fault Detected
P @ SSA
O @ DRAM
Q @ SSB
P @ DRAM
R NOT LATCHED
Q @ DRAM 
SSA reloaded
O @ SSB
N @ DRAM
O @ SSB
M @ DRAM
P @ SSA
N @ DRAM
Q @ SSB
O @ DRAM
 R NOT LATCHED
P @ DRAM
SSA reloaded
(a)
(b)
ss_clk
ss_clk
32
   
In fig. 4.3(b), three write buffers have been implemented. The additional third write buffer 
postpones the commitment of Q to the DRAM by one checkpoint interval, preventing 
instructions k and l from overwriting the values that i and j should read in case of a 
recovery. Thus, three write buffers are adequate to accomplish correct multi-phase commit. 
Consider a situation where the write buffer WB0, which contains the store data for a 
particular checkpoint interval, is not saved into the rollback state. This case is shown in fig. 
4.4. The stores in P have already been written to the DRAM by the time recovery is 
initiated. After recovery, a load, shown to read data written by a store in P, may receive its 
Figure 4.4: Importance of saving store data in the safe storage.  Multi-phase commit 
implemented without saving the store data in the safe storage is shown. Loads k and l 
read data written by stores i and j. By the time the fault is detected, this store data (P) is 
written to the DRAM. But EMI might have corrupted it. After recovery, the loads k and l 
will again execute. They will correctly not read the store data due to stores m and n, but 
will read the corrupted data from the DRAM. Hence, on recovery it is necessary that a 
backup of the store data be brought back in to the system.
....
(i) sw addrA, r6
(j) sw addrB, r3
.....
1st checkpoint
....
(k) lw r3, addrA
(l) lw r5, addrB
....
(m) sw addrA, r6
(n) sw addrB, r3
....
2nd checkpoint
....
3rd checkpoint
....
Fault Detected
Initiate Recovery
O
P
WB0
WB1
WB2
Q
P
P
Q
R
Q
R
N
S
O
:Fault Detected
M @ DRAM N @ DRAM O @ DRAM P* @ DRAM
ss_clk
1st chkpt 2nd chkpt 3rd chkpt
Rollback
P @ MC MC => DRAM
33
  
data from the DRAM. It would seem these stores, which represent a significant overhead 
during checkpointing, do not need to be saved as they are present in the DRAM after a 
recovery. However, it is important to note that during the interval highlighted in the fig. 4.4, 
store data is being sent from the memory controller to the DRAM. Concurrently, a fault is 
also detected. EMI effects may corrupt the buffer in the memory controller or the data on 
the bus in transit to the DRAM rendering this data in the DRAM to be polluted. Saving the 
WB0 contents is necessary for backup reasons and eventually the multi-phase commit 
protocol will overwrite the DRAM with valid data using this backup after recovery. 
If a fault is detected during recovery an invalid rollback state may be delivered to the 
CPU and the system will recover to an invalid state. TERPS handles this issue by just 
initiating recovery again using the same rollback state from the safe storage. 
4.2Re-execution of instructions
Clearly, precise checkpointing and the multi-phase commit protocol work to resume 
execution to a consistent and valid state. But, when instructions are re-executed, they write 
their results to the system registers and memory again. This may trigger an event to reoccur 
and this may change the correctness of computation. 
One principle that the memory portion relies on is the fact that the memory system can 
be read from or written to multiple times without side effects; reading from a given 
memory location multiple times is the same as reading from that location once; writing to 
a given memory location multiple times with the same value is the same as writing to that 
location once. The RF also follows the same behavior. Hence, re-executing ALU and 
memory instructions, provided we maintain in-order semantics for writes as discussed 
above, does not affect the correctness of computation.
34
However, the I/O system does not behave like the memory system in this regard: I/O 
reads and writes have side effects, and the last value written to an I/O location is not 
necessarily the value read back from that location. For instance, a processor may be 
outputting information to a increase an external counter which displays the number votes 
for an electoral candidate. On re-execution of an I/O instruction after a recovery, if an 
increment signal is re-sent, the counter would increment twice and show the incorrect 
number of votes! 
We are currently developing support for I/O semantics in TERPS. One crude yet 
effective method is to checkpoint after every I/O request is executed. The CARER 
mechanism [33] implements a similar protocol. But in TERPS the frequency of 
checkpointing is dependant on the safe storage. The safe storage is slower because it is 
made from an older process technology for better fault tolerance. The checkpoint interval 
has to be long enough to meet the setup and hold times of the safe storage. Hence the 
system may have to stall after every I/O request until a checkpoint can be established. This 
would prove to be highly inefficient and its impact on performance would be significant if 
I/O requests occurred frequently. For a more efficient implementation we are developing a 
mechanism to support I/O semantics that incorporates the following characteristics:
1. Read and write buffers that are maintained by the I/O controller on a per-device basis 
and that are enabled or disabled by the operating system.
2. Read and write transactions that are identified by a monotonically increasing unique 
identifier.
3. A sliding window protocol between the CPU and the I/O controller to manage the 
buffer contents so that any transaction is in one or more of the following states: (i) 
35
buffered on the CPU, (ii) stored on the safe storage chip, (iii) buffered in the I/O 
controller, or (iv) committed to the I/O system and out of the window of vulnerability. 
We are currently modeling this mechanism in Verilog Hardware Description Language 
and expect to integrate it into our TERPS system in the future. 
36
Chapter 5
Implementation 
5.1Basic Processor Architecture
The TERPS mechanism is general in nature and can be tied with any instruction-set 
architecture. A microarchitecture?s existing logic for exception handling can be used to 
generate a precise checkpoint and by augmenting it with the write buffers and checkpoint 
rollback control logic, support for checkpoint and rollback recovery can be provided. 
For implementation purposes, we chose the RiSC-16, 5-stage pipelined architecture as 
the basic processor architecture. This architecture was selected because, 
1. The author was familiar with the processor from the onset of development and the 
architecture is well documented.
2. The design is not dependent on any particular instruction set; hence it was preferable to 
use an existing instruction set.
3. A convincing ?proof of concept? could be provided by this architecture, which, though 
simple in design, is general enough to solve complex problems. 
4. In order to have a successful physical prototype in an academic environment, the basic 
37
processor architecture had to be simple.
The 16-bit Ridiculously Simple Computer (RiSC-16), is a teaching ISA that is based 
on the Little Computer (LC-896) developed by Peter Chen at the University of Michigan. 
The RiSC-16 has an 8-entry register file, where, like the MIPS instruction-set architecture, 
by hardware convention, register 0 always contains the value 0. There are three machine-
code instruction formats and a total of 8 instructions. The instruction-set is given in table 
5.1. It has 5-stages: namely the fetch, decode, execute, memory, and writeback stages. It is 
similar to the 5-stage DLX/MIPS pipeline that is described in Hennessy and Patterson 
Table 5.1: Instruction Set Architecture
Assembly-Code Format Meaning
add regA, regB, regC R[regA] <- R[regB] + R[regC]
addi regA, regB, immed R[regA] <- R[regB] + immed
nand regA, regB, regC R[regA] <- ~(R[regB] & R[regC])
lui regA, immed R[regA] <- immed & 0xffc0
sw regA, regB, immed R[regA] -> Mem[ R[regB] + immed ]
lw regA, regB, immed R[regA] <- Mem[ R[regB] + immed ]
beq regA, regB, immed if ( R[regA] == R[regB] ) {
PC <- PC + 1 + immed
(if label, PC <- label)
}
jalr regA, regB PC <- R[regB], R[regA] <- PC + 1
PSEUDO-INSTRUCTIONS:
nop do nothing
halt stop machine & print state
lli regA, immed R[regA] <- R[regA] + (immed & 0x3f)
movi regA, immed R[regA] <- immed
.fill immed initialized data with value immed
.space immed zero-filled data array of size immed
38
[40], and it fixes a few minor oversights, such as lack of forwarding to store data, lack of 
forwarding to comparison logic in decode, implementing the 1-instruction delay slot, etc. 
This pipeline adds in forwarding for store data and eliminates branch delay slots. As in the 
DLX/MIPS, branches are predicted not taken, though implementations of more 
sophisticated branch prediction are certainly possible. 
5.2Implementation
The TERPS architecture is modeled in Verilog Hardware Description Language 
(HDL), in which the modules are described by their logical behavior suitable for synthesis. 
To guarantee the correctness of our mechanism at the behavioral level, a test bench is 
written as a stimulus to simulate the behavior of the entire system. All simulations were 
run in NC-Verilog which is a Logic Verification tool from the Cadence suite. 
To support checkpoint and rollback recovery, three write buffers and a checkpoint latch 
are added to the pipeline and a separate safe storage module was also developed to 
interface with the processor core module. A detailed block diagram of the TERPS 
architecture is given in fig.5.1. The fault detection signal is generated by a comparator 
circuit on the CPU chip for the ease of development and testing.
Some of the important control mechanisms added to the existing control logic and 
structure of the original pipeline for checkpointing and rollback were:
? checkpoint counter: This counter is responsible for the synchronization of the 
checkpoint rollback mechanism. It is important for controlling many other signals and 
their timing. 
39
? chkpt: This signal indicates that a checkpoint is underway. The processor is stalled 
during this time. It takes 7 CPU cycles to do a checkpoint in our implementation: 6- 
cycles for transferring the WB2 to the memory controller over a 64-bit front-side bus 
and 1-cycle for shifting the write buffer contents. 
Figure 5.1: Detailed block diagram of the TERPS Processor Architecture. The 
RiSC-16, 5-stage pipeline modified to support checkpoint and rollback. The shaded 
boxes represent clocked registers; solid lines represent data paths and buses; and dotted 
lines represent control paths. A pipeline register is labelled with the two stages that it 
divides; for example, the pipeline register that divides the instruction fetch (IF) and 
instruction decode (ID) stages is called the IF/ID register. The prominent features added 
are the 3 write buffers (WB 0-2), the 512-bit checkpoint latch, the memory controller, 
checkpoint counter, detector latch, and checkpoint rollback control logic. 
REGISTER FILE
SRC1SRC2
TGT
Program Counter
INSTRUCTION
MEMORY
OP rA rCrB PC
Sign-Ext-7
SRC1
TGT
SRC2
PCOP rT OPERAND2 OPERAND1
CTL
6
EQ!
CTL
3
s2s1
CTL
5
rT STORE DATAOP
+1
ADDRDATA IN
RF WRITE DATArT
DATA OUT
CTL
2 WEdmem
CTL
1
WE
rf
FETCH
STAGE
DECODE
STAGE
EXECUTE
STAGE
MEMORY
STAGE
WRITEBACK
STAGE
RF WRITE DATArT
Left-Shift-6
MUX
pc
MUX
s2
MUX
alu2
MUX
alu1
MUX
out
FUNC
alu
P
stomp
SRC2 SRC1
ALU OUTPUT
OPERAND0
CTL
4
MUX
imm
MUX
op0
CTL
7
IF
ID
EX
MEM
ID
EX
MEM
WB
WB
END
PC
+1
ADD
P
stall
MC
WB
0-2
DRAM
CHECKPOINT
 LA
TCH CHECKPOINT
ROLLBACK BUS
Chkpt
Rollback 
Chkpt Counter
Chkpt
SS_CLK, SS_SEL
FRONT SIDE BUS
R_mode, R_stomp
Detector Latch
M_stall
64
512
Chkpt_PC
RF
WB0
detector_R
Ctrl
SS_out_en
40
? detector_R: This signal goes high when a fault was registered. It is held high until 
recovery is finished. It is used for timing and control purposes. 
? RMODE: This signal, which makes the system go into recovery mode of operation, 
goes high just before the safe storage is supposed to latch the data from the CPU. The 
processor is stalled during RMODE. 
? RMODE_stomp: When, the system goes into recovery mode, RMODE_stomp will 
flush the pipeline, removing all ?faulty? state. 
? M_stall: This stalls the pipeline until the next checkpoint when there is a write request 
but the write buffer,WB0, is full. The write buffers have 12-entries each. 
? ss_sel: Used for selecting which bank of the safe storage the checkpointed rollback 
state will be written to or read from. 
? ss_out_en: This enables the output buffers of the safe storage so that it can output 
rollback state information onto the checkpoint rollback bus. 
The checkpoint latch contains the entire rollback state: 7 registers from the RF (RF0 is 
always 0), the precise checkpoint PC, and the write buffer WB0. This 512-bit latch is 
connected to a bi-directional checkpoint rollback bus which communicates with the safe 
storage. To prevent the array of output buffers from pulling a large amount of current at the 
same time, the bus is logically divided into 16, 32-bit sections to enable the bus in a 
staggered manner. 
The precise checkpoint PC is the PC of the next-to-commit instruction. In the 5-stage 
RiSC-16 processor, the next-to-commit instruction is in the memory stage. If that 
instruction is a nop, then the next valid instruction in the pipe is selected. 
41
During a recovery, the checkpoint rollback bus has to be allowed a turn around time 
before the safe storage sends data on it. This turn around time has been safely set to 20 
CPU cycles. 
To execute the multi-phase commit protocol correctly, the memory hierarchy has been 
modified. Instead of directly accessing the main memory, a load instruction concurrently 
checks in the write buffers WB0, WB1, WB2, the memory controller buffer, and the main 
memory for data. If there are multiple matches for the same address, it gives priority in the 
following order: WB0, WB1, WB2, memory controller buffer, and lastly main memory. 
For simplicity, a 1-cycle memory access latency was assumed. 
The safe storage module contains 2 banks A and B to store the newer and older 
rollback states. Its slower clock (ss_clk) is generated by the CPU using the chkpt_counter. 
The ss_sel signal from the CPU selects which bank the incoming rollback state should be 
gated to during a checkpoint or from which bank should the CPU read data from during a 
recovery. 
Two working versions have been developed. One, which is fully synthesizable, yielded 
a physical prototype which was fabricated through MOSIS in 0.25?m technology. It is a 
functionally limited version (the write buffer is not saved in the safe storage during a 
checkpoint) because it had to meet the constraints imposed on pin count by MOSIS. It 
checkpoints every 128 cycles and can operate at 100MHz. It will be integrated with 2 safe 
storage chips using 3D-IC technology at the Laboratory of Physical Sciences (LPS). This 
prototype was developed for ?proof of concept? and to test our capabilities in actually 
fabricating a chip which we have done in a successful manner. The other version is a fully 
functional one and its checkpoint frequency can be varied from 64 to 512 cycles per 
42
checkpoint. We plan on fabricating it in the future utilizing the full capabilities of 3D-IC 
technology. The performance analysis is based on this fully functional version. 
5.2.1Logical Verification
This section is dedicated to enumerating the steps taken in verifying the logic 
embedded in the design developed. NC Verilog is used to compile and run the verilog 
code. A screen shot of the of the NC Verilog tool is shown in fig. 5.2. 
Figure 5.2: Cadence NC Verilog. In this screen shot, you can observe the interface of 
the NC Verilog tool. The RF, PC, pipeline registers, and certain control signals are 
displayed by the simulation that is running. 
43
After running the simulation, signals of particular interest can be selected using the 
Design Browser and viewed in a timing diagram using the SimVision tool, which is 
bundled with NC Verilog. A screen shot of the Design Browser is shown in fig. 5.3. Various 
signals throughout the hierarchical modular structure of the Verilog code can be selected. 
After selecting the various registers and wires in the Design Browser, the timing 
diagram can be viewed in the SimVision waveform view. This is shown in fig. 5.4. 
Figure 5.3: The Design Browser. This screen shot shows the interface of the Design 
Browser. On the left the various modules can be viewed in their hierarchical tree 
structure. On the right the various registers and wires are available for selection to be 
viewed in the timing diagram. 
44
The checkpoint interval has been illustrated in the timing diagram. The RMODE signal 
is high when the system is in recovery mode. It can be observed from the timing diagram 
that a checkpoint is taken at the beginning of every checkpoint interval except for when the 
system is in recovery mode. This is indicated by the ?chkpt? control signal. The recovery 
penalty is shown to be 4 checkpoint intervals as explained previously. When a fault is 
detected, the system goes into recovery mode and the pipeline is flushed using the 
Figure 5.4: The Waveform view. This screen shot, shows the timing diagram of the 
simulation being run for selected signals. The signal names are displayed on the left hand 
side of the screen and from top to bottom are clk, ss_clk, chkpt, detector_out (the fault 
detection signal), detector_R, RMODE, RMODE_stomp, ss_out_en, ss_sel, data_bus_en, 
and the Checkpoint Rollback bus. The waveforms are displayed on the right side. The tool 
allows the user to zoom in and out, view the waveforms in motion, and place markers for 
debugging among many other features. 
checkpoint interval rollback penalty
CPU to SS SS to CPU
TAT
BA B A
not latched
45
?RMODE_stomp? control signal. The ?ss_sel? line indicates which bank of the safe storage 
will be written to or read from. The bank, A or B, that is selected by the ?ss_sel? line is 
marked in the diagram. When in recovery mode, the safe storage does not latch the 
checkpointed state from the CPU into the safe storage (bank B in this case) as it may be 
corrupted. The safe storage, after a bus turn around time (TAT), outputs the old rollback 
state in bank B to the CPU. This is also marked in the figure. The system resumes normal 
execution after recovery. 
This timing diagram verifies that the logic design implemented in Verilog conforms to 
the TERPS specifications. 
5.3Safe Storage Implementation 
This section discusses how the safe storage should be implemented to achieve low 
susceptibility to fault tolerance. For our prototype, we fabricated the safe storage with a 
0.5?m feature size, which is an older technology. It operates at a much lower frequency 
(781.25 KHz) as compared to the CPU chip (100MHz). This frequency is set by the 
checkpoint interval which is fixed to 128 CPU cycles for the prototype. 
The safe storage is a memory that is specially designed to have significantly more EMI 
tolerance than the processor. Most of the design techniques that can be used trade off speed 
and/or die area to achieve better EMI tolerance. As high performance CPUs require both 
speed and die area, the tradeoffs make it difficult for these techniques to be applied to a 
processor and maintain its high performance.
Better EMI tolerance can be achieved using a variety of circuit, device, and process-
level techniques. Most of these are orthogonal to each other and may be used or left out 
46
depending on the level of tolerance required by the system and the willingness of the 
designer to accept the necessary tradeoffs.
The safe storage is implemented as a static RAM that uses cross-coupled inverters as 
memory cells as opposed to a DRAM using a capacitor as the storage element. The 
presence of the regenerative feedback on the inverter circuit makes it perform better as a 
bistable circuit as compared to capacitor-based DRAMs. 
The SRAM topologies shown in Fig. 5.5 can be compared based on their cell size, 
static power consumption and (more importantly for this article) the static noise margin 
(SNM). The SNM of a memory cell gives the required value of voltage change at the 
inverter inputs to cause the cell to change state. Table 5.2 summarizes the features of each 
configuration. It is a good measure of the amount of spurious signal needed at the memory 
cell inputs to corrupt its state. 
The SNM of different memory cell configurations has been extensively studied (a very 
good example is Seevinck [42]). These studies show that the 6T configuration almost 
always has higher SNM. The 4T configuration can approach or even equal the 6T SNM but 
BL
BL
WL WL
BL
BL
WL
WL
BL
BL
WL
WL
Figure 5.5: 3 possible SRAM memory cell implementations. Fig. (a) shows the 
conventional six-transistor (6T) cell, fig. (b) shows the four- transistor (4T) cell, and fig. 
(c) shows a four-transistor loadless (4TLL) memory cell configuration.
(a) (c)(b)
BLBLBLBLBLBL
WLWLWLWLWLWL
47
at the expense of both its size and DC power consumption. This makes the 6T memory cell 
the best choice if higher EMI tolerance is needed.
The soft-error rate (SER) of SRAMs in the presence of alpha particles has also been 
widely studied [43][44]. It has been shown that maximizing the stored charge in the 
memory cell (output nodes of the inverters in Fig. 5.5) makes it harder for alpha-particles 
to erroneously cause state changes in the memory cell, resulting in better SER. The most 
common way to increase this stored charge is to increase the parasitic capacitance of the 
cell output nodes so that more charge is stored for a given supply voltage. This capacitance 
is increased using device-level techniques enlarging the cell area to increase the parasitic 
diffusion capacitances. Hence, higher capacitance is achieved at the expense of a larger 
cell area. Process-level techniques can also be used where grounded polysilicon layers are 
added to increase overlap capacitance or to completely fabricate the PMOS loads in 
polysilicon. In this case, higher capacitance is achieved in exchange for process 
complexity.
Techniques to improve SER also improve EMI tolerance. Achieving better SER by 
increasing the charge stored in the memory cell results in better EMI tolerance because 
Table 5.2: Features of different SRAM topologies
Topology Size
DC Power 
Consumption
SNM 
6T Big Very Minimal High
4T Medium Potentially 
Significant
Low-high
4TLL Small Very Minimal Low-medium
48
larger EMI signal powers are required to induce a voltage in the system that is large 
enough to exceed the cell's SNM to corrupt the cell's state.
The same principle can be applied to the entire safe-storage system and not just the 
storage cells. Using transistors with larger areas and powered by a higher supply voltage 
will result in increased charge stored within the system. This increased charge require 
larger amounts of EMI to push around.   Since the safe-storage area needs larger transistors 
and higher supply voltages to increase the stored charge, it is fabricated using a larger 
feature size process that is about two or more process generations older than the one used 
for the CPU. This exemplifies the tradeoffs between speed and EMI tolerance needed to 
implement the system.
Using the previous techniques, the circuitry within the safe-storage can be made to 
tolerate higher-levels of EMI. Care has to be taken to ensure that a specific subset of the 
communication between the CPU and the safe-storage be reliable. One way this could be 
done is to use differential signaling between the safe-storage and the CPU. Common node 
noise caused by the EMI will be cancelled and with proper care, induced differential mode 
noise will be minimal. EMI coupling must be minimized to accomplish this goal.   
Interconnect lengths must be minimized, along with current loop areas (that function as 
antennas) formed by the interconnect. This can be accomplished by using differential 
signal interconnects placed very close to each other.
Achieving all of this is facilitated by the 3D integration technology used by the system. 
This relaxes the pin limitations imposed by packaging constraints in conventional systems. 
This makes additional input/output pads available to the designer, with the added benefit 
that inter-die interconnects are going to be considerably shorter because of the chip-
49
stacking. This makes possible the use of short, very wide, differential buses needed for 
EMI-tolerant communication. An additional benefit of 3D chip integration is the 
possibility of using die-level shielding mechanisms to protect the safe-storage core from 
EMI. Our group?s efforts in 3D integration are described in a recent article [45].
As a summary of this section, the safe-storage will use six-transistor memory cells to 
maximize storage stability. A better EMI tolerance can be achieved by increasing the 
amount of stored charge within the system. This can be done by using additional grounded 
polysilicon layers to increase signal overlap capacitances, by increasing transistor sizes to 
increase diffusion capacitances, and increasing the supply voltage. The safe storage can be 
fabricated using a process technology that is approximately two generations older than the 
CPU. Lastly, 3D chip integration is used to interconnect the safe-storage and the CPU 
together. This technology removes pin limitations imposed by package constraints and 
makes possible the use of a short, very-wide differential bus. 3D integration also makes 
possible the use of various chip-level shielding schemes to further protect the safe-storage 
from EMI.
50
Chapter 6
Results
6.1Performance Analysis
This section presents the checkpoint rollback recovery mechanism and the multi-phase 
commit protocol?s effect on the overall performance of a processor. During the normal 
execution of instructions, the interaction with the checkpointing mechanism is limited to 
the write buffers and hence its impact on performance is low. The overhead is primarily 
due to the time taken to establish a checkpoint and how frequently a checkpoint is taken. 
As checkpointing is done in a periodic fashion, performance is similar for different 
benchmarks when TERPS is operating at a particular checkpointing time interval. 
However, memory intensive benchmarks may slow down forward progress significantly if 
they regularly fill up the write buffer quickly and therefore stall the machine. Hence it is 
important to select the proper write buffer size. In general, 30% of instructions are memory 
instructions and 10% of these are stores [40]. We have chosen 4 different checkpointing 
time intervals (64, 128, 256, 512) to show the impact of checkpointing on performance and 
the corresponding write buffer size is shown in table 6.1. 
51
The benchmarks used were:
1. Laplace: Uses numerical methods to approximate Laplace's equation by averaging.
2. Vector Addition: Adds 2 vectors of size 10,000. Memory intensive. 
3. Sample: Implements various basic functions which are seen in many programs like 
summation, factorial, etc. 
4. Horner: Implements Horner's method for evaluating a polynomial and compares it 
with another less efficient method.
The C compiler for RiSC-16 microprocessor (ver. 1.50) developed by Afshin Sepehri 
and Bruce Jacob [46] was used to compile Sample and Horner. 
The performance impact due to these benchmarks is shown in fig.6.1. All results are 
with a 64-bit frontside bus. The performance impact of various benchmarks for a particular 
checkpointing time interval is relatively the same. This is seen because the write buffers 
did not fill up often even in the case of the memory intensive Vector Addition benchmark 
and hence checkpointing, in this scenario, does nothing to worsen the computational speed 
of the pipeline. These results support the criteria for selecting the size of the write buffers. 
The checkpointing mechanism stalls the pipeline during a checkpoint and takes a 
checkpoint independent of the state of the system. So the performance overhead is mainly 
Table 6.1: Write buffer size 
CPU cycles per 
checkpoint
Write 
buffer size
64 8
128 12
256 24
512 48
52
due to the stalling of the pipeline during a checkpoint and how frequent a checkpoint is 
made. 
During a checkpoint, the rollback state is saved into the checkpoint latch, the store data 
in the last write buffer WB2 is transferred over the 64-bit frontside bus to the memory 
controller, and then the write buffer WB2 will be overwritten by WB1 and WB1 by WB0. 
The overhead of stalling the pipeline and performing a checkpoint is prominent when 
checkpointing is done more frequently as can be observed from the chart. It would seem 
that as the checkpointing interval is increased the performance would improve drastically. 
However, when checkpointing is done less frequently the size of the write buffers has to 
increase to accommodate more store data. Correspondingly, the time required to establish 
a checkpoint will increase as it takes more cycles to write the store data in the write buffer 
Figure 6.1: Performance Overhead due to checkpointing. Four benchmarks were run 
on the TERPS system at four different checkpointing time intervals. 
Performance Overhead 
(64-bit frontside bus)
0
1
2
3
4
5
6
7
8
9
10
64 128 256 512
Checkpointing Time Interval
 (CPU cycles/checkpoint)
Overhead (%)
Laplace
VectorAdd
Sample
Horner
53
WB2 to the memory controller over the frontside bus. This effect is reflected in the 
performance overhead. It can be seen that for the cases where a checkpoint is taken every 
128, 256, and 512 CPU cycles, the overhead remains at around 5-6%. Checkpointing 
around every 128 CPU cycles, for the current configuration, seems to be pareto-optimal. 
6.1.1Performance with the Memory Controller on-chip
From this analysis it is quite clear that the frontside bus checks the improvement in 
performance which should be observed while increasing the checkpointing interval. To 
achieve better performance with larger checkpointing intervals, the width of the frontside 
bus should be increased. However, this increases the cost drastically as the number of pins 
increases correspondingly. To overcome the constraints on pin count and still have a large 
frontside bus, the memory controller should be integrated onto the CPU chip [47][48]. The 
width of the frontside bus can be very large in this case as the bus is on-chip. The effect of 
checkpointing is quite minimal with this configuration for all frequencies of 
checkpointing, as seen in fig. 6.2, and almost insignificant for the case where 
checkpointing is done every 512 CPU cycles. 
The cost of moving the memory controller on chip may be high as die area would 
increase. System level redesign may also be costly and time consuming. Hence, such a 
step should be avoided when the extra performance overhead incurred with the memory 
controller off-chip, which is quite low to begin with, is acceptable. In the case of critical 
real-time systems, where performance may be an important issue, such a cost may be 
deemed appropriate.
54
Figure 6.2: Performance overhead due to checkpointing with the memory 
controller on-chip. 
Performance Overhead with MC on chip
0
0.5
1
1.5
2
2.5
3
3.5
4
64 128 256 512
Checkpointing Time Interval 
(CPU cycles/checkpoint)
Overhead (%)
Laplace
VectorAdd
Sample
Horner
55
Chapter 7
Conclusions and Future Work
In this thesis, the threat of intentional EMI to electronic systems was addressed by 
introducing a fault tolerant architecture, TERPS (The Embedded Reliable Processing 
System). It can significantly lower the susceptibility of a processing system against EMI-
induced transient faults by restricting the area of vulnerability to a small section of a CPU 
and a safe storage device that uses technology which is relatively much more EMI-
tolerant. The system provides this increased resistance to EMI by transparently performing 
checkpoint rollback recovery operations between the CPU and safe storage, and by 
instituting a multi-phase commit protocol between the CPU and memory controller. 
TERPS can recover from a system wide failure scenario (i.e.one in which nearly every 
transistor on a CPU is affected), while most checkpoint rollback recovery techniques 
recover from single error event faults. 
The TERPS mechanism occupies a region of the design space between schemes that 
rely primarily on redundant hardware (e.g. n-modular redundancy) and schemes that rely 
primarily on redundant computation (e.g. redundant execution-in-place). TERPS 
represents a trade-off of a moderate hardware overhead (the extra safe storage chip and 
56
write buffers) and a minimal performance overhead. 
The TERPS mechanism has reduced the scope of vulnerability to only the safe storage 
and the control logic to execute the checkpoint mechanism itself from a situation where 
everything could go wrong. A comprehensive discussion on safe storage fault tolerance 
was provided. The control logic used to control the checkpoint mechanism can be made 
more fault tolerant by employing differential signaling based techniques. 
A correctness of design was provided by stating the necessary and sufficient conditions 
for the checkpoint rollback recovery mechanism to work and then showing how TERPS 
supported them. Furthermore, our implementation, developed in Verilog, was functionally 
verified by industry standard logic verification tools. We have also built a physical 
prototype system on 0.5 ?m and 0.25 ?m processes through MOSIS. The 
photomicrographs of the chips we fabricated are shown in fig. 7.1.
The performance impact of checkpointing (i.e. the cost of stalling during checkpoints 
and buffer overflows) has been kept minimal (~6% for checkpointing every 128 CPU 
Figure 7.1: Photomicrographs of chips fabricated via MOSIS. (a) Processor core 
fabricated via MOSIS at TSMC in a 0.25 ?m feature size with a die area of 10.89 mm
2
, a 
pad count of 100, and in a MQFP package. (b) Safe Storage chip fabricated via MOSIS 
at AMI in a 0.5 ?m feature size with a die area of 5.29 mm
2
, a pad count of 84, and in a 
PLCC package. 
(a) Processor Core (b) Safe Storage
57
cycles) with the aid of high-bandwidth, low latency enabling technologies like 3D 
integration. Two memory controller configurations were described; one with the memory 
controller off-chip and another with the memory controller on-chip. The latter shows better 
performance (~2% for checkpointing every 128 CPU cycles) as compared to the former 
but with added cost.
The fabrication of a 3-D integrated chip is important in proving the feasibility of the 
system and this will be completed soon. In the near future, we plan to expand the sphere of 
protection offered by TERPS by encompassing more system elements (e.g. I/O) into the 
TERPS mechanism. 
If the RF detection latency can be accurately determined, the rollback penalty can be 
reduced in certain instances by recovering to the newer rollback state in the safe storage. 
Methods of enabling a checkpointing mechanism in which the checkpointing interval can 
be varied dynamically to improve performance should be explored. This may be useful 
when moving from one place to another where the EMI levels may change. In an 
environment with low EMI levels, the checkpoint interval can be large to improve 
performance while in a harsh EMI environment, the rate of checkpointing can be increased 
to reduce the rollback penalty. Another dynamic checkpointing mechanism may initiate a 
checkpoint every time the write buffer is full hence removing performance penalties due to 
write buffer related stalls. 
58
REFERENCES
[1] J. Bethune, S.S. Conroy, ?Newsline: The New Cold War: Defending Against 
Criminal EMI?, Compliance Engineering, May-June 2001. Available:
http://www.ce-mag.com/archive/01/05/news.html
[2] E. Sicard, C.Marot, J. Y. Fourniols, M. Ramdani, ?Electromagnetic Compatibility 
for Integrated Circuits?, Techniques l'ing?nieur/Techniques for Engineers, 2003, to 
be published.
[3] F. Fiori, S. Benelli, G. Gaidano, V. Pozzolo, ?Investigation on VLSI?s Input Ports 
Susceptibility to Conducted RF Interference?, IEEE International Symposium on 
Electromagnetic Compatibility, 18-22 Aug. 1997, pp. 326 -329.
[4] D. J. Kenneally, D.S. Koellen, S. Epshtein, ?RF Upset Susceptibility of CMOS and 
Low Power Schottky D-Type, Flip-Flops?, IEEE National Symposium on 
Electromagnetic Compatibility, 23-25 May 1989, pp. 190 -195.
[5] F. Fiori, ?Integrated Circuit Susceptibility to Conducted RF Interference?, 
Compliance Engineering 17, no. 8 (2000), pp. 40-49.
[6] S. Baffreau, S. Bendhia, M. Ramdani, E.Sicard, ?Characterisation of 
Microcontroller Susceptibility to Radio Frequency Interference?, Proc. of the 
Fourth IEEE International Caracas Conference on Devices, Circuits and Systems, 
17-19 April 2002, pp. I031-1 -I031-5.
[7] K. M. Strohm, J. B?chler, E. Kasper, J. F. Luy, P. Russer, ``Millimeter Wave 
Transmitter and Receiver Circuits on High Resistivity Silcon'', IEE Colloquium on 
59
Microwave and Millimeter Wave Monolithic Integrated Circuits ,  
(London,England), 11 Nov. 1988, Digest No: 1988/117, pp. 11/1-11/4 
[8] V. Milanovic, M. Gaitan, J.C. Marshall, M.E. Zaghloul, ?CMOS foundry 
implementation of Schottky diodes for RF detection?, Electron Devices, IEEE 
Transactions on, Volume 43, Issue 12 ,Dec. 1996, pp. 2210 -2214.
[9] K. Banerjee, S. J. Souri, P. Kapur, and K. C. Saraswat, ?3-D ICs: A Novel Chip 
Design for Improving Deep Submicrometer Interconnect Performance and 
Systems-on-Chip Integration ?, Proceedings of the IEEE, Special Issue, 
Interconnections- Addressing The Next Challenge of IC Technology, Vol. 89, No. 
5, pp. 602-633, May 2001. (INVITED)
[10] N. Savage, ?Linking with Light?, IEEE Spectrum, vol. 39, issue 8, Aug. 2002, pp. 
32-36.
[11] P. N. Glaskowsky, ?Rambus Rolls Out Yellowstone, FlexPhase Circuit Design to 
Speed DRAMs and More?, Microprocessor Report, July 15, 2002
[12] B.  Randell, P.  Lee, P. C. Treleaven, ?Reliability Issues in Computing System 
Design?, ACM Computing Surveys, vol.10 , issue 2, June 1978, pp. 123 - 165.  
[13] M. Peercy, P. Banerjee, ?Fault Tolerant VLSI Systems?, Proceedings of the IEEE, 
Vol. 81, No. 5, May 1993, pp. 745-758.
[14] M. Franklin, ?Incorporating Fault Tolerance in Superscalar Processors?, Proc. 3rd 
International Conference on High Performance Computing, 19-22 Dec. 1996, pp. 
301 -306.
60
[15] J. B. Nickel, A.K. Somani, ?REESE: A Method of Soft Error Detection in 
Microprocessors?, Proc. The International Conference on Dependable Systems and 
Networks, 1-4 July 2001, pp. 401 -410.
[16] C. Weaver, T.Austin, ?A Fault Tolerant Approach to Microprocessor Design?, Proc. 
The International Conference on Dependable Systems and Networks, 1-4 July 
2001, pp. 411-420.
[17] T. M. Austin, ?DIVA: A Reliable Substrate for Deep Submicron Microarchitecture 
Design?, MICRO-32. Proc. 32nd Annual International Symposium on 
Microarchitecture, 16-18, Nov. 1999, pp. 196 -207.
[18] A. Avizieniz et al., ?The STAR (self testing and repairing) computer: an 
investigation of the theory and practice of fault tolerant computer design?, IEEE 
Trans. on Comp. C-20, 11, Nov. 1971, 1312-1321.
[19] R. Koo and S. Toueg, ?Checkpointing and Rollback-Recovery for Distributed 
Systems?, IEEE Trans. on Software Engin. SE-13,(1), Jan. 1987, pp. 23-31.
[20] K. M. Chandy and C. V. Ramamoorthy,?Rollback and Recovery Strategies for 
Computer Programs?, IEEE Transactions on Computers, Vol. C-21, pp. 546-556, 
June 1972.
[21] J. S. Upadhyaya and K. K. Saluja,?A Watchdog Processor Based General Rollback 
Technique with Multiple Retries?, IEEE Transactions on Software Engineering, 
Vol. SE-12, pp. 87-95, Jan 1986.
[22] K. G. Shin, T. H. Lin, and Y. H. Lee, ?Optimal Checkpointing of Real-Time Tasks?, 
61
IEEE Transactions on Computers, Vol. C-36, pp. 1328-1341, Nov 1987.
[23] M. L. Ciacelli, ?Fault Handling on the IBM 4341 Processor?, 11th Fault-Tolerant 
Computing Symposium, Portland, Maine, June 1981, pp. 9-12. 
[24] M.M. Tsao et al. ?The Design of C.fast: A Single Chip Fault Tolerant 
Microprocessor?, Proc. 12th Int. FTCS, June 1982, pp. 63-69.
[25] L. Spainhower et al., ?Design for fault-tolerance in system ES/9000 model 900,? 
Proc. 22th Int.Symp. on Fault-Tolerant Computing, July 1992, pp. 38?47.
[26] Y. Tamir, M. Liang, T. Lai, and M. Tremblay, ?The UCLA Mirror Processor: A 
Building Block for Self-Checking Self-Repairing Computing Nodes?, Proc. 21st 
Int'l Symp. Fault-Tolerant Computing, June 1991, pp. 178-185.
[27] N. J. Alewine, S.-K. Chen, W.K. Fuchs, W.-M. W. Hwu ?Compiler-Assisted 
Multiple Instruction Rollback Recovery Using A Read Buffer?, IEEE Transactions 
on Computers, Volume: 44 Issue: 9 , Sept. 1995, pp. 1096 -1107.
[28] Yuval Tamir, Marc Tremblay, and David A. Rennels, ?The Implementation and 
Application of Micro Rollback in Fault-Tolerant VLSI Systems,? 18th Fault-
Tolerant Computing Symposium, Tokyo, Japan, June 1988, pp. 234-239. 
[29] Y. Tamir, M. Tremblay, ?High Performance Fault-Tolerant VLSI Systems using 
Micro Rollback?, IEEE Transactions on Computers, Volume: 39 Issue: 4 , April 
1990, pp. 548 -554.
[30] C.-C.J. Li, S.-K. Chen, W. K. Fuchs, W.-m.W. Hwu, ?Compiler-based Multiple 
Instruction Retry?, IEEE Transactions on Computers, Volume: 44 Issue: 1 , Jan. 
62
1995, pp. 35 -46.
[31] Shyh-kwei Chen and  W. Kent Fuchs, ?Compiler-Assisted Multiple Instruction 
Word Retry for VLIW Architectures?, IEEE Transactions on Parallel and 
Distributed Systems, vol. 12(12),  Dec. 2001, pp. 1293-1304.  
[32] N. S. Bowen, D. J. Pradhan, ?Virtual checkpoints: Architecture and performance,? 
IEEE Trans. on Computers, vol. 41, no. 5, May 1992, pp 516?525.
[33] D. B. Hunt and P. N. Marinos, ?A general purpose cache-aided error recovery 
(CARER) technique?, Proc. 17th Int. Symp. on Fault-Tolerant Computing, 1987, 
pp. 170?175.
[34] B. Janssens and W. Fuchs, ?The performance of cache-based error recovery in 
multiprocessors?, IEEE Transactions on Parallel and Distributed Systems, 5(10), 
Oct.1994, pp.1033-1043,. 
[35] K.-L. Wu, W. K. Fuchs, and J. H. Patel, ?Error recovery in shared memory 
multiprocessors using private caches,? IEEE Trans. on Parallel and Distributed 
Systems, vol. 1, no. 2, Apr. 1990, pp. 231?240.
[36] M. Prvulovic , J. Torrellas, Z. Zhang. ?ReVive: Cost-Effective Architectural 
Support for Rollback Recovery in Shared-Memory Multiprocessors?, Proc. of the 
29th Annual International Symposium on Computer Architecture, May 2002, 
pp.111-122.
[37] D. J. Sorin , M. M. K. Martin, M. D. Hill, and D. A. Wood, ?SafetyNet : Improving 
the Availability of Shared Memory Multiprocessors with Global Checkpoint/
63
Recovery?, Proc. of the 29th Annual International Symposium on Computer 
Architecture, May 2002, pp. 123-134.
[38] W.-M. W. Hwu, Y.N. Patt, ?Checkpoint Repair for Out-of-order Execution 
Machines?, Proc. of the 14th annual International Symposium on Computer 
Architecture, June 1987, pp. 18-26.
[39] J. E. Smith and A. R. Pleszkun, ?Implementing Precise Interrupts in Pipelined 
Processors?, IEEE Trans. Computers, C-37(5), May 1988, pp.562--573. 
[40] J. L. Hennessy and D. A. Patterson, ?Computer Architecture A Quantitative 
Approach?, Morgan Kaufmann Publishers Inc., 2 nd edition, 1996. ISBN 1-55860-
329-8. 
[41] G. S. Sohi , S. Vajapeyam, ?Instruction Issue Logic for High-performance, 
Interruptable Pipelined Processors?, Proceedings of The 14th Annual International 
Symposium on Computer Architecture, June 02-05, 1987, pp. 27-34.
[42] E.Seevinck, F.J.List and J.Lohstroh, "Static-noise margin analysis of MOS SRAM 
cells," IEEE J.Solid-State Circuits, vol SC-22, no.5, Oct.1987, pp.748-754.
[43] K.Ishibashi et al, "An alpha-immune, 2-V supply voltage SRAM using a 
polysilicon PMOS load cell," IEEE J.Solid-State Circuits, vol SC-25, no.1, 
Feb.1990, pp. 55-60.
[44] H.Sato et al, "A 500-MHz pipeline burst SRAM with improved SER immunity," 
IEEE J.Solid-State Circuits, vol. SC-34, no.11, Nov.1999, pp. 1571-1579.
[45] G. Metze, M. Khbeis, N. Goldsman, B. Jacob.  "Heterogeneous Integration: 3D 
64
Integration of Components with Different Electrical Functionality or Material 
Systems."  NSA Tech Trend, to appear.
[46] A. Sepehri, B. Jacob, ?C Compiler for RiSC-16 Microprocessor?, Available: http:/
/www.ece.umd.edu/~afshin/rcc-report.htm
[47] Chetana N. Keltcher, ?The AMD Hammer Processor Core?. 
[48] Douglas Sanders, ?Designing a PC with DECchip 21066?, Proc. of IEEE 
Compcon, 1994, pp.414-417.
65