Dual-Processor Design of Energy Efficient Fault-Tolerant System?
Shaoxiong Hua
Synopsys Inc.
700 E. Middlefield Road
Mountain View, CA 94043
huas@synopsys.com
Pushkin R. Pari
Intel Technology India Pvt. Ltd.
Bangalore, India
pushkin.r.pari@intel.com
Gang Qu
Dept. of ECE and UMIACS
University of Maryland
College Park, MD 20742, USA
gangqu@eng.umd.edu
Abstract
A popular approach to guarantee fault tolerance in
safety-critical applications is to run the application on
two processors. A checkpoint is inserted at the comple-
tion of the primary copy. If there is no fault, the sec-
ondary processor terminates its execution. Otherwise,
should the fault occur, the second processor continues
and completes the application before its deadline. In
this paper, we study the energy efficiency of such dual-
processor system. Specifically, we first derive an opti-
mal static voltage scaling policy for single periodic task.
We then extend it to multiple periodic tasks based on
worst case execution time (WCET) analysis. Finally,
we discuss how to further reduce system?s energy con-
sumption at run time by taking advantage of the actual
execution time which is less than the WCET. Simula-
tion on real-life benchmark applications shows that our
technique can save up to 80% energy while still provid-
ing fault tolerance.
1 Introduction
Real-time embedded systems are widely used in
safety-critical applications, such as avionics platform
and flight control system. Faults in such applications
may cause catastrophic effect. Therefore, these sys-
tems must be designed not only to reduce the fault
rate but also to provide the capability to tolerate fault
should it occurs. Fault tolerance is normally realized
on multi-processor system via temporal redundancy or
physical redundancy depending on the availability of
the task?s laxity. A task?s laxity is defined as the dif-
ference between its deadline and execution time. In
temporalredundancyapproach,the task?slaxityislong
enough so that the system can restart the execution ei-
?This work is partially supported by grant CNS0615222 from
the National Science Foundation.
ther in the same processor or another one.In physical
redundancy approach, the task does not have sufficient
laxity and multiple (typically two) processors will be
executing identical copies of the tasks simultaneously.
Tsuchiya et al. [11] proposed several techniques that
allow the two copies of the task to begin execution at
different times in order to reduce the overlap time be-
tween them. Specifically, one processor executes the
primary copy of the task and the other one executes
the backup copy as late as possible while meeting the
deadline. Manimaran and Murthy [7] developed a dy-
namic scheduling algorithm based on distance concept,
backup overloading, and resource reclaiming to im-
provethe multiprocessorsystem?s performancein fault-
tolerant. More recently, Rashid et al. [9] described a
multiprocessor architecture with one lead processor ex-
ecuting the primary copy and two checkers executing,
at a slower speed, the backup copy.
Meanwhile, low energy consumption has become one
of the major design objectives for embedded real-time
systems, especially battery operated portable devices.
Low power/energy consumption can also improve run-
time reliability of the circuit and therefore reduce the
fault rate. Dynamic voltage scaling (DVS) technique
variesthe processor?soperating voltageand hence clock
frequency at run-time based on the workload and other
factors. For fault-tolerant system, slack will also arise
when fault does not occur. One can exploit this type
of slack to reduce the energy consumption. Melhem et
al. [8] presented uniform and non-uniform checkpoint
placement policies for aperiodic as well as periodic
tasks (by using earliest deadline first, EDF, scheduling)
that allow a real-time system to recover from failure
and minimize the energy consumption during failure-
free operation. Unsal et al. [12] introduced non-EDF
scheduling heuristics and obtained more energy reduc-
tion. Zhang and Chakrabarty proposed an adaptive
checkpointing scheme which is combined with a dy-
namic voltage scheduling scheme to achieve power re-
duction and increase the likelihood of timely task com-
pletion in the presence of faults [14].
In this paper, we study the energy efficiency of dual-
processor system (unlike multiprocessor systems [7, 9])
that executes two copies of each task in different pro-
cessors. The highlight of our work is an optimal DVS
static power management scheme for single periodic
task (unlike many previous work on the scheduling of
a set of tasks [7, 12] or the placement of checkpoints
[8, 14]) assuming the voltage can be changed contin-
uously [13]. We also propose static/dynamic power
management algorithms for multiple periodic tasks.
2 Problem Formulation
We consider a real-timefault-tolerantdual-processor
system, which supports DVS, executing multiple peri-
odic tasks without any deadline missing.
Application Model The dual-processor system pro-
cesses multiple periodic tasks. Each periodic task can
be characterizedby (Ci,Di,Ti), where Ci is the WCET
at the reference voltage Vmax that supports the max-
imum speed fmax, Di is the deadline and Ti is the
period with Ti ? Di. The total density of the tasks
can be calculated by summationtextni=1 CiDi and the utilization is
defined as summationtextni=1 CiTi . We necessarily assume that the
density is less than or equal to 1 in order to guarantee
the schedulability of the task set.
DVS System Model The system has two processors
each of which has its private memory and supports dy-
namic speed and voltage changes. The system?s dy-
namic power consumption Pd = Cef ?V 2dd ?f, operating
speed (or clock frequency) f = k?(Vdd?Vt)2/Vdd where
Cef is the effective switching capacitance, Vdd is the
voltage, k is a constant and Vt is the threshold voltage.
When Vt is small enough to be negligible or can be pro-
portionally adjusted at run time, speed becomes linear
to Vdd [2]. Therefore, for a task that requires time e
at the highest (reference) voltage Vmax and the highest
speed fmax, if the processor completes the task at a re-
duced voltage and speed in time (e+l), the (dynamic)
energy consumption E can be expressed as
E = Pmax ?e?( ee + l)2 (1)
where Pmax = Cef ?V 2max?fmax is the power consump-
tion at the highest voltage and speed. Note that in this
paper we normalize Pmax to be 1.
Fault and Fault Recovery Model The dual-
processor system executes two copies (primary and
backup) of each task for fault tolerance and only needs
one of the copies to be completed successfully before
the deadline. One processor executes the primary copy
and the other one executes the backup copy. When
the processor completes the primary copy of the task
without fault, it will notify the other processor to stop
executing the backup copy.
For the fault-tolerant dual-processor system that ex-
ecutesmultiple periodictasks, basedonthe abovemod-
els, we consider the problem of when and at which volt-
age should two copies of each task be executed in order
to reduce the system?s total energy consumption while
completing at least one copy of each task successfully
without any deadline missing even when fault occurs.
3 Power Management Schemes
Static power management (SPM) scheme schedules
tasks in the o?ine fashion based on their WCET to
minimize the energy consumption. Dynamic power
management scheme leverages the slack between task?s
actual execution time and WCET to further reduce en-
ergy at run time.
3.1 SPM for Single Periodic Task
Assume that the system executes a single periodic
task with deadline D ? T, the period. Processor P1
will execute the primary copy of the task whose work-
load is e1 at the reference voltage (highest speed fmax),
whenever it is ready. Processor P2 executes the corre-
sponding backup copy whose required workload at the
highest speed is e2 ? e1 1. When P1 and P2 only run
at the highest speed and voltage, Tsuchiya et al. [11]
have proposed a scheduling policy which minimizes the
overlap length between the execution time slot of pri-
mary copy and backup copy (Figure 1 (a)). Let t1 be
the finishing time of the primary copy at P1 and t2 be
the start time of the backup copy at P2, this no power
management(NPM) method indicates that t1 = e1 and
t2 =
braceleftbigg e
1 : D ? e1 + e2
D ?e2 : D < e1 + e2 (2)
The expected energy consumption of the system per
period is (recall that Pmax = 1)
E =
braceleftbigg e
1 + p?e2 : D ? e1 + e2
2?e1 + e2 ?D + p?(D ?e1) : D < e1 + e2
(3)
where p ? [0,1] is the fault rate of P1.
1For static power management, e1 and e2 are the worst case
workloads of the task for P1and P2, respectively. And e1 is equal
to e2 when wedo voltage scheduling. However fordynamic power
management e1 and e2 are the remaining workloads of the task
and e2 may be greater than e1 as P2may startthe execution later
than P1. Note that the power management schemes introduced
in this section can also be used in dynamic power management.
(a) No power management 
fmax
maxf
t2
2 t1
t1
S max2.f S max3
.f
(b) Optimal power management
t t3
D
S max1.f
0
0
1
10
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
t
f D
0
t
f
Figure 1. Different static power management
schemes for single periodic task
When the dual-processor system, P1 and P2, sup-
ports dynamic voltage and frequency scaling, we pro-
pose the following optimal power management (OPM)
scheme (see Figure 1 (b) for a graphic illustration):
P1: executes from the arrival of the task (time 0) at a
fixed speed S1 ?fmax till the task?s completion at
time t1.
P2: waits till time t2 (can be 0) and executes at a fixed
speed S2?fmax to P1?s completion time t1. If there
is no fault, terminates; Otherwise, continues exe-
cuting at a fixed speed S3 ?fmax to its completion
at t3 ? D.
where Si ? [Smin,1] and Smin = fminfmax. The problem
becomes determining S1,S2,S3, and t2 such that
the total energy consumption is minimized.
After tedious calculation, we identify that the opti-
mal solution for this optimization problem either hap-
pens in the ?interior? or one of the following bound-
aries: S1 = 1, S2 = Smin, S3 = 1. We give the optimal
?interior? solution when none of the speed Si hits the
boundary Smin or 1.
S1 = k1S3e1S
3D ?e2
(4)
S2 = k2S3 (5)
S3 = e2 + e1k3D (6)
t2 = 0 (7)
where k1 = 1??p,k2 = ?p, and k3 = 3
radicalBig
( 1?p ?1)2.
3.2 SPM for Multiple Periodic Tasks
Now we consider the system which executes mul-
tiple periodic tasks by using EDF algorithm. Each
multiple periodic task is characterized by (Ci,Di,Ti),
where Ci is WCET, Di is the deadline and Ti is the
period. We can use the density of the tasks, which issummationtext
n
i=1
Ci
min(Di,Ti), to test its schedulability [5]. Here we
assume that Di ? Ti. When the density summationtextni=1 CiDi ? 1,
the system is schedulable and we can assign the exe-
cution slot Cisummationtextn
i=1
Ci
Di
to the i-th periodic task on both
processors without violating the schedulability. The
assigned time slot for each task may not be continuous
because it may be preempted by the higher-priority
task. In each time slot, we can employ the OPM algo-
rithm to find the execution speeds and start time for
the primary and backup copy of the corresponding task
in order to reduce the energy consumption.
Note that the OPM algorithm may give the primary
copy of the task a completion time earlier than its as-
signed time slot. Therefore although we only consider
the WCET of each task, when fault does not occur on
P1 there may still create slack, which is the time differ-
ence between the assigned time slot and the completion
time. The slack will be released so it or part of it can
be used by other tasks.
The static power management algorithm for multi-
ple periodic tasks is shown in Figure 2. Before the task
?n or ?o is scheduled to execute, we apply the slack
estimation algorithm proposed by Kim et al. [4] that
considers the unused times of completed tasks and re-
maining times for uncompleted tasks to estimate the
available time for the task (step 7). Then based on the
remaining and available time, we use our OPM algo-
rithm to obtain the appropriate voltage and speed for
both processors (step 8).
1. On the completion of current task ?c or the
preemption of ?c by a high-priority task ?n ;
2. update the unused times of completed tasks and
their remaining times for uncompleted tasks;
3. if ?c is completed
4. if the ready queue is not empty
5. select task ?o based on EDF;
6. if task ?n or ?o is available
7. estimate the slack for ?n or ?o;
8. use OPM algorithm to set voltage and speed;
9. start executing the task;
Figure 2. Static power management algo-
rithm for multiple periodic tasks.
3.3 Dynamic Power Management
The actual execution time of each task may deviate
from its WCET, sometimes in a very large amount,
and hence creates slack. Dynamic power management
exploits such slack for further energy reduction. The
simplicity and therefore low computation complexity
of the above optimal solution enables us to adopt it at
the run-time for dynamic power management.
Note that for single periodic task, the slack created
by the previous task instance can not be used by the
current task instance because their time slots do not
overlap. However for multiple periodic tasks, the slack
created from one task may be used by other tasks be-
cause their execution time may be overlapped due to
the preemption. Our dynamic power management al-
gorithm for multiple periodic tasks is almost the same
as Figure 2 except that there are additional slack aris-
ing from the difference between the actual task execu-
tion time and its WCET.
4 Simulation Results
The goal of our simulation includes the followings:
(1) validation of the optimality of our static power
management algorithm OPM for single periodic task,
(2) demonstration of the energy savings by our static
power management algorithm for multiple periodic
tasks, and (3) showing the energy efficiency of the pro-
posed dynamic power management algorithm.
4.1 Benchmarks and Simulation Setup
Table 1 gives the characteristics of a set of real-
life benchmarks for fault tolerant applications that we
have collected. These include Computerized Numerical
Control (CNC) machine controller applications [3], the
Avionics application [6], the Inertial Navigation Sys-
tem (INS) [1], and the videophone application [10].
Each benchmark consists of a set of tasks and the
number of tasks is listed in the second column. The
third and fourth columns give the range of these tasks?
WCETs and periods, respectively. The last column
shows the utilization.
Application # of WCETS Period Utili-
tasks (ms) (ms) zation
CNC 8 0.035 ? 0.72 2.4 ? 9.6 0.489
INS 6 1.18 ?100.28 2.5 ? 1250 0.736
Avionics 17 1 ? 9 25 ? 1000 0.850
Videophone 4 1.4 ? 50.4 40 ? 66.7 0.984
Table 1. The real-life benchmarks.
The synthetic applications are generated to have dif-
ferent utilization in (0,1]. The fault is randomly gen-
erated based on the given fault rate that falls in [0,1].
The task?s actual execution time in our simulation fol-
lows a discretized normal distribution with average be-
tween 0 and WCET (WCET is normalized to 1) and a
standard deviation between 0 and 0.25.
The dual-processorsystem schedules the tasks based
on EDF policy and the two processors will execute the
tasks in the same order. Both processors support dy-
namic voltage and frequency scaling and the frequency
changes from fmin = 0.3?fmax to fmax when the volt-
age changes in the corresponding range. For static
power management scheme, we consider one hyperpe-
riod (the least common multiple of task periods) based
on the task?s WCET. For dynamic power management,
we simulate 10,000 hyperperiods for each benchmark
with the above actual execution time model.
4.2 Results of the SPM Schemes
We first use a ?guided? exhaustive search to ver-
ify the optimality of the solution (the value of S1,S2,
and S3 for the dual-processor system) provided by our
OPM scheme. We set S1 to go from Smin = 0.3 to
Smax = 1.0 with an increment of 0.01. For each fixed
S1 of the primary processor, we can easily find the
optimal value of S2 and S3 for the secondary proces-
sor. This gives us the best static scheduling policy
for a fixed S1 . For a single task, we compute its en-
ergy consumption by NPM scheme and then simulate
its execution with this speed setting, obtain its energy
consumption, and calculate its saving over NPM.
0.1 0.2 0.3 0.4
0.5 0.6 0.7 0.8
0.9 1
0
0.2
0.4
0.6
0.8
1
0
10
20
30
40
50
60
70
80
90
100
Voltage RatioUtilization
Percentage Eneergy Savings
Figure 3. Optimality of OPM: the energy sav-
ings of different static scheduling policies
over NPM for a single periodic task.
Figure 3 reports the energy saving for different uti-
lization value when the fault rate is 0.001. The voltage
ratio R is defined as S1?SminSmax?Smin and it increases from
0 to 1 as S1 goes from Smin to Smax. For each fixed
value of utilization, we mark a * on the energy sav-
ing by the optimal voltage setting as shown in Figure
3. This optimal voltage setting coincides with the one
from our proposed OPM algorithm within the error of
voltage increment (which corresponds to a 0.01?Smax
increase in speed) we set earlier.
Figures 4(a) and 4(b) report the energy saving of
SPM schemes over NPM for different utilization and
fault rate. For single task, we see that the energy sav-
ing increases as the fault rate increases or the utiliza-
tion decreases (Figure 4(a)). When the utilization is
low, there are more slack which is in favor of the volt-
age scaling based OPM approach. When the fault rate
is high, the second processor often continues the exe-
cution after the first processor completes its execution.
The increase of energy saving in this case is due to the
optimal setting of the second processor?s speed (S2 and
S3) by OPM.
(a) The case of single periodic task.
(b) The case of multiple periodic task.
Figure 4. The energy saving of SPM schemes
over NPM for different utilization and fault
rate.
Figure 4(b) reveals that OPM?s performance on en-
ergy saving for multiple periodic tasks is similar except
when the utilization is high. In that case, there will
(always) be tasks available in the ready queue. Since
our SPM algorithm (Figure 2) uses all of the available
slack for the current task, there is almost zero slack
released after its completion even when the fault does
not occur. On the other hand, for NPM when fault
does not occur, there is some slack released by termi-
nating the execution of processor P2. When there is
slack available, NPM delays the execution of the task
in P2 in order to minimize the overlap time between
the execution of P1 and P2. Therefore when the fault
does not occur the slack will be accumulated and the
energy consumption will be reduced in NPM, but not
in our algorithm.
4.3 Results of the Dynamic Scheme
We apply our proposed dynamic power management
scheme to the set of real-life benchmarks shown in Ta-
ble 1. Figure 5 reports our energy savings over NPM
for different fault rate and different BCET/WCET ra-
tio. In almost all the cases, our algorithm is more en-
ergy efficient. For example, on the CNC benchmark
(Figure 5(b)) the energy saving can be as high as 80%.
Similar to the analysis of OPM, we see that the higher
the fault rate, the more energy saving by our dynamic
power management scheme.
In general, when the BCET/WCET ratio increases,
the task?s actual execution time becomes closer to
WCET with less variance. Therefore we expect less
energy saving. However, it is interesting to see that in
some applications (CNC and INS), significant amount
of energy saving is reported even when BCET and
WCET become the same. We suspect that the impact
of the BCET/WCET ratio to energy saving depends
on the structure of the specific application and it is
being investigated currently. In the Videophone ap-
plication, when the BCET is close to WCET or much
less than WCET, our algorithm performs slightly worse
than NPM. This is caused by the high utilization of this
application for the same reason as we have discussed
in the previous section 4.2.
5 Conclusions
We consider the dual-processor system that concur-
rently runs two copies of the same applications in dif-
ferent processorsin order to tolerate fault. We have de-
veloped voltage scheduling techniques on such a system
in order to reduce the total system energy consumption
while still tolerating the fault occurrence. Specifically,
we propose an optimal static voltage scaling scheme for
single periodic task, which can also be used for multiple
periodic tasks to saveenergy. In order to further reduce
the energy consumption, the dynamic power manage-
ment scheme will be applied to exploit the slack arising
from the variation of task?s execution time during run
time. The experiment results confirm our claims and
show that our techniques can save up to 80% energy
vs. no power management scheme while still providing
fault tolerance.
(a) Avionics (b) CNC
(c) Video Phone (d) INS
Figure 5. Dynamic power management scheme: energy savings over NPM on real-life benchmarks.
References
[1] A. Burns, K. Tindell, and A. Wellings. ?Effective analysis
for engineering real-time fixed priority schedulers?, IEEE
Tran. on Software Eng., Vol. 21, pp. 475-480, 1995.
[2] F. Gruian. ?System-Level Design Methods for Low-Energy
Architectures Containing Variable Voltage Processors?,
Proc. of the Workshop on Power-Aware Computing Sys-
tems, pp. 1-12, 2000.
[3] N. Kim, M. Ryu, S. Hong, M. Saksena, C. Choi, and H.
Shin. ?Visual Assessment of a Real-Time System Design: a
Case Study on a CNC controller?, Proc. of IEEE Real-Time
Systems Symposium, pp. 300-310, 1996.
[4] W. Kim, J. Kim, and S. L. Min. ?A Dynamic Voltage Scal-
ing Algorithm for Dynamic-Priority Hard Real-Time Sys-
tems UsingSlack TimeAnalysis?, Proc. of Design, Automa-
tion and Test in Europe, pp. 788-794, 2000.
[5] J. W. S. Liu. ?Real-Time Systems?, Prentice Hall, 2000.
[6] C. Locke, D. Vogel, and T. Mesler. ?Building a Predictable
Avionics Platform in Ada: a Case Study?, Proc. of IEEE
Real-Time Systems Symposium, pp. 181-189, 1991.
[7] G. Manimaran and C.S.R. Murthy. ?A Fault-Tolerant Dy-
namic Scheduling Algorithm for Multiprocessor Real-Time
Systems and Its Analysis?, IEEE Tran. on Parallel and
Distributed Systems, Vol. 9, No. 11, pp. 1137-1152, 1998.
[8] R. Melhem, D. Moss?e, and E. N. Elnozahy. ?The inter-
play of power management and fault recovery in real-time
systems?, IEEE Trans. on Computers, 53(2), pp. 217-231,
2004.
[9] M.W. Rashid, E.J. Tan, M.C. Huang, and D.H. Albonesi.
?Power-Efficient Error Tolerance in Chip Multiprocessors?,
IEEE Micro, pp. 60-70, Nov.-Dec. 2005.
[10] D. Shin, J. Kim, and S. Lee. ?Intra-Task Voltage Schedul-
ing for Low-Energy Hard Real-Time Applications?, IEEE
Design and Test of Computers, Vol. 18, No. 2, pp. 20-30,
2001.
[11] T. Tsuchiya, Y. Kakuda, and T. Kikuno. ?A New Fault-
Tolerant Scheduling Technique for Real-Time Multiproces-
sor Systems?, International Workshop on Real-Time Com-
puting Systems and Applications, pp. 197-202, 1995.
[12] O.S. Unsal, I. Koren, and C.M. Krishna. ?Towards Energy-
Aware Software-Based Fault Tolerance in Real-Time Sys-
tems?, International Symposium on Low Power Electronics
and Design, pp. 124-129, 2002.
[13] L. Yuan and G. Qu. ?Analysis of Energy Reduction on Dy-
namic Voltage Scaling-Enabled Systems,? IEEE Transac-
tions on Computer-Aided Design of Integrated Circuits and
Systems, Vol. 24, No. 12, pp. 1827-1837, 2005.
[14] Y. Zhang, and K. Chakrabarty. ?Energy-Aware Adaptive
Checkpointing in Embedded Real-Time Systems?, Proc.
Design, Automation and Test in Europe Conference, pp.
918-923, 2003.