1702 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 18, NO. 12, DECEMBER 1999
Power Optimization of
Variable-Voltage Core-Based Systems
Inki Hong, Darko Kirovski, Gang Qu, Miodrag Potkonjak, and Mani B. Srivastava
Abstract? The growing class of portable systems, such as
personal computing and communication devices, has resulted in a
new set of system design requirements, mainly characterized by
dominant importance of power minimization and design reuse.
The energy efficiency of systems-on-a-chip (SOC) could be much
improved if one were to vary the supply voltage dynamically
at run time. We develop the design methodology for the low-
power core-based real-time SOC based on dynamically variable
voltage hardware. The key challenge is to develop effective
scheduling techniques that treat voltage as a variable to be
determined, in addition to the conventional task scheduling and
allocation. Our synthesis technique also addresses the selection of
the processor core and the determination of the instruction and
data cache size and configuration so as to fully exploit dynam-
ically variable voltage hardware, which results in significantly
lower power consumption for a set of target applications than
existing techniques. The highlight of the proposed approach is
the nonpreemptive scheduling heuristic, which results in solutions
very close to optimal ones for many test cases. The effectiveness of
the approach is demonstrated on a variety of modern industrial-
strength multimedia and communication applications.
Index Terms? High-level synthesis, scheduling, synthesis for
low power, system-on-a-chip (SOC).
I. INTRODUCTION
A. Motivation
T
HE growing class of portable systems, such as personal
computing and communication devices, demands data-
and computation-intensive functionalities with low power con-
sumption. For such systems, power consumption is the primary
design goal since the battery life is a primary constraint
on power, due to the fact that the battery technology has
not followed the progress pace of the semiconductor indus-
try. Recent advances in power-supply technology along with
custom and commercial CMOS chips that are capable of
operating reliably over a range of supply voltages make it
possible to create processor cores with supply voltage that
can be varied at run time according to application timing
constraints. The variable voltage processor core can be made
to operate at different optimal points along its power-versus-
speed curve in order to achieve much higher energy efficiency
than existing techniques for a wider class of applications. Such
Manuscript received May 22, 1998; revised May 14, 1999. This paper was
recommended by Associate Editor E. Macii.
I. Hong is with Synopsys Inc., Mountain View, CA 94043 USA.
D. Kirovski, G. Qu, and M. Potkonjak are with the Computer Science
Department, University of California, Los Angeles, CA 90095-1596 USA.
M. B. Srivastava is with the Department of Electrical Engineering, Univer-
sity of California, Los Angeles, CA 90095-1596 USA.
Publisher Item Identifier S 0278-0070(99)09915-7.
systems also require design flexibility, which results in the
need for implementation on programmable processor platform.
In fact, embedded software running on RISC and digital signal
processing (DSP) processor cores has emerged as a leading
implementation methodology for such applications as speech
coding, modem functionality, video compression, and commu-
nication protocol processing [3], [46]. Current semiconductor
technology allows the integration of programmable processors
and memory structures on a single die, which enables the
implementation of a system on a single chip. Similarly, the
exponential growth of both applications and implementation
technology has outpaced the design productivity of the tradi-
tional synthesis process. The shrinked time-to-market window
has exacerbated the situation. There is a wide consensus
that only through reuse of highly optimized cores can the
demands of the pending applications and the potential ultra-
large-scale integration be matched. Therefore, a low-power
core-based system-on-a-chip (SOC), consisting of a variable
voltage programmable processor core and a memory hierarchy,
attracted much attention from virtually all silicon vendors.
Our synthesis technique targets typical a modern
application-specific SOC, consisting of a variable voltage
processor core, an instruction and data cache, and a number
of optional hardware accelerators and control blocks. The
distribution of power dissipation by the components of an
application-specific SOC depends on the actual applications
running on the system. However, extensive studies indicate
that the power consumption of the processor and cache
cores accounts for significant portion of the overall power
consumption of the described SOC [2], [16]. Therefore,
in this paper, we focus on the power optimization of the
processor and cache cores.
Average power consumption in CMOS technology is given
by [6], where is the system clock frequency,
is the supply voltage, is the load capacitance, and
is the switching activity. The most effective way to reduce
power consumption of a processor core is to lower the supply-
voltage level, which exploits the quadratic dependence of
power on voltage [6]. Reducing the supply voltage, however,
increases circuit delay and decreases clock speed. The resulting
processor core consumes lower average power while meeting
the deadlines. Unfortunately, this technique is ineffective when
tight deadlines are present in systems. Another power opti-
mization technique for processor cores is the system shutdown
[46]. The system shutdown technique, though usable even in
the presence of tight deadlines, is inferior to the supply-voltage
reduction technique for the cases when both techniques can
0278?0070/99$10.00 ? 1999 IEEE
HONG et al.: POWER OPTIMIZATION OF VARIABLE-VOLTAGE CORE-BASED SYSTEMS 1703
TABLE I
THE CHARACTERISTICS OF THE TWO TASKS USED TO ILLUSTRATE THE
MOTIVATION FOR DYNAMICALLY VARIABLE-VOLTAGE APPROACH
be applied. The limitations of the techniques arise due to the
fact that systems are designed with a fixed supply voltage.
The supply-voltage reduction technique attempts to find a
single optimal voltage level for the entire processor operation,
while the system-shutdown technique makes a binary run-time
decision on whether to turn on or off the power supply.
The goal of the research presented in this paper is to develop
the design methodology for the low-power, core-based, real-
time SOC based on dynamically variable voltage hardware.
We consider real-time systems since almost all embedded
systems have to meet real-time constraints dictated by their
environment [3]. The key challenge is to develop effective
scheduling techniques that treat voltage as a variable to be
determined, in addition to the conventional task scheduling
and allocation. Our synthesis technique also addresses the
selection of the processor core and the determination of the
instruction and data cache size and configuration so as to
fully exploit dynamically variable voltage hardware, which
will result in significantly lower power consumption for a set
of target applications than existing techniques. This paper is
an extended version of [20].
B. Motivational Example
To illustrate the key point of the proposed dynamically
variable voltage approach, we consider a set of tasks as a
motivational example, shown in Table I. Two independent
computations and need to be executed on an
embedded processor core in the time interval [0, 20]. Each task
can be executed immediately after its arrival and is required
to be finished by its deadline time. Preemption is not allowed
due to the high context-switching cost.
Let the maximum supply-voltage level be
Power is normalized to its value at the reference point, i.e.,
Reducing supply voltage results in increased
circuit delay; and, to a good accuracy, the circuit delay is given
by where is the threshold voltage
and is a constant that depends on capacitance, mobility, and
transistor dimensions [6]. Assuming a typical value of 0.8
for the threshold voltage, the execution time is the
following function of supply voltage :
The power consumption, as mentioned previously, is given by
Since is inversely proportional to delay,
it follows that is the following function of supply
voltage :
The system consumes 1 W when no power reduction tech-
niques are applied. We now consider the application of shut-
down, supply-voltage reduction, both shutdown and supply-
voltage reduction, and dynamically variable voltage approach,
respectively.
With the shutdown technique, the system will operate at
Task is executed in the interval [0, 5]. Task
is executed in the interval [5, 10]. The processor can be shut
down for the interval [10, 20] and then be resumed for the next
task. The duty cycle of the processor is 50%, so the average
power consumption is 0.5 W.
With the supply-voltage reduction technique, the system will
operate at a lower but fixed supply-voltage level. Unfortu-
nately, the tight deadline on Task means that the supply
voltage cannot be lowered less than V since
otherwise Task will miss its deadline. Thus, the system
operates at V with V Task
is executed in the interval [0, 6]. Task is executed in the
interval [6, 12]. The average power consumption is 0.67 W.
Since the system can be shut down during the interval [12,
20], the average power consumption can be lowered to
W, which results in 20% power reduction,
compared to the shutdown technique.
With the variable voltage hardware, one can schedule the
two tasks such that Task is executed in the interval [0, 6] at
2.97 V with V W, and Task is executed in
the interval [6, 20] at 1.95 V with V W. The
average power consumption is
W, which is 44% lower than the shutdown technique and
30% lower than the supply-voltage reduction in combination
with the shutdown technique.
The preceding example illustrates that it is always better to
reduce voltage than to shut down when both techniques are
applicable, and that dynamically variable supply voltage helps
unleash this potential.
C. What Is New?
In this paper, we develop the first approach for power min-
imization of variable-voltage core-based application-specific
systems. The power minimization is addressed at several
synthesis tasks including task scheduling, instruction and data
cache size determination, and processor core selection. We
explore static scheduling algorithms that treat voltage as an
optimization degree of freedom for the applications with real-
time constraints. By selecting the most efficient voltage profile
in the presence of multiple timing constraints, our algorithms
result in much larger savings in energy than other techniques
such as the system shutdown and the supply-voltage reduction.
D. Paper Organization
The rest of this paper is organized in the following way.
Section II presents the related work in the field of low-
power system- and behavioral-level synthesis and variable-
voltage systems. Section III explains the necessary background
material on variable-voltage systems. The design flow of
the novel synthesis approach is presented in Section IV. In
Section V, the optimization problems are recognized and their
1704 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 18, NO. 12, DECEMBER 1999
computational complexities are established. In the same sec-
tion, the synthesis algorithms are proposed and explained in
detail. Section VI presents experimental data and discussion
to evaluate the effectiveness of the approach. This paper is
concluded in Section VIII.
II. RELATED WORK
Low-power design has emerged as an important area of
research with the recent growth of wireless communications
applications. We review the research results relevant to low-
power systems based on dynamically variable-voltage hard-
ware.
The prior research activity on technology, circuits, and
techniques for variable-voltage systems is of direct relevance
to our research. At the technology level, efficient dc?dc
converters that allow the output voltage to be rapidly changed
under external control have recently been developed. For ex-
ample, Namgoong et al. from Stanford University [38] recently
presented a high-efficiency dc?dc buck-converter switching
regulator with adjustable output supply voltage. The output
voltage ranges from 1.5 to 3.5 V while delivering a power
level of 10 to 200 mW, with an efficiency of 83?93%. The
time taken to reach steady state at the new voltage is under
6 ms/V. Stratakos et al. [47] have described an integrated
buck-converter circuit with greater than 90% efficiency. While
their design delivered a fixed 1.5 V from a 6-V battery, the
design is generalizable to produce variable supply voltages.
Suzuki et al. [49] reported a RISC processor with variable
supply voltage, which can vary from 0.4 to 2.9 V from a
3.3-V external power supply. Such supply-voltage chips are
crucial enablers for this research.
At the hardware design level, research work has been per-
formed on chips with dynamically variable supply voltage that
can be adjusted based on 1) process and temperature variations
and 2) processing load as measured by the number of data sam-
ples queued in the input (or output) buffer. The first adjustment
allows the minimum possible voltage for current process and
temperature conditions to be used for a given clock frequency,
whereas the second adjustment allows the computation to be
slowed down and the supply voltage correspondingly lowered
during periods of low workload. Dynamically adapting voltage
to operate at the point of lowest power consumption for given
temperature and process parameters was first suggested by
Macken et al. [36]. Nielsen et al. [39] extended the dynamic
voltage adaptation idea to take into account data-dependent
computation times in self-timed circuits. Recently, researchers
at the Massachusetts Institute of Technology [4] have extended
the idea of voltage adaptation based on data-dependent com-
putation time from [39] to synchronously clocked circuits. The
approach is useful when the computation time is data depen-
dent, which is often the case in signal processing, for example,
a forward error corrector chip for a digital cassette recorder
where the processing time depends on the number of errors
in the input sample. Buffering data and averaging processing
rate can yield power reductions of an order of magnitude
in some applications, but with increased latency [18], [39].
Because these approaches rely on run-time reactive approaches
to dynamic voltage adaptation, they work fine only where aver-
age throughput is the metric of performance. Therefore, these
approaches are inapplicable to hard real-time systems such as
embedded control applications with strict timing requirements.
There has been research on scheduling strategies for ad-
justing CPU speed so as to reduce power consumption. Most
of the existing work is in the context of a nonreal-time
workstation-like environment. Weiser et al. [52] proposed an
approach where time is divided into 10?50-ms intervals, and
the CPU clock speed (and voltage) is adjusted by the task-
level scheduler based on the processor utilization over the
preceding interval. The CPU speed is slightly lowered or
raised depending on whether the utilization has been low
or high, and is raised to the maximum level if the CPU is
falling behind. Energy is saved by trying to spread cycles
into what would otherwise have been idle time. Govil et
al. [17] enhanced [52] by proposing and comparing several
predictive and nonpredictive approaches for voltage changes
and concluded that smoothing helps more than prediction.
Yao et al. [54] described a minimum energy schedule and an
average rate heuristic for job scheduling with preemption for
independent processes with deadlines.
Chou and Borriello [8] presented a static nonpreemptive
software scheduling algorithm for reactive hard real-time sys-
tems in the context of hardware/software cosynthesis. Given
a constraint graph for tasks with several timing constraints,
their algorithm is guaranteed to find a feasible ordering if
one exists. The drawback of the algorithm is high complexity,
which is exponential in the worst case. Dave et al. [10] have
also proposed a real-time scheduling algorithm for software
scheduling, which allows both preemptive and nonpreemptive
scheduling.
Landman and Rabaey [31], [32] have performed the work
on macro- and microlevel power estimation. The work at
Princeton and Fujitsu on the estimation of power consumption
for programmable processors [50] is of particular relevance.
Evans and Franzon [12] and Su and Despain [48] have
proposed power estimation models for cache, specifically for
SRAM. Many researchers have proposed power-efficient cache
designs, e.g., [28]. Compiler-directed techniques for minimiz-
ing power consumed by cache have also received significant
attention in the research community [30], [40]. Panda et
al. [42] presented a technique for determining the best data
cache size for a given application. Panda et al. [41] presented
techniques to minimize data cache conflicts for direct-mapped
caches by analyzing the array access patterns. Kirovski et al.
[26] developed techniques to minimize instruction/data cache
conflicts for direct-mapped caches. The techniques proposed
in [26], [30], and [40]?[42] are complementary to our ap-
proach since they can be applied as a preprocessing step. The
techniques proposed in [26], [30], and [40] reduce the cache
conflicts, while those proposed in [41] and [42] reduce the size
of the cache or improve the performance of the cache, mainly
by optimizing memory and cache data assignment. While the
previous works concentrated on the memory and cache data
assignment, our work considers other complementary issues
such as processor selection, cache sizing, task scheduling, and
voltage scheduling simultaneously.
HONG et al.: POWER OPTIMIZATION OF VARIABLE-VOLTAGE CORE-BASED SYSTEMS 1705
One well-known technique that results in substantial power
reduction is reducing the voltage to reduce the switching
power, which is the dominant source of power dissipation
in a CMOS circuit and is proportional to , where is
the supply voltage [6]. Indeed, supply-voltage reduction of
VLSI chips to save power has long been used in memories
and watches, where the on-chip supply voltage is reduced
to a fixed value [36]. However, this reduction came with a
speed penalty due to increased gate delays. Chandrakasan et
al. [6], [5] have shown that the strategy of operating at a
fixed reduced voltage can be coupled with architectural-level
parallelism and pipelining to compensate for lower clock rate
due to voltage reduction so that the overall throughput remains
the same but the overall power is still lowered, although at the
cost of increased latency.
Several researchers have addressed the issue of power in
event-driven systems and proposed various techniques for
shutting down the system or parts of the system. Examples
include the predictive shutdown of software programmable
processor in a wireless X terminal [22], [46] and disk shut-
down [11]. At an implementation level, circuit techniques for
low-power hardware have been proposed [6]. Compilation
techniques for low-power software have emerged for both
general-purpose computation [50] and DSP computations [21].
Numerous behavioral synthesis research efforts have also
addressed power minimization; for example, see the survey
of [43].
Four research groups in particular have addressed the use
of multiple (in their software implementation, restricted to
two or three) different voltages for behavioral synthesis [7],
[23], [35], [44]. Interestingly, although they used the term
?variable voltage,? they actually addressed scheduling when
a fixed number of simultaneously available voltages are used.
Therefore, there is no similarity between our work and these
efforts beyond accidental similarity of terminology.
III. PRELIMINARIES
In this section, we describe the task model and hardware and
power models for processor cores, caches, and variable-voltage
hardware.
A. Computational and Timing Models
A set of independent tasks is to be executed on a SOC.
Each task is associated with the following parameters:
? is its arrival time;
? is its deadline;
? is its period;
? is its execution time at the nominal (maximum possi-
ble) voltage level .
We assume that all tasks are defined on semi-infinite or
very long streams of data. A task consumes one set of input
and produces one set of output at each iteration. The duration
of each iteration for a task is a period for the task. The
execution time of a task is a worst case execution time since all
tasks should always satisfy their timing constraints. Many real-
time systems can be modeled using our computational/timing
model. An example system is a set of avionics tasks that
runs on an avionics application-specific system [1], [45]. The
task set consists of 24 independent periodic tasks. Another
example is a multimedia server that provides multimedia data
to multiple users [37].
Without loss of generality, we assume that all tasks have
identical periods. When this is not the case, a simple pre-
processing step and application of the least common multiple
(LCM) theorem [33] transforms an arbitrary set of periods to
this design scenario in polynomial time. As a consequence,
when the LCM theorem is applied, there may be a need
to run several iterations of a given task within this overall
period. Since context switching overhead is usually high for
modern systems, we assume no task preemption. Note that
this preemption restriction actually does not affect any of
the proposed methods, since in all discussed design cases
nonpreemptive policies yield superior results in comparison
with preemptive policies.
B. Power Model
It is well known that there are three principal components
of power consumption in CMOS integrated circuits: switching
power, short-circuit power, and leakage power. The switching
power is given by [6], as indicated earlier.
is defined to be effective switched capacitance. In CMOS
technology, switching power dominates power consumption. It
is also known that reduced voltage operation comes at the
cost of reduced throughput [6]. The gate delay follows
the following formula: where
is a constant [6]. The maximum rate at which a circuit is
clocked monotonically decreases as the voltage is reduced.
From these equations together with the observation that the
speed is proportional to and inversely proportional to the
gate delay, the power-versus-speed curve can be derived, in
particular the normalized power at
as a function of the normalized speed at some
reference For example, assuming V
and V, the power-versus-speed curve follows the
following equation:
By varying the system can be made to operate at different
points along this curve.
From the equation, it is easy to see that the power is a
convex function of speed. It immediately follows that if there
is a single computation that needs to be executed by some
deadline, it is always better to run the computation at a constant
speed corresponding to the average speed required than to
change the speed (and correspondingly voltage) during the
computation. In the proposed scheduling heuristic, we use this
observation so that each task is executed at a single voltage.
However, the level of this single voltage may change from
task to task, and one of our goals is to determine this ?single?
voltage for each task.
We assume that the power consumption of software running
on a processor is proportional to the number of instructions.
1706 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 18, NO. 12, DECEMBER 1999
This assumption is based on the fact that for some modern
processors such as the Fujitsu SPARClite MB86934, a 32-bit
RISC microcontroller, power is directly proportional to the
number of operations, regardless of the mix of operations
being executed [51]. Tiwari and Lee [51] report that all the
operations including integer arithmetic logic unit instructions,
floating-point instructions, and load/store instructions incur
similar power consumption, whereas different instructions
draw different amounts of power consumption for other pro-
cessors such as Intel 486DX2 and Fujitsu DSP processor [34],
[50]. In many cases, the same instruction draws different
amount of currents depending on the context such as the
predecessor and successor instructions [50]. When the power
consumption depends on the mix of operations being executed,
as in the case of the Intel 486DX2 and Fujitsu DSP processor
[34], [50], s more detailed hardware power model may be
needed. Note that our approach does not rely on any specific
power model and thus can be applicable with any power
model.
C. Architecture and Hardware Model
Several factors combine to influence system performance:
instruction and data cache miss rates and penalty, processor
performance, and system clock speed. Power dissipation of
the system is estimated using processor power dissipation per
instruction and the number of executed instructions per task,
supply voltage, energy required for cache read, write, and off-
chip data access, as well as the profiling information about
the number of cache and off-chip accesses. The approach
[26], [27] that we use to realistically address the factors used
to estimate power consumption leverages on existing cache
and processor models. This enables the synthesis approach
to be rapidly updated and applied to environments with new
technologies.
The system performance is computed using the following
formula for cycles per instruction (CPI):
CPI
MIPS
Cache Miss Ratio
D-Cache Miss Ratio
Cache Miss Penalty
where Cache Miss Ratio was computed during the trace
driven simulation of the cache subsystem and or
is a system clock frequency, and MIPS stands for million
instructions per second. Cache Miss Penalty, and MIPS
are system parameters introduced in Section III. The system
energy consumption (SEC) is computed using the following
formula:
SEC
where is the total number of cycles required to execute
the application, is the number of cycles while the
CPU is idle waiting for data to be fetched from main memory
to cache (blocking caches are assumed for low hardware
cost), is the average energy consumed per cycle
for a particular processor core, and are the
number of cache accesses and energy consumed per access,
respectively, and and are the number of main
memory accesses (reads due to cache misses and write-
back updates) and energy consumed per memory access,
respectively. Since power consumed by the cache decoder
represents approximately 30% of the overall cache power
consumption [12], the power savings due to the decrease of
transitions in the cache decoder are accounted for by scaling
down 30% of the total cache power consumption with the
ratio of the number of transitions due to address change in the
optimized (relocated) and traditional basic block schedules.
The constants in the formulas have been either obtained from
the manufacturer?s datasheets or computed using our system
performance and power evaluation and simulation platform.
The above formulas have been adopted from [26] and [27].
Data on microprocessor cores have been extracted from
manufacturer?s datasheets as well as from the CPU Center
Info Web site.
1
We assume that the processor cores follow the
power model described in the preceding subsection.
We use CACTi [53] as a cache delay estimation tool with
respect to the main cache design choices: size, associativity,
and line size. The energy model is adopted from [12] and
[48]. The overall cache model is scaled with respect to actual
industrial implementations. Caches typically found in current
embedded systems range from 128 B to 32 KB. Although
larger caches correspond to higher hit rates, their power con-
sumption is proportionally higher, resulting in an interesting
design tradeoff [12], [28]. Higher cache associativity results
in significantly higher access time [53]. We use a recently
developed compiler strategy that efficiently minimizes the
number of cache conflicts for direct-mapped caches [26]. We
have experimented with two-way set-associative caches, which
did not dominate comparable direct-mapped caches in a single
case. Cache line size was also variable in our experimentations.
Its variations corresponded to the following tradeoff: larger
line size results in higher cache miss penalty delay and
higher power consumption by the sense amplifiers and column
decoders, while smaller line size results in larger cache decoder
power consumption. Extreme values result in significantly
increased access time. We estimated the cache miss penalty
based on the operating frequency of the system and external
bus width and clock for each system investigated. This penalty
ranged between four and 20 system clock cycles. Write-back
was adopted as opposed to write-through, since it is proven to
provide superior performance and especially power savings in
uniprocessor systems at increased hardware cost [24]. Each of
the processors considered is constrained by the Flynn?s limit
[13] and is able to issue at most a single instruction per clock
period. Thus, caches were designed to have a single access
port. Cache access delay and a power consumption model were
computed for a number of organizations and sizes, assuming
the feature size of 0.5 m and typical six transistors per CMOS
SRAM cell implementation. The nominal energy consumption
per single off-chip memory access, 98 nJ, is adopted from [14].
1
See http://infopad.eecs.berkeley.edu/CIC/.
HONG et al.: POWER OPTIMIZATION OF VARIABLE-VOLTAGE CORE-BASED SYSTEMS 1707
D. Variable-Voltage Hardware
The variable voltage is generated by the dc?dc switching
regulators in the power supply, which is essentially a feedback
control system. The time for the voltage to reach a steady state
at the new voltage is a strong function of the feedback loop
parameters and the LC output filter in the switching regulator.
Reference [38] reported efficient DC?DC switching regulators
with fast transition times. DC?DC converters with faster
transition times may be obtained with the cost of increased
power dissipation within the converter. Dual to the supply-
voltage variation is the accompanying variation of the clock
frequency. The clock frequency also takes time to stabilize at
the new value. The time and power overhead associated with
voltage switching is negligible [38] and is not considered in
the analysis. In addition, the computation itself can continue
even during the voltage and frequency transition.
E. Summary of Theoretical Results
The nonpreemptive scheduling of a set of independent
tasks with arbitrary arrival times and deadlines on a fixed
voltage processor is an NP-complete task [15]. Therefore, the
nonpreemptive scheduling of a set of independent tasks with
arbitrary arrival times and deadlines on a variable-voltage
processor is also an NP-complete task since the preceding
problem is its special case. Yao et al. [54] have provided the
optimal preemptive static scheduling algorithm for a set of
independent tasks with arbitrary arrival times and deadlines.
The running time of the algorithm is for
tasks. Since any optimal nonpreemptive scheduling solution
is always worse than or equal to the solution of its preemptive
version when there is no overhead for preemption, it can serve
as a lower bound for nonpreemptive scheduling solutions. This
lower bound plays a crucial role in speeding up our branch and
bound algorithm for the nonpreemptive scheduling problem.
IV. DESIGN METHODOLOGY FOR VARIABLE-VOLTAGE
SYSTEMS:GLOBAL DESIGN FLOW
In this section, we describe the global flow of the proposed
synthesis system and explain the function of each subtask
and how these subtasks are combined into a synthesis system.
The goal is to choose the configuration of processor, I-cache,
and D-cache and the variable-voltage task schedule with
minimum power consumption that satisfies the requirements of
multiple nonpreemptive tasks. To accurately predict the system
performance and power consumption for target applications,
we employ the approach that integrates the optimization,
simulation, modeling, and profiling tools. The synthesis tech-
nique considers each nondominated microprocessor core and
competitive cache configuration and selects the hardware setup
that requires minimal power consumption and satisfies the
individual performance requirements of all target applications.
The application-driven search for a low-power core and cache
system requires usage of trace-driven cache simulation for
each promising point considered in the design space. We attack
this problem by carefully scanning the design space using
search algorithms with sharp bounds and by providing power-
ful algorithmic performance and power estimation techniques.
We use the system performance and power evaluation and
simulation platform based on SHADE [9] and DINEROIII
[19]. SHADE is a tracing tool that allows users to define
custom trace analyzers and thus collect rich information on
run-time events. SHADE currently profiles only SPARC bina-
ries. The executable binary program is dynamically translated
into host machine code. The tool also provides a stream of
data to the translated code, which is directly executed to
simulate and trace the original application code. A custom
analyzer composed of approximately 2500 lines of C code is
linked to SHADE to control and analyze the generated trace
information. The analyzer sources relevant trace information
from SHADE and builds a control flow graph (CFG) cor-
responding to the dynamically executed code. The analysis
consists of two passes. The first pass determines the boundaries
of basic blocks, while the second pass constructs a CFG by
adding control flow information between basic blocks. We also
collect the frequencies of control transfers through each basic
block and branch temporal correlation [26]. Once the CFG is
obtained, an algorithm is employed to reposition application
basic blocks in such a way that instruction cache misses and
cache decoder switching activity are minimized [27]. Our
experimentation uses a basic block relocation lookup table
to simulate the relocation of basic blocks in main memory.
An entry in the basic block relocation table consists of
two elements: the original and optimized starting address of
the basic block. To simulate cache performance of a given
application and data stream, we use a trace-driven cache
simulator DINEROIII. Cache is described using a number of
qualitative and quantitative parameters such as instruction and
data cache size, replacement policy, associativity, etc.
The system optimization process is composed of a sequence
of activations of these tools. The SHADE analyzer traces
program and data memory references as well as the CFG.
The CFG is used to drive the code reposition module, which
produces a new application mapping table. Streams of refer-
ences are sent to a program that uses the basic block relocation
lookup table to map from the original address space into the
optimized address space. The remapped trace of addresses,
along with all unmodified data memory references, is sent to
DINEROIII for cache simulation. We use our custom program,
which computes performance and power consumption based
on the formulas given in Section III-C.
The synthesis search loop generates iteratively hardware
configurations. For each nonevaluated cache configuration, the
synthesis platform estimates its performance when a particular
application is executed. Then, the performance data are used
during the scheduling process for estimation of task run times.
If the scheduler finds a feasible schedule, it reports the power
consumption of this hardware configuration to the resource
allocation algorithm.
V. SYNTHESIS ALGORITHMS
The optimization problems encountered in the synthesis of a
low-power variable-voltage SOC and competitive optimization
algorithms are described in this section.
1708 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 18, NO. 12, DECEMBER 1999
Fig. 1. Pseudocode for the resource allocation procedure.
A. Resource Allocation
In this phase of the synthesis approach, a search is con-
ducted to find an energy-efficient system configuration. The
search algorithm is described using the pseudocode shown in
Fig. 1. Since performance and power evaluation of a single
processor, I- and D-cache configuration requires a trace-driven
simulation, the goal of our search technique is to reduce
the number of evaluated cache systems using sharp bounds
for cache system performance and power estimations. In
addition, a particular cache system is evaluated using trace-
driven simulation only once since the data retrieved from such
simulation can be used for overall system power consumption
estimation for different embedded processor cores with minor
additional computational expenses.
The algorithm excludes from further consideration processor
cores dominated by other processor cores. One processor
type dominates another if it consumes less power at same or
higher frequency and results in higher MIPS performance at
the same nominal power supply. The competitive processors
are then sorted in ascending order with respect to their
power consumption per instruction and clock frequency ratio.
Microprocessors, which seem to be more power efficient, are
therefore given priority in the search process.
The search for the most efficient cache configuration is
bounded with sharp bounds. A bound is determined by mea-
suring the number of conflict misses and comparing the
energy required to fetch the data from off-chip memory due
to measured conflict misses and the power that would have
been consumed by twice larger cache for the same number
of cache accesses assuming zero cache conflicts. Note that
we assume that cache power is a monotone function of size.
We terminate further increase of the cache structure when the
power consumption due to larger cache would be larger than
the energy consumed by the current best solution. Similarly,
another bound is defined at the point when the energy required
to fetch the data from off-chip memory, due to conflict
cache misses for twice smaller cache with the assumption of
zero-energy consumption per cache access, is larger than the
energy required for both fetching data from cache and off-chip
memory in the case of the current cache structure. We abort
further decrease of the cache structure if the amount of energy
required to bring the data due to additional cache misses from
off-chip memory is larger than the energy consumed by the
current best solution.
When evaluating competitive hardware configurations, the
target applications are scheduled with the variable-voltage task
scheduler using the predicted system performance and power
on the configuration considered. Before the nonpreemptive
task scheduling is performed, a more efficient optimal pre-
emptive scheduler [54] is applied to get a lower bound. If the
result is worse than the current best solution, the configuration
is worse than the current best one.
B. Task Scheduling
As described in Section III, the nonpreemptive scheduling
of a set of independent tasks with arbitrary arrival times and
deadlines on a variable-voltage processor is an NP-complete
task. We have developed an efficient and effective heuristic for
the general variable-voltage task-scheduling problem, which
leverages least constraining, most-constrained heuristic para-
digm in order to obtain competitive solutions. The algorithm
is described using the pseudocode shown in Fig. 2.
We first define the terms used to describe the algorithm.
The lifetime of a task is defined as the time between the
task?s arrival time and its deadline. The time partition and
time interval represent any contiguous time period between
two time points. In particular, the time partition is used to
refer to one of the two split time intervals of a task. The
start (finish) time of a task is the time when the task starts
(finishes) its execution.
The algorithm consists of two phases. In the first, all tasks
are scheduled at the nominal voltage. If a feasible schedule is
found in the first phase, in the second phase the supply voltages
are adjusted for low power. These two phases are repeated
times, and the best solution obtained from all the
iterations is chosen as the final solution. Each iteration of the
two phases generates different solutions since the objective
functions used in the first phase to guide the search strategy
are randomized by a random offset. The complexity of the
algorithm is for tasks.
In the first phase, the task scheduling at the nominal voltage
is done in the following way. Initially, the time period in which
the tasks appear is divided into time regions such that each
time region has its start or end time equal to a start or end
time of some task and there does not exist a task that has
either an arrival or deadline time within the time region. Then,
for each time region an objective function is
computed as
where the functions and
HONG et al.: POWER OPTIMIZATION OF VARIABLE-VOLTAGE CORE-BASED SYSTEMS 1709
Fig. 2. Pseudocode for the variable-voltage task-scheduling algorithm.
assume that all tasks are scheduled from their arrival until their
deadline regardless of time overlaps and return the maximum
and average voltage for all tasks alive at time region
respectively. Larger implies that the time region
is more constrained. The objective functions (or constraints)
of time regions are used to compute the constraints of tasks.
For each task an objective function is computed as
Larger implies that the task is more constrained.
The most constrained task which does not completely
include the lifetime of another tasks in its lifetime, is then
scheduled in the time interval equal to its run time at the
nominal voltage, which spans over a subset of time regions
with the lowest sum of objective functions among any feasible
subset of time regions. When comparing the subset of time
regions, we only consider the cases that the start time of
the task is only at the start time of each time region in
the task?s lifetime or the end time of the task is only at
the end time of each time region. For each time interval
[start, end] of all eligible time intervals, we compute
We schedule the task in the
time interval with the smallest Upon scheduling the task
other tasks? deadlines and arrival times are updated in the
following way.
? If the scheduled task?s lifetime is included in the lifetime
of some other task , and the task can be scheduled
at both partitions of its lifetime upon deleting the time
regions that contain the task then both partitions of
the task?s lifetime are evaluated for scheduling. Objective
functions are computed for all the time intervals with
their length equal to the task?s nominal voltage run time
such that the start time of the time interval is only at
the start time of each time region, or the end time of the
time interval is only at the end time of each time region,
shown in the lower part of Fig. 3. For each time interval,
we compute using the same formula in the above. The
partition that includes the time interval with the minimal
is selected since it is likely to least constrain other
tasks in the later steps.
? Otherwise, each remaining task?s arrival time and dead-
line is updated if its lifetime intersects with the lifetime
of the task
This process is repeated until all tasks are scheduled at the
system nominal voltage.
We explain the process using the example shown in Fig. 3.
Six tasks T[0], T[1], T[2], T[3], T[4] and T[5] are shown with
their arrival, deadline, and nominal voltage run times. The set
of tasks creates a set of nine time regions TR[0]?TR[8]. Based
on objective functions computed, task T[4] is selected as the
most constrained. Once this task is scheduled, the arrival times
and deadlines of tasks T[2] and T[3] are updated as shown (see
arrows in Fig. 3). However, task T[1] can be scheduled at both
TR[1, 2, 3] and TR[5, 6, 7]. There are five possible schedules
with respect to the existing time regions at TR[1, 2, 3] and
three (actually five, but only three distinct) at TR[5, 6, 7],
as shown in the bottom part of Fig. 3. The least constraining
schedule (marked in Fig. 3) is in the TR[5, 6, 7] partition since
during these time regions, T[1] can be scheduled in the time
interval, where there are no other alive tasks. Therefore, the
lifetime of task T[1] is reduced to TR[5, 6, 7]. Next, the task
T[3] is most constrained. The task T[3] is scheduled at the
start of TR[5], and the lifetime of the task t[1] is adjusted.
1710 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 18, NO. 12, DECEMBER 1999
Fig. 3. Scheduling tasks at nominal voltage.
Fig. 4. Adjusting the voltage of scheduled tasks for low power.
The task T[5] is now most constrained and is scheduled to be
finished at the end time of TR[8]. The lifetime of the task T[1]
is again adjusted. At this point, the lifetimes of the tasks T[0],
T[1], and T[2] do not have any intersections with other tasks?
lifetimes, so they are scheduled at their arrival times. Note that
the tasks can be also scheduled to be finished at their deadlines
since both schedules have the same value. When the two
schedules have the same value, we choose the one with
earlier start time in our implementation.
Once tasks are scheduled, the procedure that tunes the
supply voltage of each task is invoked. It iteratively traverses
the set of tasks and tries to lower the voltage of each task until
no energy dissipation improvement occurs. Whenever a task
has a neighboring task with a lifetime that intersects the
lifetime of and supply voltage lower than the border
between the tasks is moved according to Fig. 4. For example,
in Fig. 4, task T[i] has been assigned a supply voltage higher
than task T[i-1]. The new arrival time of T[i] and deadline
time of T[i-1] is moved to the arithmetic mean of the current
arrival/deadline time, and the arrival/deadline time as if T[i-1]
is supplied with T[i]?s voltage V[i]. When the arithmetic mean
is later (earlier) than the deadline (arrival time) of T[i], we
choose the deadline (arrival time) as the new schedule time.
For a fixed task-scheduling problem, this voltage adjustment
algorithm results in solutions arbitrarily close to the optimal
solution, given the task schedule at the nominal voltage.
HONG et al.: POWER OPTIMIZATION OF VARIABLE-VOLTAGE CORE-BASED SYSTEMS 1711
VI. EXPERIMENTAL RESULTS
In this section, we report the results of the conducted
experiments. First, we describe the benchmarks and its input
data used for the evaluation of our approach. Next, we analyze
the experimental results.
We used nine applications, shown in Table II, to demon-
strate the effectiveness of the approach. All nine applica-
tions are in the public domain and can be obtained via
the Internet.
2
JPEG software from the Independent JPEG
Group implements JPEG baseline, extended-sequential, and
progressive compression processes. We used integer discrete
cosine transform (DCT) for decoding a JPEG file to a PPM
file. Integer DCT and progressive compression process options
were used for compression. GSM software used in the exper-
imentation is an implementation of the European GSM 06.10
provisional standard for full-rate speech transcoding, prI-
ETS 300 036, which uses residual pulse excitation/long term
prediction coding at 13 kbit/s. A 16-bit linear pulse-code-
modulation data file is used to measure execution parameters
of the GSM encoder. The measurement of the GSM decoder
execution characteristics was done using the output of the
GSM encoder. EPIC (Efficient Pyramid Image Coder) imple-
ments lossy image compression and decompression utilities.
The compression algorithm is based on a critically sampled
(imperfect-reconstruction) dyadic wavelet decomposition and
a combined run-length/Huffman entropy coder. The remaining
benchmarks (Mipmap, Osdemo, and Texgen) use a three-
dimensional (3-D) graphic library called Mesa. The author of
the library claims that Mesa is a valid alternative to OpenGL
since it uses the OpenGL application programming interface.
Mipmap is a simple mipmap texture mapping example. Os-
demo is a demo program that draws a set of simple polygons
using the Mesa 3-D rendering pipe. Last, Texgen renders a
texture mapped version of the Utah teapot.
In order to evaluate the efficiency of our approach for
variable-voltage system synthesis, we conducted a set of ex-
periments. The processor cores are assumed operational at the
dynamically variable [0.8, 3.3]-V supply voltage. Performance
and power consumption data have been scaled according
to the performance-versus-voltage and power-versus-voltage
relationships, respectively. We experimented with our variable-
voltage scheduling algorithm by scheduling six different sets
of applications and timing constraints on a number of hardware
configurations. Each application set was defined using a set of
tasks Each task encompasses one exe-
cution of an application from Table II. For each application,
we established four different quality regimes, which resulted
in four different power-timing performances. The regime of
the application execution was not changed at run time. The
tasks were scheduled within the least common multiple of
their periods. The timing constraints (start and end time) were
determined pseudorandomly with a goal of achieving tight
schedules. The input traces of the task mix have been adopted
from state-of-the-art video stream modeling reports [29]. We
compared the obtained results (column new) with a lower
bound obtained using Yao?s preemptive scheduling algorithm
2
See http://www.cs.ucla.edu/ leec/mediabench/applications.html
TABLE II
ABRIEF DESCRIPTION OF THE APPLICATIONS USED IN THE
EXPERIMENTATION: IC?NUMBER OF DYNAMIC INSTRUCTIONS IN MILLIONS
with assumed zero preemption cost (column LB). We also
compared them with the results of a fixed voltage system,
where the fixed voltage used was the minimum voltage that can
satisfy all constraints (column FV). Last, to demonstrate the
effectiveness of variable voltage systems over multiple voltage
systems, we have performed an additional comparison with a
system supplied with three different optimized voltage supplies
(column 3VS). The results are presented in Table III. The val-
ues in Table III are SEC values defined in Section III-C. The
low-power processor-cache application-mix champion config-
urations are described in the first five columns of Table III.
The next six columns present the power consumption lower
bound, the result obtained using our algorithm, and the result
obtained by a fixed voltage system for six different sets of
applications extracted from Table II.
Note that our nonpreemptive scheduling algorithm resulted
in solutions that were only 1.47% higher on average than the
lower bound determined using Yao?s preemptive scheduling
algorithm over a series of more than 10 scheduling solutions.
In addition to the results presented in Table III, in order to
emphasize the advantage of relying on variable voltage while
scheduling, we report an average standard voltage deviation
per schedule to be 0.61 V over the same set of scheduling
solutions. The average maximal voltage per schedule was
3.12 V, while the average minimal voltage was 1.08 V with
respect to the nominal 3.3 V and threshold voltage V.
The importance of effective allocation for more complex
systems is emphasized in Table III, where for each application
mix and set of champion configurations, the average power
consumption among the champion configurations is presented
in rows 8, 16, and 24. As shown, the importance of having
allocation analysis is increasingly important as the system
complexity increases. While for the ten-task mix the best
configuration is only 3% better than the average, our allocation
technique finds a configuration for the case of a 50-task mix
such that 25% power savings are achieved compared to the
average configuration.
We have also applied the variable voltage synthesis strategy
to a problem of scheduling 24 avionics tasks [1] and schedul-
ing 104 tasks of an elbow manipulator [25]. Since the code for
both applications was not available, we did not use the code
relocation technique in the optimization process. However, the
power consumption was obtained by employing the scheduling
technique for variable voltage and a single instruction per cycle
execution model. The energy dissipation equaled:
1712 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 18, NO. 12, DECEMBER 1999
TABLE III
EVALUATION OF THE SCHEDULING ALGORITHM ON A SET OF ARCHITECTURES AND APPLICATION MIXES:
LB?LOWER BOUND, FV?FIXED VOLTAGE APPROACH, 3VS?THREE FIXED VOLTAGE SUPPLIES
?LB J, New J, 3VS J, and FV J
for the set of 24 avionics tasks;
?LB 107J, New J, 3VS J, and FV
J for the set of 104 tasks of the elbow manipulator.
The results were obtained respectively for the Yao?s pre-
emptive scheduling algorithm with assumed zero preemp-
tion cost (column LB), our scheduling heuristics for variable
voltage (column New), multiple voltage (three voltage sup-
plies?column 3VS), and fixed voltage systems (column FV).
VII. DISCUSSION
Variable-voltage cores involve overhead in terms of area,
timing, and power due to the variable-voltage supply hardware.
According to the results of Suzuki et al. [49], this overhead
is negligible for a processor core, which our research targets.
When this overhead outweighs the gains from power reduction,
other approaches such as using multiple voltages may be more
effective.
We currently consider only nonpreemptive scheduling poli-
cies. Since we target a static scheduling problem, both pre-
emptive and nonpreemptive scheduling policies can be used.
There are two advantages of using nonpreemptive scheduling
policies over preemptive ones. First, modern processors that
use deep pipelining and high clock speed have high preemption
overhead. Thus, the gains by using preemptive scheduling
policies do not usually justify the high overhead. Second,
preemptive scheduling algorithms require higher overhead for
run-time scheduling decisions. On the other hand, introduction
of preemption to a scheduling problem can greatly simplify the
problem. While a scheduling problem can be an optimization
problem of polynomial time complexity when preemption
is assumed, it might be an NP-complete problem when no
preemption is allowed [15]. Considering the high context
switching cost for preemption, however, the cost should be
taken into account in the preemptive scheduling problem,
which might reestablish its computational intractability. We
note that this overhead may not be critical for some modern
systems. For such systems, it might be better to use pre-
emptive scheduling over nonpreemptive scheduling. However,
we should also consider the overhead of changing voltage
levels in the preemptive scheduling problem since we may
need to change the voltage levels frequently and drastically
(e.g., from 3 to 1 V and vice versa). In the nonpreemptive
scheduling problem, such overhead can be considered negligi-
ble. Yao et al. [54] has proposed an optimal algorithm for
the preemptive scheduling problem, if we do not consider
the voltage change overhead. It would be an interesting
research problem to determine a better scheduling policy for
the given design scenario with a realistic preemption penalty
model. In our design cases, nonpreemptive policies yielded
superior results in comparison with preemptive policies when
considering the preemption penalty of context switch and
voltage changes.
In this research, we assume that the time taken to reach
steady state at the new voltage and power consumed by the
dc?dc converter is negligible. Namgoong et al. [38] reported
that the time taken to reach steady state at the new voltage
was under 6 ms/V. For some fine-grained applications, this
timing overhead may be significant and should be considered.
HONG et al.: POWER OPTIMIZATION OF VARIABLE-VOLTAGE CORE-BASED SYSTEMS 1713
Suzuki et al. [49] reported that the area overhead of the
dc?dc converter was under 1% of the chip size. Based on the
observation that power consumption is roughly proportional to
area, we estimate the power consumed by the converter will
be less than 1 mW for the processor. We believe this power
overhead is negligible.
Although many real-time application-specific systems can
be modeled using our computational, timing, and hardware
model, we can generalize our approach to handle more general
models. Possible directions are to handle dependency con-
straints and data-dependent task execution times. We should
also verify if our approach can be generalized for other
architectures such as superscalar and multiprocessor machines.
VIII. CONCLUSION
We developed the design methodology for the low-power
core-based real-time SOC-based on dynamically variable-
voltage hardware. The key contribution was to develop
effective scheduling techniques that treat voltage as a
variable to be determined, in addition to the conventional
task scheduling and allocation. Our synthesis technique
also addressed the selection of the processor core and the
determination of the instruction and data cache size and
configuration so as to fully exploit dynamically variable
voltage hardware, which resulted in significantly lower power
consumption for a set of target applications than existing
techniques. The highlight of the proposed approach was
the nonpreemptive scheduling heuristic, which resulted in
solutions very close to optimal ones for many test cases. The
effectiveness of the approach was demonstrated on a variety
of modern industrial-strength multimedia and communication
applications, such as MPEG and JPEG encoders and decoders.
REFERENCES
[1] J. A. Bannister and K. S. Trivedi, ?Task allocation in fault-tolerant
distributed system,? Acta Informatica, vol. 20, no. 3, pp. 261?281, 1983.
[2] T. D. Burd and R. W. Brodersen, ?Processor design for portable
systems,? J. VLSI Signal Process., vol. 13, nos. 2?3, pp. 203?221, 1996.
[3] R. Camposano and J. Wilberg, ?Embedded system design,? Design
Automation Embedded Syst., vol. 1, nos. 1?2, pp. 5?50, 1996.
[4] A. Chandrakasan, V. Gutnik, and T. Xanthopoulos, ?Data driven signal
processing: An approach for energy efficient computing,? in Proc. Int.
Symp. Low Power Electronics and Design, pp. 344?352, 1996.
[5] A. P. Chandrakasan, M. Potkonjak, R. Mehra, J. Rabaey, and R.
Brodersen, ?Optimizing power using transformations,? IEEE Trans.
Computer-Aided Design, vol. 14, no. 1, pp. 12?31, 1995.
[6] A. P. Chandrakasan, S. Sheng, and R. W. Broderson, ?Low-power
CMOS digital design,? IEEE J. Solid-State Circuits, vol. 27, no. 4, pp.
473?484, 1992.
[7] J.-M. Chang and M. Pedram, ?Energy minimization using multiple
supply voltages,? in Proc. Int. Symp. Low Power Electronics and Design,
1996, pp. 157?162.
[8] P. Chou and G. Borriello, ?Software scheduling in the co-synthesis of
reactive real-time systems,? in Proc. Design Automation Conf., 1994,
pp. 1?4.
[9] B. Cmelik and D. Keppel, ?Shade: A fast instruction-set simulator for
execution profiling,? in Proc. SIGMETRICS Conf. Measurement and
Modeling of Computer Systems, vol. 22, no. 1, 1994, pp. 128?137.
[10] B. P. Dave, G. Lakshminarayana, and N. K. Jha, ?COSYN: Hardware-
software co-synthesis of embedded systems,? in Proc. Design Automa-
tion Conf., 1997, pp. 703?708.
[11] F. Douglis, P. Krishnan, and B. Bershad, ?Adaptive disk spin-down
policies for mobile computers,? in Proc. USENIX Symp. Mobile and
Location-Independent Computing, 1995, pp. 121?137.
[12] R. J. Evans and P. D. Franzon, ?Energy consumption modeling and
optimization for SRAM?s,? IEEE J. Solid-State Circuits, vol. 30, no. 5,
pp. 571?579, 1995.
[13] M. J. Flynn, Computer Architecture: Pipelined and Parallel Processor
Design. Boston, MA: Jones and Bartlett, 1996.
[14] R. Fromm, S. Perissakis, N. Cardwell, C. Kozyrakis, B. McGaughy,
D. Patterson, T. Anderson, and K. Yelick, ?The energy efficiency of
IRAM architectures,? in Proc. Int. Symp. Computer Architecture, 1997,
pp. 327?337.
[15] M. R. Garey and D. S. Johnson, Computer and Intractability: A Guide
to the theory of NP-Completeness. New York: Freeman, 1979.
[16] R. Gonzalez and M. Horowitz, ?Energy dissipation in general purpose
microprocessors,? IEEE J. Solid-State Circuits, vol. 31, no. 9, pp.
1277?1284, 1996.
[17] K. Govil, E. Chan, and H. Wasserman, ?Comparing algorithms for
dynamic speed-setting of a low-power CPU,? in Proc. ACM Int. Conf.
Mobile Computing and Networking, 1995, pp. 13?25.
[18] V. Gutnik and A. Chandrakasan, ?An efficient controller for variable
supply-voltage low power processing,? in Proc. Symp. VLSI Circuits,
1996, pp. 158?159.
[19] M. D. Hill and A. J. Smith, ?Evaluating associativity in CPU caches,?
IEEE Trans. Comput., vol. 38, no. 12, pp. 1612?1630, 1989.
[20] I. Hong, D. Kirovski, G. Qu, M. Potkonjak, and M. Srivastava, ?Power
optimization of variable voltage core-based systems,? in Proc. Design
Automation Conf., 1998, pp. 176?181.
[21] I. Hong, M. Potkonjak, and R. Karri, ?Power optimization using
divide-and-conquer techniques for minimization of the number of oper-
ations,? in Proc. IEEE/ACM Int. Conf. Computer-Aided Design, 1997,
pp. 108?113.
[22] C. Hwang and A. C.-H. Wu, ?A predictive system shutdown method
for energy saving of event-driven computation,? in Proc. IEEE/ACM
Int. Conf. Computer-Aided Design, 1997, pp. 28?32.
[23] M. C. Johnson and K. Roy, ?Datapath scheduling with multiple supply
voltages and level converters,? ACMTrans. DesignAutomationElectron.
Syst., vol. 2, no. 3, pp. 227?248, 1997.
[24] N. P. Jouppi, ?Cache write policies and performance,? in Proc. Int.
Symp. Computer Architecture, 1993, pp. 191?201.
[25] H. Kasahara and S. Narita, ?Parallel processing of robot-arm con-
trol computation on a multimicroprocessor system,? IEEE J. Robot.
Automat., vol. RA-1, no. 2, pp. 104?113, 1985.
[26] D. Kirovski, C. Lee, W. Mangione-Smith, and M. Potkonjak,
?Application-driven synthesis of core-based systems,? in Proc.
IEEE/ACM Int. Conf. Computer-Aided Design, 1997, pp. 104?107.
[27] , ?Synthesis of power efficient systems-on-silicon,? in Proc. Asia
and South Pacific Design Automation Conf., 1998, pp. 557?562.
[28] U. Ko, P. T. Balsara, and A. K. Nanda, ?Energy optimization of
multilevel cache architectures for RISC and CISC processors,? IEEE
Trans. VLSI Syst., vol. 6, no. 2, pp. 299?308, 1998.
[29] M. Krunz and S. K. Tripathi, ?On the characterization of VBR MPEG
streams,? in Proc. ACM Int. Conf. Measurement and Modeling of Com-
puter Systems (SIGMETRICS 97), 1997, pp. 192?202.
[30] C. Kulkarni, F. Catthoor, and H. De Man, ?Code transformations for
low power caching in embedded multimedia processors,? in Proc. Int.
Parallel Processing Symp., 1998, pp. 292?297.
[31] P. E. Landman and J. M. Rabaey, ?Architectural power analysis: The
dual bit type method,? IEEETrans.VLSISyst., vol. 3, no. 2, pp. 173?187,
1995.
[32] , ?Activity-sensitive architectural power analysis,? IEEE Trans.
Computer-Aided Design, vol. 15, no. 6, pp. 571?587, 1996.
[33] E. L. Lawler and C. U. Martel, ?Scheduling periodically occurring tasks
on multiple processors,? Inform. Process. Lett., vol. 12, no. 1, pp. 9?12,
1981.
[34] M. T.-C. Lee, V. Tiwari, S. Malik, and M. Fujita, ?Power analysis
and minimization techniques for embedded DSP software,? IEEE Trans.
VLSI Syst., vol. 5, no. 1, pp. 123?135, 1997.
[35] Y.-R. Lin, C.-T. Hwang, and A. C.-H. Wu, ?Scheduling techniques for
variable voltage low power designs,? ACM Trans. Design Automation
Electron. Syst., vol. 2, no. 2, pp. 81?97, 1997.
[36] P. Macken, M. Degrauwe, M. Van Paemel, and H. Oguey, ?A voltage
reduction technique for digital systems,? in Proc. IEEE Int. Solid-State
Circuits Conf., 1990, pp. 238?239.
[37] R. Muntz, J. R. Santos, and S. Berson, ?RIO: A real-time multimedia
object server,? Performance Eval. Rev., vol. 25, no. 2, pp. 29?35, 1997.
[38] W. Namgoong, M. Yu, and T. Meng, ?A high-efficiency variable-
voltage CMOS dynamic DC-DC switching regulator,? in Proc. IEEE
Int. Solid-State Circuits Conf., 1997, pp. 380?381.
[39] L. S. Nielsen, C. Niessen, J. Sparso, and K. van Berkel, ?Low-power
operation using self-timed circuits and adaptive scaling of the supply
voltage,? IEEE Trans. VLSI Syst., vol. 2, no. 4, pp. 391?397, 1994.
1714 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 18, NO. 12, DECEMBER 1999
[40] P. Panda and N. Dutt, ?Low power mapping of behavioral arrays to
multiple memories,? in Proc. IEEE Int. Symp. Low Power Electronics
and Design, 1996, pp. 289?292.
[41] P. Panda, N. Dutt, and A. Nicolau, ?Memory data organization for
improved cache performance in embedded processor applications,? in
Proc. Int. Symp. System-Level Synthesis, 1996, pp. 90?95.
[42] , ?Data cache sizing for embedded processor applications,? in
Design, Automation Test Eur., 1998, pp. 925?926.
[43] M. Pedram, ?Power minimization in IC design: Principles and appli-
cations,? ACM Trans. Design Automation Electron. Syst., vol. 1, no. 1,
pp. 3?56, 1996.
[44] S. Raje and M. Sarrafzadeh, ?Variable voltage scheduling,? in Proc. Int.
Symp. Low Power Design, 1995, pp. 9?14.
[45] R. S. Ratner, E. B. Shapiro, H. M. Zeidler, S. E. Wahlstrom, C. B. Clark,
and J. Goldberg, ?Design of a fault tolerant airborne digital computer,?
in Computational Requirements and Technology, vol. 2, SRI Final Rep.,
NASA Contract NAS1-10920, 1973.
[46] M. Srivastava, A. P. Chandrakasan, and R. W. Brodersen, ?Predictive
system shutdown and other architectural techniques for energy efficient
programmable computation,? IEEE Trans. VLSI Syst., vol. 4, no. 1, pp.
42?55, 1996.
[47] A. J. Stratakos, S. R. Sanders, and R. W. Brodersen, ?A low-voltage
CMOS DC-DC converter for a portable battery-operated system,? in
Proc. Power Electronics Specialist Conf., 1994, vol. 1, pp. 619?626.
[48] C.-L. Su and A. M. Despain, ?Cache design trade-offs for power and
performance optimization: A case study,? in Proc. Int. Symp. Low Power
Design, 1995, pp. 63?68.
[49] K. Suzuki, S. Mita, T. Fujita, F. Yamane, F. Sano, A. Chiba, Y.
Watanabe, K. Matsuda, T. Maeda, and T. Kuroda, ?A 300 MIPS/W
RISC core processor with variable supply-voltage scheme in variable
threshold-voltage CMOS,? in Proc. IEEE Custom Integrated Circuits
Conf., 1997, pp. 587?590.
[50] V. Tiwari, S. Malik, and A. Wolfe, ?Power analysis of embedded
software: A first step toward software power minimization,? IEEE Trans.
VLSI Syst., vol. 2, no. 4, pp. 437?445, 1994.
[51] V. Tiwari and M. T.-C. Lee, ?Power analysis of a 32-bit embedded
microcontroller,? in Proc. Asia and South Pacific Design Automation
Conf., 1995, pp. 141?148.
[52] M. Weiser, B. Welch, A. Demers, and S. Shenker, ?Scheduling for
reduced CPU energy,? in Proc. USENIX Symp. Operating Systems
Design and Implementation, 1994, pp. 13?23.
[53] S. J. E. Wilton and N. P. Jouppi, ?CACTI: An enhanced cache access
and cycle time model,? IEEE J. Solid-State Circuits, vol. 31, no. 5, pp.
677?688, 1996.
[54] F. Yao, A. Demers, and S. Shenker, ?A scheduling model for reduced
CPU energy,? in IEEE Annu. Foundations Comput. Sci., pp. 374?382,
1995.
Inki Hong received the M.S. degree in computer science from Stanford
University, Stanford, CA, in 1994. He is currently pursuing the Ph.D. degree
in the Computer Science Department, University of California, Los Angeles.
He is with Synopsys, Inc., as an R&D Engineer in the Logic Synthesis
Department. His research interests include CAD of VLSI circuits, specializing
in logic synthesis and high-level and system-level synthesis of VLSI systems.
He has published more than 20 refereed papers.
Mr. Hong received the 1996 Chorafas Foundation Award and the 1997
Design Automation Conference Graduate Scholarship Award.
Darko Kirovski, for a biography, see p. 1326 of the September 1999 issue
of this TRANSACTIONS.
Gang Qu received the M.Sc. degree from the Computer Science Department,
University of California, Los Angeles, in 1997, where he currently is pursuing
the Ph.D. degree.
His research interests include intellectual property protection and core-based
system design.
Miodrag Potkonjak, for a biography, see p. 1326 of the September 1999
issue of this TRANSACTIONS.
Mani B. Srivastava received the M.S. and Ph.D. degrees from the University
of California, Berkeley.
He is an Associate Professor of electrical engineering at the University of
California, Los Angeles. He worked for several years in Networked Com-
puting Research at Bell Labs. His current research is on wireless multimedia
networking, networked embedded systems, reconfigurable systems, low-power
systems, and gigabit wireless with funding from NSF, and DARPA?s GloMO,
SenseIT, and NGI programs. He has received several patents and has published
extensively in wireless networking, low-power systems, and system-level
tools.
Dr. Srivastava?s recent awards include an Okawa Foundation Grant and an
NSF CAREER Award.