ABSTRACT
Title of dissertation: RUNTIME ADAPTATION IN EMBEDDED
COMPUTING SYSTEMS USING
MARKOV DECISION PROCESSES
Adrian E. Sapio, Doctor of Philosophy, 2019
Dissertation directed by: Professor Shuvra S. Bhattacharyya
Department of Electrical and
Computer Engineering, and
Institute for Advanced Computer Studies
University of Maryland, College Park
During the design and implementation of embedded computing systems (ECSs),
engineers must make assumptions on how the system will be used after being built
and deployed. Traditionally, these important decisions were made at design time
for a fleet of ECSs prior to deployment. In contrast to this approach, this research
explores and develops techniques to enable adaptation of ECSs at runtime to the
environments and applications in which they operate. Adaptation is enabled such
that the usage assumptions and performance optimization decisions can be made
autonomously at runtime in the deployed system.
This thesis utilizes Markov Decision Processes (MDPs), a powerful and well
established mathematical framework used for decision making under uncertainty, to
control computing systems at runtime. The resulting control is performed in ways
that are more dynamic, robust and adaptable than alternatives in many scenarios.
The techniques developed in this thesis are first applied to a reconfigurable
embedded digital signal processing system. In this effort, several challenges are
encountered and resolved using novel approaches. Through extensive simulations
and a prototype implementation, the robustness of the adaptation is demonstrated
in comparison with the prior state-of-the-art.
The thesis continues by developing an efficient algorithm for conversion of
MDP models to actionable control policies ? a required step known as solving the
MDP. The solver algorithm is developed in the context of ECSs that contain general
purpose embedded GPUs (graphics processing units). The novel solver algorithm,
Sparse Parallel Value Iteration (SPVI), makes use of the parallel processing capa-
bilities provided by such GPUs, and also exploits the sparsity that typically exists
in MDPs when used to model and control ECSs.
To extend the applicability of the runtime adaptation techniques to smaller
and more strictly resource constrained ECSs, another solver ? Sparse Value Iter-
ation (SVI) is developed for use on microcontrollers. The method is explored in a
detailed case study involving a cellular (LTE-M) connected sensor that adapts to
varying communications profiles. The case study reveals that the proposed adapta-
tion framework outperforms a competing approach based on Reinforcement Learning
(RL) in terms of robustness and adaptation, while consuming comparable resource
requirements.
Finally, the thesis concludes by analyzing the various logistical challenges that
exist when deploying MDPs on ECSs. In response to these challenges, the thesis
contributes an open source software package to the engineering community. The
package contains libraries of MDP solvers, parsers, datasets and reference solutions,
which provide a comprehensive infrastructure for exploring the trade-offs among
existing embedded MDP techniques, and experimenting with novel approaches.
RUNTIME ADAPTATION IN EMBEDDED COMPUTING
SYSTEMS USING MARKOV DECISION PROCESSES
by
Adrian E. Sapio
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2019
Advisory Committee:
Professor Shuvra Bhattacharyya, Chair/Advisor
Professor Marilyn Wolf
Professor Rajeev Barua
Professor Richard La
Professor Balakumar Balachandran
Table of Contents
1 Introduction 1
2 Reconfigurable Digital Channelizer Design Using Factored Markov Decision
Processes 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Reconfigurable Channelizer Design . . . . . . . . . . . . . . . . . . . 18
2.3.1 Polyphase DFT Filter Bank . . . . . . . . . . . . . . . . . . . 19
2.3.2 Tunable Polyphase Decimation Filter . . . . . . . . . . . . . . 20
2.3.3 Summary of Processing States and Their Properties . . . . . . 23
2.4 MDP-Based Channelizer Control . . . . . . . . . . . . . . . . . . . . 26
2.4.1 Multiobjective Rewards . . . . . . . . . . . . . . . . . . . . . 29
2.4.2 MDP Solver and Policy . . . . . . . . . . . . . . . . . . . . . . 31
2.4.3 Transition States . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.4 Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.5 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.1 Comparison with Manually Generated Policies . . . . . . . . . 43
2.5.2 Comparison with mHARP . . . . . . . . . . . . . . . . . . . . 47
2.5.3 Trade-offs in Modeling Transition States . . . . . . . . . . . . 48
2.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . 54
3 Efficient Solving of Markov Decision Processes on GPUs using Parallelized
Sparse Matrices 56
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . 60
3.2.1 MDP Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2.1.1 POMDPs and Approximate Solvers . . . . . . . . . . 64
3.2.2 Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3.1 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3.2 Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3.3 SPVI Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3.3.1 State Transition . . . . . . . . . . . . . . . . . . . . 71
3.3.3.2 Action Selection . . . . . . . . . . . . . . . . . . . . 74
3.3.3.3 Stopping Criteria . . . . . . . . . . . . . . . . . . . . 74
3.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.4.1 Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.4.2 Multivariate State Spaces . . . . . . . . . . . . . . . . . . . . 76
3.4.3 Sequential Decision Problems . . . . . . . . . . . . . . . . . . 78
3.4.4 Case Study: Solar Powered Computing . . . . . . . . . . . . . 79
3.4.4.1 State Space and Action Space . . . . . . . . . . . . . 80
ii
3.4.4.2 State Transition Matrix . . . . . . . . . . . . . . . . 81
3.4.5 Case Study: MDP-Based Reconfigurable Channelizer System
(MRCS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 87
3.5.2 Algorithm Comparison . . . . . . . . . . . . . . . . . . . . . . 88
3.5.2.1 Thrust-VI . . . . . . . . . . . . . . . . . . . . . . . . 89
3.5.3 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4 Runtime Adaptation in Wireless Sensor Nodes Using Structured Learning 92
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . 97
4.2.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 97
4.2.1.1 Stability . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.3 Survey of Compact MDP Models . . . . . . . . . . . . . . . . . . . . 101
4.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.4.1 Structured Learning . . . . . . . . . . . . . . . . . . . . . . . 103
4.4.2 Temporal Difference Equations . . . . . . . . . . . . . . . . . 105
4.4.3 Sparse Value Iteration (SVI) . . . . . . . . . . . . . . . . . . . 106
4.4.3.1 State Transition . . . . . . . . . . . . . . . . . . . . 108
4.4.3.2 Action Selection . . . . . . . . . . . . . . . . . . . . 108
4.4.3.3 Stopping Criteria . . . . . . . . . . . . . . . . . . . . 109
4.5 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.5.1 LTE-M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.6 MDP-Based Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.6.1 State and Action Spaces . . . . . . . . . . . . . . . . . . . . . 121
4.6.2 Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.6.3 Structured Learning Controller . . . . . . . . . . . . . . . . . 127
4.6.3.1 Sensing Application . . . . . . . . . . . . . . . . . . 128
4.6.3.2 Packet Queue . . . . . . . . . . . . . . . . . . . . . . 128
4.6.3.3 LTE-M Modem . . . . . . . . . . . . . . . . . . . . . 129
4.6.4 Q-Learning Controller . . . . . . . . . . . . . . . . . . . . . . 130
4.7 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.8 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
4.8.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 136
4.8.2 Matrix Format . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.8.3 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.8.3.1 Memory Usage . . . . . . . . . . . . . . . . . . . . . 139
4.8.3.2 Computation . . . . . . . . . . . . . . . . . . . . . . 141
4.8.4 Adaptation Overhead . . . . . . . . . . . . . . . . . . . . . . . 145
4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
iii
5 GEMBench: A Platform for Collaborative Development of GPU Accelerated
Embedded Markov Decision Systems 150
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.2 Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.2.1 Solver Implementations . . . . . . . . . . . . . . . . . . . . . . 155
5.2.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5.2.3 Benchmarking Platforms . . . . . . . . . . . . . . . . . . . . . 160
5.2.4 Dimensionality of Rewards . . . . . . . . . . . . . . . . . . . . 161
5.3 GEMBench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.3.1 Selection Criteria . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.3.2 Hardware and Operating System . . . . . . . . . . . . . . . . 166
5.3.3 Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.3.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
5.3.5 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
6 Conclusions and Future Work 175
Bibliography 184
iv
Acknowledgments
I owe my gratitude to all the people who have made this thesis possible. First,
I would like to thank my advisor Professor Shuvra Bhattacharyya, for giving me
the opportunity to work on challenging and interesting projects over the past years.
I would also like to thank Professor Marilyn Wolf, whose insight and guidance on
various matters proved very helpful along the way.
The deepest thanks of all go to my wife Mayrola, for without her support and
sacrifice over a countless number of nights and weekends this work would never have
been completed. Finally I would like to thank my parents, for sending me off in the
right direction and providing unconditional support.
This research was supported in part by the Computer and Network Systems
Program of the U.S. National Science Foundation under Grant No. CNS1514425
and CNS151304.
v
Chapter 1
Introduction
1
This thesis presents techniques developed to achieve runtime adaptation in
resource constrained embedded computing systems (ECSs). Adaptation is defined
herein as strategic decision-making in the control of one or more of the following:
algorithm selection (among a set of available algorithms), parameter selection from
a set, and partial or complete reconfiguration of programmable digital logic from a
set of available logic configurations.
This work seeks to leverage the computational and reconfiguration capabilities
of modern embedded computing platforms to develop systems that can better adapt
to the environment in which they are operating. Adapting to the environment using
an effective system-level reconfiguration framework (SLRF) can help these systems
operate more effectively ? e.g., with improved trade-offs among achievable data
rate, latency, and energy efficiency.
During the design of embedded computing systems, engineers must make as-
sumptions on how the system will be used after being built and deployed. These
assumptions are typically used to balance multiple conflicting performance objec-
tives, and they have a significant effect on how well the system ultimately performs.
This research seeks to explore and develop techniques to help advanced systems
better adapt at runtime to the environments and applications in which they oper-
ate. The goal is to embed novel algorithm implementations that can be deployed
alongside the application functionality, with the purpose of enabling the system to
continually reconsider and update its own operational trade-offs autonomously.
The framework includes multiobjective optimization at its core, embracing
the multifaceted nature of embedded computing design, where making strategic
2
trade-offs among conflicting goals is critical. Additionally the framework allows for
accumulated knowledge and beliefs to be represented probabilistically, which is a
crucial requirement since the real-world environments in which embedded computing
systems operate are typically not deterministic.
The alternative to runtime adaptation is systems in which the runtime behavior
is fully specified at design time, prior to deployment. The disadvantage to this
approach is that it forces engineers to make assumptions about the environments
that their systems will be deployed in, which is often not feasible, or not practical
for widely deployed systems that encounter a wide range of environments ? e.g., a
cellular connected device that might have one unit deployed in a location with very
strong cellular coverage and another with weak cellular coverage.
In this scenario, if the system performance could be optimized by a cellu-
lar communications modem configuration, an engineer might be forced to select
one configuration for both installations that would be either better suited for one
over the other, or a compromise between both that would be sub-optimal in either
case. As thousands or even millions of embedded computing systems are deployed
in ubiquitous ways throughout all areas of society, a manual optimization of each
individual unit of this form becomes infeasible. If on the other hand, a runtime
adaptation framework is used, the systems can be programmed to self-optimize to-
wards high-level goals such as maximizing battery life and maintaining a reliable
communication link. In this way, they can autonomously make effective trade-off
decisions for each individual installation, after being deployed and even in response
to changing conditions over time. Environmental properties such as the quality of
3
cellular coverage could have a short term, time-dependent variation such as con-
gestion during peak hours of the day. Environmental properties could also have a
long term, time-dependent variation such as improvement due to new cellular towers
being installed, or congestion as a community grows and more users populate the
coverage area.
The design of algorithms to control resource constrained computing systems
effectively at runtime has been a topic of active research for at least 20 years. A
good survey for early work in this area can be found in [1]. This survey reviews
a wide range of techniques, including fixed threshold-based approaches, dynamic
approaches including stochastic controllers using dynamic programming, as well as
some guidance on how MDPs can be used in this context.
Since the time period when that survey was written, a variety of approaches to
this research problem have flourished over the years. Some researchers have sought
to formulate the design challenge as a constrained optimization problem [2]. Other
researchers focused on modeling the dynamics of the system?s energy consumption,
and simplifying the control decisions to be simple threshold-based comparisons with
respect to the energy budget (e.g., see [3, 4]). Another popular approach has been
to model the system as linear in the context of feedback control systems and then
use Model Predictive Control (MPC) theory to modulate a processing duty cycle
(e.g., [5, 6]).
All of these approaches were shown to be successful in their respective case
studies, but share some common limitations when considered for use in other cases.
For example, several of these approaches assume deterministic behavior from the
4
system being controlled. These approaches model the behavior of the system in
response to some actuation, and assume the system will always behave the same
way. However, many ECSs have some stochastic behavior, either due to complex
unmodeled dynamics or due to being affected by an external factors that are difficult
to predict. A second common limitation is the assumption that the system being
controlled can be modeled as a linear system. Computing systems often do not
behave in linear ways, and attempts at formulating linear approximations to non-
linear behavior is limited to only the simplest non-linearities, which significantly
constrains the overall applicability and generality of this approach. A third common
limitation is that the dynamics of the system being controlled often need to be well
understood at design time. For many computing systems whose behaviors depend
heavily on external factors, this can be an unrealistic assumption.
As efforts in this area progressed, the paradigms shifted from classical control
systems theory to various forms of adaptive algorithms, and then to more general-
ized approaches that researchers have termed as self-configuration, self-optimization
and most recently, self-awareness [7]. A wide ranging survey of these works and or-
ganization of them into these various self-X categories can be found in [8]. In that
work, researchers define self-awareness as ?attributes in a system that enable it to
monitor its own state and behavior, as well as the external environment, so as to
adapt intelligently?. Another definition can be found in [9] : ?self-aware computing
describes a novel paradigm for systems and applications that proactively maintain
knowledge about their internal state and its environment and then use this knowl-
edge to reason about behaviours?.
5
Among the most promising directions for creating these self-awareness at-
tributes in ECSs is through the use of Markov Decision Processes (MDPs), a pow-
erful and well established mathematical framework used for decision making under
uncertainty. The SLRF proposed in this thesis makes use of MDPs, as they have
shown success in this area because they are inherently capable of modeling stochas-
tic behavior and non-linear responses, and they are also well equipped to deal with
incomplete models and uncertainty. This allows decisions to be encapsulated into
a policy-based framework, where the decision options and the criteria for making
those decisions are defined and the adaptation problem is shifted to an optimal
policy design problem.
Following this approach, the design effort can be decomposed into at least the
following sub-problems, which must be addressed in the design of the proposed form
of embedded decision agent.
? Modeling the available system capabilities in terms of how they help to ac-
complish the high-level goals.
? Determining trade-offs associated with these capabilities.
? Modeling the effects that the exogenous environment has on these goals.
? Predicting the future behavior of the environment based on past statistics,
and anticipating the effectiveness of alternative available functionalities in the
future environment.
In this thesis, a methodology is presented that addresses these sub-problems,
6
analyzes the effectiveness of the resulting adaptive systems, compares the perfor-
mance to competing approaches, and explores numerous issues that arise in the
process.
In Chapter 2, the methodology is applied to the control of digital filtering and
decimation operations commonly found in the physical layer Digital Signal Process-
ing (DSP) of wireless communication systems. In this work, several challenges were
encountered and resolved using novel approaches. These include the use of tran-
sition states to accurately model a wide range of the processing system dynamics,
a scalarization approach to employ the MDP within a multiobjective optimization
framework, and the factorization of system states into internal and external group-
ings for efficient representation and embedded computation. Additionally, the con-
cept of factorization is applied to reduce the storage size of the MDP model and
execution time of the policy generation algorithms.
The limitation of the work in Chapter 2 is that the policy that is analyzed is an
immutable policy that is generated once, offline. There is no runtime adaptation in
this approach, however the work serves to establish the feasibility of the framework
as a viable tool for runtime adaptation of the stated task. Continuing in this spirit,
the work presented in Chapter 3 begins addressing the challenges of deploying time-
varying control policies under the MDP-based framework. The work shows that
sparsity ? a characteristic of MDPs that arises when they are used in this context
? is an important concept to exploit for efficient and practical policy generation
within embedded computing systems at runtime. Then, the processing parallelism
capabilities of modern embedded graphics processing units (GPUs) ? such as those
7
used in smartphones, robotics and automobiles ? are leveraged to generate control
policies in faster and more efficient ways than the prior state-of-the-art.
Chapter 4 extends the applicability of the runtime adaptation methodology to
a wider class of embedded computing systems. This extension is achieved by devel-
oping a variation on the approach that is suitable for smaller microcontroller-based
systems (without GPUs), and then presenting a detailed design and implementa-
tion of an adaptive controller in such a system. The controller is compared to
Reinforcement Learning (RL) based techniques on a range of performance criteria.
The comparison of the proposed method with RL-based techniques is generalized,
exploring when one approach is better suited over the other, and also exploring the
system aspects that affect this choice.
Following these developments, Chapter 5 contains detailed surveys and a re-
flection on the state of affairs encountered in the endeavors of Chapters 2-4. Limi-
tations and hindrances that were found are identified, such as the lack of available
benchmarking data, and widespread incompatibility between MDP file formats. In
an attempt to remedy some of these issues for future research efforts, a common
embedded MDP development platform is proposed by authoring an open-source
software package and pairing it with a family of popular off-the-shelf development
hardware boards. The software package has recently been released to the academic
community, and feedback has been solicited from communities of interest.
Finally, Chapter 6 contains conclusions from this research and also discussion
on areas for future work.
8
Chapter 2
Reconfigurable Digital Channelizer Design Using Factored Markov
Decision Processes
9
2.1 Introduction
Digital channelizers are critical subsystems in wireless communication sys-
tems that are employed when a multiplexed signal contains information in different
frequency subbands, and the application requires separating one input signal (con-
taining multiple subbands) into one or more output signals (each containing a subset
of the input subbands) [10]. This function is commonly required in cognitive radio
systems [11, 12]. In this chapter, the reconfiguration capabilities of modern embed-
ded platforms are leveraged to develop digital channelizers that can better adapt
to the environment in which they are operating. The material in this chapter was
published in preliminary form in [13, 14].
Adapting to the environment using an effective system-level reconfiguration
framework (SLRF) can help these systems operate more effectively ? e.g., with im-
proved trade-offs among achievable data rate, latency, and energy efficiency. For this
purpose, MDPs are applied in novel ways to make dynamic decisions on maintain-
ing or adapting signal processing configurations during channelizer operation. An
MDP-based SLRF is proposed to develop dynamic reconfiguration policies for use
in stochastic environments in which adaptation of hardware/software configurations
for digital channelizer processing is strategic.
While the SLRF techniques are developed in this chapter with a specialized
focus on digital channelizer implementation, the results suggest that the underlying
MDP techniques are applicable across many other types of embedded signal process-
ing systems (ESIPs). Exploring the generalization of the SLRF for broader classes
10
of ESIPs is therefore a useful direction for future work.
The MDP-based approach for digital channelizer design optimization results in
increased robustness when used to periodically re-optimize the system policy specif-
ically for the external environment it is being used in. This periodic re-optimization
can be done completely autonomously by an embedded signal processor, without
any need for human-in-the-loop intervention. The information that these design
optimization methods require is completely observable by the system at runtime.
The results in this chapter show that MDPs are useful tools for controlling
resources in computing systems. Additionally, two innovations are introduced that
significantly enhance the effectiveness of MDPs for channelizer design optimization.
First, a mechanism is added to address hardware/software codesign scenarios that
involve multidimensional design objectives and constraints, which are commonly
encountered in transceiver system design. This is done through a multidimensional
framework for the definition of the MDP rewards function.
Second, a concept called transition states is introduced to represent interme-
diate states (between distinct channelizer configurations) in the target system. The
transition states are applied in scenarios where commanding a state change can re-
sult in one or more time steps (frames) where the system is in a non-productive
transition mode. Since being in transition from one state to another can result in
missing real-time deadlines for processing requests, the control policy must choose
carefully when to command a transition, and only seek to do so when the end result
will be a net positive for the system in the long run, in spite of any short-term
negative effects due to the transition frames. Such incorporation of transition states
11
within the SLRF extends its utility to a broader class of applications, including chan-
nelizers, where transitions between productive states must be taken into account for
accurate assessment and optimization of dynamic reconfiguration control.
The work in this chapter continues by applying a methodology developed
in [15] to transform an MDP into a factored MDP. This concept addresses a problem
that frequently occurs with MDPs ? the number of possible states of the model
can be extremely large. As detailed in [16], a major motivation behind factored
representations is that some parts of this large state space generally do not depend
on each other and that this independence can be exploited to derive a more compact
representation of the global state. In this thesis, factorization serves to reduce the
storage size of the MDP model and execution time of the policy generation algo-
rithms. Such advancements are critical enablers for the direction of this work ?
deploying the modeling framework and policy generation algorithms on embedded
systems.
When the framework and algorithms are integrated with the application on
an embedded platform, they can be used to perform periodic re-optimization of the
reconfiguration policies in addition to applying the policies to manage system config-
urations. To be practical in resource constrained and power constrained embedded
environments, the deployed implementations of the modeling framework and policy
generation algorithms must be carefully optimized so that they consume minimal
amounts of storage and impose minimal computational burden. The application of
factored MDP techniques in this chapter is an important step towards these objec-
tives.
12
Next, the findings of an expanded performance analysis of the proposed method-
ology are detailed. Specifically, a suite of competing control policies are described
and compared objectively with the MDP based techniques. The results show that
the MDP based techniques outperform the alternative schemes in nearly all cases.
Finally, a trade-off analysis of the costs and benefits of including transition
states in the framework is presented. This exploration details and quantifies the
design time modeling costs of transition states in both storage size and execution
time. These costs are then contrasted with the benefits in the form of the run-time
performance when transition states are included versus when they are not.
The remainder of the chapter is organized as follows. A cursory review of the
history of channelizers and MDPs, and their development is presented in Section 2.2.
In Section 2.3, the signal processing application and the algorithms involved are
detailed. In Section 2.4, the MDP-based approach is introduced, along with an
illustration of how it is applied to the signal processing application. That is followed
by Section 2.5 with a summary of the simulations performed and the resulting data
and observations that were made. The chapter concludes in Section 2.6 with a
discussion of future work on the use of MDPs in channelizers.
13
2.2 Background and Related Work
A digital channelizer can be generalized as having the inputs and outputs
shown in Figure 2.1. Without loss of generality, the inputs and outputs are rep-
resented as frame-based vector quantities, with time decomposed into fixed-width
slots referred to as frames. The frame arrival rate is constant and the stream of
incoming frames is never ending. A channelizer is often a subsystem of a larger
signal processing system. For each frame of data, the channelizer is commanded by
higher-level elements of the larger signal processing system on a per-frame basis.
These higher level elements determine which subchannels need to be produced and
which do not.
An example of such a channelizer framework can be found in the cognitive
radio of [17]. In that application, a channelizer is used to isolate sub-bands within
some wireless spectrum dynamically. This dynamic behavior involves consuming a
wideband signal, and applying digital filters and rate-changing operations to produce
an output that contains some subset of the input signal frequencies.
In Figure 2.1, for each frame n of data, x(n) is a length N complex vector of the
wideband input signal. This data is presented to the channelizer alongside CR(n),
a length NC binary vector that provides the channelization request for that frame.
The channelizer outputs NC parallel output data vectors,
y (n)? , ? = {1, 2, . . . , NC}. (2.1)
Each of these vectors contains a channelized subset of the input.
14
Channelized Outputs
Channelization 
Request 
 ? 		/
 ? 0,1  
 ? 		/


Wideband Input
 ? 		 
 ? 		
/


Figure 2.1: Channelizer inputs and outputs.
Good surveys of popular digital channelizer architectures to date are found
in [10, 18, 19]. The most common architectures are based on the Cosine Modu-
lated Filter Bank (CMFB), Discrete Fourier Transform Filter Bank (DFTFB) and
Per-Channel Filter Bank (PCFB). Aside from these well-established architectures,
several other interesting designs for application-specific channelizers can be found
in [20, 21, 22, 23].
As illustrated in [11, 24], the channelizer is often one of the most computation-
ally intensive and power consuming blocks of cognitive radio transceivers, mainly due
to its need to run at the highest data rates. For this reason, several researchers have
sought to create channelizer designs where the key parameters that control the pro-
cessing (e.g., filter coefficients, data rates, and subchannel masks) are configurable
at run-time [25, 26, 27]. This class of DSP systems is referred to as ?reconfigurable
channelizers?, and the active body of DSP research is evidence of the importance of
optimizing channelizer processing for exactly what is required, and nothing more.
The goal is generally to improve efficiency by increasing processing productivity,
15
while simultaneously decreasing energy consumption.
The body of prior work referenced above provides a number of efficient chan-
nelizer designs that can be flexibly configured for different trade-offs. However,
this body of work does not address how or when the configurable parameters are
changed, nor provide policies for changing them at run-time. In this chapter, MDP-
based methods are developed to bridge this gap.
Other researchers have sought to use MDPs with similar goals. Wei et al.
have demonstrated the effective use of an MDP to control the processing rate of
a network router [28]. This work created a Markov model of only the external
environment, not the system under control. In contrast, the SLRF proposed in
this chapter incorporates Markov models of both the controlled system and the
external environment, which provides a more comprehensive foundation for dynamic
adaptation.
Hsieh et al. [3] devise a scheduling policy that selects among alternative imple-
mentations of common functions, such as FFTs. The alternative options accomplish
functionally the same operation, but with different execution times, power demands,
and hardware requirements. As in the SLRF proposed here, Hsieh?s approach uses
an algorithm to make reconfiguration decisions based on what requests are placed on
the system at runtime. However, in Hsieh?s approach, these requests are converted
to a time series signal, smoothed using a moving average filter, and then compared
to thresholds in order to derive reconfiguration decisions. The designer must com-
mit to a smoothing factor on the incoming requests, and effectively assume a priori
some of the resulting dynamics of the system.
16
Compared to Hsieh?s methods, the SLRF proposed here takes a very different
approach by transforming both the system and operating environment into stochas-
tic models, which can then be reasoned upon within the framework of MDPs. In
contrast to the approach of Hsieh, there are no a priori trade-offs on the smooth-
ing of incoming requests. Furthermore, instead of condensing the observable data
into one-dimensional signals, larger conditional probability tables are maintained.
Thus, the algorithms in the SLRF can incorporate more knowledge into the decision
framework. By incorporating historical transition probabilities, the MDP is able to
infer in real-time whether a new request is likely to be the start of an event that
should be acted upon, or is more likely a spurious request that is better ignored.
This inference can be performed immediately and without the delay associated with
the step response through a smoothing filter.
17
2.3 Reconfigurable Channelizer Design
In this section, a reconfigurable digital channelizer design is presented, that
forms the foundation for an MDP-based, adaptive channelization system. The sys-
tem is detailed in Section 2.4, and demonstrated experimentally in Section 2.5.
The channelizer system is implemented on the Silicon Labs EFM32GG, a
small and low power ARM Cortex M3-based microcontroller. The processor is
running on the EFM32 STK3700 development kit, which houses the CPU as well as
sophisticated energy monitoring circuitry. For this hardware, a channelizer width of
NC = 8 subchannels is used in an illustrative experiment.
This particular channelizer system is developed with applicability to wireless
sensor networks, which impose challenging constraints on energy consumption and
resource utilization. However, with its foundation in MDP techniques, the design
methodology is not specific to any particular domain of channelization applications.
For example, the methodology can be adapted to large scale, high performance
channelization scenarios that involve dozens or hundreds of subchannels that require
the use of FPGAs or GPUs to run in real-time. Developing such adaptations for
these additional classes of processing platforms is an interesting area for future
investigation.
To examine the ability of the system to adapt to its environment, two separate
use cases are considered, which are referred to as A and B. Additionally, multiple
scenarios are created within those use cases, by varying parameters of the application
that are understood to be time-varying. Two separate channelizers are designed,
18
one ideally suited for each use case, as detailed in Section 2.3.1 and Section 2.3.2. A
reconfiguration policy is then derived using the SLRF with the decision-making au-
thority to select which channelizer algorithm to use at any given time. Additionally,
the algorithms contain configuration parameters, and the SLRF is given control of
these parameters. This results in a unified controller for reconfiguration, dynamic
power management, and runtime parameter optimization.
2.3.1 Polyphase DFT Filter Bank
Use Case A is the application in [17]. In that system, the requests are modeled
as i.i.d. (independent and identically distributed) Bernoulli across both the time
and subchannel dimensions. These statistics for the requests mean that there is no
opportunity to anticipate the request vector. For such an environment, a sensible
option is a filter bank that outputs all subchannels at all times, in the most efficient
way possible. For this, a Polyphase implementation of the canonical Discrete Fourier
Transform Filter Bank (DFTFB) described in [10] is used.
To implement this DSP block, a low pass filter is designed to be used as
the ?prototype? filter in the filter bank. The filter has a passband width of one
eighth of the full spectrum, since there are eight equally spaced channels. The filter
coefficients are chosen using the Equiripple FIR design method detailed in [29].
The prototype filter is then shifted in frequency, decomposed into its polyphase
components Em(z), and implemented into the DFTFB, as described in [10]. A block
diagram of the derived DFTFB is shown in Figure 2.2. The resulting magnitude
19
M E0(z)
z-1
M E1(z)
Inverse
z-1 . . DFT .
. . .
. . .
M EM-1(z)
Figure 2.2: DFTFB block diagram, M = NC .
response for each of the 8 outputs is shown in Figure 2.3.
As can be seen from the magnitude responses of the 8 channelized outputs,
this filter bank can simultaneously channelize all of the subchannels, and thus, no
tunable parameters are required for this algorithm. In order to optimize for bursts
of communication activity as well as idle time, the controller has the ability to
put the DFTFB in and out of a sleep mode. The DFTFB remains resident in the
current configuration, and can be gated on and off very quickly. The gating off of
the DFTFB corresponds to its sleep mode.
2.3.2 Tunable Polyphase Decimation Filter
Use Case B is the Sequential Sensing application in [30], where a channelizer
with the same inputs and outputs as Use Case A is required. However, the request
statistics imposed on this channelizer are quite different from those in Use Case A.
20
0
-10
-20
-30
-40
-3 -2 -1 0 1 2 3
w [Rad/sample]
Figure 2.3: DFTFB magnitude responses.
In Use Case B, the channelizer is requested to produce only one output subchannel
at a time. One or more frames (usually multiple frames) elapse between requests
for different subchannels.
Since only one channel is requested at any given time, only a tunable decima-
tion of the input data is needed ? i.e., to filter out the unwanted subchannels. For
this, a polyphase implementation of an 8-to-1 decimation (DCM) filter and mixer
as described in [31] are employed, shown in Figure 2.4.
The operation shown suppresses all but one subchannel out of the incoming
signal, and then uses a complex mixer to shift the extracted channel down to be
centered at DC. Once centered at DC, a simple decimation of samples gives the
resulting output stream. The same filter coefficients used for the prototype low pass
filer of the DFTFB can be used in the DCM. Such a DCM design produces the
same frequency response per subchannel. Prior to implementation, the polyphase
21
Magnitude Response [dB]
Figure 2.4: DCM block diagram, M = NC .
technique detailed in [31] is utilized to reduce the runtime processing requirements
further without changing the resulting filtering operation. The resulting subsystem
is referred to as a polyphase decimation filter.
Unlike the DFTFB, this configuration does have tunable parameters: the filter
coefficients and mixing frequency. Using 8 parameter sets, this algorithm can be
modified to select any of the 8 subchannels, effectively being an efficient low-pass,
band-pass or high-pass decimation filter. Both the filter coefficients and the amount
of frequency shifting are tunable, as shown in the block diagram (Figure 2.4). The
signal is first passed through a digital filter Hm(z), whose coefficients are specific
to each channel m. Then, the filter output is shifted in frequency by multiplying
it with a complex sinusoidal signal, whose frequency is also specific to each channel
m. The formula to generate the sinusoidal frequency is the exponential shown in
the block diagram. This configuration is also designed to be kept in a sleep mode
during periods of idle user activity.
22
An important implication of using digital filtering algorithms based on Finite
Impulse Response (FIR) filters and not Infinite Impulse Response (IIR) filters in the
DFTFB and DCM processing configurations is that the resulting signal processing
algorithms are numerically stable, unconditionally. IIR-based designs by necessity
require careful positioning of the transfer function?s poles, as well as attention to
how those pole positions are affected by the numerical implementation of the filter
in terms of quantization effects, fixed point field widths and the dynamic range
of intermediate calculations. A thorough analysis of these concepts can be found
in [29].
2.3.3 Summary of Processing States and Their Properties
The MDP framework requires an enumeration of the states that the processing
system can be in at any time. The experimental embedded system has 13 states,
which fall in the categories listed in Table 2.1.
The first row of the table covers the states when the system is in a sleep mode,
with either the DCM or DFTFB ready to run. The distinction is made between
these as two separate states to allow the model to capture any difference in time
that it may take to re-enable the resident and already initialized algorithm out of
sleep mode compared to switching to the other algorithm. Further discussion on
these delays will be presented in Section 2.4.3.
The last two states, whose labels are prefixed with ?Trans.?, are states of be-
ing in transition to the DFTFB or DCM, respectively. The time required by the
23
processing system to transition between states is an important detail in this frame-
work. The incorporation of transition states into the MDP is a novel contribution
in this work that is intended to take such transition times into account (detailed in
Section 2.4.3). This concept of transition states allows an SLRF to compute decision
paths involving transitions that can take multiple time frames to complete.
The third column of the table shows the number of channels provided by the
system while in each state. Note that while in transition, the system is consuming
power but not producing any channelized data.
The fourth column of the table shows the CPU power consumed by the system
in each state. These measurements were performed at design time by putting the
processor into test modes created for this purpose. Each test mode loaded a single
configuration and executed it at the experimental application?s frame rate. With the
processor operating in such a test mode, the Silicon Labs EFM32GG development
tools allowed the power consumption of the associated state at the associated frame
rate to be measured.
It is clear from Table 2.1 that the DFTFB is the most productive configuration
(producing all 8 subchannels), while being the most power hungry in its ON state.
It is also clear from the table that the DCM algorithm represents a less productive
configuration (producing only 1 subchannel) compared to the DFTFB, but with
the benefit of reduced power consumption. If only one channel is requested for an
extended period of time, then a rational controller should select the DCM configu-
ration over the DFTFB during that time in order to conserve power. This means
the controller must balance the short term penalty of a non-productive transition
24
State Number of Number of Average
Category States Channels Produced Power
SLEEP 2 0 5.36 ?W
DCM 8 1 7.61 mW
DFTFB 1 8 17.92 mW
Trans. DFTFB 1 0 10.25 mW
Trans. DCM 1 0 10.25 mW
Table 2.1: Categories of processing states and their properties.
with the long term benefit of the presumably more favorable new state.
It can be seen from Table 2.1 that the number of channels affects the number of
states, and thus, the size of the MDP state space. This has significant implications
on the resources required to host an MDP-based control policy on the target system,
and ultimately, on the scalability of this approach to channelizers with more than 8
channels. This concept will be explored in detail in Section 2.4.4.
25
2.4 MDP-Based Channelizer Control
In this section, an SLRF for modeling reconfigurable channelizers is developed
with the goal of generating run-time control policies that can be steered in terms of
multidimensional operational objectives, including latency, throughput, and energy
efficiency. The procedure is to first create a Markovian model of the system, and then
use an MDP solver to generate a control policy from the developed system model. It
is important to note that the system and the environments that it operates in need
not be Markovian or even stochastic in nature, and the Markovian assumptions are
made as approximations expressly for the purpose of arriving at the control policy.
These assumptions are validated by evaluating the resulting control policy on the
real system (not the model) in its intended use case.
The resulting MDP-based dynamically reconfigurable channelizer is illustrated
by the block diagram shown in Figure 2.5. The key feature of this system is that
the channelization requests do not have direct control over the processing system.
Rather, the channelization requests go only to the MDP-generated run-time control
policy, which decides when and how to act on each specific request. The policy
determines the best action to take, with the objective of maximizing the long-term
average performance rather than solely based on an immediate reward. To make
this determination, the policy uses models of the application and processing system
characteristics. The policy may decide to reconfigure the processing system imme-
diately if that is assessed as the best decision, or counter intuitively, it may decide
to ignore a request that it predicts is a spurious request and would not justify a
26
CR Runtime
Control
Control 
Policy
Actions
Processing 
System State
	
 y1
 y2
x ?
yNc
Figure 2.5: Dynamically reconfigurable channelizer.
reconfiguration event.
The key components of the MDP underlying the reconfigurable channelizer
system are the 4-tuple (S, A, STM s, R), where the components of this 4-tuple
are respectively referred to as the system state space, action space, state transition
matrices (STM s), and reward function. In some contexts a scalar discount factor
and a probability distribution for the starting state are also included in this list,
sometimes making it a 5-tuple or 6-tuple. However, in this discussion the 4-tuple
definition is sufficient.
The state space S is defined by enumerating all possible states of the external
requests imposed on the processing system (channelization requests), as well as a
list of modes that the processing system can be in at any time (reconfiguration
states), which were detailed in Section 2.3. The combination (product) of these two
subspaces (external requests and processing modes) yields the state space of the
channelizer system.
27
For the action space A, the MDP policy is given control over the reconfigura-
tion decision, as well as selected parameter values within particular configurations.
As a result, the action space consists of all the possible configurations and parameter
values that can be commanded.
The STM s are a set of stochastic matrices that define the probability of the
next state given the existing state, conditioned on a given action. There is one STM
defined for each action. These matrices are obtained by multiplying together the
independent statistics of the external channelization requests with the conditional
statistics of the processing system?s state transitions. The statistics of the channel-
ization requests used to generate the STM s are given by the following equations.
??????P0(CRj), i = i0P (CRj|i) = ??? , (2.2)P1(CRj), i =6 i0
P0(CRj) = (pstart)PD(CRj) + (1? pstart)1{j=i0} (2.3)
P1(CRj) = (pstop)1{j=i0} + (1? pstop)PD(CRj) (2.4)
P (CR ) = ??(j)(1? ?)NC??(j)D j (2.5)
where i0 is the state where no processing requests are incoming (representing periods
of inactivity), ?(j) represents the number of requested subchannels in the CR state
j, ? is a parameter used to simulate various levels of communication activity, and
28
pstart , pstop are used to simulate the system entering and exiting periods of inactivity.
The statistics of the processing system used to generate the STM s are detailed in
Section 2.4.3.
2.4.1 Multiobjective Rewards
For the reward function R, the following methodology for incorporating mul-
tidimensional design objectives into an MDP-based channelizer design framework is
used. Given a set X = {x1, x2, . . . , xN } of NR evaluation functions for key per-R
formance metrics, a reward function R : (S ? A) ? R is defined in terms of these
metrics for each action in each state. Here, R denotes the set of real numbers.
Each evaluation function xi : (S ?A)? R is used to estimate system perfor-
mance in terms of a specific implementation concern, such as average energy con-
sumption, latency, or throughput. These estimation functions can be formulated
at design time by using knowledge of the system and its available configurations,
or measured at runtime by supporting instrumentation. The result of each evalu-
ation function xi is transformed by a mapping gi : R ? [0, 1], which is defined at
design time for each metric. These transformations are introduced to normalize the
performance metrics in order to allow them to be combined into the single scalar
output of R. This kind of transformation and combination follows the scalarization
approach to multiobjective optimization, as described in [32].
The combination of the transformed results of the evaluation functions is per-
formed through a set of weights ? = {r1, r2, . . . , rN }, one corresponding to eachR
29
metric, such that
?NR
(ri ? [0, 1] for each i) and (1 = ri). (2.6)
i=1
Determining these weights ? is a design time aspect of the SLRF. The weights are
determined once and then continually used to steer any executions of the solver to
seek policies that achieve the desired prioritization of metrics in consideration with
the observed external environment statistics.
Once the evaluation functionsX, transformations {gi}, and combination weights
? are determined, the reward function can be evaluated using Equation 2.7 for any
given s ? S and a ? A.
?NR
R(s, a) = ri gi(xi(s, a)) (2.7)
i=1
In the experiments, the rewards are defined as follows. First, g1 is defined
as the normalized rate of successful channelization requests. This can be expressed
as (?r ? ?d)/NC , where ?r represents the total number of channelization requests
input to the system during a given time interval ? , and ?d represents the number
of dropped requests (i.e., requests where there is a failure to produce the desired
channel) during ? .
g2 is defined based on a formulation in [28] for the normalized power savings
of an electronic system. Specifically, in order to normalize power consumption and
treat it as a form of savings, power consumption (x2) is measured in each state
and the minimum and maximum possible values are recorded. Then the power
30
measurement are transformed relative to the maximum and minimum power that the
system consumes in all of the possible states (g2). The result is shown in Equation 2.8
and Equation 2.9.
x2,MAX ? x2(s, a)
g2(x2(s, a)) = , (2.8)
x2,MAX ? x2,MIN
where
x2(s, a) ? Power Consumed(s, a)
x2,MAX = max{x2(s?, a?)} (2.9)
s?,a?
x2,MIN = min{x (s?2 , a?)}.
s?,a?
Note that this definition is consistent with the convention previously defined:
the most power hungry state has g2 = 0 (and thus is the least rewarded), while the
least power hungry state has g2 = 1 (and thus is the most rewarded).
The combination of rewards functions g1 and g2 effectively steer the MDP to
find policies that are most productive at channelizing the incoming signal as per the
channelization requests, while consuming as little power as possible on average.
2.4.2 MDP Solver and Policy
With the definitions and rewards described above, an off-the-shelf MDP solver
can be employed to generate a policy that simultaneously seeks to maximize the
rate of successful channelization requests while consuming the least energy possible,
taking into account both the physical characteristics of the processing system as
31
well as the independent statistics of the operating environment at the current time.
In these experiments, the open source MATLAB solver MDPSOLVE [33] is used.
The resulting control policy has the form f : S ? A ? i.e., a mapping from
states into actions. This mapping can be implemented as a function or simple
lookup table that is invoked or accessed once per frame, respectively. To execute
the controller, the incoming request is combined with the current processing system
state. The result is then used as an index to lookup the operations involved in the
next optimal control action.
In this example application, the total number of states is 3328 and the total
number of actions is 13. For these quantities, the action can be encoded into 4 bits
and thus 2 encoded actions can be packed into 1 byte of storage. The result is a
policy that can be packed into 1.6kB. For the prototype hardware implementation,
it was feasible to simply store the policy as a lookup table in RAM and index it to
look up the next action.
2.4.3 Transition States
In this design context, the processing system is typically a deterministic, con-
trollable machine, such as a general purpose processor (GPP), programmable digital
signal processor (PDSP), field programmable gate array (FPGA) or graphics pro-
cessing unit (GPU). The proposed framework assumes that this type of processing
system can be modified or reconfigured through the action decision of the MDP. By
definition, in MDP frameworks the system is assumed to transition probabilistically
32
from one state to another as a result of an action decision. This abstract probabilis-
tic transition viewpoint is not immediately amenable to modeling the transitions of
a deterministic processing machine. Rather, the resulting state changes in the pro-
cessing system are better described as a change that is guaranteed to occur but can
take some fixed or variable amount of time to complete. Additionally, the change
may take longer than a single frame to complete. Some examples of the types of
operations typically encountered in this context that must be accounted for are:
(1) computation of the schedule for a dataflow graph before being able to execute
it, (2) allocation of memory from an operating system heap when initializing algo-
rithms, (3) the block copy of code or data from a slower, larger long-term storage
to a smaller, faster location (e.g., page fault), (4) the block copy of code from non-
executable regions to executable regions (e.g., overlays), and (5) dynamic full or
partial reconfiguration (DPR) of FPGA regions, to name a few.
To assign the required state transition probabilities in this context, suppose
that the processing system receives action w in frame n while in state sp(n) = u, and
that this state/action pair is known to deterministically transition the processing
system to a new state v in an amount of time denoted as Tu,v|w, which need not be
an exact multiple of the frame period TF .
If Tu,v|w < TF , then the conditional State Transition Matrix for the processing
system (SP STM) is trivially computed by
33
??????1, j = vSP STM i,j|w = ??? (2.10)0, otherwise
This represents a guaranteed (i.e., with probability 1) transition of the processing
system to state v, that completes before the start of the next frame.
If, on the other hand, this transition takes longer than TF , a new processing
system state m is needed, which is defined as the state of being in transition from
sp = u to sp = v. In this case, the conditional SP STM matrix is calculated by
??????
?????1, i = j = v?????????1, i = u, j = m
SP STM i,j|w = ??????1? c, i = j = m
, (2.11)
?????????c, i = m, j = v????0 otherwise
where
[ ( )]
T ?1u,v|w
c = floor . (2.12)
TF
For example, if the processing system transition takes 4.67 frames to complete
and the action is held constant until the completion of the transition, then the
system will begin transitioning immediately following the triggering action, and will
remain in transition for 4.67 frames before arriving at the destination state. In this
case, the conditional transition matrix states that with probability 1, the processing
34
system will transition from the starting state to the transition state in the first
frame, and then for each subsequent frame will remain in the transition state with
probability 3/4, and will jump to the destination state with probability 1/4. This
is exactly how the transition would appear to an agent who naively observes the
processing state as a stochastic process during just the transition sequence. This
agent would observe 3 non-transitions and 1 transition out of 4 trials.
The observations during the transition can be modeled as a Bernoulli random
variable, as was done in [34] through the use of Bernoulli trials. Here, the two
random outcomes are interpreted as those of remaining in transition and completing
the transition. Then the Maximum Likelihood Estimator (MLE) of the Bernoulli
parameter can be shown to be exactly as given by Equation 2.12. For this reason,
the Bernoulli probability mass function is given by the corresponding row of the
conditional transition matrix, as expressed in Equation 2.11. With knowledge (or
an estimate) of the transition time from each state/action pair in the model, the
entire set of SP STM matrices can be populated in this manner.
2.4.4 Factorization
In this chapter, the MDP model and solver components are implemented and
invoked at design time, in order to generate a control policy that is used at run-
time. However, in order to implement runtime adaptation it is ultimately necessary
to transfer the MDP model and solver to the target system such that the solver
can be invoked periodically at run time. The solver can then be used to dynami-
35
cally re-optimize the control policy in response to a changing external environment.
Working towards this goal, in this section an analysis is performed of the required
target platform resources necessary for embedded deployment of the MDP model
and solver. The main aspects of resource utilization that are investigated here are
(1) the size of the four MDP constructs (S, A, STM s, R) that need to be held in
memory, and (2) the execution time of the MDP solver required to generate the
control policy.
In this context, there are significant advantages to adopting the Factored
MDP approach developed in [15]. In that work, knowledge of the stochastic inter-
dependencies between the state space variables are exploited to reduce both the
memory requirements and solver execution time.
In MDP problems, the state s ? S is constructed to model the problem the
MDP is being applied to. Often this results in the state being an instantiation of
a discrete multivariate random variable Z = (Z1, Z2, . . . , ZN ), with each variableZ
Zi taking on values in DOM (Zi), where DOM (X) represents the set of admissible
values of the random variable X. A state is a set of instantiations of the NZ random
variables, and can be written as a vector z ? DOM (Z). The size of the state
space is defined by the cardinality of this set, which is denoted as |DOM (Z)|. As
a result, each row of each transition matrix for an MDP has width |DOM (Z)|, and
describes the probability of reaching all possible combinations of the set of variables
(Z1, Z2, . . . , ZN ).Z
MDPs with this kind of formulation are said to have a multivariate state
space. When an MDP?s STM s are stored in a structured way that uses knowledge
36
of the causal relationships between these state variables to reduce storage size, the
MDP is said to be factored. From the empirical observations of this application, it
can be seen that this method can effectively reduce STM storage size considerably.
However, it requires a specific conditional probability structure to be present in an
MDP, and the data structures must be created by hand with specific knowledge of
the exact structure.
This can require a subject matter expert in the loop anytime a transition
probability changes, which can complicate runtime autonomous solving of an MDP
that changes over time in unknown ways. In general, this requirement can be prob-
lematic if the underlying structure is not fully understood. The effectiveness of the
technique is acknowledged here, but also the fact that it cannot always be used (for
these reasons) is important as well.
Using the factorization approach, the state space of the channelizer can be
represented as:
s = (CR1,CR2, . . . ,CRN ,CF 1,CF 2). (2.13)C
Here, CRi is the channelization request for channel i, CF1 is the top-level pro-
cessing configuration, and CF2 is the processing subconfiguration. The benefit of
using this scheme is that it enables the explicit specification of the stochastic inter-
dependencies of the variables within the state space. With this in mind, factored
MDPs make use of Dynamic Bayesian Network (DBN) diagrams [35] to explicitly
define and illustrate these dependencies.
37
Channelization 
CR CR
Requests
Processing 
CF1 CF1
Configuration
Processing 
CF CF
Sub-Configuration 2 2
Frame Frame
n n+1
Figure 2.6: Dynamic Bayesian network representation of the channelizer state space.
A DBN diagram of the channelizer?s STM s when conditioned on an MDP
action is shown in Figure 2.6. Note that the (CR1,CR2, . . . ,CRN ) requests areC
grouped together into a single vector CR for conciseness. A stochastic dependency
between two variables in the state space (from one time frame to the next) is de-
noted via the presence of an arrow between the dependent variables. The absence
of an arrow denotes independence. Thus, the diagram shows that the joint proba-
bility distribution of the channelization requests is dependent only on the requests
in the previous frame, and is independent of the processing configuration. The
processing configuration is dependent only on the previous processing configuration
(since reconfigurations are not instantaneous). However, this dependency is only on
the top-level processing configuration (e.g., DCM, DFTFB, etc.) and not on the
subconfiguration (e.g., the filter coefficients).
Knowledge of this underlying stochastic structure within the state space al-
lows for considerable reduction of the size of the data structures required to store
the MDP model. The effect on the largest of these components (the STM s) is high-
38
lighted. Only the conditional probabilities with respect to the dependent variables
need to be stored, rather than with respect to all variables ? as would be necessary
in an equally sized state space where the underlying stochastic structure is unknown.
The factorization made possible by the knowledge is represented in Equation 2.14,
where the superscript n is used to denote the time index. The rearrangements are
made possible through (1) independence between the channelization request and
processing configuration, and (2) independence between the channelization request
and the MDP action.
p(s(n+1)|s(n), a(n)) = p(cr (n+1), cf (n+1), cf (n+1) | cr(n) (n)1 2 , cf1 , cf (n), a(n)2 )
(2.14)
= p(cr (n+1) | cr (n)) p(cf (n+1), cf (n+1)1 2 | cf (n)1 , a(n))
The resulting reduction in the number of elements in the STM s is shown in
Equation 2.15. This reduction represents a significant savings. Note that the quan-
tity shown in Equation 2.15 is the cardinality of the sets, which is a count of the
number of elements regardless of what underlying data type is used for representa-
tion in the MDP model and solver algorithms. For example, if the data type used is
a 16-bit or 32-bit representation, the total storage size would be 2 bytes or 4 bytes
per element, respectively.
|S|2 |A|  |DOM(CR)|2 +
|DOM(CF1, CF2)| |DOM(CF1)| |A| (2.15)
121.8x106  66.3x103
39
2.4.5 Stability
Once an MDP-generated policy is used to control the processing system, the
stability of the resulting dynamics can be analyzed. There are two aspects of stability
that arise in the context of this SLRF.
First, it is important to design the model of the processing system such that
for a given action, the model is guaranteed to eventually transition to the desired
processing state (after any transition delays and transient dynamics have settled)
and to remain in that state indefinitely for as long as that action is held constant.
For example, the action that configures the channelizer to be in the DFTFB state
should eventually result in it reaching that state, and it should remain in that state
indefinitely until another action is selected. This characteristic of the model aids
stability and can be defined formally in the context of Markov chains, which allows
it to be verified systematically at design time before attempting to generate the
optimal policy for the MDP. Such a step of systematic model checking can aid in
larger modeling efforts where the criteria may not be immediately obvious, as in the
case of complicated transition dynamics including multiple transition states.
The product of terms format of Equation 2.14 that results from factorization
allows for the transition probabilities of the processing system to be specified sep-
arately from those of the external environment. This subset of the full STM s can
be viewed as a smaller set of state transition matrices MCF for the processing sys-
tem only, not the external environment. For a given action, the processing system
is modeled to transition according to this state transition matrix and thus forms
40
a Markov chain with state vector CF . The stability criterion described above is
equivalent to requiring that each transition matrix in MCF qualify as an absorbing
Markov chain as defined in [36], and additionally that only one absorbing state exists
per action. Such a Markov chain is one where each state can reach the absorbing
state, and once the absorbing state is entered it cannot be exited. In this SLRF, the
way to leave the absorbing state is to change the action, which effectively changes
the processing system transition probabilities and creates a new Markov chain that
allows for transition out of the state. This absorbing characteristic can be inspected
numerically in the model, by verifying that all of the MCF matrices conform to
these numerical restrictions.
The second aspect of stability that arises is that of the dynamics of the full
MDP, which includes both the processing system and the external environment. The
dynamics of the external environment are exogenous and uncontrollable, and thus
not a concern from a stability viewpoint. However, the environment affects what
the MDP-generated policy will contain, and by extension the resulting sequences of
actuations that the MDP-generated policy exposes the processing system to. Given
that the processing configurations are a finite set, instability could materialize as an
excessive reconfiguration back and forth between processing states that negatively
affects the overall closed loop performance with regard to the defined performance
metrics. The presence of such instabilities can be detected using the simulations
introduced in Section 2.5.
In the case of this form of instability, the designer can correct the dynamics
by adjusting the reward weights ? introduced in Section 2.4.1. A thorough analysis
41
of why some reward functions lead to this form of behavior can be found in [37].
42
2.5 Results
To evaluate the effectiveness of the MDP-based Reconfigurable Channelizer
System (MRCS), a simulation was developed with external requests that follow the
statistics of the two use cases ? here termed IID for the i.i.d. requests of Use
Case A (introduced in Section 2.3.1), and SEQ for the sequential sensing of Use
Case B (introduced in Section 2.3.2). In the following sections three evaluations
are performed. First, the results are compared against those of manually generated
policies, that are considered representative of a typical approach used in industry at
the time of this writing. Second, the results are compared against another method
published by researchers. Third, the effectiveness and trade-offs associated with
modeling transition states are explored.
2.5.1 Comparison with Manually Generated Policies
In order to evaluate the effectiveness of the MDP generated control policy,
several alternative control policies are created to compare it against. These are
referred to as the ?manually generated? policies, and contrasted with the set of
?MDP generated? control policies. The manually generated policies were generated
through intuitive heuristics, by first defining common sense rules for controlling
the system in question, and then translating those rules into code. This represents
the traditional method that an embedded software developer would use to create
a reconfiguration policy. For the manually generated alternatives, the rules and
resulting policies are as follows:
43
1. DFTFB ? This policy keeps the DFTFB algorithm on the chip at all times,
and invokes it in all frames regardless of the external requests. This policy
was used purely as a starting baseline, as this policy represents the absence
of reconfiguration options, using the most productive and processor intensive
channelizer available in the system at all times to meet all requests.
2. DFTFB+Sleep ? This policy also keeps the DFTFB algorithm on the chip
at all times. However, if the number of requested channels is 0, the DFTFB
is put into sleep mode. Otherwise, the DFTFB is kept on.
3. DCM+Sleep ? This policy keeps the DCM algorithm on the chip at all times.
If the number of requested channels is 0, the DCM is put into sleep mode.
Otherwise, the DCM is kept on and applied to produce one of the requested
channels.
4. DFTFB+DCM+Sleep ? This is a set of policies that use both the DFTFB
and DCM algorithms. The reconfiguration decision occurs based on how many
channels are requested in the upcoming frame. If less than DFT THRESH
channels are requested, the DCM algorithm is used. If more than this threshold
are requested, the DFTFB algorithm is used. Additionally, if the number of
requested channels is 0, the algorithm that is currently is loaded is put into
sleep mode. If a reconfiguration is in progress, it is allowed to finish regardless
of incoming requests. The DFT THRESH parameter is varied from 2 to 6,
resulting in 5 different control policies.
In order to compare the policies objectively, the following experimental setup
44
was created on the EFM32GG development board. Both channelizer algorithms
were implemented in C, compiled and stored on the system?s non-volatile storage.
A MATLAB simulation was created that produced a time series of channelization
requests having the statistics described in the two use cases A and B. The time series
output of the simulation was translated to a C array and stored on the EFM32GG.
A test harness was written on the EFM32GG, which was driven by a periodic timer
interrupt. At the interrupt rate, the next channelization request was pulled from
the stored array and that channelization request was then used as an input to the
dynamically reconfigurable channelizer system.
This system was implemented in C and executed on the EFM32GG. In order to
facilitate an objective comparison of control policies, all of the manually generated
policies were stored as Lookup Tables (LUTs) in addition to the MDP generated
policies. This allowed both the manually- and MDP-generated policies to be invoked
by suitably swapping out the contents of the LUT.
As part of the test harness, a small amount of diagnostic code was incorporated
to compute performance objective 1 (channelization productivity) in real-time. This
computation was performed by comparing the produced channelizer outputs with
the requests. A channelization request that was successfully carried out was labeled
a success. Conversely, a request that was not carried out was labeled a failure
(e.g., if the processing system was in a reconfiguration state during a frame with
channelization requests in it, or if a configuration was in place that could not produce
enough output channels, etc.). The ratio of the successful outcomes to the number
of requests was used to compute a success rate, which was used as a measure of
45
system productivity. The measured productivity results were periodically streamed
to a laptop computer using the ARM on-chip Real-Time Trace (RTT) functionality,
and EFM32GG Single Wire Output (SWO) port. The streamed output for each
case was tabulated and used for comparison.
Metric 2 (CPU power consumption) was measured by using the EFM32GG
board?s energy monitoring tools. These development tools allowed a very accurate
current measurement to be taken, showing the exact current drawn by the CPU
over time for each control policy. The total current drawn over the total simulation
time was used to create a single metric for average power consumption. Thus, a
highly repeatable experimental setup was applied, where all experimental settings
were kept the same from case to case with the only difference being the control
policy being used.
Results of these experiments are summarized in Figure 2.7. Here, each point
in the figure represents the average performance of one policy over the entire simu-
lation. The MDP policies generated by different values of r1 are connected together,
illustrating a Pareto front generated by the suite of MDP policies. The manually
generated policies are plotted without any connecting lines. If the distance from
the origin is used as a scalar metric of performance, the MDP generated policies all
outperform or perform equally to the best manually generated policies.
46
1  
MDP
0.9 Manual
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0  
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Normalized Rate of Succesful Requests (g )
1
Figure 2.7: Policy comparison results.
2.5.2 Comparison with mHARP
Next, the MRCS is compared to a competing published method, the Highly
Adaptive Reconfiguration Platform (HARP), introduced in [3]. One modification
was needed, as the published HARP made decisions purely to optimize energy ef-
ficiency. This was inadequate for the channelizer case study, as the most energy-
efficient result is one where the system never leaves its sleep state. To remedy this,
the single metric in HARP was replaced with the multidimensional reward formu-
lation in Section 2.4.1, to construct a useful HARP policy and also to provide a fair
comparison between the two methods. This modified method is referred to here as
multiobjective HARP (mHARP).
47
Normalized Power Savings (g )
2
For each of the two competing techniques, 10 scenarios were created by vary-
ing the Bernoulli parameter in use case A, and another 10 by varying the channel
dwell time in use case B. The result is 20 simulations where the proposed method
and the baseline method (described below) were allowed to implement and run the
optimal control policy for the given use case and external environment. The sys-
tem characteristics and measurements described in the previous section were used
to define the processing system under control. The results from these experiments
are summarized in Figures 2.8 and 2.9, for use cases A and B, respectively. As
previously mentioned, HARP requires a priori tuning for a given desired system
dynamic. In this simulation, mHARP was optimized for power savings. The results
show that when tuned in this way, mHARP does well in this metric for all scenar-
ios (producing slightly better performance than the MRCS approach), but greatly
sacrifices performance in the success rate for half of the scenarios. Conversely, when
mHARP was optimized for the success rate, large shortcomings in the power savings
were observed. In contrast, the MRCS involves no a priori tuning, and optimizes
all decision making for each scenario individually without compromises. These re-
sults show the MRCS to have greater robustness to a wide range of parameters in
different applications, all without any human-in-the-loop intervention.
2.5.3 Trade-offs in Modeling Transition States
An analysis was performed into the effectiveness of modeling processing state
transitions, as described in Section 2.4.3. Although the prototype system did not
48
1
0.8
0.6
MRCS
mHARP
0.4
0.2
0
1 2 3 4 5 6 7 8 9 10
Scenario
1
0.8
0.6
MRCS
mHARP
0.4
0.2
0
1 2 3 4 5 6 7 8 9 10
Scenario
Figure 2.8: Experimental comparison between MRCS and mHARP, for IID use case.
49
Normalized Channelization
Power Savings Success Rate
1
0.8
0.6
MRCS
mHARP
0.4
0.2
0
1 2 3 4 5 6 7 8 9 10
Scenario
1
0.8
0.6
MRCS
mHARP
0.4
0.2
0
1 2 3 4 5 6 7 8 9 10
Scenario
Figure 2.9: Experimental comparison between MRCS and mHARP, for SEQ use
case.
50
Normalized Channelization
Power Savings Success Rate
incur large reconfiguration delays, larger delays are anticipated in future work as it
scales up to larger channelizer applications. Adding transition states to the MDP
model has the undesirable effect of increasing the size of the state space, which is
known to increase the size of the model?s data structures as well as the execution
time of the policy generation algorithms. In order to make informed modeling
decisions, it is crucial to understand what is gained at the expense of these costs.
With these goals in mind, one of the scenarios of the IID application was selected
for exploration, and modified in two ways.
First, the dynamics of the processing system were modified by changing the
amount of time that transitions of the top-level reconfigurations would take to com-
plete. This delay was varied between 1 and 5 frames, representative of a range of a
small reconfiguration delay to a large delay. Second, two alternative MDP modeling
approaches were used and compared: one with the transition states modeled and
one without.
Delays STM Size Solver Execution
Modeled [Elements] Time [Seconds]
No 66020 17.2
Yes 66394 24.0
Table 2.2: Modeling costs with and without transition delay modeling.
The cost of the additional modeling is shown in Table 2.2. The increase in the
size of the STM s is practically negligible, however the increase the solver?s execution
time is not. The benefits of this more expensive modeling come at run-time, and
51
1
0.9
Delays Modeled
0.8
Delays Not Modeled
0.7
0.6
1 1.5 2 2.5 3 3.5 4 4.5 5
Reconfiguration Delay [Frames]
14.4
14.2
Delays Modeled
14 Delays Not Modeled
13.8
13.6
1 1.5 2 2.5 3 3.5 4 4.5 5
Reconfiguration Delay [Frames]
Figure 2.10: Run-time performance with and without transition delay modeling.
are shown in Figure 2.10. This figure shows the resulting assessment in terms of the
performance metrics defined in the previous section.
From this assessment, it can be seen that when transitions are not modeled, the
performance of the system (with respect to both metrics) degrades proportionally
with the length of the reconfiguration delays. This degradation is attributed to the
system spending more time in a non-productive reconfiguration state. In compari-
son, the MDP that has the transitions modeled does not exhibit this performance
degradation. These results are attributed to the fact that the MDP with transition
states is able to consider the reconfiguration penalties in its decision criteria, and as
52
Channelization
Average Power [mW] Success Rate
a result is more ?reluctant? to trigger costly reconfigurations.
53
2.6 Conclusions and Future Work
In this chapter, a methodology was presented for design and implementation
of adaptive digital channelizer systems, and a novel channelizer design was demon-
strated, called the MDP-based reconfigurable channelizer system (MRCS), that is
derived using the new methodology. This methodology and the MRCS employ com-
pact, system-level models based on MDPs to generate control policies that optimize
the required embedded signal processing tasks in terms of relevant, multidimensional
design optimization metrics.
The MRCS was designed using a System-Level Reconfiguration Framework
(SLRF), which provides a systematic methodology for dynamic adaptation of em-
bedded signal processing configurations. The framework includes multiobjective
optimization at its core, embracing the multifaceted nature of embedded systems
design, where making strategic trade-offs among conflicting goals is critical. As a
result of this emphasis, the framework can jointly optimize power consumption and
real-time throughput, and can readily be adapted to address other combinations of
metrics that are important for a given application.
The effectiveness of the method was shown in simulation, using empirical mea-
surements for the properties of an experimental signal processing system. Through
extensive simulations, it was shown that the MRCS outperforms the prior state-of-
the-art in terms of robustness to changing applications and scenarios.
Useful directions for future work include adapting the MDP-based, reconfig-
urable channelizer design methodology to derive dynamically reconfigurable forms
54
of other types or other combinations of channelizer architectures, and continuing to
generalize the proposed design methodology to address broader classes of embedded
signal processing applications.
One requirement of the proposed SLRF is that the statistics of the external
environment and reconfiguration dynamics must be known at design time. In cer-
tain applications, this may not be feasible, or they may be time-varying to such a
point that a policy generated offline at design time may experience a reduction in
effectiveness as these properties change.
An important complement to this framework is in learning strategies to es-
timate these statistics at runtime for systems where they are not constant or not
known up front. These running estimates can then be used to periodically re-
optimize the control policy and keep it performing optimally across time-varying
use cases and a time-varying environment. These concepts are explored in the chap-
ters that follow.
55
Chapter 3
Efficient Solving of Markov Decision Processes on GPUs using
Parallelized Sparse Matrices
56
3.1 Introduction
This chapter presents a novel algorithm that enables fast and efficient use
of MDPs in real-time on resource constrained signal processing systems that are
equipped with graphics processing units (GPUs). Material in this chapter was pub-
lished in preliminary form in [38].
For many years, researchers have been applying MDPs in limited ways to
control computing systems at runtime [34]. Typical application domains that have
applied MDPs in this way are artificial intelligence [35], multirate digital signal
processing [13], and wireless sensor networks [39], among many others.
MDPs are often regarded as useful tools in theory. However, they have often
been deemed too computationally demanding to be fully leveraged in resource con-
strained computing systems due to the processing time and RAM required to do
so [16]. This suggests that their full potential and utility in this class of systems has
not yet been reached.
In this chapter, recent advancements in parallel processing made possible by
embedded GPUs are applied to help bridge this gap. The benefits of GPUs in
accelerating many important types of computations is now well accepted in the
embedded systems community [40, 41]. The application of GPUs to accelerate MDP
algorithms has been reported on in recent published work (e.g., see [42]). However,
this previous work focuses on one MDP use case, and does not include analysis
on how acceleration can be optimized for different kinds of MDPs. One important
contribution of this chapter is to provide a general analysis of GPU-based MDP
57
acceleration. In particular, an analysis is presented detailing how much parallelism
exists and where it can be found, as well as a parameterized blueprint for how many
parallel threads and GPU kernel executions can be invoked as a function of the
MDP dimensions and parameters.
Additionally, acceleration results are produced that are significantly beyond
other MDP techniques that have used GPUs. These improvements are established
by incorporating recent advancements in sparse linear algebra on GPUs [43]. By
integrating GPU acceleration and sparse linear algebra techniques to MDP solver
design, a novel algorithm is developed, called Sparse Parallel Value Iteration (SPVI).
SPVI enables fast, memory-efficient implementation of MDP solvers on resource
constrained GPU platforms, such as mobile and embedded GPU SoCs.
With these capabilities, SPVI enables the use of MDPs in new and more
useful ways than was previously possible. Also, SPVI relaxes constraints and limi-
tations that other MDP techniques have imposed, which can further facilitate more
widespread use of MDPs in signal processing systems. The source code to SPVI is
included in the software package described in Chapter 5.
The remaining sections of this chapter are organized as follows. Section 3.2
contains a survey of the literature, and provides history and background for the
chapter. In Section 3.3, SPVI is defined in detail and in Section 3.4, applications
are explored to illustrate how the concepts that SPVI uses materialize in real-world
use across typical use cases. In Section 3.5, the SPVI algorithm is compared experi-
mentally against the prior state-of-the-art in MDP solver algorithms, and shown to
have considerable advantages with respect to performance and feasibility of imple-
58
mentation on modern embedded computing hardware.
59
3.2 Background and Related Work
The different ways in which an MDP can be applied to control a computing
system are vast and varied. MDPs provide a generic decision making framework that
uses abstract concepts including states, actions, transition probabilities and rewards.
Once these concepts are defined they are then passed to an MDP solver, which is
an algorithm that produces an optimal policy with respect to those definitions. The
policy is a mapping from states to actions, such that an agent using the policy looks
up what action to take for any given state.
However, there is no consensus in the literature regarding exactly how to
map elements of computing systems to components in the MDP framework. This
mapping is in general left to the designer who is applying the MDP to solve a specific
computing problem. For example, a processing system can be commanded to run a
particular algorithm (and this can be modeled as a state), or that same command
can be modeled as an action instead, or it could be modeled as both (an action that
leads to a state). Also, the choice of granularity for these definitions is important ?
e.g., are two invocations of the same algorithm with a slightly different parameter
value considered two different actions, or the same action?
There are several approaches in the literature to map elements of computing
systems to MDP states and actions, and these different approaches lead to different
results, with implications in both the final policy performance as well as how hard
it is to model and solve the MDP. One of the earliest known applications of using
an MDP to control resources in computing systems at runtime is [34]. Notable ex-
60
amples of differing approaches include the reconfigurable digital filter presented in
Chapter 2, a reconfigurable router [28], a power management module for a micropro-
cessor [44], a smartphone scheduling program that synchronizes email efficiently [45],
and a collision avoidance algorithm for commercial aircraft [46].
3.2.1 MDP Solvers
One of the first challenges associated with using MDPs is choosing what con-
stitutes a state, an action and a reward. After that is decided, the associated MDP
data structures must be stored on a computer and used as the inputs to the MDP
solver to produce a policy. With this policy, the runtime decision framework consists
of observing what state the system is in, and using that as input to the policy to
determine what action to take.
The classical methods to solve MDPs are algorithms known as Value Iteration,
Policy Iteration and Modified Policy Iteration [35]. All of these algorithms produce
an optimal solution to the MDP problem, with different approaches leading to dif-
ferent implications in the execution time, power requirements and memory use of
the solver routines.
These classical MDP solver algorithms suffer from the same issue as most
systems that try to reason using computations of probability distributions: the
framework?s data structures grow exponentially with the size of the state space. A
large state space is desirable in order to have sufficient model expressiveness to tackle
difficult decision problems, but this desire is at odds with the resource requirements
61
needed to solve an MDP that has a large state space. The upper limits on memory
consumption that are available on typical embedded computing systems can often
easily be reached, before many important system details have been modeled.
More specifically, the total size of (number of elements in) an MDP?s State
Transition Matrices (STM s) are N2SNA, where NS and NA are the number of ele-
ments in the state space and action space, respectively. The STM s are the largest
data structures in the MDP, and usually the most difficult structures to store and
process, due to their large size. The STM s are large matrices even for modest
choices of NS and NA, and if one were to add a state variable with L states to the
state space, this addition would increase the size of the STM s by L2. Besides the
storage space and memory requirements to store large data structures, increasing
the state space also causes the solver?s execution time and power consumption to
grow exponentially as well.
Thus, for computing systems that operate under strict resource constraints,
it is not enough to frame an MDP in a way that produces a well performing so-
lution. There is also the practical issue of whether the solver can be successfully
implemented on the targeted platform, and whether it can complete in an amount
of time reasonable for the application.
This so-called curse of dimensionality [16] usually results in limiting the use
of MDPs to a mode of deployment that greatly hampers their usefulness: the solver
is invoked only once offline, and then the generated MDP policy only (not the entire
framework required to solve the MDP) is used on the target system. This scenario is
suboptimal and limiting if the problem inputs are unpredictable, constantly chang-
62
ing, or dependent on the environment. To overcome these limitations, one approach
is to design efficient and compact MDP solvers, which enable the full MDP to be
stored and solved on-demand on the target system. Such embedded MDP deploy-
ment leads to a more intelligent and adaptive class of embedded systems, which can
learn, adapt and autonomously re-optimize themselves for changing conditions and
use cases.
The design of compact solvers has been the goal of various researchers in
recent years. For example, Boutilier et al. propose factored MDPs as a method for
compact representation of large, structured MDPs [15]. This method can work quite
well in principle. However, it requires a specific conditional probability structure
to be present in an MDP, and the data structures must be created by hand with
specific knowledge of the exact structure. The state transition matrices must be
manually converted into tree-shaped conditional probability structures. This can
be difficult, time-consuming and typically requires a subject matter expert in the
loop anytime a transition probability changes, which effectively prevents or at least
greatly complicates runtime autonomous solving of an MDP that changes over time.
Additionally, this requirement can be problematic if the underlying structure is not
fully understood. In contrast, the SPVI algorithm has no such limitation. SPVI
does not require having any knowledge of the structure of the probabilities in the
STM s, nor does it require hand-crafting of the data structures into tree-shaped
objects.
Hoey et al. detail an algorithm similar to Value Iteration using Algebraic
Decision Diagrams [47]. This approach shows good results in taming the curse of
63
dimensionality, but imposes the same restrictions as [15] and thus has the same
limitations.
In another example, Jonsson and Barto present an algorithm that performs
hierarchical decomposition of factored MDPs into smaller subtasks to help alleviate
the growth in complexity [48]. This approach can be effective, but also requires
a priori knowledge of the causal structure within the MDP. This knowledge can
be very difficult or impossible to know for many MDPs, a shortcoming identified
by the authors themselves. Also, this method requires that the MDP have this
decomposability property, which is not always the case.
In a similar spirit, Lin and Dean [49] present a method to solve a large MDP
by first decomposing the state space into regions, determining actions to take within
those regions, and then using novel approaches to combine the resulting sub-policies
into an overarching policy that solves the original, large MDP. The authors note in
this work that the decomposition must be done a priori by a domain expert. This
decomposition is not guaranteed to be feasible, and when it is feasible, it can be
very difficult to perform.
In contrast to all of this prior work, the proposed SPVI approach requires no
special knowledge of the structure within an MDP in order to be used.
3.2.1.1 POMDPs and Approximate Solvers
Another critical issue in deploying MDPs into embedded systems is that the
runtime system may not have any way to know exactly what state it is in. In
64
this case, the problem statement changes to that of a Partially Observable MDP
(POMDP). Observability here refers to the ability of the system to observe its place
in the state space.
In POMDPs, the policy is no longer a mapping from discrete states to discrete
actions, but instead a mapping from a continuous space known as the belief vector,
to a discrete action [35]. The belief vector represents a probabilistic interpretation
of the system?s best guess for what state it is likely in. Since the discrete state
space is a subset of the continuous belief vector, POMDPs carry with them even
more computational burdens for two reasons: 1) the solver is tasked with solving a
harder problem, and 2) the system invoking the policy must maintain an evolving
time series of the current belief vector, in order to use it as the input to the policy.
The Artificial Intelligence community has spent considerable effort developing
computationally efficient solvers for POMDPs. A good survey for work in this area
can be found in [50]. Many of these solvers are approximate solvers (not exact
solvers), in that they seek to reduce computation by settling for finding sub-optimal
solutions that may be very close to the exact solution.
POMDP solvers can be used to solve (fully observable) MDPs, but not vice-
versa. Although this chapter focuses on solving MDPs and not POMDPs, it is worth-
while to investigate whether recent advances in approximate solutions to POMDPs
can help in solving MDPs efficiently. In Section 3.5, findings of this survey are
utilized to obtain and benchmark some contenders, and use these as candidates in
comparisons for completeness.
65
3.2.2 Sparsity
A novel feature of the SPVI algorithm is the exploitation of sparsity in in the
MDP data structures. In this context, sparsity is defined as the percentage of zero-
valued elements in the MDP STM s. Wijs et al. [51] present a promising method
to decompose MDPs into subgraphs, exploiting sparsity on GPUs. However, the
method is presented in the context of model checking for formal methods in software
engineering. More research is required to incorporate this method into an MDP
solver.
A caveat must be added to this approach, in that high sparsity is not guaran-
teed to exist in an MDP. In other words, one can easily synthesize an artificial MDP
with its contents populated with random numbers that meet the definition of a valid
MDP. Such a synthetic MDP can be generated in a way that does not exhibit high
sparsity. However, when MDPs are constructed to solve real-world problems, the
resulting data structures are often highly sparse. The proposed SPVI approach can
be applied to any MDP (sparse or not). However, the largest performance increases
will result only if the MDP is sparse.
Section 3.4 elaborates the claim that sparsity is commonly found in MDPs,
and provides both conceptual and empirical evidence to support it. The aim is to
show that the assumption of sparse STM s is not a major limiting factor for practical
MDPs.
66
3.3 Method
As mentioned in section 3.2.1, there are three well known algorithms for solving
MDPs. This chapter focuses strictly on Value Iteration and leaves exploration of
the other two algorithms as an interesting direction for future research. First, the
characteristics of the Value Iteration algorithm that make it suitable for acceleration
via parallel processing are detailed. Then, the focus turns to the important concept
of sparsity in MDPs. Finally, SPVI is presented in detail. SPVI can be viewed as
a version of Value Iteration that uses both parallel processing and sparse matrix
representations.
3.3.1 Parallelization
Value Iteration is an algorithm that is used to generate an optimal policy for an
MDP. In Value Iteration, a real number (or value) V (s) is associated with each state
s. This mapping is known as the Value Function. The value V (s) represents the
expected reward that can be obtained from state s. The Value Function V is derived
by using the iterative procedure shown in Equation 3.1, which starts out assigning
a value of zero for each state and then incrementally converges from that to the
optimal Value Function. Once sufficient iterations are performed, the optimal Value
Function is known and the optimal MDP policy can be obtained trivially from it.
This process of deriving the Value Function is a form of dynamic programming [52].
67
V 0(si) = 0 ? (3.1)
V n(si) = max{R(si, a) + ? [P (sj|si, a)V n?1(sj)]}
a?A
sj?S
In Equation 3.1, V n(si) is the approximation to the Value Function in state
si at loop iteration n, S is the discrete state space, A is the discrete action space,
R(si, a) is the reward function for each state-action pair (si, a), ? is a scalar discount
factor, and P (sj|si, a) is the probability of transitioning from state si to state sj after
taking action a. Arranging the conditional probabilities in a matrix with si as rows
and sj as columns gives the State Transition Matrix (STM ) for action a.
A good discussion of strategies for computing the stopping criteria used to
terminate the iteration in Equation 3.1 can be found in [15]. Under this optimal
solution, the MDP policy contains the optimal action to take in a given state to
maximize the expected reward. It is worthwhile to note that although the iteration
is initialized here using the zero vector, it is shown in [53] that the iteration converges
for any initialization vector.
In SPVI, the capability of modern embedded GPUs to compute using mas-
sive parallelization is utilized. The case studies in Section 3.4 and performance
benchmarks in Section 3.5 use a device in the NVIDIA Jetson family of embedded
GPUs. These processors are small, power-optimized General Purpose GPUs (GPG-
PUs) that can be used in embedded systems to enable hundreds or thousands of
parallel threads of execution in highly Size, Weight and Power (SWaP) constrained
systems. To describe the acceleration of Value Iteration through parallel processing,
68
the following section contains an analysis of how much parallelism can be exploited
in Equation 3.1.
There are three components of Equation 3.1 that can be accelerated with par-
allel execution of threads, as long as two copies of the Value Function are maintained
in memory (one for V n and one for V n?1). First, the elements inside the max{}
operation can be computed independently for each action a ? A. Second, each
iteration for the entire second line of Equation 3.1 can be computed independently
for each state si ? S. Third, the stopping criteria typically used is the comparison
of ?V n ? V n?1?? to a scalar threshold, and this norm operation can be computed
using a parallelized dimensionality reduction kernel.
With these characteristics, the Value Iteration algorithm is inherently well
suited to significant acceleration through the high levels of parallelization achievable
with embedded GPUs.
3.3.2 Sparsity
The previous section described the Value Iteration algorithm as it would be
implemented on a single-threaded CPU. If sufficient parallel processing resources
are available, it is possible to achieve acceleration through computation of indepen-
dent operations simultaneously. With this goal in mind, Equation 3.1 is rewritten
into Equation 3.2, using matrix and vector representations to move away from the
sequential-looping approach:
69
V n = max{R + ? ?M ? V n?1}. (3.2)
a?A
Here, R represents the reward function for each state and action flattened into a
length NSNA column vector, where NS and NA are the number of elements in the
state space and action space, respectively; ? remains a scalar; and M represents
the vertical concatenation of all NA of the NS ? NS transition matrices into an
(NSNA)?NS matrix.
Through execution profiling, it was consistently observed that a large portion
of the total computation time in Value Iteration is spent multiplying the V n?1 values
by the transition probabilities. In other words, a large portion of the computation
time in Equation 3.1 is spent performing the summation loop over sj ? S, which
needs to be repeated (NSNA) times for each iteration. Equivalently, in Equation 3.2,
the majority of the time is spent performing the large matrix-vector multiplication
M ?V n?1. This trend was observed consistently on single threaded CPU implemen-
tations using profiling timestamps, as well as on parallelized GPU implementations
using the NVIDIA Profiler nvprof.
Under the assumption that the matrix M is sparse, it follows that that the
majority of the computation time in Value Iteration solvers is spent multiplying a
large sparse matrix by a vector. In other words, this time is spent multiplying ele-
ments by zero and then summing those zeros to other zeros. SPVI exploits the same
principle as all sparse linear algebra software libraries ? that an operation that is
guaranteed to produce a known result (zero), can be skipped altogether resulting in
70
a performance improvement in time, memory use, and power consumption. By par-
allelizing computations using a GPU, and replacing linear algebra operations with
GPU operations that are specifically optimized for sparse matrix-vector algebra, it
is shown that a significant improvement in performance gain beyond the current
state-of-the-art is achieved.
3.3.3 SPVI Algorithm
A pseudocode description of the SPVI algorithm is shown in Algorithm 1. The
key features of SPVI that enable its high performance are: (1) a parallel deployment
scheme that takes full advantage of as many GPU cores, blocks and threads as are
available, (2) the consolidation of (NSNA) independent operations into one large
parallelized operation, and (3) the use of a single compressed sparse matrix in place
of the NA large STM s that are typically required.
3.3.3.1 State Transition
The computation of the product M ?V n?1 in Equation 3.2 is efficiently imple-
mented in SPVI using a sparse Matrix-Vector multiplication. The sparsity in the
transition matrices is exploited by the conversion of the large and sparse M to a
much smaller, densely packed Ms on lines 1 through 5. The sparse matrix Ms is
created in the Compressed Sparse Row Matrix format. This conversion only needs
to be performed once at initialization. Details on this sparse matrix format, as well
as a thorough analysis of the history and performance advantages of performing
71
ALGORITHM 1: Sparse Parallel Value Iteration (SPVI).
Input: S, A, R(si, a), P (sj |si, a), ?, ?
Output: ?
1 Compute NNZ , the number of non-zero elements in M
2 Allocate memory for Ms, a sparse matrix sized for NNZ
3 for each element in M do
4 if element is non-zero, add it to Ms
5 end
0
6 V ? 0
7 n? 0
8 repeat
9 n? n+ 1
T ? K SPARSE MULT (M , V n?110 s )
11 Q? K SAXPY (?, T ,R)
n
12 V , ?n ? K MAX REDUCE (Q)
13 N ? K INF NORM (V n, V n?1)
14 ? = MAX(N)
15 until ? < ?
16 ? ? ?n
72
sparse Matrix-Vector multiplications in GPUs can be found in [43].
A sparse matrix format is a format for a data structure that represents a
matrix. However, instead of the standard approach for matrix storage (a serialization
of each element in the matrix regardless of its value), a sparse matrix structure
contains an array of just the non-zero elements, along with two other arrays that
indicate where those elements are located in the matrix. The format implicitly
assumes that elements not specified are zero by default. In this form, a matrix can
be represented with no loss of information and if the sparsity of a matrix is high
then the sparse representation can be much smaller than the matrix stored in a
standard (fully serialized) format. Correspondingly, multiplying a sparse matrix by
a vector can be much faster and memory-efficient if the sparsity is high.
In SPVI, the multiplication of a sparse matrix by a vector is performed using
a CUDA kernel and is denoted by K SPARSE MULT in Line 10. This kernel is
a standard sparse matrix-vector multiplication as provided by the GPU libraries
CUSP (2016) and NVIDIA?s cuSPARSE (2015).
Next, the discount factor and rewards need to be applied. After computing
the product M ? V n?1s , the remaining steps can be implemented by scaling the
product by a scalar ? and then adding it to the vector R. This algebraic operation
is commonly referred to as a Single-Precision A X plus Y (SAXPY), and is very
efficiently implemented on most linear algebra packages. Several CUDA linear alge-
bra packages (e.g., CUBLAS) provide a highly optimized parallel execution version
of SAXPY, which is used in SPVI and referred to here as K SAXPY in Line 11.
73
3.3.3.2 Action Selection
The NS elements of the Value Function V
n and policy ?n are computed from
(NSNA) elements of the Q vector, which constitutes the selection of the optimal
action for a given state. This computation is implemented with NS parallel de-
ployments of an NA to 1 dimensionality reduction kernel invoked in Line 12 of
Algorithm 1. The execution configuration for the kernel is set up to launch one
kernel per MDP state. The kernel computes the maximum value (and the action as-
sociated with it) from a subset of Q, striding across the vector only on the elements
associated with the state assigned to the given kernel deployment.
3.3.3.3 Stopping Criteria
In order to evaluate the stopping criteria for SPVI, the infinity norm of the
incremental approximations to the Value Function must be computed. This opera-
tion is represented by lines 13 and 14 of Algorithm 1. Line 13 is an execution of NS
parallel deployments of a reduction kernel. The kernel is deployed with an execution
configuration of NB CUDA blocks, where NB is as large as possible for the specific
CUDA hardware that is available.
As a result of this execution configuration, the reduction kernel of Line 13
provides a reduction from two length NS vectors to one length NB vector. Line 14
is a maximization loop run on the CPU. This use of a loop on the CPU does not
have much impact on the execution time because the length of the vector N is NB,
which is typically much smaller than NS.
74
3.4 Applications
The previous section detailed the methodology used by SPVI to exploit spar-
sity in MDPs. This section begins with a listing of the sparsity found in several
MDPs encountered during this research, followed by discussion and justifications on
why this is a typical finding for real-world MDPs.
3.4.1 Survey
This section contains a survey of several embedded systems challenges that
are solved with MDPs. For these MDP solutions, the sparsity of the STM s were
computed or estimated. The results are shown in Table 3.1. As is evident from the
table, there is a high level of sparsity in each of these MDPs.
Problem Domain Sparsity
Coffee Robot AI/Robotics 96.9%
Russell/Norvig Maze Navigation/Planning 99.7%
ACAS Avionics 99.1%
SPC Sensor Networks 99.8%
MRCS Cognitive Radio 90.9?93.3%
Table 3.1: Sparsity in surveyed MDPs.
In the Coffee Robot problem [15], researchers detailed a classic problem used
in Artificial Intelligence and Robotics. The problem contains a decision framework
used by a notional agent (the ?robot?) that leaves an office building to buy coffee
75
for its owner. In the Russel/Norvig Maze [35], researchers create a grid representing
a floor plan, and show how MDPs can be used to solve indoor navigation problems.
Since the original problem was an illustrative grid of size 3x4, the analysis here
includes an extension of the grid to be much bigger in order to make the problem
more realistic. In the Airborne Collision Avoidance System (ACAS) [46], MIT
researchers developed an MDP for collision avoidance on commercial airliners.
The Solar Powered Computer (SPC) [2] problem is a challenging dynamic
power management problem. The MDP-based Reconfigurable Channelizer System
(MRCS) presented in Chapter 2 is a signal processing system that uses an MDP to
dynamically reconfigure itself for a given environment and use case.
The examples listed above show anecdotally that sparsity can appear promi-
nently in real-world MDPs. It is reasonable to conclude that this is a typical finding,
not just a handful of outliers, in the context of MDP problems for embedded systems.
Reasons supporting this conclusion are explored in Section 3.4.2 and Section 3.4.3,
where common forms of sparsity that arise are described.
3.4.2 Multivariate State Spaces
In Section 2.4.4, the concept of a multivariate state space was defined. From
MDP surveys in the literature it was observed not only that this type of state space
is very common, but also that MDPs with this type of state space exhibit a high
degree of sparsity. One reason that sparsity results from this is that an action rarely
affects all of the state variables simultaneously. Rather, it may have an effect on
76
one or only a few of the state variables, and thus the probability of a transition
from a given state to all possible combinations of all of the other state variables is
very unlikely or impossible. Rather, only a small subset of the combinations can
be reached and this gives rise to sparsity in the STM for that action. This type of
sparsity occurs very clearly in the Coffee Robot and MRCS problems.
For example, in the Coffee Robot problem, the state space consists of the
combination of several Boolean variables of the form listed below. Each of these is
either true or false at any point in the state space.
? O: The robot is located in the office.
? W: The robot is wet.
? U: The robot has an umbrella.
? R: It is raining.
? HCR: The robot has a coffee in its possession.
? HCO: The owner has a coffee in its possession.
The MDP actions for this problem consist of simple, direct actions. For exam-
ple, one action is to pick up the umbrella if it is in the office. This action only affects
the state variable U (umbrella); it does not change the robot?s location, whether it
is wet or not, etc. When the STM is constructed for this action, any states that
correspond to a change in these other state variables is an unreachable state. This
results in a large percentage of the STM being populated with zeros ? representing
that the probability of reaching those states is zero.
77
3.4.3 Sequential Decision Problems
Another reason to expect sparsity to be found in many real-world MDPs is
that MDPs are often used to solve sequential decision problems. By definition,
this implies having to make many small decisions to reach a desired long-term goal
or state. For example, navigation problems usually involve a sequence of decision
points where the agent must decide to turn left, turn right or go straight. These
decisions move the agent from one location to another (nearby) location. The set of
possible locations that can be reached by one decision makes up a tiny fraction of
all of the possible locations in the entire state space. If there was an action available
to simply jump to the destination state in one hop, then the navigation problem
would not be very difficult to solve algorithmically (and thus an MDP would not be
required).
However, this is usually not the case, and often a desirable long-term goal or
state can only be reached through a sequence of decisions that create a trajectory
through many small and incrementally overlapping subsets of immediately reachable
states. When the STM is constructed for each action, only the tiny subset of states
that are immediately reachable are populated with non-zero transition probabilities.
The rest of the large matrix is filled with zeros. For this reason, it is reasonable to
conclude that sparsity is an inherent property of MDPs when used to solve these
types of problems. This type of sparsity occurs very clearly in the Russell/Norvig
Maze, ACAS and SPC problems.
The following subsections contain discussions of specific case studies to show
78
how MDPs can be used to enable runtime adaptation in complex scenarios, and
give more detailed illustrations of STM sparsity. These case studies were imple-
mented in the MDP testbed and used to generate performance benchmarks that are
summarized in Section 3.5.
3.4.4 Case Study: Solar Powered Computing
In recent years, there has been strong interest and optimism in the design of
systems that can run purely off of a photovoltaic energy harvesting source (e.g.,
see [5, 54]). These energy sources are commonly combined with an energy buffer
such as a supercapacitor or a rechargeable battery. A primary design goal for this
class of systems is to design them in such a way that they can run indefinitely on the
energy provided by the harvesting source alone, a property known as being energy
neutral [55].
This goal is challenging because solar energy is time-varying and uncontrol-
lable. Furthermore, there is no solar energy at night. In order to keep a system
powered at times other than those of high incoming solar energy, it is necessary
to use a suitable power management scheme that is well matched to the power
requirements of the processing system, and also to the performance goals of the
application. Additionally, an intelligent power management scheme might employ
strategic conservation of energy stores to be used at a later time of the day.
An MDP was created to dynamically control the power of an embedded com-
puting system closely mimicking the setup in [2]. The performance of this MDP-
79
based controller was compared to the competing technique proposed in that work,
which solves a constrained optimization problem using Linear Programming. The
performance of the MDP was nearly identical to that of the Linear Programming
approach. It is important to note that the authors in [2] cited the Linear Program-
ming solver as too computationally intensive to run in their resource constrained
embedded system, reinforcing the theme that often there are not enough computing
resources to house intelligent decision frameworks in systems that can benefit from
them.
3.4.4.1 State Space and Action Space
The state space for this problem is defined using the multivariate approach
described above in Section 3.4.2. The state space for the system is represented as:
s = (TD , SC ),
DOM (TD) = {1, 2, . . . , N } , (3.3)TD
DOM (SC ) = {k/NSC | k ? {1, 2, . . . , NSC}}
where TD represents the discretized Time of Day (in hours), SC represents the dis-
cretized State of Charge (SC) of the energy buffer, and NTD and NSC are positive
integers. The SC is the equivalent of a fuel gauge for the energy buffer, with the
units being percentage points (0%=empty, 100%=full). Both TD and SC are inher-
ently continuous valued quantities (time and energy) that are mapped into discrete
sets through a coarse-grained quantization so that they can be incorporated into a
discrete state MDP.
80
In order to manage power, the MDP action regulates the amount of application-
specific processing that the system performs. This is assumed to be in the form of a
duty-cycle. In other words, a percentage of time can be spent in a productive state
versus a sleep or low-power state. The action a ? A is the choice of the duty cycle,
which is discretized as shown in Equation 3.4:
A = {k/NA} | k ? {1, 2, . . . , NA}. (3.4)
3.4.4.2 State Transition Matrix
The STM s are defined as shown in Equation 3.5. This STM formulation
utilizes the fact that the TD variable is independent of both the SC variable and the
action. This allows for factorization (as described in Section 2.4.4) to be employed
in the state space, a concept that plays a crucial role in the compact modeling and
efficient solving of MDPs.
P (sj|si, a) = P (TD j, SC j|TD i, SC i, a)
. (3.5)
= P (TD j|TD i)P (SC j|TD i, SC i, a)
The entries for the transition probabilities of the TD variable are then com-
puted as in Equation 3.6. Here, a simplification is made due to the discretization
width of the TD variable being equal to the control frame interval. Thus, for each
iteration of the control frame, the time of day is guaranteed (with probability 1) to
increment by 1 hour.
81
??????1, TD j = mod(TD i, NTD) + 1p(TD j|TD i) = ??? (3.6)0 otherwise
To compute the entries for the probabilistic transition of the SC variable,
knowledge of how the SC will be affected by each control action is required. For
this, the historical platform power consumption data is used, as well as the predicted
incoming solar energy, as shown in Figure 3.1. The Platform Power Model block in
this figure refers to a running estimate of how much average power will be consumed
by each processing configuration (i.e., each MDP action), as predicted from the
logged data. The mathematical details of the SC transitions are omitted here for
conciseness.
An illustration of the underlying state transition structure that results in this
MDP formulation is shown in Figure 3.2. Each circle represents a state in the state
space, with the shaded circles denoting one possible trajectory. The arrows represent
the reachable states (i.e., the transitions that have non-zero probability of occur-
rence). It can be seen from this formulation how sparsity is a central characteristic
of the STM .
A non-sparse STM would be a scenario where most or all of the circles in the
figure can be reached from any given circle. Such a scenario is not possible for two
reasons. First, for any circle (starting state) only the subset of circles consisting of
the next time frame (the column immediately to the right) are reachable. Thus,
from the TD component of the state space alone, the sparsity of the STM will be
at least 1? (1/NTD). Second, only a few rows of the next column are reachable, not
82
Solar  Energy 
M odel
M D P Iterate O nce
Solver Per D ay
P latform  
Power M odel C ontro l
Policy
T im e of D ay
P latform  C ontro l
Policy Lookup T able Iterate O nce
Stored Energy %
Per H our
Processing
C ontro l
Application 
P latform  D ata C ontinuous 
Specific
Logging O peration
Processing
Figure 3.1: Block diagram of MDP-based dynamic controller for a solar-powered
computing system.
Stored 
Energy
Time
Figure 3.2: Time-Energy state grid and transitions.
83
the entire column. This further increases the sparsity.
In these experiments, it was found that sufficient granularity in the discretiza-
tion of the state space (in order to attain good decision making performance) could
be reached with NTD = 24, and NSC ? 50. With the state spaces sized in this way,
the total STM sparsity is at least 99.8%, as is listed in Table 3.1.
3.4.5 Case Study: MDP-Based Reconfigurable Channelizer System
(MRCS)
Chapter 2 detailed significant advantages to using an MDP to derive dy-
namic reconfiguration policies for use on channelizers in stochastic environments.
The MDP-based approach for digital channelizer design optimization resulted in
increased robustness when used to periodically re-optimize the system policy specif-
ically for the external environment it is being used in. However, in that chapter the
proposed method utilized a single MDP-based policy that was generated by invoking
an MDP solver once offline (in MATLAB) and not in the target system. In order to
predict the effectiveness of SPVI to address this limitation, the MRCS is revisited
here to analyze its sparsity.
The state space for this problem was defined using the multivariate approach
described in Section 3.4.2. The state variables consist of the channelization re-
quests, combined with the available processing configurations for a given processing
platform. The state space for the system is defined in Equation 2.13.
In this system, the state space contains all possible combinations of channeliza-
84
tion requests with all possible platform configurations. The action space is defined
by all combinations of the processing configuration variables, such that the MDP
can select which processing configuration to use at any given time. Thus, the MDP
actions are the set of selectable configurations and NA = |DOM (CF )|.
The channelization requests are exogenous and not under the control of the
MDP. However, the processing configuration is fully under the control of the MDP.
This scenario allows the STM to be defined as was shown in Equation 2.14. In
this formulation, the CR state variables form their own independent uncontrollable
Markov chain embedded within the larger MDP state space and transition matrices.
A notional diagram of the state transition diagram for an arbitrarily selected
action is shown in Figure 3.3, for a scenario where NA is 3. This diagram shows
that when Action 3 is selected by the MDP policy, the system transitions with a
probability of 1 to the region of the state space corresponding to platform configu-
ration CF 3. Anytime Action 3 is selected by the policy, the system can transition
within different values of the CR variables as prescribed by the factored transition
probabilities P (CR j |CR i), however it cannot transition to any region of the state
space corresponding to a different platform configuration.
As a result, the transition probabilities to all other states that do not corre-
spond to configuration CF 3 are zero, and for this reason sparsity is an inherent
part of this MDP formulation. Since only one selected configuration is reachable
for a given action, an STM sparsity of at least 1 ? (1/NA) results. The system in
Chapter 2 uses NA values ranging from 11 to 15, and thus a sparsity ranging from
90.9% to 93.3% results.
85
C F 1
C R M a r k o v  C h a i n
C F 2 C F 3
C R M a r k o v  C h a i n C R M a r k o v  C h a i n
Figure 3.3: Channelizer state transition diagram for Action 3.
86
3.5 Results
3.5.1 Experimental Setup
In order to objectively measure the performance of SPVI compared to alterna-
tive methods, a testbench was created using the NVIDIA Jetson TK1 development
board. This board consists of an NVIDIA Tegra K1 SoC, which is a size, weight and
power optimized processor used in many embedded and mobile applications. The
SoC contains a Quad-Core ARM Cortex A15 CPU running an embedded Linux
distribution. The setup allows CPU-only programs to be run, which is an im-
portant part of the comparative benchmarking, which compares CPU-only with
GPU-accelerated approaches. The GPU acceleration is made possible on-demand
by enabling the K1 SoC?s on-chip Kepler GPU, which contains 192 CUDA cores
and can be programmed and controlled directly from a user space application on
the embedded Linux distribution.
The testbench contains implementations of three of the MDPs introduced in
Section 3.4. The MDPs are used as the input to multiple MDP solver implementa-
tions. In general, it was found that the competing solver algorithms all produced
roughly the same policy output, and thus the comparisons are purely in terms of
the computational resources expended in finding the policies.
Through the use of this testbench, it was possible to keep all test details con-
sistent and change only the solver algorithm. This allows for a fair comparison of
solvers using consistent and objective experiments. The solvers selected for compar-
ison are described in Section 3.5.2.
87
3.5.2 Algorithm Comparison
Three algorithms were used for benchmarking, which are referred to here as
VI, Thrust-VI and SPVI. The first algorithm is a CPU-only implementation of
Value Iteration, which is referred to below as VI. This was created directly from
Equation 3.1. This implementation represents a straightforward single-threaded
implementation of the classical Value Iteration algorithm and was used purely as a
baseline for comparison with the other algorithms.
Next, a survey of the literature was performed to find the current state-of-the-
art in high performance MDP solvers. Several algorithms and open source packages
were analyzed. The search was restricted to open source candidates that were avail-
able in C or C++ source code, and avoided those in Java, MATLAB or Python.
This decision was made so that the results could be applied with no caveats to
resource constrained embedded systems that are not able to support the runtime
requirements of higher level languages.
The solvers that were obtained or implemented are: SPUDD [47], AAPL [56],
pomdp-solve [57], and Thrust-VI [42]. In the case of SPUDD, AAPL, and pomdp-
solve, open source software was available to download and use in the experiments.
For Thrust-VI, no software was available but the algorithm description in the litera-
ture was detailed enough that direct implementation from pseudocode was possible
without any ambiguity. After some experiments and analysis, it was determined
that Thrust-VI had the best performance among candidates. Thus, Thrust-VI was
identified as the current state-of-the-art and chosen for comparison against the SPVI
88
algorithm.
Many details and logistical matters involved in searching for open source MDP
solvers were identified as important issues during this work. Discussion of these is-
sues, the criteria used to determine the state-of-the-art, and a more in-depth analysis
of the solver survey is presented in Chapter 5.
3.5.2.1 Thrust-VI
Thrust-VI is a GPU accelerated version of Value Iteration. The algorithm uses
the open-source NVIDIA Thrust library. Thrust is a C++ parallel algorithms library
that allows programmers to write portable C++ code, and then execute that code on
CUDA GPUs, OpenMP systems, and other parallel execution platforms. Thrust-VI
is an implementation of the classic Value Iteration algorithm, using Thrust?s parallel
functions. The finding that this algorithm outperforms other more algorithmically
complex alternatives is an interesting outcome. It suggests that decades of novel and
clever algorithmic advances in this field were surpassed by simply exploiting data
parallelism with GPUs ? a completely different approach with fewer restrictions.
Both SPVI and Thrust-VI leverage GPUs for execution time acceleration.
The most significant difference between these two is that SPVI uses sparse matrix
representations and sparse matrix-vector operations and Thrust-VI does not.
89
3.5.3 Measurements
In the experiments, the solver execution time was measured using Linux?s
native timing support. The results are listed in Table 3.2. Both GPU-accelerated
solvers (Thrust-VI and SPVI) were considerably faster than the CPU-only Value
Iteration solver.
MDP VI Thrust-VI SPVI
Solar Powered Computer 28.48 6.62 1.60
Russell/Norvig Maze (30x100) 31.71 7.33 1.62
MRCS Channelizer 44.66 9.14 3.19
Table 3.2: Solver execution time (seconds).
SPVI significantly outperformed Thrust-VI in all of the MDPs that were
tested. The execution time reduction from Thrust-VI to SPVI was 76% for the
Solar Powered Computer, 78% for the Russell/Norvig maze (with a grid size of 30
rows by 100 columns), and 65% for the MRCS Channelizer. Although Thrust-VI
did utilize the 192 CUDA cores effectively in all of the MDPs, it spent a lot of time
multiplying elements by zero for the reasons detailed in Section 3.4. SPVI directly
addresses this inefficiency via the use of sparse linear algebra optimizations and as
a result is able to reduce execution time considerably.
90
3.6 Conclusion
In this chapter, the utility of sparse data structures were demonstrated for
solving a large class of MDPs that are relevant to embedded computing systems.
Reasoning was presented conceptually on why sparse matrices arise in MDP for-
mulations, and significant performance gains were presented from exploiting this
sparsity in solver implementations on a mobile Graphics Processing Unit (GPU).
The ability to solve MDPs efficiently on resource- and power-limited mobile GPUs
enables novel applications in which MDP solvers are embedded in the applications
rather than being restricted to offline use.
Additionally, the proposed methods reduce restrictions and a priori assump-
tions that are required by compact MDP solvers that have been developed previ-
ously by researchers. This advancement can lead to a wider consideration of MDPs
in embedded computing systems where they previously may not have been feasible
or practical.
Useful directions for future work include further acceleration by decomposing
MDP transition matrices up into groups of smaller matrices that can be multiplied
utilizing faster GPU block shared memory (instead of slower global memory). An-
other direction is to investigate sparsity-driven acceleration of the other two classical
solver algorithms: Policy Iteration and Modified Policy Iteration.
91
Chapter 4
Runtime Adaptation in Wireless Sensor Nodes Using Structured
Learning
92
4.1 Introduction
This chapter presents a novel algorithm and a detailed application of tech-
niques that enable the fast and efficient use of MDPs in real-time on resource con-
strained Cyber Physical Systems (CPSs). In this context, a resource constrained
CPS is defined as a distributed system containing components that have embed-
ded computers whose physical resources are constrained to be significantly below
what is available in a typical, consumer-grade desktop or laptop computing sys-
tem. These limits are typically imposed in order to reduce the size, weight and
power, as well as cost (SWAP-C) of each unit. While this is a relative definition
of a resource constrained CPS, in terms of present technology it can be defined as
an embedded system that would typically have less than 1MB of RAM, less than
10MB of non-volatile storage, and a single-core microcontroller with a clock speed
under 100MHz.
In previous chapters, MDPs have been explored as a means to control com-
puting systems at runtime in ways that are more dynamic, robust and adaptable
than alternatives. Using MDPs, engineers can create systems that effectively learn
and reason using models of their own system dynamics, observations of their own
inherent limitations and effectiveness of their actions towards reaching application-
level goals. In this context, these systems exhibit a level of self-awareness in their
behavior, with the ultimate design goals being continual autonomous optimization
that leads to higher levels of runtime resiliency, robustness and efficiency.
When seeking to develop self-aware systems, researchers have recently turned
93
to Reinforcement Learning (RL) [58], a field of Machine Learning that uses MDPs
in restricted ways. However, this chapter explores an important class of CPSs that
are not well served by current RL techniques. Specifically, this class of systems is
one where some components of the system?s dynamics are known at design time,
and the rest are unknown at design time or expected to be time-varying at runtime.
In general, RL frameworks do not try to learn the effect that control outputs
have on the system?s state. Instead, they seek to use runtime observations to find
a relationship between control outputs and rewards. In general, however, a critical
part of this relationship is how the control output affects the state of the system being
modeled. In RL frameworks, this causality is implicit in the modeling abstraction
and not defined nor learned explicitly. This method works well in many cases, for
example in large systems where the state dynamics are too large and complex to be
considered or modeled explicitly. However, this chapter shows that departing from
this conventional method can be useful for resource constrained CPSs that have
more manageable state spaces. In particular, if engineers possess a priori knowledge
about how some of the control outputs might affect the system state, it can be
advantageous to codify that knowledge into the learning algorithms at design time.
The approach proposed in this work provides models and algorithms that enable
designers to exploit a priori knowledge in this way.
Motivated by this deficiency, an alternative class of MDP-based system model-
ing techniques are defined, which are referred to as Compact MDP Models (CMMs),
and CMM-based approaches are developed as an alternative to RL for design and
implementation of adaptive CPSs. In general, it is envisioned that certain CPSs are
94
better suited for RL, while others are better suited for CMM. Through this work,
the causes of why one of these two approaches might be better over another for a
given application are explored, with the goal of improving understanding to allow
designers to pick the better option.
Building on the work in the previous chapters, this chapter contributes the
following additional content:
? A detailed survey of techniques from the literature that enable the compact use
of MDPs on resource constrained systems. This survey leads to the definition
of a class of CMM methods which encapsulates several different approaches
that are useful in streamlining the application of MDPs to CPS systems.
? The introduction of the Sparse Value Iteration (SVI) algorithm ? a variation
of the SPVI algorithm presented in Chapter 3. While SPVI is a parallel pro-
cessing algorithm developed for GPUs, in this chapter a scaled down variation
is presented that runs effectively on small, single-threaded Microcontrollers
(MCUs).
? A detailed example of how to apply MDP-based techniques to design a wireless
sensor CPS.
? The results of a performance simulation, illustrating the differences between a
CMM-based design approach compared to Q-Learning, a popular alternative
technique from the Reinforcement Learning (RL) literature.
? An empirical study of an MDP solver running on a resource constrained MCU,
95
including measurements of data storage requirements, execution time and
power consumption.
? A power consumption model for an LTE-M wireless modem, derived from
experimental lab measurements taken on a live wireless Internet Protocol (IP)
data network.
The remainder of the chapter is organized as follows. A cursory review of the
history of techniques for controlling CPSs is presented in Section 4.2. Section 4.3
contains a survey of recent advancements in CMMs. Section 4.4 presents Structured
Learning with Sparse Value Iteration, a novel method for CMM-based design. In
Section 4.5, a case study of a wireless sensor CPS is detailed, to explore the challenges
and trade-offs inherent in creating an efficient control policy. Section 4.6 contains
an illustration of how CMMs can be used to solve the design problem introduced
in Section 4.5 along with a comparison of that approach to a competing RL-based
approach. Finally, Section 4.7 contains simulations of the runtime performance
and Section 4.8 contains the results of an embedded system implementation of the
competing techniques. Section 4.9 concludes the chapter with a discussion of the
results and directions for future work.
96
4.2 Background and Related Work
Chapter 1 and Section 3.2.1 contained a discussion of the challenges involved
in applying MDPs to control computing systems at runtime. To overcome the limi-
tations of MDPs with these goals in mind, two main approaches have been pursued:
RL and CMMs. RL essentially tries to arrive at the policy without explicitly model-
ing all of the MDP components or invoking a solver. On the other hand, CMMs are
approaches that do define all of the MDP components and invoke a solver, but do so
via algorithmic optimizations that significantly reduce computational requirements.
These two alternatives are sometimes referred to as model-free and model-based RL,
respectively. Henceforth in this chapter, the abbreviation ?RL? refers to model-free
RL unless otherwise stated.
These two categories ? RL and CMM ? of techniques are described in Sec-
tion 4.2.1 and Section 4.3, respectively.
4.2.1 Reinforcement Learning
Reinforcement Learning (RL) [58] is an area of machine learning that enables
systems to formulate optimal decision policies using observations of rewards that
are received for previous decisions at runtime. These techniques use MDPs as the
framework for formulating the decision problem, but seek to learn the optimal policy
directly using observations of rewards in response to decisions, rather than through
the explicit definition of all of the MDP components followed by invocation of an
MDP solver.
97
Agent
Reward Action
rt r at+1 t
Environment
st State
st+1
Figure 4.1: Block diagram of reinforcement learning paradigm.
More specifically, an RL framework typically contains the top level block di-
agram shown in Figure 4.1. The learning takes place by some agent, which is
responsible for selecting an action out of a set of actions, given a system state. This
selection is typically done in a discrete-time setting and iterated at a fixed rate. Each
selected action (in a given state) leads to some consequence in the environment, and
that causes it to transition into a new state in the state space. The selected action
and transition to that new state are associated with a scalar reward, which is fed
back to the agent (positively or negatively). The agent in turn considers the reward
it has been given along with the new state that resulted, and again selects the next
action, repeating indefinitely.
As mentioned above, these techniques are sometimes referred to as model-
free learning. Model-free learning techniques possess the advantage of completely
bypassing the need to maintain large STM s, and run computationally intensive
98
MDP solvers.
However, this advantage comes at a cost, as the consequences of all actions
taken in all states have to be learned and periodically updated, even those that are
constant and known a priori at design time. This learning comes at the cost of
occasionally having to make random decisions at runtime to explore the effect of
alternative decisions [59]. This cost is a central drawback of model-free techniques,
and is associated with a complex trade-off called the exploration versus exploitation
trade-off [58].
One popular RL technique that has shown promising results is Q-Learning [44,
60, 61, 62]. In Q-Learning, a scalar value ?Q? is assigned to each action in each
state, and referred to as the Q function. This function represents the average fu-
ture rewards that can be expected by taking a given action in a given state. The
Q-Learning method continually learns and updates this Q function using a simple
technique called the method of Temporal Differences (TD) [63], and uses it to formu-
late an optimal control policy that is based entirely on the system and environment
involved. As the environment and the system?s dynamics change at runtime, the
policy changes with it.
In one example [44], an Adaptive Power Management (APM) hardware module
using Q-Learning was used to put a microcontroller in and out of low power states,
and resulted in a learning controller that managed power transitions better than
an expert user. In another example [60], Q-Learning was used to optimize the
throughput of an energy harvesting wireless sensor node while meeting challenging
constraints.
99
4.2.1.1 Stability
Under the control scheme proposed in this chapter, an MDP-based control
policy that varies at runtime is used to control a computing system. This scenario
requires a more involved analysis of stability compared to the scenario discussed in
Section 2.4.5, where the MDP-based control policy was fixed at runtime. Both the
proposed method, and the competing Q-Learning method are subject to the same
questions of whether the resulting time varying MDP-based control has the potential
for exhibiting any runtime instability. These questions fall under the category of
stability analysis of time varying MDP-based control, which is an active area of
ongoing research in the RL community. The earliest known work on this topic
is [64], where the concept of Lyapunov functions are applied to guarantee stability
in time varying MDP-based control. Lyapunov functions are a fundamental tool
used in control systems theory to analyze the stability of systems evolving through
time. These concepts are outside the scope of this thesis, and present an interesting
area for future work. The reader is directed to [65] and references therein for the
latest research in this area.
In the next section, an overview of CMM methods is presented, which can be
viewed as model-based methods that are designed with an emphasis on streamlining
computational efficiency. Recent advancements in this area are surveyed that result
in performance on par with Q-Learning.
100
4.3 Survey of Compact MDP Models
As mentioned previously, the deployment of MDPs in resource constrained
systems has typically been limited to usage modes where the MDP modeling and
solving are done offline, and only the resulting policy is stored in the runtime system.
The goal of moving the MDP model and solver into the runtime system by mitigat-
ing the longstanding barriers has been a common goal among many researchers over
the years, resulting recently in very creative and effective techniques. In this chap-
ter, this category of approaches is referred to as compact MDP models (CMMs),
given that they provide a smaller or computationally optimized representation of
the system in question than compared to a direct implementation of the MDP?s
data structures. The effectiveness of these recent developments, especially when
applied in combination with one another, leads to questioning the conventional ap-
proach of limiting consideration to model-free RL approaches in the implementation
of resource constrained, MDP-based systems.
Some of these CMM techniques seek to reduce the storage size of the MDP?s
data structures by exploiting some structural component embedded within the MDP
(e.g., see [15, 38, 47, 48, 49]). Other techniques involve modeling approaches that
reduce the MDP state space via generalization and abstraction of system dynamics
(e.g., the approach in Chapter 2). Another approach has been to keep algorithms
and data structures as is, and take advantage of recent advancements in parallel
processing using embedded GPUs, for example [42] and the algorithm presented in
Chapter 3.
101
In this chapter?s case study, three CMM techniques are utilized. Each of these
has been introduced in a previous section: transition states (Section 2.4.3), factor-
ization (Section 2.4.4), and exploitation of sparsity (Section 3.2.2). The significant
advantages each of these provides becomes evident when compared to both a direct
classical MDP implementation and a competing model-free method.
102
4.4 Method
4.4.1 Structured Learning
Creating an MDP model on a computing system consists of defining the states
and actions, the STM s, and the reward function for the given decision problem and
its environment. The STM s are NA stochastic matrices, each of size NS by NS (one
matrix for each action). Each STM defines the probability of transitioning from the
current state to any one of the possible other states, given an action. This is generally
written as a discrete conditional probability distribution as in Equation 4.1:
p(s(n+1)|s(n), a(n)),?s ? S, a ? A, (4.1)
which gives the probability of the system transitioning to state s(n+1) at time index
n + 1 given that it was in state s(n) and action a(n) was selected at time index
n. The process of instantiating the STM s is the allocation of storage for N2SNA
numerical quantities and assigning them values from 0 to 1. Given this viewpoint,
the methods in the literature can be grouped into two categories: the model-based
approach where all N2SNA terms are defined a priori and treated as constants, and
the model-free approach, where none of the N2SNA are defined and in fact storage
for them is never even allocated.
In this chapter, these two are viewed as extremes of a continuum that has
many other options. A blend between the two is proposed, where some of the STM
terms are assumed to be known a priori, and others are not. More specifically:
103
p(s(n+1)|s(n), a(n)) ? {? ? ??},
(4.2)
?s ? S, a ? A,
where the implication is made that all of the probability values in the STM s come
from one of two parameter subsets ? and ?. The set ? is the set of STM entries
that are fully known a priori and can be set to a fixed value at design time. The set
?? contains the remaining matrix elements; they are either not known a priori or are
expected to be time-varying at runtime. ?? ? ?? is used to denote the latest value of
a running of estimate of the true value, for each parameter ? ? ?.
Rather than taking each entire STM as either constant, or completely un-
known, a flexible middle ground is adopted and it is assumed to be partially known
and partially unknown. In this way, the system model has some parts of it that are
fixed, and other parts that are assumed to change over time. Then, the runtime
adaptation process consists of learning only the set of parameters ??, rather than
the entire STM s. In this way, the model contains a mix of some predetermined
structure from ?) and some runtime learning (from ??). This approach is referred
to as Structured Learning in this chapter.
Structured Learning allows the designer to restrict how much effort is spent
trying to learn unknown parameters, and results in higher overall awareness and
adaptation performance for a certain class of CPS devices, as will be demonstrated in
the case study. The advantage comes from being able to direct the system?s learning
efforts to be focused on the relevant parts of the problem, and prevent redundant
attempts to constantly question and update assumptions about the system that a
104
system designer knows will never change.
4.4.2 Temporal Difference Equations
The Structured Learning method defined above relies on a continual runtime
learning of parameters. For this, a very simple technique is used, which is prevalent
in RL: the use of weighted averaging through Temporal Difference (TD) equations,
with the central concept shown in Equation 4.3:
??(n+1) = ??(n) ? (1? ?) + ?(n) ? ?, (4.3)
where ?(n) is an observed value of one of the parameters ? ? ? at timestep n,
??(n) is the value of the running estimate of ? at timestep n, and ? is a learning
rate parameter which controls how sensitive the running estimates are to individual
observations. The method essentially consists of performing a low-pass filtering or
smoothing operation on observed values, and thus maintaining a running estimate
that tracks the latest observed values for a given parameter.
As the observations change over time, the running estimates track the changes
while also reducing the effect of statistical outliers. Using TD, the Structured Learn-
ing method can ingest the latest observations of each of the parameters ? ? ?,
compute the running set of estimates for each ?? ? ??, and combine them with the
constant set ? to assemble the fully populated, partially time-varying STM s at any
timestep.
105
4.4.3 Sparse Value Iteration (SVI)
In Structured Learning, the full set of STM s are instantiated. In order to
obtain a control policy from this, an MDP solver must be invoked. Although Struc-
tured Learning is fully compatible with the SPVI algorithm that was presented in
Section 3.3, it is not used in this chapter. This is because SPVI is a GPU algorithm,
and the case study in this chapter is on a resource constrained system where it is
assumed that a GPU is not available.
For this reason, a modified version of SPVI is introduced here. SPVI achieves
runtime acceleration not only from the parallel processing performed on a GPU,
but also from the use of sparse matrix representations and sparse linear algebra
operations. Through experimental benchmarking, it was found that the sparse linear
algebra techniques alone add performance benefits on single-threaded CPUs without
the use of parallel processing on GPUs. The experimental results in the following
sections show this to be true even on a resource constrained single threaded MCU,
not just on the much more powerful processing environments of larger CPUs.
The acceleration is not as significant as when a GPU is available, however it
is considerable enough to be a valuable technique for use on a wide range of CPU-
only systems. The result of these findings is a new algorithm presented here, called
Sparse Value Iteration (SVI). This algorithm is similar to SPVI, with the difference
that SVI is suitable to run on single-threaded CPUs that are not equipped with
GPUs.
The performance benefits of SVI over VI materialize as decreases in runtime,
106
processing energy consumption, and memory footprint. The specific decreases in
the experimental setup are detailed in Section 4.8. A pseudocode description of the
SVI algorithm is shown in Algorithm 2.
ALGORITHM 2: Sparse Value Iteration (SVI).
Input: S, A, R(si, a), P (sj |si, a), ?, ?
Output: ?
1 Compute KNZ , the number of non-zero elements in M
2 Allocate memory for Ms, a sparse matrix sized for KNZ
3 for each element z in M do
4 if z 6= 0 add it to Ms
5 end
0
6 V ? 0
7 n? 0
8 repeat
9 n? n+ 1
10 T ? SPARSE MULT (M , V n?1s )
11 Q? SAXPY (?, T ,R)
n
12 V , ?n ? MAX REDUCE (Q)
13 ?? INF NORM (V n, V n?1)
14 until ? < ?
15 ? ? ?n
107
4.4.3.1 State Transition
The computation of the product M ? V n?1 in Equation 3.2 is efficiently im-
plemented in SVI using a sparse Matrix-Vector multiplication. The sparsity in
the transition matrices is exploited by the conversion of the large and sparse M
to a much smaller, densely packed Ms on lines 1 through 5. Then, a multiplica-
tion of the sparse matrix by a vector is performed using a subroutine denoted by
SPARSE MULT in Line 10. This subroutine is a standard sparse matrix-vector
multiplication.
The sparse matrix Ms is created in the Compressed Sparse Row Matrix for-
mat, as was the case in SPVI. This conversion only needs to be performed once at
initialization.
Next, the discount factor and rewards need to be applied. After computing the
product M ?V n?1s , the remaining steps can be implemented by scaling the product
by a scalar ? and then adding it to the vector R. This is the same SAXPY operation
that was described in SPVI, and is also very efficiently implemented in CPU-based
linear algebra packages. This subroutine is denoted here as SAXPY in Line 11 of
Algorithm 2.
4.4.3.2 Action Selection
The NS elements of the Value Function V
n and policy ?n are computed from
(NSNA) elements of the Q vector, which constitutes the selection of an optimal
action for a given state. This computation is invoked in Line 12 of Algorithm 2.
108
The subroutine computes the maximum value (and the action associated with it)
from a subset of Q, striding across the vector only on the elements associated with
each state.
4.4.3.3 Stopping Criteria
In order to evaluate the stopping criteria for SVI, the infinity norm of the in-
cremental approximations to the Value Function must be computed. This operation
is represented by Lines 13 and 14 of Algorithm 2.
109
4.5 Application
In this section, a specific type of CPS is detailed, which forms the basis for
the case study in the remainder of the chapter. The CPS is an embedded system
with constraints on its size, weight and power (SWaP), containing the physical
components shown in the following list.
? A sensor and/or actuator to interact with the physical environment.
? A wireless modem used to provide Internet access to the system.
? A low-power Microcontroller Unit (MCU) executing a program that controls
the sensor and/or actuator as well as the wireless modem.
? An energy source that is used to power the system. The source can be a
battery that needs to be replaced periodically, or an energy harvesting source
(such as a solar panel paired with a rechargeable battery).
This type of CPS is expected to exist as one instance of a plurality of identical
nodes in an installed base, and the nodes are connected to an application server via
an Internet connection. The design aims to empower each node with the ability to
optimize its performance for its own specific environmental conditions, rather than
centralizing all of this optimization responsibility in the cloud. This approach is
advantageous in terms of scalability, reliability and robustness.
The MCU runs an application-specific program that utilizes the sensor and/or
actuator to interact with its physical environment. The application is typically
110
multi-modal, in the sense that it does not always do exactly the same application-
level operations at all times. The application may, for example, have two modes
such as 1) the normal sensor and actuator operating mode, and 2) a firmware update
mode where security patches and product updates are routinely downloaded to the
CPS node. It is expected that the application could have more than two modes.
The CPS node?s power source is typically very limited in its capacity. In
the case of a solar panel, this may be due to a desire to keep the solar panel cost
low and the size small. In the case of a battery, this may be due to a desire to
maximize battery life in order to reduce how often the battery needs to be recharged
or replaced. Regardless of the power source, the CPS node?s application is generally
tasked with carrying out its functions in the most energy efficient manner possible
due to limitations in its energy source.
In order to maximize battery life, a low-power MCU is used, rather than
a general purpose computing system. As a result of this, the CPS node will be
limited in its CPU frequency, RAM size, and non-volatile storage capacity. This fact
becomes important when considering the use of computationally intensive control
algorithms. A more involved program requires more computing resources, which
in turn requires the use of a more capable computing system that consumes more
power (even while idle). Thus, a thorough consideration of control algorithms must
analyze not just the control performance, but also the computational requirements
that are needed to deploy it on a self-contained and resource-limited MCU.
The MCU is also tasked with controlling the wireless modem to enable com-
munication typically with another Internet-connected device such as another CPS
111
node or a cloud-based application server. In this case study, the focus is on one
specific form of wireless network that is very common at the time of this writing:
LTE-M (LTE for Machines), also known as LTE Cat-M1 [66]. This wireless protocol
is a subset of the full LTE protocol, the wireless network that makes up the majority
of cellular Internet connections at the time of this writing. LTE-M is a reduced form
of the full protocol, and is designed specifically for resource constrained devices. In
the following section, the power profile of an LTE-M modem is described in detail in
order to illustrate the power management considerations an MCU controller must
balance.
4.5.1 LTE-M
The LTE-M modem is usually the largest consumer of power in the processing
and communication subsystems. When actively communicating with a cell tower,
an LTE-M modem?s average power consumption can be as much as 650 milliWatts.
This quantity is very high relative to the consumption of other components in the
CPS node. As a result, leaving the LTE-M modem powered on and connected to
a cell tower continuously is usually not a feasible option for CPS nodes, from the
point of view of maximizing the lifetime of a limited energy supply. As a result, the
MCU must turn the LTE-M modem on and off strategically in order to stay within
a limited energy budget.
In order to fully understand this power management challenge, a Sequans
Monarch VZM20Q LTE-M modem was obtained and a cellular data plan for the
112
Figure 4.2: Block diagram of wireless sensor CPS.
Verizon Wireless LTE-M network in the United States was purchased. This is a
live Internet Protocol (IP) network with nearly nationwide coverage, and effectively
provides a mobile Internet connection that can be accessed from almost any location
by an MCU. In this work, the focus is on the upstream flow of information. That
is, the collection of data through a sensor and the transmission of that sensor data
up to a cloud-based server. The resulting information flow within the CPS node is
outlined in Figure 4.2, where the arrows interconnecting the blocks represent the
flow of information from one component to another.
Due to the LTE-M power profile, a feasible use case for a system like this
would be to keep the modem powered off by default, periodically turn it on to
exchange data with the Internet and then power it back down to conserve energy.
To model this use case, the Sequans Monarch LTE-M modem was controlled from a
test application on a laptop computer that powered the modem on, connected to the
nearest Verizon Wireless cell tower, transmitted some information (representing a
sensor reading) from the test application to a cloud-based server, then disconnected
113
160
140
120
100
80
60
40
20
0
0 2 4 6 8 10 12 14 16
Time [Seconds]
Figure 4.3: Current consumption of LTE-M modem during data transfer.
the cell tower connection, and finally powered down the modem. During multiple
runs of this test, the power consumption of the LTE-M modem was measured. A
typical current versus time trace of the energy required by the modem to perform
this procedure is shown in Figure 4.3. The trace is a time-series of current draw
in milliAmperes at a fixed voltage of 4.0 Volts, and thus instantaneous power and
total energy for the transaction can both be computed from the data.
Additionally, the quantity of data that is transmitted to the server on each test
run was varied. Specifically, a packet size of 256 Bytes per packet was used and the
choice of how many of these fixed-sized packets were transmitted was also varied.
114
Current [milliAmps] @ 4.0V
These measurements helped to quantify the approximate energy consumption of the
LTE-M modem as a function of the amount of data transmitted. This allowed for an
approximate model of the modem power consumption to be created, as a function
of the quantity of packets transmitted. The derived model is given by Equation 4.4:
ETX(NT ) = c1 + c2(NT ? 1) [Joules ]
(4.4)
NT ? 1, c1 = 6.62, c2 = 1.55,
where ETX is the energy consumption in Joules and NT is the number of packets.
This equation illustrates a defining characteristic of this type of wireless connection:
the energy overhead of powering the modem on and connecting to a cell tower can be
significantly higher than the incremental cost of transmitting a packet once the con-
nection is active. Thus, if optimizing strictly for energy efficiency it is advantageous
to queue up multiple packets before powering on the modem, and then transmit the
queued packets together. This approach, however, is the opposite of what should
be done if optimizing for transmission latency. This is a fundamental trade-off of
controlling the LTE-M modem in this CPS node.
The power control challenge for a system like this consists of implementing an
algorithm that strategically determines when to allow new sensor data to accumulate
in the sensor, versus when to invoke a communication event (which would produce a
power consumption profile similar to the one shown in Figure 4.3) that transfers all
packets in the queue up to the cloud-based server. This decision problem involves a
trade-off of energy efficiency versus communication latency.
A trivially simple policy is one where any time a new packet arrives in the
115
queue, it is immediately transmitted up to the cloud. In fact, this could be con-
sidered a default case where no control policy analysis had occurred, and the MCU
program was designed simply to transmit whenever a new packet exists. However,
this is still a control policy and is considered as such. This policy will have a low
energy efficiency (which is undesirable) and low communication latency (which is
desirable).
In another example, a controller could wait until a fixed number of packets
accumulate in the queue before doing a batch transfer of all accumulated packets
up to the cloud. This policy would have higher energy efficiency but also higher
communications latency, as some packets might sit in the queue for some time
before being sent to the server. Ultimately, the CPS node?s application will dictate
what type of latency is desirable for the system, and it should be made as efficient
as possible while meeting the specified latency goals.
The policies described above are both examples of very simple ?fixed thresh-
old? policies. These are easy to implement and analyze for the dynamics described
thus far. An engineer could likely devise other clever policies similar to these, based
on heuristics and system simulations. However, in a real-world CPS deployment,
the system control dynamics are likely to be more complicated and time-varying
than has been described so far. Additional complexity arises from the following
phenomena:
? The application is likely to be multi-modal and thus the rate of packet gener-
ation within the CPS node is in general time-varying.
116
? The number of different modes that the application may contain could be
much more than two.
? The modem connection time and power consumption are also time-varying
due to the physical mobility of the CPS node relative to the fixed location of
the cell tower as well as the presence of network congestion from other LTE-M
users within the same cell during periods of heavy usage.
When these real-world factors are modeled, it becomes advantageous to con-
sider more involved control policies that can monitor and adapt to the time-varying
conditions which may not be precisely modeled or understood prior to a system
deployment. For this reason, two approaches are considered that contain runtime
learning capabilities specifically to self-optimize continuously at runtime, adapting
to the time-varying nature of the CPS node?s environment. This feature avoids hav-
ing to understand and anticipate all of the runtime conditions ahead of time, which
for practical purposes is an infeasible requirement for real-world deployments that
have very high numbers of nodes.
In Sections 4.6 through 4.8, the multiple options for controlling the CPS node
in the case study are described, and then the performance of these policies are
compared with regard to the efficiency versus latency trade-off in simulation. After-
wards, the competing options are implemented on a resource constrained MCU and
the resulting computational requirements and deployment feasibility are discussed.
117
4.6 MDP-Based Control
In this section, two approaches to create control policies for the CPS node in-
troduced in Section 4.5 are detailed, and then their performance is compared using
simulation. Both approaches use discrete-time MDPs to generate a control policy,
which determines when to power the LTE-M modem on and off. The controllers
are both designed with the assumption that several of the system?s characteristics
and dynamics are time-varying and uncontrollable. Thus, both controllers employ
some form of learning, in that the policy that each controller employs is continu-
ally updated to reflect the time-varying changes in both the system it is controlling
and the environment that the system interacts with at any given time. The two
controllers can be viewed as being two different realizations of the agent compo-
nent in the framework of Figure 4.1. These two methods are referred to as: 1) the
Structured Learning controller and 2) the Q-Learning controller. The Structured
Learning controller is a novel approach described for the first time in this chap-
ter, and the Q-Learning controller is a well known technique in the RL literature
(e.g. [58]).
A summary of the differences between the controllers is shown in Table 4.1.
The MDP components that are common between the two controllers are: the dis-
crete state space S, the discrete action space A, and the reward function R(s, a).
The other components of the controllers are different, namely the STM and policy
generation method. The Structured Learning controller contains a parameterized
state transition matrix and employs an MDP solver at runtime to generate a con-
118
Structured Learning Q-Learning
State Space Common: S
Action Space Common: A
Rewards Common: R(s, a)
Explicitly defined,
STM Not needed
parameterized
Temporal difference
Policy Solver invoked at
equation used to
Generation runtime
estimate Q function
Table 4.1: Comparison of two MDP-based controllers.
trol policy. In contrast, the Q-Learning controller does not contain an explicit state
transition matrix, and instead contains a running estimate of the MDP?s Q function.
This allows the Q-Learning controller to completely bypass having to store an STM
and run an MDP solver.
One fundamental difference between the two techniques is that the Structured
Learning controller has an STM with a pre-populated structure, and only param-
eters within the structure need to be learned at runtime. There are some aspects
of the STM that are fixed at design time, and in this way, some structure does not
need to be learned. In contrast, the Q-Learning controller has no a priori structural
assumptions, and must learn the entire Q function at runtime. Depending on how
much of the STM is known at design time in a given application, the Q-Learning
controller may have to learn more parameters at runtime compared to the Structured
119
Learning controller.
Structured
Learning Q-Learning
0 ? |?| ? (NS ? 1)NSNA NSNA
Table 4.2: Number of parameters to learn at runtime for each controller.
This contrast is shown in Table 4.2, where ? is the number of unknown or time-
varying parameters in the STM , as was defined in Equation 4.2. This motivates
a general design guideline to be considered: to determine what type of learning
approach to use, an engineer should first analyze the percentage of the STM that
is known versus the percentage that is unknown or expected to be time-varying. In
other words, the designer should seek to minimize the number of parameters that
need to be learned at runtime, which is |?| in Structured Learning, and NSNA in
Q-Learning (recall that |?| denotes the cardinality of a set ?). If many parameters
out of the STM need to be learned at runtime, it is possible that |?| ? NSNA, in
which case the Q-Learning approach would have less parameters to learn at runtime
and likely provide better adaptation performance.
The two different controller designs investigated in this section were formulated
to concretely explore trade-offs between model-based (Structured Learning) and
model-free (Q-Learning) in a tangible, real-world example. In the remainder of this
section, the common components of the two techniques are presented first, followed
by their differences.
120
4.6.1 State and Action Spaces
Both controllers utilize a multivariate state space, a concept introduced in
Section 2.4.4. The MDP state space is essentially the combination of each of the
states of the sensor application, packet queue and LTE-M modem. This is defined
as shown in Equation 4.5:
s = (sa, sq, sm) ? S
sa ? {0, 1, . . . , NSA ? 1},
(4.5)
sq ? {0, 1, . . . , NSQ ? 1},
sm ? {0, 1, . . . , NSM ? 1},
where s ? S is the state variable, which is composed of three separate variables
sa, sq, sm: the sensor application state, queue state and modem state, respectively.
In general, the interest here is in applications that can run in one of a set of
alternative modes (see Section 4.5). Thus sa is defined as the application mode, a
state variable taking a value out of a discrete set of NSA modes. The packet queue
has a specific number of packets in it at any given time. In order to allow the
controllers to make decisions based on the current state of the queue, sq is defined
as the number of packets in the queue and made part of the MDP state space.
The queue is assumed to be a fixed size NSQ ? 1, due to the goal of housing
it in a resource constrained MCU. The queue can only hold up to NSQ ? 1 packets,
and any attempts to queue additional packets beyond this limit will result in the
newest packet being discarded. The discarding of a packet represents a loss of data
121
and is a very undesirable event, that the controllers must seek to avoid through the
decisions in their respective control policies.
It should be noted that in some applications, some amount of data loss may be
tolerable. Such applications can be accommodated readily in both controller designs
by making suitable adaptations. The details of those adaptations are omitted in this
chapter for brevity.
The LTE-M modem is a complex mixed-signal System-On-Chip (SoC), con-
taining the LTE-M protocol implementation and runtime signal processing. There
is an enormous state space that could be defined for the inner workings of the mo-
dem. However, for these controllers the majority of that information is not relevant
and thus the modem state is collapsed into a small set of states that is detailed
enough for the controller to implement a high performing control policy, without
being overly burdened with the computational and storage implications of a large
state space. In this spirit, the concept of transition states introduced in Section 2.4.3
is utilized.
In this design context, the system being controlled contains many discrete
states. Depending on the level of modeling and decision making that is desired, as
much as tens of thousands of states or more could be considered relevant. In general,
the modeling process involves making design-time decisions as to what level of detail
is modeled in the MDP, for each of the system characteristics and dynamics. More
fine-grained detail allows for a more precise model, but this leads to a large state
space and the computational challenges associated with that.
The concept of transition states allows for significant reduction of the state
122
space to occur when the system passes through a large set of states for a limited
time, and the only relevant detail is when the system enters and exits the set as a
whole. In some way, these trajectories need to be modeled in the MDP. By utilizing
the proposed concept of transition states, a large group of fine-grained states can be
abstracted into a single coarse-grained state. A coarse-grained state derived in this
way is what is referred to here as a transition state.
While the system is actually in one of the fine-grained states abstracted by
a transition state ? , the system is modeled using ? as being in transition between
the predecessor and successor states of ? . The transition state encapsulates a set
of discrete states that are present but not relevant to the decision process being
designed. The only transition probability needed is derived from an estimate of the
expected time until the system leaves the set of states abstracted by ? . Inaccuracies
resulting from the associated estimation process represent a potential design trade-
off: the creation of transition states out of larger sets of fine-grained states may result
in lower accuracy in the overall MDP model. Designers have significant flexibility
to control the number and granularity of transition states to help optimize this
trade-off.
In this spirit, the large number of fine-grained LTE-M modem states are
collapsed into three coarse-grained states: M OFF , M CONNECTING, and
M CONNECTED. The M CONNECTING state is a transition state, and the
other two are not.
M OFF refers to the modem being fully powered off, and the modem can
remain in this state indefinitely until commanded otherwise. M CONNECTING
123
is the state that begins immediately after the modem has been powered on and ends
when a successful Internet connection has been established. The modem cannot
remain indefinitely in this state. By definition, it is guaranteed to transition out
of this state after a specified amount of time steps. M CONNECTED is the
state when a working Internet connection has been established and continues to
be maintained. Transmission of packets is only possible in the M CONNECTED
state. This modeling approach requires the selection of a scalar constant for the
modem?s power consumption in each of the three states.
As can be observed from Figure 4.3, the power is roughly constant in the
M OFF and M CONNECTED states. However, this is not the case in the
M CONNECTING state. This is addressed by representing power consumption
during the entire transition as a fixed value: the average power consumption during
the duration of the transition. This simplification is a way of providing the MDP
the information that it needs to implement a high performing policy, in as compact
a representation as possible. This modeling approach is a design choice, and it is
noted that an interesting area for future study is in the trade-offs for varying levels
of modeling expressiveness in this area.
In this system, the controllers are being tasked with turning the LTE-M modem
on and off. This binary control implies an ?on? action that powers on the LTE-M
modem and commands it via the modem?s interface to attach to the cell tower and
establish an Internet connection. Conversely the ?off? action implies tearing down
any existing Internet connections, and shutting off the LTE-M modem gracefully
via the modem?s shutdown procedures. The resulting action space is shown in
124
Equation 4.6. The two actions off and on in Equation 4.6 are abstractions of multi-
step LTE-M modem command sequences.
a ? A = {off , on} (4.6)
4.6.2 Rewards
In order to ?motivate? the controllers to find an effective balance to the latency
versus energy efficiency trade-off described in Section 4.5, the reward function is
defined as shown in Equation 4.7. The reward function maps each state-action pair
(s, a) to a scalar reward.
R(s, a) = r1I(s, a) + r2NT (s, a) + r3ND(s, a),
r1 = ?10,
(4.7)
r2 ? {3, 4, . . . , 10, 100, 1000},
r3 = ?100,
where I(s, a) is the average electrical current consumed by the modem, NT (s, a)
is the number of packets known to be transmitted, and ND(s, a) is the number of
packets dropped by the modem due to an overflowing queue in the previous timestep.
For each of these quantities the function arguments (s, a) are used to denote the
respective value of each of the terms known, expected, or averaged when action a is
taken in state s. Instead of power consumption, electrical current is used in its place
due to it being equally suitable (given a constant voltage) and more straightforward
125
to measure with an MCU in an embedded system.
With this formulation, the scalar reward is thus a linear combination of observ-
able time-varying signals and quantities. This formulation steers both controllers to
the desired goals, by rewarding them (with a positive reward value) when a packet is
transmitted successfully and penalizing them (with a negative reward value) when
electrical current is consumed or the packet queue overflows. The reward constants
r1, r3 were selected via experimentation and r2 was left as a free parameter in order
to be able to generate a set of instances for each controller. Each instance in the
set places different amounts of importance on the latency requirement relative to
the energy efficiency requirement. This approach allows for simulation of a suite
of controllers for each method, and plotting the resulting performance for a more
robust comparison. The resulting policies are ones where the respective controllers
turn the modem on and off at each discrete time step, in the way they determine
is the optimal approach for obtaining the maximal rewards. In other words, they
attempt to transmit the packets generated by the sensor application without incur-
ring undesired delay or consuming more electrical power than is needed, through
dynamic and changing conditions.
In Section 4.6.3 through Section 4.6.4, the differences between the two con-
trollers that being evaluated in this case study are detailed.
126
4.6.3 Structured Learning Controller
The Structured Learning controller consists of the common components de-
scribed above, plus the addition of the STM s and an MDP solver. The STM s are
described in this section, and the solver is described in Section 4.8.
The stored STM s at any given time are a combination of constants and time-
varying parameter estimates. The constants are programmed in at design time, and
the estimates are maintained by observing samples of the relevant quantities, and
using the Temporal Difference (Equations 4.3) method to update the estimates. The
estimates are plugged into the STM data structures, which serve to maintain fully
populated STM s at each time step.
The STM s are constructed using a factored formulation, which greatly re-
duces the storage requirements of the MDP. The factorization procedure is shown
in Equation 4.8 through Equation 4.12. The factorization serves to convert one large
multivariate conditional probability distribution into a product of functions, each
having the form of a lower dimensional conditional probability distribution. The
terms correspond to the subsystems of the sensor application, packet queue and
LTE-M modem, respectively. This re-arrangement enables a significant reduction
of the MDP storage requirements.
p(s(n+1)|s(n), a(n)) = p(s(n+1), s(n+1), s(n+1)|s(n), a(n)a q m ) (4.8)
= p(s(n+1)|s(n+1) (n+1) (n) (n)a q , sm , s , a ) ? p(s(n+1) (n+1)q , sm |s(n), a(n)) (4.9)
127
= p(s(n+1)|s(n)) ? p(s(n+1)a a q , s(n+1)|s(n), a(n)m ) (4.10)
= p(s(n+1)a |s(n)a ) ? p(s(n+1)q |s(n+1) (n) (n) (n+1) (n)m , s , a ) ? p(sm |s , a(n)) (4.11)
= p(s(n+1)|s(n)) ? p(s(n+1)|s(n), a(n)) ? p(s(n+1) (n) (n)a a q m |sm , a ) (4.12)
4.6.3.1 Sensing Application
Given the observability of the sensing application?s mode, the controller can
maintain parameters that statistically characterize how often the application is in
a given mode, and how likely the CPS is to transition from any mode to any other
given mode. These characterization parameters are listed in Equation 4.13. Using
the latest values of these parameters, the sa term of the factored STM can be fully
instantiated.
?? (n+1) (n) 2 (4.13)i,j = p(sa = j|sa = i)?(i, j) ? {0, 1, . . . , NSA ? 1}
4.6.3.2 Packet Queue
Since it is part of the state space, the dynamics of the queue must also be
modeled as a transition matrix. This results in having to model a fully deterministic
process into stochastic structures in order to fit into the MDP framework, and thus
most of the probabilities are either 1 or 0. The transition probabilities are almost
128
all known at design time, since the dynamics of the queue do not change. The
only uncertainty comes from the rate of packets entering the queue from the sensing
application, and the rate leaving the queue from the LTE-M modem connection.
The packet queue?s state is defined as the number of packets in it at a spe-
cific timestep. The transition probabilities amount to the likelihood of transition to
another state, in other words the change in the number of packets. The transition
to a state where more packets are inserted corresponds directly to the sensing ap-
plication?s packet generation rate. These events are combined with the probability
of packets being removed from the queue by being transmitted to the cloud server.
If the modem state and action are such that the modem is not yet connected, then
no packets can leave the queue and thus transitions to states where the number of
packets is reduced are not possible. If the modem is connected, then packets can
leave the queue.
4.6.3.3 LTE-M Modem
In the model, the dynamics of the LTE-M modem are a direct application of
the transition state concept. In order to instantiate this component of the STM s,
the controller needs to maintain a running estimate of how long the LTE-M modem
takes to connect to the network. This time-varying quantity is referred to as TC and
its most recent estimate as T?C . With this estimate, the resulting transition proba-
bilities are shown in Figure 4.4. Following the transition states theory introduced
in Section 2.4.3), the value of the parameter ? is defined in Equation 4.14, where
129
Figure 4.4: State transition matrices for LTE-M Modem.
TF is the duration of one control frame.
As can be observed from the Figure 4.4 and Equation 4.14, this component
of the STM s is parameterized by the running estimate T?C (through ?) in two of
the matrix elements, and combined with known constants for the remaining matrix
elements.
[ ( )]?1
T?C
?? = floor (4.14)
TF
4.6.4 Q-Learning Controller
The Q-Learning controller was implemented directly from the description of
the technique in [58]. In this method, a function Q is created as a mapping Q(s, a) :
(S ? A) ? R, where R denotes the set of real numbers. Each mapping in the
function represents an estimate of the total amount of reward an agent can expect
to accumulate over the future, starting from a given state s and taking a given
action a. The Q function is updated on each iteration of the controller using the
130
temporal difference equation. Using the latest estimated version of Q(s, a), the
action is selected by comparing all actions for the given state s and selecting the
action with the largest value. The remaining details of the Q-Learning method can
be found in [58].
131
on-off
mdp
ql
2
10
1
10
110 120 130 140 150 160 170 180 190 200
mJ/pkt
Figure 4.5: Simulation Results: Energy efficiency versus communication latency.
4.7 Simulation
In order to objectively compare the runtime performance of the Structured
Learning and Q-Learning controllers presented in Section 4.6, a MATLAB simula-
tion was created containing models of all the subsystems described in the case study.
In the simulation testbed, a sensor application generates packets at rates consistent
with a given mode, and also simulates the transition between modes at specified
transition rates. A packet queue object models a generic fixed length queue, which
acts as a First In First Out (FIFO) data structure, and overflows if the maximum
number of elements is exceeded. A dynamic model of the LTE-M modem was cre-
ated using the collected time-series data and electrical power measurements (see
Section 4.5.1).
132
Avg. Latency [s]
The simulation was run with three separate controllers: the two MDP-based
controllers described in the previous section, as well as a third manually-generated
policy. The manually-generated policy simply checks the queue, and turns on the
modem any time a specified number Nq of packets are in the queue, where Nq is
a parameter of the policy. Once the modem is turned on, it remains on until the
queue is empty.
The simulation was run multiple times for each controller, and each simulation
run is represented by a single point in the graph of Figure 4.5. The fixed threshold
(manually generated), Structured Learning, and Q-Learning approaches are denoted
by the ?on?off?, ?mdp?, and ?ql? traces, respectively. The best performance cor-
responds to points that have the lowest average communication latency (the vertical
axis) and simultaneously the lowest average energy efficiency (the horizontal axis).
For the fixed threshold technique, multiple policies were generated by vary-
ing the Nq threshold at which the modem was powered on. For the MDP-based
controllers, the r2 constant in the reward function from Equation 4.7 was varied.
This approach produced a set of control policies for each controller, allowing a full
exploration of the performance limits of each technique.
The first conclusion that can be made from the data in Figure 4.5 is that both
MDP-based controllers outperform the fixed threshold approach, for all possible
values of the Nq threshold. This is likely a result of the MDP-based policies being
richer and more expressive; while the fixed threshold policies are a function of the
packet queue state only (ignoring the application and modem characteristics), the
MDP-based policies materialized as a non-trivial function of the entire state space.
133
In this case study, since the MDP is able to reason using algorithmic methods
on data structures and computations on conditional probabilities, it can consider
more effects and consequences systematically and produce highly optimized policies
that are more expressive relative to the simple, manually-derived, fixed threshold
heuristic.
The second conclusion that can be made from the data is that the Structured
Learning controller outperforms the Q-Learning controller. As an example, if the
rewards are tuned such that both learning controllers achieve an average packet
transmission latency of 20 seconds, the Structured Learning controller is able to
accomplish this with an average energy efficiency of 135 mJ per packet, compared
to 163 mJ per packet on the Q-Learning controller. This amounts to a 17% savings
in the transmission energy for sending the exact same packets at the same average
latency. This can be an important difference since transmission energy is often the
largest source of energy consumption on an LTE-M connected sensor.
This improvement is a direct consequence of the Structured Learning algo-
rithm focusing the learning on only the time-varying aspects of the system (e.g.
modem power, cell tower connection time, etc.). and accepting as unquestionable
truth the other dynamics and attributes (e.g. that the packet queue contains one
less packet after a packet is removed from it, etc.). In contrast, the Q-Learning
controller is forced to learn (and continue to update indefinitely) all aspects of the
system transition probabilities in response to selected actions. It must continually
experiment with exploratory actions and accumulate data to learn all of the system
dynamics (including well understood behavior, such as the packet queue?s dynam-
134
ics). These results are consistent with the claim that expending learning effort on
such immutable aspects is both unnecessary and detrimental to the overall system
performance.
135
4.8 Implementation
This section details the results of implementation experiments performed to
assess the viability of the competing MDP-based control strategies in the context of a
state-of-the-art processing platform for resource constrained CPSs. The alternatives
are implemented on a typical MCU that would be used to realize the CPS case study,
and are compared in terms of their execution time, memory usage and processing
power consumption.
4.8.1 Experimental Setup
The competing controllers were implemented on the Silicon Labs EFM32GG, a
small and low power ARM Cortex M3-based MCU. The processor was running on the
EFM32 STK3700 development kit, which houses the CPU as well as sophisticated
energy monitoring circuitry. The EFM32GG contains 128 kB of RAM and 1MB of
FLASH, which make it a typical example of a resource constrained platform for the
CPS in the case study at the time of this writing.
In order to compare the controllers objectively, the following experimental
setup was created on the EFM32GG development board. All controllers were im-
plemented in C and stored in the MCU?s program memory one at a time. Memory
usage was computed by statically allocating all data structures and examining the
map file that the MCU?s compiler generates. A common test harness was written
for the EFM32GG, which was driven by a periodic timer interrupt. The interrupt
rate was configured to be 100ms, which was used as the fixed-period discrete-time
136
iteration rate for the controllers.
The C program initially puts the CPU into its low power sleep mode. It
remains in that mode until the periodic interrupt fires. Once the interrupt fires,
the MCU is woken from sleep and it then executes the computations needed for one
iteration of the controller under test. Once the iteration has completed, the MCU
returns back to sleep mode where it waits for the next firing of the periodic interrupt.
Since the sleep current is extremely low compared to the run current (microAmperes
compared to milliAmperes), this approach enabled precise measurement of both the
execution time and computational energy required to execute each controller on
the MCU by observing the current versus time profile of each controller. Real-time
MCU current consumption was measured by using the EFM32GG board?s energy
monitoring tools, which allow very accurate current versus time data to be observed
in the form of a high resolution time-series waveform capture.
Using this simple fixed rate scheduling scheme combined with the Cortex-M3
sleep modes and the development board?s current monitoring tools, it was possible
to observe the execution time and processing current consumed by the CPU for
each control policy. This testbench provided a highly repeatable experimental setup
where all settings were kept the same from case to case with the only difference
being the control policy being used.
137
103
Direct
Factored
Sparse
102
101
100
10 15 20 25 30 35 40 45 50
Queue Size [Packets]
Figure 4.6: Transition matrix storage sizes.
138
STM Storage Size [kB]
4.8.2 Matrix Format
In the case study?s MDP, the number of states NS is 66, and the number
of actions NA is 2. The number of non-zero elements KNZ in the STM s is 444.
This represents a sparsity of 94.9%. Aside from the direct implementation of the
full STM s, two CMM techniques were evaluated: a factored implementation and a
sparse implementation. Figure 4.6 shows the resulting STM storage sizes for these
techniques on the case study, over a range of packet queue sizes.
As can be observed from the data, both the factored and sparse implemen-
tations reduce storage size considerably. However, the sparse method is the most
effective in this regard and for this reason it was selected as the approach for the
implementation. The implementation used a single byte to store each element of
the policy lookup table, and four bytes to store the floating point elements of the
STM s and reward functions.
4.8.3 Measurements
4.8.3.1 Memory Usage
First, the techniques are compared in terms of how much data storage each
required on the MCU. The results are shown in Table 4.3, where the column la-
beled Sparse Structured Learning represents the Structured Learning method im-
plemented with sparse matrices, as described in Section 4.4.3.
The results in Table 4.3 show that the Q-Learning approach is the most favor-
able in this metric, and furthermore show that it requires significantly less data stor-
139
Structured Sparse Structured
Learning (VI) Learning (SVI) Q-Learning
STM 34.0 kB 2.60 kB -
Rewards 0.51 kB 0.51 kB 0.51 kB
Q function - - 528 B
Total 34.5 kB 3.11 kB 1.03 kB
Table 4.3: Data storage size in kiloBytes.
age than the Structured Learning (VI) approach. However, when the SVI method
is applied to Structured Learning (SVI), there is a significant reduction in the re-
quired data storage. The data shows that in this case study, the CMM techniques
reduce the storage requirements of Structured Learning to be much closer to that
of Q-Learning, while not beating it in this regard.
Generalizing these results beyond the case study, it can be seen from Table 4.3
that all of the methods being compared require the same amount of storage for the
reward function. Thus, the difference in memory usage is attributed to the storage
of the STM s in the VI and SVI controllers, compared to only the Q function in
the Q-Learning controller. This difference can be calculated in the general case as
follows. Assuming STM entries are 4 byte single precision floating point values, the
Q function can be stored in 4NSNA bytes, as it consists of a table of NSNA floating
point values. The storage size of the STM s in the Structured Learning (VI) method
is 4N 2S NA bytes, as it consists of NA stochastic matrices, each of size (NS ?NS).
The STM s in the Sparse Structured Learning (SVI) method were implemented
140
with sparse matrices stored in coordinate format [43]. This format stores only
the non-zero elements of a matrix, along with two indices representing the column
and row index of the element, respectively. Assuming that dlog2(N)/8e bytes are
required to store an integer index that can take on one of N values, the storage size
required for the STM s in coordinate format is shown in Equation 4.15, where KNZ
is the number of non-zero elements in the STM s.
KNZ (dlog2(NsNa)/8e+ dlog2(Ns)/8e+ 4) (4.15)
Evaluating the formula above (Equation 4.15) using the constants from the
case study (NS = 66, NA = 2, KNZ = 444) results in the storage sizes shown in
Table 4.3. The formula can be used to predict the required storage sizes for other
case studies by appropriately changing the values of (NS, NA, KNZ ).
4.8.3.2 Computation
Structured Sparse Structured
Learning (VI) Learning (SVI) Q-Learning
Per Control Iteration: 3.5 ?s / 117 nJ 3.5 ?s / 117 nJ 211 ?s / 7.06 ?J
Per Solver Iteration: 50.1 s / 1.67 J 5.61 s / 187 mJ -
Average Power: 483 ?W 69.8 ?W 78.8 ?W
Table 4.4: Execution time (in seconds), computation energy (in Joules) and average
power (in Watts).
141
Next, the execution time and power consumption of the MCU were measured,
when executing each of the competing techniques. The results are shown in Ta-
ble 4.4. These results show that Q-Learning requires the same computation on
every control period. The control operation consists of updating the Q function
based on the observed state transition and reward, and computing the best action
for a given state using the latest values in the Q function. The Q-Learning method
performs these operations on every control period.
In contrast, the Structured Learning techniques (VI and SVI) run a solver,
which was invoked once per hour. These techniques compute a new control policy
every hour, and the time and energy required to do this is shown in the second row
of Table 4.4. After computing a new policy every hour, the Structured Learning
techniques simply look up which action to use from a stored table for the remainder
of that hour.
The data in Table 4.4 shows that the Structured Learning techniques involve
much less computation time and energy consumption during a typical control period
relative to Q-Learning. Note that the first row of data for Structured Learning in
Table 4.4 excludes the computation associated with solver execution, while the third
row of data (labeled Average Power) includes the effects of solver computation.
It is important to note that the Structured Learning implementation would
likely require a priority-based preemptive scheduling scheme such that the control
iteration execution would take higher execution priority over the solver, such that
any real-time deadlines associated with the controller are not missed due to running
the solver.
142
The data in Table 4.4 shows that although Q-Learning does consume less
average power than Structured Learning (VI), when the SVI method is applied
to Structured Learning the average power is reduced lower than Q-Learning. SVI
reduces computation time by replacing standard matrix operations by sparse matrix
operations. This results in a significant reduction in computation time, given that
the STM s are extremely sparse.
Generalizing these results beyond the case study, it is possible to identify
the factors that affect which of the methods is favorable in terms of computation
costs ? e.g., processing time and energy consumption. In the Structured Learning
techniques, the size of the MDP and complexity of the STM s determine how long
it takes to execute a solver to produce a control policy. If the MDP is very large
and complex, the solver will take longer to execute. In contrast, the Q-Learning
technique is not affected at all by this attribute. In this aspect, it can be concluded
that Q-Learning is better suited to deal with large MDPs than Structured Learning
in terms of computation expense.
Another factor that is relevant is how often the Structured Learning techniques
are required to compute an updated policy. In general, a suitable update rate is
determined by the application?s adaptation requirements, and how quickly the time-
varying environment is changing. In the case study presented here, the policy is
updated once per hour, but a system that adapts to more slowly changing dynamics
may only need to update the control policy once per day or even less frequently. On
the other hand, a system adapting to fast changing dynamics may need to update the
policy much more often, such as once per second. A faster update rate will generally
143
VI
SVI
101 Q-Learning
Period = 1 Hour
100
10-1
102 103 104
Policy Update Period [s]
Figure 4.7: Average power consumed by the MCU on competing control algorithms
as a function of the policy update period.
increase the computational cost of the Structured Learning techniques, whereas Q-
Learning is not affected by this consideration at all. In this regard, Structured
Learning is better suited to applications were the adaptation is on dynamics that
are varying relatively slowly in time.
In this case study, the data point of a 1 hour update period leads to Sparse
Structured Learning having lower computational cost than Q-Learning. The crossover
point where Q-Learning consumes less computational power is shown in Figure 4.7
to be approximately at 45 minutes. It is important to note that this crossover point
is specific to the MDP size and complexity (affecting solver execution time) and the
choice of MCU (affecting run current, sleep current, and solver execution time).
144
Average Power [mW]
4.8.4 Adaptation Overhead
The improvement in energy efficiency shown in Figure 4.5 is the result of
the runtime adaptation making dynamic modem actuation decisions that result in
improved performance trade-offs. However, these improvements are obtained at the
cost of expending processing energy running the adaptation algorithm, as shown in
Table 4.4. An objective analysis of the overall benefits of the runtime adaptation
can only be made if the positive benefits of the adaptation are combined with its
negative effects.
Fixed Threshold VI SVI Q-Learning
Energy Efficiency [mJ/packet] 151 125 125 139
Average Modem Power [mW] 1.49 1.24 1.24 1.38
Adaptation Overhead [?W] - 483 69.8 78.8
Total Power [mW] 1.49 1.72 1.31 1.45
Overall Improvement - -15.4 % 12.1 % 2.7 %
Table 4.5: Overall improvement from runtime adaptation.
The numerical result of such an analysis is shown in Table 4.5. The first row is
the energy efficiency achieved by each competing method, obtained by selecting an
operating point for each method from Figure 4.5. In order for the selections to be a
fair comparison, the operating points are chosen to have roughly equal performance
in the other performance metric defined for the application ? transmission latency.
All three selections have an average latency of approximately 35 seconds. The second
145
row is obtained by multiplying the energy efficiency (in milliJoules per packet) times
the average packet generation rate in the simulation (in packets per second), which
was 0.0099. Thus, the second row represents the average power consumed by the
LTE-M modem, when under the control of each competing adaptation method under
identical packet generation statistics.
The third row gives the energy expended by running the adaptation algorithm,
and is taken directly from Table 4.4. This represents the cost at which the modem
energy efficiency improvements are obtained. The fourth row is the sum of the
second and third rows, and represents the total power that is consumed by the sensor
node. The last row quantifies the percent improvement over the Fixed Threshold
case, which represents a baseline in which no online adaptation method is used.
The numbers in the table show that the best overall option is Structured
Learning using the SVI algorithm, with a 12.1% improvement in power consumption
over the baseline, when the other metric ? transmission latency ? is held roughly
constant across the alternatives. Additionally, it can be seen that although Q-
Learning runs at a slightly lower computational cost, the adaptation performance
in terms of energy efficiency is not as good, leading to a worse overall performance.
A final observation that can be made from this data is that Structured Learning
provides superior modem efficiency gains with both the VI and SVI algorithms,
however under the VI algorithm those gains are effectively lost due to the relatively
large processing power required to run the VI algorithm. The overall improvement
of this configuration is -15.4%, meaning it actually makes the performance worse
than not having any adaptation at all. It is only through the use of the much
146
more power-efficient SVI algorithm that Structured Learning becomes an attractive
choice, and in fact even outperforms Q-Learning in this regard.
147
4.9 Conclusion
In this chapter, a survey of recent developments in Compact MDP Models
(CMMs) was provided, and by integrating several complementary CMM techniques,
a novel CMM-based approach to CPS design was developed. Comparisons between
CMM-based methods and Q-Learning in the context of CPS were made. The dif-
ferences between the two approaches were explored conceptually, as well as through
a detailed case study involving both simulation and a prototype implementation.
From the results of this chapter, it can be concluded that Q-Learning can be
considered a more robust technique when either very little is known about the system
a priori, or a large percentage of the dynamics are expected to continually change
at runtime. In contrast, when a significant portion of the system?s environment or
its dynamics are fixed, the CMM option can provide a more efficient and robust
approach.
An LTE-M connected sensor was detailed as a CPS case study to compare a
CMM-based learning controller to an alternative controller that used Q-Learning.
For a specified average packet transmission latency, the CMM-based controller re-
sulted in a 17% reduction in LTE-M transmission energy, which is often the largest
source of energy consumption on an LTE-M connected sensor. The energy sav-
ings are accomplished through strategic management of the LTE-M modem and
connection status, using learned dynamics of the system and its environment.
Since the learning controller must be implemented in the deployed system,
and its processing can be considered an overhead to the LTE-M connected sensor?s
148
main purpose, the implementation costs of the two learning controllers were also
analyzed. The implementation was on a small microcontroller that is typical of
what would be used for such a CPS system. In this experiment, it was found that
the CMM-based controller used 69.8 ?W compared to 78.8 ?W for the Q-Learning
controller, an 11.4% savings. However, the CMM-based controller required more
RAM to store its data structures ? 3.11kB compared to 1.03kB.
Finally, the overall benefits of using runtime adaptation were analyzed by com-
paring the improvement of the LTE-M transmission energy alongside the processing
overhead of running the adaptation algorithm. In this analysis, Structured Learn-
ing was found to outperform Q-Learning only in the case when a power efficient
implementation of the algorithm is used. This result concretely demonstrates the
need for runtime adaptation techniques to be very lightweight in the overhead they
add to the system in terms of memory and processing power consumption. If the
adaptation techniques are too resource intensive, they can cancel out the energy
efficiency gains made by the runtime adaptation, and even lead to worse overall
performance than executing the system without any adaptation.
Useful directions for future work include explorations into other challenging
CPS case studies with larger state spaces, and continued development of compact
techniques that provide self-awareness and runtime adaptation capabilities at all
levels of embedded implementation.
149
Chapter 5
GEMBench: A Platform for Collaborative Development of GPU
Accelerated Embedded Markov Decision Systems
150
5.1 Introduction
The previous chapters have shown the use of MDPs as increasingly relevant
tools in the design of Embedded Computing Systems (ECSs). However, progress
in this area currently suffers from a lack of common benchmarking methodologies.
The work presented in this chapter helps to bridge this gap. Material in this chapter
was published in preliminary form in [67].
More specifically, this chapter presents a summary of challenges associated
with MDP-based design for ECSs, a survey of the state-of-the-art in MDP solvers
and datasets that are relevant to embedded systems, and a novel open source soft-
ware package for facilitating experimental research in the implementation and ap-
plication of embedded MDPs.
In recent years, a new class of MDP solver implementations has emerged that
uses GPUs for acceleration. Examples of solvers in this class include those pre-
sented by Noer [68], Ruiz and Hernandez [42], and the SPVI algorithm introduced
in Chapter 3. The results in these works show performance improvements of roughly
an order of magnitude beyond what is possible with CPU-only solvers. Additionally,
in Chapter 3 a variable dependency analysis has been presented to provide insight
into why MDP solvers can benefit significantly from the parallelism available in
GPUs.
Motivated by the increasing interest in deploying MDPs within resource con-
strained embedded systems, and the promising role of GPUs to support such de-
ployments, this chapter presents a novel open source platform for testing and bench-
151
marking GPU-accelerated ECSs that employ MDPs. This platform is referred to as
GEMBench, which stands for the Gpu accelerated Embedded Mdp testBench.
GEMBench is designed to help researchers address significant logistical chal-
lenges in incorporating MDPs and their solvers into novel embedded system designs.
An important decision point in this context is whether to develop a new MDP solver
or to use an existing implementation. This decision point leads naturally to the fol-
lowing questions: Which existing open source MDP solver alternative is the best
to use or compare against for a given set of system design constraints? How much
processing time, memory and power does a given solver consume in order to solve a
given MDP? Can an MDP solver?s performance be improved through optimizations
or algorithmic innovations?
GEMBench is targeted to a specific embedded GPU platform, the NVIDIA
Jetson platform, and is designed for future retargetability to other platforms. The
orientation to a specific platform is important for the objectives of GEMBench,
which include promoting quantitative comparison among alternative MDP solvers
and implementations.
Additionally, an open source software package [69] is contributed, called the
GEMBench Package. The GEMBench Package can be downloaded onto the tar-
geted platform to create a development testbench. The testbench contains imple-
mentations of published solvers, datasets to run the solvers on, reference solutions
to the datasets, documentation on how to measure relevant performance metrics,
and guidance on how to contribute future developments to the framework in a
consistent manner. The testbench, which encompasses the GEMBench-compatible
152
D atasets
Execution 
Scripts
Parsers
R eference
Solu tions
Solvers
Figure 5.1: Block diagram of the GEMBench Package.
platform (NVIDIA Jetson) and the GEMBench Package, is what is referred to here
as GEMBench.
A block diagram of the GEMBench Package, illustrating the major compo-
nents is shown in Figure 5.1. The arrows denote the flow of information, where, in
a given experiment, a selected solver utilizes a selected MDP format-specification
parser to solve a selected MDP. The solver, parser, and MDP are selected from
three extensible libraries, respectively. The package is intended to allow new solvers
to be written by researchers, who would immediately have datasets to run them
on and reference solutions to compare them with in order to validate correctness.
Additionally, performance measurements, such as execution time and power con-
sumption, can be obtained and compared to other reference solvers provided by the
package. Furthermore, all of this can be automated to efficiently create extensive
benchmarking data through the use of execution scripts, which are also included
with the package.
The remainder of this chapter is organized as follows. In Section 5.2, a sum-
153
mary is presented of the current landscape in MDP solvers, common datasets and
platforms. In Section 5.3, GEMBench and its associated software package are de-
scribed in detail. In Section 5.4, the results for representative experiments using
GEMBench are provided. Finally, Section 5.5 concludes with interesting directions
for future work.
154
5.2 Survey
This section presents the results of a survey of existing MDP solvers, reference
platforms and benchmarking efforts to date. The survey focuses on aspects of MDP
solvers that are important to understand from the viewpoint of experimenting with
and deploying them on embedded systems.
This survey is a subset of a much larger body of work. There are dozens of
papers in the literature documenting MDP solver algorithms or techniques. Out
of those efforts, only a subset contain an attached or referenced software imple-
mentation of the technique. It is on those works with a corresponding software
implementation that are the primary focus here.
5.2.1 Solver Implementations
Table 5.1 and Table 5.2 summarize the CPU-based and GPU-based MDP
solver implementations that were found to date, respectively. From this survey,
one can conclude first that MDP solvers have been an active area of research and
development for the last 25 years. Second, researchers have commonly contributed
their own solver implementations to the growing body of work.
Based on these survey results, attempts to determine which of these implemen-
tations could be labeled as the current state-of-the-art in solvers were unsuccessful.
Several types of complications arose. In some cases, algorithms did not compile on
the available computing system. Version incompatibilities with dependencies were
suspected in some cases, and poor documentation of multi-step build sequences in
155
Solver Name Development Period Language
pomdp-solve [70] 1994-2007 C/C++
Symbolic HSVI [71] 1998-Today Java, Perl
MDP Toolbox for MATLAB [72] 1999-2002 MATLAB
ZMDP [73] 2004-2016 C/C++
SPUDD [47] 2007-2011 C/C++
Symbolic Perseus [74] 2007-2009 MATLAB, Java
APPL [56] 2009-2017 C/C++
C++ MDP Solver [75] 2010-2018 C/C++
MDPSOLVE [33] 2011-2015 MATLAB
libpomdp [76] 2011-2014 MATLAB, Java
AI-Toolbox [77] 2015-Today C/C++
Table 5.1: CPU-based MDP solver implementations.
other cases. In other cases, solvers compiled and ran successfully, but were not com-
patible with the same MDP files as other solvers. This is a critical shortcoming that
effectively prevented objective comparison (without laborious conversion between
different file formats).
5.2.2 Datasets
Benchmarking a solver requires not only a working solver implementation, but
also an MDP for the solver to solve. A survey of MDPs used in solver implementa-
tions was performed, and the results are summarized in Table 5.3.
156
Solver Name Development Period Language
Noer13 [68] 2013 C++/CUDA
Thrust-VI [42] 2015 C++/CUDA
SPVI [38] 2015-2018 C++/CUDA
Table 5.2: GPU-based MDP solver implementations.
In some cases, researchers simply created an MDP by instantiating the MDP
data structures as constants in source code. This approach is perhaps the easiest
route to solve a single MDP, but does not scale well to solve many different MDPs. In
other cases, researchers stored an MDP?s data structures in a file, and then created
a means to read in and parse the file in order to provide the MDP to the solver
algorithm. Some researchers defined new file formats for representing MDPs, while
others adopted formats introduced by previous researchers. Clearly, the use of file-
based specification of MDP data structures is more flexible than coding constants
into source code, as it allows for one solver to solve many MDPs simply by changing
the MDP file. Additionally, this method facilitates sharing of specific MDPs across
research groups.
The first standard MDP file format that gained a foothold in this area was the
.pomdp format [70], sometimes referred to as the Cassandra format (after the last
name of the author). This is a human readable text file that describes the MDP
components using a custom syntax that seems to have been created precisely for this
purpose. The format is well documented, and many example MDPs can be found
in this format throughout the literature. In [68], researchers contribute a MATLAB
157
parser for the .pomdp format, which ingests .pomdp files and instantiates an MDP
in the MATLAB workspace.
The strength of the .pomdp format is that it is easy to read and straightforward
to inspect visually. The downside is that it is inefficient in terms of file size. In one
case [73], an 180MB file was required to specify an MDP. In another case, in an
experiment in the work of Section 2.4 an MDP went above 750MB when stored in
.pomdp format. In practical use, hundreds of MBs is likely too large of a storage
size requirement in general for ECSs, at the time of this writing.
To circumvent the logistical issues of transporting files of this size from one
system to another, occasionally researchers have resorted to writing scripts that
generate a .pomdp file immediately prior to solving it, rather than copying a large
.pomdp file from one benchmarking setup to another.
In general, it was found that although MDP data structures can be very large,
their information content is relatively low compared to their size. As a result, their
associated information is highly compressible. See Section 3.4.1 for elaboration on
these findings. The findings support the claim that the .pomdp format can introduce
needlessly high files storage requirements.
The second format that has been used across research groups is the .spudd
format, first created for use by the SPUDD solver [47]. This format is also a human-
readable text file, but the MDP format is in the form of tree-shaped data structures
known as Algebraic Decision Diagrams (ADDs). ADDs are used in .spudd files
because the SPUDD solver operates Factored MDPs, as described in Section 2.4.4.
Factored MDPs can often be stored very compactly using tree-shaped struc-
158
tures, and SPUDD even operates on the MDP using tree-shaped structures for the
intermediate solver calculations. Other solvers inspired by or derived from SPUDD
also use ADDs and .spudd files. The downside to this format is that an unfactored
MDP must be factored before being stored in this format, and this factorization
process can be very difficult or even impossible if the MDP does not contain a spe-
cific underlying conditional dependence property. Another downside to this format
is that it is difficult to parse (compared to the .pomdp format), due to its use of
tree-shaped structures and Lisp syntax.
A third effort at defining MDP file formats arose from a series of MDP solv-
ing competitions held as part of the International Conference on AI Planning and
Scheduling (AIPS), which later merged with the International Conference on Auto-
mated Planning and Scheduling (ICAPS). In these conferences, MDP solver com-
petitions were held on 10 occasions from 1998 through 2018. These competitions
defined their own file formats, documented them for use by competitors, and pro-
vided the MDP files in those formats.
The file formats steadily evolved over the years, including the Planning Domain
Definition Language (PDDL) [78] version 1.0, PDDL version 2.1, the Probabilistic
Planning Domain Definition Language (PPDDL) [79] version 1.0, and the Relational
Dynamic Influence Diagram Language (RDDL) [80]. These formats are by far the
most complex (compared with .spudd and .pomdp), but also the most powerful
and expressive. These file formats are designed to specify many different classes of
planning and decision problems, beyond just MDPs and POMDPs.
159
5.2.3 Benchmarking Platforms
In the documents associated with each of the solver implementations that were
found, there were no common hardware platforms used to conduct performance or
benchmarking experiments. One exception to this is in the ICAPS conferences,
where the competition organizers ran the candidate solvers on a common computer
in order to compare performance objectively.
Aside from the ICAPS competitions, it was found that some researchers did not
take any runtime performance measurements at all, and of the researchers who did
perform such measurements, usually the workstation that was available to that group
was used. Commonly, researchers detailed the specifications of their computing
systems. This approach seems to be the standard approach to date: each research
group runs its algorithm on its own computing system, whatever it may happen to
be.
With this approach, it becomes virtually impossible (without large amounts
of reimplementation effort) to compare the runtime performance of alternative ap-
proaches introduced in two papers describing algorithms that are evaluated on dif-
ferent computing systems. There were no instances found where two research groups
had the same computing hardware (in terms of processor type, CPU speed, RAM,
etc.), even by coincidence.
160
5.2.4 Dimensionality of Rewards
An issue that further complicates the landscape is that of the dimensionality
of the reward function. In Equation 3.1, R(si, a) is the reward for selecting action
a from state si. This is an example of a two-dimensional reward function, since the
function is a mapping from S?A to a scalar reward value. However, as noted in [37]
some works use a three-dimensional reward function R(si, a, sj), (a mapping from
S?A?S) and some use a one-dimensional reward function R(si), (a mapping from
S only). The versions of Equation 3.1 that use three and one dimensional reward
functions are shown in Equation 5.1 and Equation 5.2, respectively. These different
representations can present logistical challenges when attempting to piece together
various works.
?
V n(s ) = max{ P (s |s , a)[R(s , a, s ) + ?V n?1i j i i j (sj)]} (5.1)
a?A
sj?S
?
V n(si) = R(si) + max{? [P (sj|si, a)V n?1(sj)]} (5.2)
a?A
sj?S
For example, if an MDP in .pomdp format from [70] is imported into MATLAB
using the .pomdp parser from [74], the reward function will exist in the MATLAB
workspace as a three-dimensional rewards object. That representation is then in-
compatible with the MATLAB solver MDPSOLVE [33], which only accepts at most
a two-dimensional reward function.
Conversion strategies to mitigate this are as follows. Increasing the dimen-
sionality of the rewards is trivial, as the extra function input can simply be ignored.
161
Decreasing the dimensionality (as is required for the .pomdp/MDPSOLVE exam-
ple above) requires some care. To reduce from a three-dimensional reward to a
two-dimensional reward, Equation 5.1 is rewritten as Equation 5.3, and then the
first summation is evaluated to arrive at the equivalent expression using a two-
dimensional reward function in Equation 5.4. The resulting conversion formula is
then shown in Equation 5.5.
?
V n(si) = max{ P (sj|si, a)R(si, a, sj)+
a?A
sj??S (5.3)
? P (sj|si, a)V n?1(sj)}
sj?S
?
V n(si) = max{R(si, a) + ? P (sj|si, a)V n?1(sj)} (5.4)
a?A
sj?S
?
R(si, a) = P (sj|si, a)R(si, a, sj) (5.5)
sj?S
This reduction can always be done without loss of information, and thus the
constraint of at most a two-dimensional reward function is not a limitation for any
solver. In spite of this, some solvers and parsers use three-dimensional rewards
anyway.
There is no such formula to reduce from two-dimensional to one-dimensional
rewards without loss of information. This can only be done in the special case where
the content of the two-dimensional reward function is such that it is only a function
of one of the arguments to begin with. In such a special case, the dimension can
simply be collapsed along the dimension of the unused argument to convert to a
162
one-dimensional reward function, typically R(s).
163
Source Number of MDPs Format
SPUDD [47] 70 .spudd
pomdp-solve [70] 55 .pomdp
MDPSOLVE [33] 32 Constants in source code
Symbolic Perseus [74] 21 .spudd
ICAPS-04 20 PPDDL 1.0
ICAPS-16 15 RDDL
libpomdp [76] 15 .spudd and .pomdp
ZMDP [73] 9 .pomdp
AAPL [56] 8 .pomdp and .pomdpx
AIPS-02 8 PDDL 2.1
ICAPS-18 8 RDDL
AIPS-98 6 PDDL 1.0
AIPS-00 5 PDDL 1.0
ICAPS-06 10 PPDDL 1.0
ICAPS-11 11 RDDL
Noer13 [68] 2 .pomdp
AI-Toolbox [77] 4 Constants in source code
MDP Toolbox for MATLAB [72] 2 Constants in source code
C++ MDP Solver [75] 1 Constants in source code
Table 5.3: MDP datasets.
164
5.3 GEMBench
As motivated in Section 5.1, this section proposes the use of a common bench-
marking platform called GEMBench. This section details the components of GEM-
Bench, and justification for the decisions made in its design.
5.3.1 Selection Criteria
The selection of the hardware and operating system to target in the first version
of GEMBench was made with the following considerations:
? Availability: the platform should be easily accessible to researchers, imposing
minimal cost and logistical barriers.
? Repeatability: researchers should be able to reproduce published results from
other researchers on their own platform instance, with their own experiments.
? Observability: researchers should be able to easily measure performance met-
rics that are relevant to ECS design, such as execution time, memory require-
ments, and power consumption.
? Ease of Use: A file system and robust networking stack is required to move
MDP datasets and source code onto the platform with minimal effort.
? Development flexibility: the platform should be compatible with and contain
rich support for toolchains of many different programming languages used in
technical and scientific programming (e.g., C/C++, Python, Java, MATLAB,
Go, Rust, Julia).
165
? GPU Support: It is reasonable to expect that GPUs will play a big role in
the future advancements in MDP solvers, and thus the platform must have a
programmable GPU.
? Documentation: The platform must be well documented in order to minimize
the amount of initial time researchers need to spend to become productive.
? Long Term Support: The platform must have planned support for many years
to come.
5.3.2 Hardware and Operating System
The common reference hardware of GEMBench, referred to as the GEMBench-
compatible platform, is selected as the NVIDIA Tegra TX-1 Development Board.
This platform is selected as one that satisfies the requirements summarized in Sec-
tion 5.3.1. This is a Linux-based platform that contains a Quad ARM A57 CPU
and an NVIDIA Maxwell GPU with 256 CUDA cores. The board contains 4GB of
RAM and a 16GB eMMC storage. The board runs a Linux distribution known as
Linux4Tegra (L4T), which is based on Ubuntu Linux. The software development
kit provided with the board provides a well documented ecosystem.
The use of a self-contained single-board computer allows for computing hard-
ware replication across research labs with minimal effort compared to setting up
larger general purpose computing systems such as desktop computers. Additionally,
the NVIDIA Jetson module is a small (50mm x 75mm) embedded computing mod-
ule that can be designed into a small form factor ECS, outside of the development
166
board. The Jetson module is well supported by a rich ecosystem of compatible
peripherals including cameras, robotic platforms and drones.
The Linux-based OS provides a full-featured set of capabilities for software
development and benchmarking, such as TCP/IP networking, USB, Wi-Fi, and
HDMI video, to name a few. Linux is favored in this context over smaller embedded
operating systems, such as Real-Time Operating Systems (RTOSs), due to its ease
of use. Linux enables efficient use of common toolchains, and its file system and
networking stack allow datasets to be copied onto the board easily. These two
operations can be much more complicated on RTOS-based or smaller embedded OS
systems. The lighter-weight OSs generally trade-off productivity in exchange for
higher levels of optimization. In the design of GEMBench, ease of use is favored
over optimization in this context. This is to provide a lower barrier to researchers
getting up and running with the testbench.
5.3.3 Solvers
Part of the GEMBench Package [69] contains an open source release of an
implementation of one of the GPU-accelerated solvers listed in Table 5.2: SPVI.
This software package is intended to allow researchers who purchase the development
board to easily reproduce the performance benchmarks of that solver. Additionally,
the package contains guidance for how to contribute additional solvers that can be
run on the platform. Specifically, the dependencies (and their versions) should be
documented, along with compile and run instructions.
167
Ideally, researchers will be able to easily run any existing solvers, and then
develop and implement new solvers on the platform, and easily produce benchmark-
ing measurements that objectively compare multiple solvers on a robust collection
of MDP datasets. The new performance claims can then be easily replicated across
other research groups.
5.3.4 Datasets
A survey of datasets and their file formats used in MDP research was presented
in Section 5.2. Due to the existing adoption of the .pomdp format across multiple re-
search efforts, along with the large number of MDPs already available in that format,
its continued use is encouraged here and it is selected for the primary benchmarking
dataset. Also included is a curated set of .pomdp files as a packaged benchmarking
dataset. Along with this set, the corresponding solutions to each of the MDPs is
provided, which is something not previously found anywhere to date. The reference
solutions were obtained using the pomdp-solve [70] and MDPSOLVE [33] solvers,
due to their maturity over newer solver packages.
Also included in the GEMBench Package is an open source C/C++ example of
how to ingest and parse MDP files in the .pomdp format. This example is intended
to save researchers time in incorporating this file format into their solvers.
A secondary candidate for a benchmarking file format is the .spudd format,
due to the large collection of MDP datasets available. As previously noted, this
format is extremely difficult to work with due to its tree-shaped data structures and
168
Lisp syntax. There is interest in using the SPUDD MDPs for a secondary dataset,
and this is left as a future research effort. There were two .spudd parsers found
to date: one embedded in the SPUDD solver itself, and another provided by the
Symbolic Perseus solver package [74] that can ingest .spudd files into MATLAB.
The current plan is to use these parsers to convert .spudd files into .pomdp
files, although it is understood that some files will be very large in size after this
conversion. Thus, only a subset will be converted. The benefit of this conversion is
that any solver that can consume .pomdp files will also be able to solve the MDP
datasets currently only available in .spudd format.
An open task is the parsing of .spudd files directly within a solver - any inter-
ested researcher is highly encouraged to contribute a utility that can ingest .spudd
files and output them into more general-purpose format that does not depend so
heavily on the concept of ADDs and Factorization that is central to the SPUDD
solver.
Additionally, the survey in Section 3.4 and analyses in Section 4.8 strongly
suggest that using sparse matrix representations would likely have a significant effect
on MDP storage size and solver computation. Researchers are encouraged to propose
a standard file format for sparse MDP representations and either convert existing
.pomdp or .spudd MDPs into sparse formats, or contribute entirely new sparse
MDPs to the collective datasets.
169
5.3.5 Measurements
Execution time can be measured on the board using one of two methods. One
method is the use of software-based timestamps. Using these involves making calls
to the Linux time API from user space, and finding the difference between successive
calls to compute elapsed time.
Another method to measure time is the use of GPIO combined with an external
oscilloscope. A GPIO line can be set from Linux user space at the start of a solver
routine, and then cleared at the end. By measuring the resulting square wave voltage
on this GPIO pin with an oscilloscope, a very precise timing measurement can be
made.
Both CPU and GPU memory use can be measured using the NVIDIA CUDA
API. The cudaMemGetInfo() function is well-documented and allows for objective
measurement of memory consumption.
The power consumption of the entire board can be measured directly from
Linux user space. The board contains a Texas Instruments INA3221 Current and
Voltage monitor, and instantaneous values can be read directly from the device using
a device driver provided by the NVIDIA board support package.
170
5.4 Experiments
To demonstrate the use of GEMBench, two MDP solvers were implemented on
the NVIDIA Jetson TX-1 development board. The first implementation is a CPU-
only implementation of Value Iteration, which is referred to here as VI. This was
created directly from the equations detailed in [53]. This implementation represents
a straightforward single-threaded implementation of the classical Value Iteration
algorithm and was used purely as a baseline for comparison with the other imple-
mentation.
The second implementation choice was Sparse Parallel Value Iteration (SPVI),
which was described in detail in Chapter 3. SPVI leverages the GPU for execution
time acceleration. A key feature of SPVI is that it uses sparse representations and
sparse matrix-vector arithmetic operations in the GPU.
In the experiments, the solver execution time was measured on the entire
Cassandra dataset using Linux?s native timing support. The results for the full
dataset are shown in Figure 5.2. The curated Cassandra dataset has 55 MDPs on
the testbench, and the results of solving all of these MDPs is shown in the figure.
Each data point represents one invocation of the solver on a specific MDP.
The MDPs in Figure 5.2 are sorted by a scalar size metric N2SNA, where NS
and NA are the number of states and actions in each MDP, respectively. N
2
SNA is
used to denote the size of the MDP because the largest data structures in an MDP
are the state transition matrices, which contain a total of N2SNA entries.
One can draw several conclusions from Figure 5.2. First, the GPU-based
171
101
100
10-1
VI
SPVI
10-2
10-3
10-4
102 104 106 108
2
MDP Size (N N )
s a
Figure 5.2: Solver execution time for Cassandra dataset.
172
Solver Execution Time [Seconds]
implementation (SPVI) takes at least one second to solve any MDP, regardless of
how small it is. This is due to the amount of time it takes to setup the GPU and
initialize the CUDA cuSparse library used by SPVI. There does not appear to be
any such setup time effect in the CPU-based VI algorithm.
Second, there is a crossover point around N2SNA ? 2e6 beyond which the GPU-
based solver is faster. To focus in on this aspect, the execution times for solving
the four largest MDPs are listed in Table 5.4. For ?baseball.pomdp? (the largest
MDP in the dataset), SPVI (the GPU-based solver) is 48.5% faster than VI (the
CPU-based solver), a considerable difference.
MDP Name N2SNA VI SPVI
cit.pomdp 3.22e5 0.043 0.964
sunysb.pomdp 3.60e5 0.058 0.980
fourth.pomdp 4.42e6 1.390 1.061
baseball.pomdp 3.53e8 37.653 19.373
Table 5.4: Solver execution time (seconds).
Third, one can conclude that the Cassandra dataset is rich in small MDPs
and lacking in large MDPs. Considerably more data is needed to explore GPU-
based solver performance on larger MDPs. The GEMBench project aims to spur
collaborative research in directions such as this, which support more insightful and
comprehensive evaluation of alternative MDP implementation approaches.
173
5.5 Conclusion
This chapter has introduced GEMBench, a benchmarking tool for evaluating
implementations of solvers for MDPs. The utility of common benchmarking envi-
ronments has been demonstrated in many application areas. With the increasing
relevance of MDPs in embedded systems, and the complex trade-offs involved in
MDP solver deployment, GEMBench helps to bridge an important gap in design
and implementation of MDP-equipped applications.
Along with the presentation of GEMBench, surveys of the landscape of MDP
solvers, reference platforms and benchmarking were presented, with an emphasis on
details that are relevant for experimenting with embedded implementations. GEM-
Bench is designed for extensibility with additional file formats, datasets, and solvers,
as well as retargetability to other processing platforms. Useful extensions for future
work include continuing to add support for additional MDP file types and datasets,
implementing more solvers from the literature, and exploring the trade-offs between
using an MDP versus a POMDP model for a given ECS.
Additionally, the amount of system knowledge put into the MDP before de-
ployment is an important consideration. An interesting area for future work relating
to this is the exploration of making system assumptions offline versus learning those
aspects of the system at runtime. The implementation aspects of these concepts
with GPUs is likely to be an important research area in coming years.
174
Chapter 6
Conclusions and Future Work
175
In this thesis, a methodology for the design and implementation of runtime
adaptation in embedded computing systems was presented. The methodology em-
ploys compact, system-level models based on Markov Decision Processes (MDPs) to
generate control policies that optimize the required embedded computing tasks in
terms of relevant, multidimensional design optimization metrics. Through simula-
tions and implementations of case studies, the methodology was explored in several
applications areas.
In Chapter 2, the methodology was applied to reconfigure a digital channelizer,
and the results were shown to outperform the state-of-the-art in various metrics.
One major aspect of the work in this chapter is that the channelization requests
were treated as being generated externally, and the reconfigurable channelizer was
only a subsystem tasked with responding to those requests. An interesting area
for future research is in the incorporation of the runtime adaptation methodology
to the upper layers as well, effectively creating an adaptive cognitive radio system.
In that case, the channelization requests would be generated by the runtime adap-
tation framework in response to higher level application goals such as maintaining
communications throughput in response to changing interference, as one example.
Surely, various new challenges would arise in that effort and those would be
interesting areas for future research. Specifically, it can be anticipated that although
the state representation used to encode the channelization requests (a bit vector,
with one bit for each channel) worked well for the 8 channel case, the representation
would likely be problematic for systems with larger numbers of channels. With 8
channels, the number of states required to represent the channelization request is
176
256, which is manageable in our design. However, for 16 or 24 channel systems, the
size of the state space becomes a much more complex issue. It should be noted that
this challenge is not remedied by the factorization techniques presented in Chapter 2,
which were effective at keeping the state space small due to growth from other types
of complexity. It is possible that a different encoding method can be used if some
application-level context is incorporated. Progress in this direction may be possible
with further analysis of a specific cognitive radio case study.
Additionally, the penalty costs of reconfiguration were modeled as being due
to time spent reconfiguring the processing hardware, or channelization requests that
were missed. If a full cognitive radio application were considered, there would likely
be application level goals that could be used in the rewards, such as an end-to-end
communications Quality of Service (QoS) objective, instead of or in addition to
energy efficiency. In this form of multi-layered runtime adaptation, it is anticipated
that some form of control and reconfiguration hierarchy would be beneficial.
In Chapter 3, an MDP solver was developed that solves certain MDPs faster
than the state-of-the-art on embedded processors containing GPUs. The solver made
use of the sparsity typically found in an MDP?s transition matrices, and utilized effi-
cient sparse linear algebra operations to achieve the fast runtime. This advancement
was possible in part because of the relatively recent availability of general purpose
programmable GPUs on embedded SoCs, compared with the multiple decades of
MDP solver research on CPUs (without GPUs). Additionally, it was found that the
Value Iteration (VI) algorithm is very well suited for acceleration with the parallel
processing capabilities that GPUs can provide.
177
There are several directions for future work in this area. For example, an
immediate question that arises is how well the acceleration scales for much larger
MDPs. Certainly various optimization opportunities will arise when solving MDPs
whose state spaces are much larger than the number of parallel threads that can be
spawned on a GPU.
Aside from the intricacies that GPUs bring, there is much work left to do
on periodic solving of time-varying MDPs. Specifically, in all cases in this thesis
whenever an MDP was solved, it was solved without any prior knowledge of the
solution. In other words, an MDP was solved to produce a policy; then the MDP was
modified slightly (due to tracking of time-varying system or environmental factors);
and then that second MDP was solved to produce a second policy, without any
knowledge of anything produced by the first MDP solver invocation.
In the experimental observations, there was typically a high similarity or cor-
relation between the solver?s inputs from run to run. Correspondingly, it was also
typical to see a level of similarity or correlation between the outputs. Given these
trends, it is reasonable to hypothesize that some amount of the solver computations
on subsequent runs are redundant in some way, and that there may be efficiency
improvements in the solver algorithm that can be achieved if this redundancy could
be understood and exploited ? for example, if some artifact or intermediate calcu-
lations of the previous invocation are carried forward to the subsequent invocation.
The VI algorithm involves starting from a random choice of the Value Function
and iterating towards a final solution. Typically, the starting point used was all
zeros for consistency and to aid in debugging. However, it has been shown that
178
the algorithm converges for any starting point. One approach that was tried was
starting from the previous solution rather than from all zeros or a random point. In
some preliminary experiments on this, the result was a faster solver time in some
cases, and a slower solver time in other cases, and it was not clear why. Perhaps a
starting point can be chosen more intelligently based on how the MDP has changed
and some other aspects of the previous run?s artifacts.
Another possible area for future work related to Chapter 3 is in the exploration
of other solver algorithms altogether. This thesis has focused entirely on the VI
algorithm, however there are other algorithms that have proved popular in the past,
most notably Policy Iteration (PI) and Modified Policy Iteration (MPI). It remains
to be seen whether these other algorithms can also benefit from being implemented
in GPUs in the way that VI does.
Another aspect to consider is that these three algorithms (VI, PI, MPI) are
known as exact solvers of MDPs. When viewed from an analytical standpoint, they
converge on a provably optimal solution to an MDP. However, there also exists a
class of MDP solvers known as approximate solvers, typically found in POMDP
contexts. It remains to be seen whether the computational advantages of produc-
ing an approximate solution outweighs the disadvantages of producing a possibly
suboptimal control solution in the context of embedded, runtime adaptation frame-
works. The exploration of this trade-off on different applications and case studies
could provide examples of important overall system improvements.
In Chapter 4, a runtime adaptation framework named Structured Learning was
proposed and analyzed in the context of an LTE-M connected system. Simulations
179
of control performance were compared to Q-Learning, a competing technique that
has very recently been popularized in the research community. Additionally, the
competing frameworks were implemented on a resource-constrained MCU and their
memory and computational requirements were compared objectively.
There are various aspects of the work in Chapter 4 that generated new direc-
tions for future research. In the case study that was analyzed, the MCU contained
a very simple application that generated packets in an idealized way. It would be
interesting to dissect a relevant consumer product available today, and observe its
traffic pattern. Then, attention could be paid as to how well the runtime adapta-
tion framework could perform on that specific traffic pattern. If shortcomings are
identified, then exploration of how to modify the runtime framework for that traf-
fic pattern would be important. In addition to improving the performance of the
given consumer application, this kind of study could provide useful insight into how
methods proposed in this thesis can be further streamlined in application-specific
ways.
Additionally, the work in Chapter 4 focused on adaptation of the LTE-M
connection when the event of establishing the cellular modem?s connection was time-
varying. However, there are other parameters that would likely also be time-varying
? for example, the transmit power, transmission latency and overall data rate.
Modeling these additional parameters and re-defining the adaptation rewards in
order to take them into account would be an interesting effort that is likely to yield
new advances.
Another direction for future work stems from the fact that this thesis only
180
compared Structured Learning to a basic form of Q-Learning. However, Q-Learning
is a technique that exists in the larger area of Reinforcement Learning (RL), and
that is a rapidly evolving area of research at this moment. It is anticipated that
comparisons between Structured Learning and emerging advancements in RL in the
context of embedded systems would prove insightful. As a relevant example, RL has
found recent success [81] in beating the world champion of the board game Go by
approximating the MDP Value Function using a deep neural network (DNN). While
these techniques have traditionally been implemented on large computing systems
with vast computing resources, recent advancements [82] have made it possible to
implement DNNs on smaller and smaller computing systems, including extremely
resource constrained MCUs. As a result, an interesting area for future research
is in the design of runtime adaptation frameworks using DNN-based RL for highly
resource constrained embedded computing systems, and how these techniques would
compare to the Structured Learning techniques proposed in this thesis.
In Chapter 5, the GEMBench platform was created to aid in the development
and exploration of techniques on embedded computing systems that use MDPs. The
platform consists of software written throughout the work performed for this thesis
research, targeted at a popular family of off-the-shelf NVIDIA embedded computing
development boards.
The software has been released to the academic community in an open source
format with a permissive license that allows it to be generally used by any researchers
working in this area. The aim is to accelerate future developments by lowering the
amount of time required to build critical software infrastructure, namely generat-
181
ing example MDPs, writing parsers to convert from one format to another, and
implementing alternative solvers to compare policy outputs and computational re-
quirements.
Several directions for future work remain in this area. Specifically, it would
be good to add support for at least one more format for MDP files (.spudd). This
file format is commonly found in other research efforts, and an interesting set of
example MDPs could be immediately be added to the GEMBench platform.
Additionally, since the MDP state transition matrices are already located on
the platform, it is possible to create a simulator that creates a model of a system
and performs Monte Carlo analyses of the discrete time trajectories resulting from
those state transition matrices. This would allow different policies to be evaluated in
terms of the resulting closed loop dynamics. This capability currently does not exist
on GEMBench. Instead, experiments requiring such capability were performed in
MATLAB throughout the research involved in this thesis. The advantage of having
this capability on GEMBench is that it would be easy to share the Markovian
simulator with other researchers. It should be noted that this type of simulator
could be used for exploring RL-based techniques, such as Q-Learning, as well.
MDPs are not always the most adequate modeling framework for every possible
use of runtime adaptation. There are cases where they are highly effective, and cases
where they are not. Various factors affecting the suitability of MDPs to specific
applications were detailed in Section 3.4. For applications where the use of MDPs is
advantageous, collaborative partnerships between researchers on common platforms
like GEMBench can help to accelerate progress. As research and technology advance,
182
applications will continue to arise that require solving bigger and more complicated
runtime adaptation challenges. This will be aided partly by technology trends that
result in faster and more efficient computing hardware. However, there is also
an equally important category of continual advancement that must be made in
algorithms and modeling techniques.
183
Bibliography
[1] L. Benini, A. Bogliolo, and G. De Micheli, ?A survey of design techniques for
system-level dynamic power management,? IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, June 2000.
[2] A. Kansal, J. Hsu, S. Zahedi, and M. B. Srivastava, ?Power management in en-
ergy harvesting sensor networks,? ACM Transactions on Embedded Computing
Systems, vol. 6, no. 4, 2007.
[3] C. Hsieh, F. Samie, M. S. Srouji, M. Wang, Z. Wang, and J. Henkel, ?Hard-
ware/software co-design for a wireless sensor network platform,? in Hard-
ware/Software Codesign and System Synthesis (CODES+ISSS), International
Conference on, New Delhi, India, Oct 2014, pp. 1?10.
[4] Q. Liu et al., ?Power-adaptive computing system design for solar-energy-
powered embedded systems,? IEEE Transactions on Very Large Scale Inte-
gration (VLSI) Systems, vol. 23, no. 8, pp. 1402?1414, 2015.
[5] C. Moser, L. Thiele, D. Brunelli, and L. Benini, ?Adaptive power management
in energy harvesting systems,? in Proceedings of the Design, Automation and
Test in Europe Conference and Exhibition, 2007, pp. 1?6.
[6] C. Moser, L. Thiele, D. Brunelli and L. Benini, ?Adaptive power management
for environmentally powered systems,? IEEE Transactions on Computers, vol.
59, no. 4, pp. 478?491, April 2010.
[7] L. Esterle and B. Rinner, ?An architecture for self -aware IoT applications,?
in 2018 IEEE International Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), April 2018, pp. 6588?6592.
[8] Nikil Dutt, Axel Jantsch, and Santanu Sarma, ?Toward smart embedded sys-
tems: A self-aware system-on-chip perspective,? ACM Trans. Embed. Comput.
Syst., vol. 15, no. 2, pp. 22:1?22:27, Feb. 2016.
[9] P. R. Lewis, M. Platzner, B. Rinner, J. Torresen, and X. Yao, Self-aware
Computing Systems: An Engineering Approach, Springer, 2016.
[10] P. P. Vaidyanathan, Multirate Systems and Filter Banks, Dorling Kindersley,
first edition, 1993.
[11] S. J. Darak, A. P. Vinod, R. Mahesh, and E. M-K. Lai, ?A reconfigurable filter
bank for uniform and non-uniform channelization in multi-standard wireless
communication receivers,? in Proceedings of the International Conference on
Telecommunications, 2010, pp. 951?956.
184
[12] C. Xu, W. Liang, and H. Yu, ?Green-energy-powered cognitive radio networks:
Joint time and power allocation,? ACM Trans. Embed. Comput. Syst., vol. 17,
no. 1, pp. 13:1?13:18, Aug. 2017.
[13] A. Sapio, M. Wolf, and S. S. Bhattacharyya, ?Compact modeling and manage-
ment of reconfiguration in digital channelizer implementation,? in Proceedings
of the IEEE Global Conference on Signal and Information Processing, Wash-
ington, D.C., December 2016, pp. 595?599.
[14] A. Sapio, L. Li, J. Wu, M. Wolf, and S. S. Bhattacharyya, ?Reconfigurable
digital channelizer design using factored Markov decision processes,? Journal
of Signal Processing Systems, December 2017.
[15] C. Boutilier, R. Dearden, and M. Goldszmidt, ?Exploiting structure in policy
construction,? in Proceedings of the International Joint Conference on Artificial
Intelligence, 1995, pp. 1104?1111.
[16] O. Sigaud and O. Buffet, Eds., Markov Decision Processes in Artificial Intelli-
gence, Wiley, 2010.
[17] C. Lee, W. Chen, S. S. Bhattacharyya, and T. Lee, ?Dynamic, data-driven
spectrum management in cognitive small cell networks,? in 8th International
Conference on Signal Processing and Communication Systems (ICSPCS), Gold
Coast, Australia, Dec 2014, pp. 15?17.
[18] J. Hu, Z. Zuo, Z. Huang, and Z. Dong, ?Dynamic digital channelizer based on
spectrum sensing,? PLOS One, August 2015.
[19] D. Zhou, ?A review of polyphase filter banks and their application,? Tech. Rep.
AFRL-IF-RS-TR-2006-277, Air Force Research Laboratory, Rome, NY USA,
September 2006.
[20] F. Harris, C. Dick, X. Chen, and E. Venosa, ?Wideband 160-channel polyphase
filter bank cable TV channeliser,? IET Signal Processing, vol. 5, no. 4, pp. 325?
332, 2011.
[21] F. Harris, E. Venosa, X. Chen, C. Dick, and B. Adams, ?A novel and efficient
multi-resolution channelizer for software defined radio,? in Proceedings of the
International Conference on Acoustics, Speech, and Signal Processing, 2013,
pp. 2649?2653.
[22] S. Dhabu, Smitha K. G., and A. P. Vinod, ?A low complexity reconfigurable
channel filter based on decimation, interpolation and frequency response mask-
ing,? in IEEE International Conference on Acoustics Speech and Signal Pro-
cessing (ICASSP), Vancouver, BC, Canada, May 2013, pp. 5583?5587.
[23] A. Edison and T. G. James, ?Reconfigurable perfect reconstruction filter bank
channelizer for software defined radio,? in IEEE India Conference (INDICON),
Kochi, India, Dec 2012, pp. 1138?1141.
185
[24] W. A. Abu-Al-Saud and G. L. Stuber, ?Efficient wideband channelizer for
software radio systems using modulated PR filterbanks,? IEEE Transactions
on Signal Processing, vol. 52, no. 10, pp. 2807?2820, 2004.
[25] Z. Chang, A. P. Vinod, and P. K. Meher, ?Reconfigurable architectures for low
complexity software radio channelizers using hybrid filter banks,? in Proceedings
of the IEEE Singapore International Conference on Communication Systems,
2006, pp. 1?5.
[26] S. J. Darak, S. K. P. Gopi, V. A. Prasad, and E. Lai, ?Low-complexity reconfig-
urable fast filter bank for multi-standard wireless receivers,? IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, vol. 22, no. 5, pp. 1202?1206,
2014.
[27] P. K. Devi and R. S Bhuvaneswaran, ?Flexible reconfigurable architecture for
SDR receiver,? in IEEE Malaysia International Conference on Communications
(MICC), Kuala Lumpur, Nov 2013, pp. 265?270.
[28] Y. Wei, X. Wang, F. Guo, G. Hogan, and M. Collier, ?Energy saving local con-
trol policy for green reconfigurable routers,? in IEEE International Conference
on Communications, 2015, pp. 221?225.
[29] A. V. Oppenheim, R. W. Schafer, and J. R. Buck, Discrete-Time Signal Pro-
cessing, Prentice Hall, second edition, 1999.
[30] M. Xu, H. Li, and X. Gan, ?Energy efficient sequential sensing for wideband
multi-channel cognitive network,? in IEEE International Conference on Com-
munications, 2011, pp. 1?5.
[31] B. Farzad, ?Variable bandwidth polyphase filter banks,? M.S. thesis, San Diego
State University, 2014.
[32] B. E. Bjornson, E. A. Jorswieck, M. Debbah, and B. Ottersten, ?Multiobjective
signal processing optimization: The way to balance conflicting metrics in 5G
systems,? IEEE Signal Processing Magazine, vol. 31, no. 6, pp. 14?23, 2014.
[33] P. L. Fackler, ?MDPSOLVE a MATLAB toolbox for solving Markov deci-
sion problems with dynamic programming ? user?s guide,? Tech. Rep., North
Carolina State University, January 2011.
[34] L. Benini, A. Bogliolo, A. Paleologo, and G. De Micheli, ?Policy optimiza-
tion for dynamic power management,? IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, vol. 18, no. 6, pp. 813?833, June
1999.
[35] S. Russell and P. Norvig, Artificial Intelligence, A Modern Approach, Pearson,
third edition, 2009.
186
[36] C. M. Grinstead and J. L. Snell, Introduction to Probability, American Math-
ematical Society, second edition, 1991.
[37] Andrew Y. Ng, Shaping and Policy Search in Reinforcement Learning, Ph.D.
thesis, University of California, Berkeley, 2003.
[38] A. Sapio, S. Bhattacharyya, and M. Wolf, ?Efficient solving of Markov deci-
sion processes on GPUs using parallelized sparse matrices,? in Proceedings of
the Conference on Design and Architectures for Signal and Image Processing
(DASIP), Oct 2018.
[39] A. Munir and A. Gordon-Ross, ?An MDP-based application oriented opti-
mal policy for wireless sensor networks,? in Proceedings of the International
Conference on Hardware/Software Codesign and System Synthesis, 2009, pp.
183?192.
[40] A. Singh, A. Prakash, K. Basireddy, G. Merrett, and B. Al-Hashimi, ?Energy-
efficient run-time mapping and thread partitioning of concurrent OpenCL ap-
plications on cpu-gpu mpsocs,? ACM Trans. Embed. Comput. Syst., vol. 16,
no. 5s, pp. 147:1?147:22, Sept. 2017.
[41] S. Wang, G. Zhong, and T. Mitra, ?CGPredict: Embedded GPU performance
estimation from single-threaded applications,? ACM Trans. Embed. Comput.
Syst., vol. 16, no. 5s, pp. 146:1?146:22, Sept. 2017.
[42] S. Ruiz and B. Hernandez, ?A parallel solver for Markov decision process
in crows simulations,? in Artificial Intelligence (MICAI), Fourteenth Mexican
International Conference on, Mexico, Octoober 2015, pp. 107?116.
[43] S. Filippone, V. Cardellini, D. Barbieri, and A. Fanfarillo, ?Sparse matrix-
vector multiplication on GPGPUs,? ACM Trans. Math. Softw., vol. 43, no. 4,
pp. 30:1?30:49, Jan. 2017.
[44] Y. Debizet, G. Lallement, F. Abouzeid, P. Roche, and J. Autran, ?Q-learning-
based adaptive power management for IoT system-on-chips with embedded
power states,? in Proceedings of the IEEE International Symposium on Circuits
and Systems (ISCAS), May 2018, pp. 1?5.
[45] E. Jung, F. Maker, T. L. Cheung, X. Liu, and V. Akella, ?Markov decision
process (MDP) framework for software power optimization using call profiles
on mobile phones,? Journal of Design Automation for Embedded Systems, vol.
14, no. 2, pp. 131?159, 2010.
[46] Z. N. Sunberg, M. J. Kochenderfer, and M. Pavone, ?Optimized and trusted
collision avoidance for unmanned aerial vehicles using approximate dynamic
programming,? in 2016 IEEE International Conference on Robotics and Au-
tomation (ICRA), May 2016, pp. 1455?1461.
187
[47] J. Hoey, R. St-Aubin, A. Hu, and C. Boutilier, ?SPUDD: stochastic planning
using decision diagrams,? in Proceedings of the Conference on Uncertainty in
Artificial Intelligence, 1999, pp. 279?288.
[48] A. Jonsson and A. Barto, ?Causal graph based decomposition of factored
MDPs,? Journal of Machine Learning Research, vol. 7, pp. 2259?2301, 2006.
[49] T. Dean and S. Lin, ?Decomposition techniques for planning in stochastic do-
mains,? in Proceedings of the 14th International Joint Conference on Artificial
Intelligence - Volume 2, 1995, IJCAI?95, pp. 1121?1127.
[50] A. Somani, N. Ye, D. Hsu, and W. Sun Lee, ?DESPOT: online POMDP plan-
ning with regularization,? Advances in Neural Information Processing Systems,
vol. 58, 01 2013.
[51] Anton Wijs, Joost-Pieter Katoen, and Dragan Bos?nac?ki, ?Efficient GPU algo-
rithms for parallel decomposition of graphs into strongly connected and max-
imal end components,? Formal Methods in System Design, vol. 48, no. 3, pp.
274?300, Jun 2016.
[52] R. Bellman and E. Lee, ?History and development of dynamic programming,?
IEEE Control Systems Magazine, vol. 4, no. 4, pp. 24?28, November 1984.
[53] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic
Programming, John Wiley and Sons, Inc., first edition, 2005.
[54] S. Shresthamali, M. Kondo, and H. Nakamura, ?Adaptive power management
in solar energy harvesting sensor node using reinforcement learning,? ACM
Trans. Embed. Comput. Syst., vol. 16, no. 5s, pp. 181:1?181:21, Sept. 2017.
[55] P. Wa?gemann, T. Distler, H. Janker, P. Raffeck, V. Sieh, and W. Schro?Der-
Preikschat, ?Operating energy-neutral real-time systems,? ACM Trans. Embed.
Comput. Syst., vol. 17, no. 1, pp. 11:1?11:25, Aug. 2017.
[56] O. Brock, J. Trinkle, and F. Ramos, SARSOP: Efficient Point-Based POMDP
Planning by Approximating Optimally Reachable Belief Spaces, MIT Press,
2009.
[57] T. Smith and R. Simmons, ?Point-based POMDP algorithms: Improved analy-
sis and implementation,? in Proc. of the Conference on Uncertainty in Artificial
Intelligence, July 2005.
[58] R. Sutton and A. Barto, Reinforcement Learning: an Introduction, MIT Press,
first edition, 1998.
[59] A. D. Tijsma, M. M. Drugan, and M. A. Wiering, ?Comparing exploration
strategies for Q-Learning in random stochastic mazes,? in Proceedings of the
IEEE Symposium Series on Computational Intelligence (SSCI), Dec 2016, pp.
1?8.
188
[60] R. C. Hsu, C. T. Liu, and H. L. Wang, ?A reinforcement learning-based ToD
provisioning dynamic power management for sustainable operation of energy
harvesting wireless sensor node,? IEEE Transactions on Emerging Topics in
Computing, vol. 2, no. 2, pp. 181?191, June 2014.
[61] W. Liu, Y. Tan, and Q. Qiu, ?Enhanced q-learning algorithm for dynamic
power management with performance constraint,? in Proceedings of the Design,
Automation Test in Europe Conference Exhibition (DATE), March 2010, pp.
602?605.
[62] S. Yue, D. Zhu, Y. Wang, and M. Pedram, ?Reinforcement learning based
dynamic power management with a hybrid power supply,? in Proceedings of
the IEEE international Conference on Computer Design (ICCD), Sept 2012,
pp. 81?86.
[63] J. Modayil, A. White, P. M. Pilarski, and R. S. Sutton, ?Acquiring a broad
range of empirical knowledge in real time by temporal-difference learning,?
in Proceedings of the IEEE International Conference on Systems, Man, and
Cybernetics (SMC), Oct 2012, pp. 1903?1910.
[64] Theodore J. Perkins and Andrew G. Barto, ?Lyapunov design for safe rein-
forcement learning,? J. Mach. Learn. Res., vol. 3, pp. 803?832, Mar. 2003.
[65] Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad
Ghavamzadeh, ?A Lyapunov-based approach to safe reinforcement learning,?
in Proceedings of the 32nd International Conference on Neural Information
Processing Systems, USA, 2018, NIPS?18, pp. 8103?8112, Curran Associates
Inc.
[66] S. Dawaliby, A. Bradai, and Y. Pousset, ?In depth performance evaluation
of LTE-M for M2M communications,? in Proceedings of the IEEE 12th In-
ternational Conference on Wireless and Mobile Computing, Networking and
Communications (WiMob), Oct 2016, pp. 1?8.
[67] A. Sapio, R. Tatiefo, S. Bhattacharyya, and M. Wolf, ?GEMBench: a platform
for collaborative development of GPU accelerated embedded Markov decision
systems,? in Proceedings of the International Conference on Embedded Com-
puter Systems: Architectures, Modeling and Simulation (SAMOS), Jul 2019.
[68] D. Noer, ?Parallelization of the value-iteration algorithm for partially observ-
able markov decision processes,? M.S. thesis, Technical University of Denmark,
2013.
[69] ?GEMBench Package,? 2019, https://ece.umd.edu/DSPCAD/projects/csm/
packages/gembench.tar.gz, Visited on March 14, 2019.
[70] A. R. Cassandra, Exact and Approximate Algorithms for Partially Observable
Markov Decision Processes, Ph.D. thesis, Brown University, 1998.
189
[71] Hyeong Seop Sim, Kee-Eung Kim, Jin Hyung Kim, Du-Seong Chang, and
Myoung-Wan Koo, ?Symbolic heuristic search value iteration for factored
POMDPs,? in Proceedings of the AAAI Conference on Artificial Intelligence
(AAAI), 2008.
[72] Kevin Murphy, ?Markov decision process toolbox for MATLAB,? https://
www.cs.ubc.ca/~murphyk/Software/MDP/mdp.html, Accessed: 2019-01-05.
[73] T. Smith, Probabilistic Planning for Robotic Exploration, Ph.D. thesis, Carnegie
Mellon University, 2007.
[74] P. Poupart, Exploiting Structure to Efficiently Solve Large Scale Partially Ob-
servable Markov Decision Processes, Ph.D. thesis, University of Toronto, 2005.
[75] P. Elod, ?Vision-based quadcopter navigation for following indoor corridors
and outdoor railways,? M.S. thesis, Tech. Univ. of Cluj-Napoca, 2014.
[76] Diego Maniloff, ?libpomdp,? https://www.cs.uic.edu/~dmanilof/code.
html, Accessed: 2019-01-05.
[77] E. Svalorzen, ?AI-Toolbox,? https://github.com/Svalorzen/AI-Toolbox,
Accessed: 2019-01-05.
[78] Drew McDermott, PDDL ? The Planning Domain Definition Language, Yale
University, 1998.
[79] Hakan L. S. Younes and Michael L. Littman, ?PPDDL1.0: the language for
the probabilistic part of ipc-4,? Tech. Rep.
[80] Scott Sanner, ?Relational dynamic influence diagram language (RDDL): Lan-
guage description,? 2010.
[81] D. Silver et al., ?Mastering the game of Go without human knowledge,? Nature,
vol. 550, pp. 354, Oct 2017.
[82] Liangzhen Lai and Naveen Suda, ?Enabling deep learning at the IoT edge,? in
Proceedings of the International Conference on Computer-Aided Design, 2018,
ICCAD ?18.
190