ABSTRACT
Title of dissertation: PARTICLE FILTERING FOR
STOCHASTIC CONTROL AND
GLOBAL OPTIMIZATION
Enlu Zhou, Doctor of Philosophy, 2009
Dissertation directed by: Professor Steven I. Marcus
Department of Electrical and Computer Engineering
Professor Michael C. Fu
Department of Decision, Operations,
and Information Technologies
This thesis explores new algorithms and results in stochastic control and global
optimization through the use of particle flltering. Stochastic control and global
optimization are two areas that have many applications but are often di?cult to
solve.
In stochastic control, an important class of problems, namely, partially observ-
able Markov decision processes (POMDPs), provides an ideal paradigm to model
discrete-time sequential decision making under uncertainty and partial observation.
However, POMDPs usually do not admit analytical solutions, and are computation-
ally very expensive to solve most of the time. While many e?cient numerical algo-
rithms have been developed for flnite-state POMDPs, there are only a few proposed
for continuous-state POMDPs, and even more sparse are relevant analytical results
regarding convergence and error bounds. From the modeling viewpoint, many ap-
plication problems are modeled more naturally by continuous-state POMDPs rather
than flnite-state POMDPs. Therefore, one part of the thesis is devoted to devel-
oping a new e?cient algorithm for continuous-state POMDPs and studying the
performance of the algorithm both analytically and numerically. Based on the idea
of density projection with particle flltering, the proposed algorithm reduces the
inflnite-dimensional problem to a flnite-low-dimensional one, and also has the  ex-
ibility and scalability for better approximation if given more computational power.
Error bounds are proved for the algorithm, and numerical experiments are carried
out on an inventory control problem.
In global optimization, many problems are very di?cult to solve due to the
presence of multiple local optima or badly scaled objective functions. Many ap-
proximate solutions methods have been developed and studied. Among them, a
recent class of simulation-based methods share the common characteristic of repeat-
edly drawing candidate solutions from an intermediate probability distribution and
then updating the distribution using these candidate solutions, until the probabil-
ity distribution becomes concentrated on the optimal solution. The e?ciency and
accuracy of these algorithms depend very much on the choice of the intermediate
probability distributions and the updating schemes. Using a novel interpretation of
particle flltering, these algorithms are unifled under one framework, and hence, many
new insights are revealed. By better understanding these existing algorithms, the
framework also holds the promise for developing new improved algorithms. Some
directions for new improved algorithms are proposed, and numerical experiments
are carried out on a few benchmark problems.
PARTICLE FILTERING FOR
STOCHASTIC CONTROL AND GLOBAL OPTIMIZATION
by
Enlu Zhou
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulflllment
of the requirements for the degree of
Doctor of Philosophy
2009
Advisory Committee:
Professor Steven I. Marcus, Chair/Advisor
Professor Michael C. Fu, Co-Chair/Co-Advisor
Professor P. S. Krishnaprasad
Professor Andr?e L. Tits
Professor S. Raghavan
c Copyright by
Enlu Zhou
2009
Dedication
In memory of my grandfather Mr. Mengquan Yu.
ii
Acknowledgments
First of all, I would like to express my sincerest gratitude to my advisors
Professor Steven I. Marcus and Professor Michael C. Fu. They have guided me in
research and in career, provided me with tremendous support, and always had faith
in me. I regard them as my role models for my career life, because of their dedication
to the quality and integrity of research, love and patience for students, and being
wonderful human beings. I have become a better person because of the flve years
of interaction with them. I will always cherish my experience as a graduate student
under their supervision.
I would also like to thank Professor P. S. Krishnaprasad. I took flve graduate
courses taught by him, and often went to ask him questions or discuss problems.
He is also the flrst person that introduced me to the fleld of particle flltering. His
lectures are insightful, inspiring, and challenging. I have been greatly in uenced by
his great passion in research and pursuit of new knowledge.
I also want to thank Professor Andr?e L. Tits. He is not only a great teacher but
also a good friend. I have learned a lot from him in class, and especially beneflted
from his technical rigorousness. I also had a lot fun with him in skiing, playing ping
pong, and chatting at cofiee hours.
I cannot express enough thanks to my mother Zhilin Yu and father Huaishen
Zhou, who always give me unconditional love and support, and are the source of
my courage in hard times. I also want to thank Peng Qiu, for the continuous happy
time spent together and the discrete times listening to my complaints.
iii
Table of Contents
List of Tables vi
List of Figures vii
List of Abbreviations viii
1 Introduction 1
1.1 Particle Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Partially Observable Markov Decision Processes . . . . . . . . . . . . 4
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Nonlinear Filtering 12
2.1 Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Weighted Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . 15
2.3 Particle Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Fading Memory Particle Filter . . . . . . . . . . . . . . . . . . . . . . 23
3 Stochastic Control 32
3.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.1 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.2 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1.3 Simulation-Based Methods . . . . . . . . . . . . . . . . . . . . 38
3.2 Partially Observable Markov Decision Processes . . . . . . . . . . . . 38
3.2.1 Belief MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.2 Finite-State vs. Continuous-State . . . . . . . . . . . . . . . . 42
4 Solving Continuous-State POMDPs 44
4.1 Related Work and Motivation . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Density Projection . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.2 Projected Belief MDP . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Projection Particle Filtering . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Analysis of Projected Belief MDP . . . . . . . . . . . . . . . . . . . . 55
4.4.1 Main Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.2 Error Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5 Analysis of Projection Particle Filtering . . . . . . . . . . . . . . . . 61
4.5.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5.2 Main Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5.3 Error Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.6 Validation of Assumptions . . . . . . . . . . . . . . . . . . . . . . . . 74
4.7 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.7.1 Scalability and Computational Issues . . . . . . . . . . . . . . 75
iv
4.7.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 77
4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5 Particle Filtering Framework for Optimization 88
5.1 Related Work and Motivation . . . . . . . . . . . . . . . . . . . . . . 88
5.2 Filtering for Optimization . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3 Particle Filtering Framework for Optimization . . . . . . . . . . . . . 94
5.4 Interpretation of EDAS, CE, MRAS . . . . . . . . . . . . . . . . . . . 97
5.4.1 Estimation of Distribution Algorithms . . . . . . . . . . . . . 98
5.4.2 Cross Entropy Method . . . . . . . . . . . . . . . . . . . . . . 100
5.4.3 Model Reference Adaptive Search . . . . . . . . . . . . . . . . 103
5.5 Implication for New Algorithms . . . . . . . . . . . . . . . . . . . . . 106
5.5.1 Balancing Exploration and Exploitation . . . . . . . . . . . . 107
5.5.2 Combining Global Search with Local Search . . . . . . . . . . 107
5.5.3 Overcoming Premature Convergence . . . . . . . . . . . . . . 108
5.6 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.7 Conclusions and Future Research . . . . . . . . . . . . . . . . . . . . 115
6 Conclusions and Future Research 116
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Bibliography 121
v
List of Tables
2.1 Performances under difierent real system parameters ?r, with model sys-
tem parameter ?m = 0:5, and system noise uk ? ?(3;1). Each entry shows
the mean and standard error (in parentheses) of the MSEs based on 100
independent runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Performances under difierent system noises, with system parameter ?r =
?m = 0:5. Each entry shows the mean and standard error (in parentheses)
of the MSEs based on 100 independent runs. . . . . . . . . . . . . . . 29
4.1 Notations of Difierent Conditional Distributions . . . . . . . . . . . . 65
4.2 Optimal average cost estimates for the inventory control problem us-
ing difierent methods. Each entry represents the average cost of a
run of horizon 105. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.3 Optimal discounted cost estimate for the inventory control problem
using difierent methods. Each entry represents the discounted cost
(mean, standard error in parentheses) based on 1000 independent
runs of horizon 40. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.4 Continue Table 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.1 Average performance of CEA and CE on some benchmark problems.
Each entry presents the mean of H(x?) with standard error in paren-
theses, based on 100 independent runs. . . . . . . . . . . . . . . . . . 113
5.2 Average performance of CEA with difierent parameter values of fi
and fl on the Rosenbrock function. Each entry presents the mean of
H(x?) with standard error in parentheses, based on 100 independent
runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
vi
List of Figures
2.1 A graphic illustration of Table 2.1: Performances (mean and standard
error of MSEs based on 100 independent runs) under difierent values of
the real system parameter ?r. . . . . . . . . . . . . . . . . . . . . . . . 30
2.2 AgraphicillustrationofTable2.2: Performances(meanandstandarderror
of MSEs based on 100 independent runs) under difierent system noises. . 30
2.3 A typical run of the estimated states tracking the true state, when the
system noise uk ? ?(3;2), and the system parameter ?r = ?m = 0:5. . . . 31
2.4 A typical run of the estimated states tracking the true state when the
system model is inaccurate. The real system parameter ?r = 0:6, the
model system parameter ?m = 0:5, and the system noise uk ? ?(3;2). . . . 31
4.1 Our algorithm: actions taken for difierent inventory levels under dif-
ferent observation noise variances. . . . . . . . . . . . . . . . . . . . 82
5.1 At the flrst three iterations, the conditional density bk becomes more
and more concentrated on the optimal solution. . . . . . . . . . . . . 94
5.2 Some benchmark problems in two dimensions. . . . . . . . . . . . . 112
5.3 Average performance of CEA and CE on some benchmark problems. 114
vii
List of Abbreviations
CE Cross Entropy Method
DP Dynamic Programming
EDA Estimation of Distribution Algorithm
EKF Extended Kalman Filter
MDP Markov Decision Process
MRAS Model Reference Adaptive Search
KF Kalman Filter
KL Kullback-Leibler
PF Particle Filter
POMDP Partially Observable Markov Decision Process
viii
Chapter 1
Introduction
Stochastic control and global optimization are flelds that share two charac-
teristics: (i) their models can be used to formulate problems in many applications
areas; (ii) algorithms for solving problems in these flelds are usually intractable and
di?cult to analyze. To attack di?cult problems of a size that are found in most
applications requires signiflcant new methodologies. This dissertation attempts to
solve problems in stochastic control and global optimization through the use of par-
ticle flltering. This chapter gives a brief introduction to particle flltering and an
important class of problems in stochastic control, namely, the partially observable
Markov decision processes (POMDPs).
1.1 Particle Filtering
The goal of flltering is to estimate the unobserved states of a dynamic system
given a noisy observation process of the states. The classic problem is to estimate
the conditional density of the current state given the history of observations [24].
The conditional density usually evolves according to some equation, for instance, in
the case of a difiusion process the normalized Kushner-Stratonovich equation [52]
[84] or the unnormalized Zakai equation [92]. However, these equations usually do
not admit an analytical form of the solution except in some special cases, such as
1
linear Gaussian systems and flnite state space hidden Markov models. For more
general systems, many approximation flltering methods have been developed.
One approach of approximation is to modify the system in such a way that
exact fllters can be applied to the modifled system. For example, a well-known
method is the Extended Kalman fllter (EKF) (pp. 182-190, [31]) for nonlinear/non-
Gaussian systems. EKF linearizes the system equation and observation equation,
and approximates the system noise and observation noise by their flrst two moments,
i.e., Gaussian random variables. Hence, the approximated linear Gaussian system
can be flltered using the standard Kalman fllter [46] [47]. EKF is computationally
e?cient, but the convergence of the fllter is not guaranteed and the performance
could be poor in many situations. Another method is to discretize the state space
and transform the system to a flnite state hidden Markov model. The disadvantage
of this method is that the number of grids has to be su?ciently large to obtain a
good approximation and the number of grids increases dramatically as the dimension
of the state space increases.
A recent and powerful method for nonlinear flltering is particle flltering. It is a
class of Monte Carlo simulation-based flltering methods for nonlinear/non-Gaussian
systems, asetting where traditional methods often lead tocomputational intractabil-
ity. Particle flltering was flrst introduced by Gordon et al. [34], and is also called
bootstrap flltering [34], sampling importance resampling [4], the condensation algo-
rithm [42], and sequential Monte-Carlo method [4]. A good tutorial can be found
in [4], a good survey of the fleld and recent trends can be found in [21], and more
details can be found in the book [29].
2
The idea of particle flltering is to approximate the conditional density by a
flnite number of particles/samples and let these particles propagate in a certain way
to mimic the evolution of the conditional density. The approximated conditional
density converges to the true conditional density in a certain sense as the number
of particles increases to inflnity [23], [54].
The plain particle fllter consists of three steps at each time/iteration: impor-
tance sampling, weighting, and resampling. In the importance sampling step, i.i.d.
particles are sampled from a importance density. In the weighting step, particles are
weighted according to the ratio of the conditional density to the importance den-
sity and also according to the Bayes? rule using the new observation of the current
state. These weighted particles represent a discrete distribution with support points
equal to the locations of the particles and the associated probabilities equal to the
weights of the particles. This discrete distribution is an approximation to the true
conditional distribution. In the last step, new particles are resampled from the old
particles according to their weights. These new particles go through the same steps
at the next time/iteration.
The importance distribution is often chosen as the distribution of the current
state given the previous state. The beneflt of this choice is twofold. First, this
importance distribution is easy to sample, since it is equivalent to propagating the
particles through the system equation. Thus, the importance sampling step is also
called propagation or prediction step in this case. Second, the weight of each particle
is proportional to the likelihood of the current observation given that particle, and
hence is very easy to calculate. However, this is not the optimal choice of importance
3
distribution. The simulation can be very ine?cient if the importance distribution is
very difierent from the target distribution. For instance, if the conditional distribu-
tion is peaky and the importance distribution is  at, it is very likely that most of
the particles sampled from the importance distribution lie in an area of probability
close to zero according to the conditional distribution. Therefore, many improved
particle fllters have been proposed based on a better choice of the importance distri-
bution, such as the extended Kalman particle fllter [25], the auxiliary particle fllter
[68], and the unscented particle fllter [88].
The plain particle fllter often sufiers from the problem of sample impover-
ishment or loss of diversity, meaning that all particles collapse to a small num-
ber of particles within a few iterations. Since new particles are resampled from
the old particles according to their weights/probabilities, old particles with large
weights/probablities will have more copies in the next iteration. The problem of
sample impoverishment is especially severe when the system noise is small, because
the copies of the same particle cannot be dispersed far away enough through prop-
agation. Therefore, improving the resampling step is also a goal of many improved
particle fllters, such as the regularized particle fllter [65], and the particle fllter with
MCMC steps [3].
1.2 Partially Observable Markov Decision Processes
POMDPs provide a powerful paradigm for modeling discrete-time optimal
decision making under uncertainty and partial observation. It has been used to
4
model application problems arising in the areas such as manufacturing systems,
artiflcial intelligence, flnancial engineering, and operations research. A POMDP
model consists of a time-indexed state equation that models the system dynamics,
a time-indexed observation equation that models the observation or measurement
of the state, and an objective function that models the cost (or reward) that needs
to be minimized (or maximized). A decision maker interacts with the environment
over a flnite or inflnite time horizon that is divided into stages or periods. At each
stage, the decision maker receives a partial observation of the current state, takes
an action based on this piece of information along with the history of observations
and actions, and then transitions to a new state probabilistically at the next stage.
The action taken incurs a one-step cost (or reward). The objective is to minimize
(or maximize) a cost (or reward) function, where the one-step costs (or rewards) are
accrued in each stage.
The difierence between a POMDP and a Markov Decision Process (MDP) is in
the observation of the state. The state can be fully recovered from the observation
in an MDP, whereas the observation only provides partial information for the state
in a POMDP. A POMDP can be transformed to a so-called belief MDP, which is an
MDP whose states are conditional probability distributions of the true states of the
POMDP and are called belief states. This transformation allows us to utilize the
existing various techniques for solving MDPs. However, it also creates new di?cul-
ties. The flrst di?culty is how to accurately and e?ciently estimate the distribution
of the states, which is the goal of a flltering problem. The second di?culty is that
although it is an MDP problem, the state space is generally continuous and may
5
have inflnite dimensionality. The reason is explained as follows.
Consider a POMDP problem with a flnite state space. The estimation of
the state is a discrete probability distribution, which sits in a flnite-dimensional
probability simplex. Hence, the corresponding belief MDP has a flnite-dimensional
continuous state space, which sufiers from the curse of dimensionality of the continu-
ous state space of the MDP. Now consider a POMDP problem that has a continuous
state space. The probability distribution of the state sits in an inflnite-dimensional
space of continuous probability distributions. Hence, the corresponding belief MDP
has a inflnite-dimensional continuous state space that makes the problem even more
di?cult to solve. Therefore, e?cient numerical solution of POMDPs with large or
continuous state spaces is a challenging research area.
The past research on numerical solutions of POMDPs is mostly focused on
flnite-state problems. For flnite-state POMDPs, it is proved that the value function
is a piecewise linear convex function after a flnite number of iterations, provided
that the one-step cost function is piecewise linear and convex [81]. This feature has
been exploited in various exact and approximate algorithms such as those found
in [82], [81], [83], [57], and [36]. The algorithm in [81] and the Witness algorithm
in [57] carry out exact value iterations by flnding and updating the minimum set
of linear functions that determine the value function at each iteration for a flnite-
horizon problem. Because the number of such linear functions grow exponentially
with the number of iterations, these algorithms are computationally very expensive
and are limited to very simple problems in practice. Howard [39] and Sondik [83] use
policy iteration to solve the inflnite-horizon discounted-cost problems. Hauskrecht
6
[36] summarizes several value function approximation methods in a nice framework
of modifying the value iteration equation by changing the order of summations and
maximization, including approximation with fully observable MDP, approximation
with Q-functions, fast informed bound method, and approximation with unobserv-
able MDP. [36] also summarizes many grid-based methods with various techniques
of choosing the set of grids and approximating the value function.
Despite the abundance of algorithms for flnite-state POMDPs, the aforemen-
tioned inflnite-dimensionality of continuous-state POMDPs suggests that simple
generalizations of many of the discrete-state algorithms to continuous-state mod-
els are not appropriate or applicable. Therefore, some researchers have been mo-
tivated to look for e?cient algorithms for continuous-state POMDPs [87] [69] [75]
[17]. However, compared to flnite-state POMDPs, the research on continuous-state
POMDPs is more recent and still sparse. Therefore, it is a fleld worth exploring.
1.3 Contributions
My doctoral research has yielded new solution methods to partially observ-
able Markov decision processes and global optimization through the use of particle
flltering. To the best of our knowledge, application of particle flltering to POMDPs
is relatively sparse, and it has never before been applied to the fleld of global opti-
mization.
The flrst line of our research aims to establish a framework for solving large or
continuous-statePOMDPs. Mostoftheexistingapproximatesolutionsforcontinuous-
7
state POMDPs involve some kind of dimension reduction of the belief space, i.e.,
reducing the inflnite dimensionality to a flnite (low) dimensionality. Dimension
reduction is a very broad idea, and it is worth studying how to do dimension re-
duction in an e?cient and meaningful way. On the one hand, we want to reduce
the dimensionality as much as possible so that the computational demand can be
lowered. On the other hand, we do not want to lose too much information during
dimension reduction so that the approximation is useful. Moreover, we would like
to have a mechanism that allows us to control the tradeofi between complexity and
accuracy: we can increase the accuracy of the solution by increasing the computa-
tional power, and vice versa. These considerations have motivated our research to
investigate a new dimension reduction technique to develop e?cient numerical solu-
tions for large/continuous-state POMDPs. Based on the idea of density projection
with particle flltering, we have developed a theoretically sound method that efiec-
tively reduces the dimension of the belief state and has the  exibility to represent
arbitrary belief states, such as multimodal or heavy-tailed distributions. We have
proved rigorous convergence results and error bounds of the algorithm. We have
also applied the approach to and obtained good numerical results on an inventory
control problem and portfolio optimization problems in flnancial engineering. The
development and analysis of the approach appear in our papers [95] [96]. Applica-
tions of the approach to flnancial engineering are to appear in our working paper
[94].
As a second line of our research, we have proposed a flltering approach to opti-
mization, and in particular, have developed a framework based on particle flltering.
8
Global optimization problems can be extremely di?cult to solve, due to the presence
of multiple local optimal solutions in the general setting. A class of simulation-based
methods for global optimization has been introduced recently, which includes the
estimation of distribution algorithms (EDAs), the cross-entropy (CE) method, and
model reference adaptive search (MRAS). In these algorithms, new solutions are
generated from an intermediate probability distribution that is updated or induced
from the previously generated solutions. Algorithms that fall in this category difier
in the choice and updating of the intermediate probability distribution, which plays
a key role in determining the efiectiveness of the algorithm. This has motivated us
to look for a unifying and systematic approach to such simulation-based methods
for optimization. We have introduced a framework based on particle flltering that
unifles EDAs, the CE method and MRAS, as well as combining the simulation-based
global search with the gradient-based local search in a nice way. This  exible frame-
work holds the promise of generating new improved algorithms by incorporating
many of the vast array of techniques that have been developed to improve particle
fllters, and other recent results in nonlinear flltering. This framework unifles many
recent simulation-based algorithms, and combines simulation-based global search
with gradient-based local search in a nice way. More importantly, the framework
holds the promise of generating new improved algorithms by incorporating many of
the vast array of techniques that have been developed to improve particle fllters,
and other recent results in nonlinear flltering. We are currently developing and
testing new algorithms under this framework, and analyzing convergence properties
and error bounds of the framework. Preliminary results of this work appear in our
9
paper [97]. A more complete and developed work appears in our paper [96].
Besides the two main lines of research, we have also developed a fading memory
particle fllter. It is an improved particle fllter in situations where the system models
are inaccurate and simulation is insu?cient.
1.4 Outline
The rest of the dissertation is organized as follows.
Chapter 2 provides the necessary background and literature review on nonlin-
ear flltering, and proposes a fading memory particle fllter. We describe the problem
formulation of nonlinear flltering, and describe in details some approximate non-
linear fllters, including the extended Kalman fllter, the weighted extended Kalman
fllter, and particle flltering in general. In the end of the chapter, we present the
derivation of a fading memory particle fllter and results from numerical experiments.
Chapter 3 provides the necessary background and literature review on stochas-
tic control. In particular, we focus on Markov decision processes and partially ob-
servable Markov decision processes, and describe the problem formulation of each.
For MDPs, we describe in details the value iteration method and policy iteration
method, and brie y describe simulation-based methods. For POMDPs, we describe
the transformation to belief MDPs, and discuss numerical complexity of flnite-state
vs. continuous-state POMDPs.
Chapter 4 presents our work on a new e?cient numerical method for solving
continuous-state POMDPs. Section 4.1 reviews related work and explains our moti-
10
vation for this work; section 4.2 describes the density projection technique, and uses
it to develop the projected belief MDP; section 4.3 develops the projection particle
fllter; section 4.4 computes error bounds for the projected belief MDP; section 4.5
computes an error bound for the projection particle fllter; section 4.6 presents a
validation of the assumptions used in the proof of error bounds; section 4.7 dis-
cusses scalability and computational issues of the method, and applies the method
to a simulation example of an inventory control problem; section 4.8 concludes and
discusses the work.
Chapter 5 presents a particle flltering framework for simulation-based opti-
mization algorithms. Section 5.1 reviews related work and explains our motivation
for this work; section 5.2 transforms a global optimization problem to a flltering
problem; section 5.3 applies particle flltering to the transformed flltering problem
and develops a general framework for simulation-based optimization algorithms;
section 5.4 uses the framework to interpret some existing algorithms and reveals
some new insights; section 5.5 discusses the directions for developing new improved
algorithms under this framework; section 5.6 presents numerical results of a new
improved algorithm in comparison with the cross-entropy method on some bench-
mark problems; section 5.7 concludes our current work and discusses some future
research direction.
Chapter 6 concludes the dissertation and outlines some future research.
11
Chapter 2
Nonlinear Filtering
The nonlinear flltering problem [24] [61] [45] involves the estimation of a
stochastic process x (called the state process) which cannot be observed directly.
Information concerning x is obtained from observations of a related process y (the
observation process). The objective is the computation, for each t, of least-square
estimates of functions of the state xt given the \past" observations fys;0 ? s ? tg,
i.e., to compute quantities of the form E[`(xt)jys;0 ? s ? t], or perhaps to compute
the entire conditional distribution of the state xt given fys;0 ? s ? tg. When the
observations are being received sequentially, it is also desired that this computation
be done recursively. Except in some special cases such as the linear Gaussian case,
there do not exist statistics of the conditional distribution that can be computed
recursively with flnite-dimensional fllters [16] [63] [62] [40] [90]. Hence, there have
been many attempts to develop recursive flnite dimensional suboptimal fllters, and
signiflcant efiort has been dedicated to developing numerical methods for solving
nonlinear flltering problems (see [19] for a recent survey).
In this thesis, we focus on a discrete-time model
xk = f(xk?1;uk?1); k = 1;2;:::;
yk = h(xk;vk); k = 0;1;:::; (2.1)
where for all k, xk 2 Rnx is the state , yk 2 Rny is the observation, the random
12
disturbances fukg 2 Rnu and fvkg 2 Rnv are sequences of i.i.d. continuous random
vectors, and nx, ny, nu, and nv are the dimensions of xk, yk, uk, and vk, respectively.
Assume that fukg and fvkg are independent of each other, and independent of x0,
which follows a distribution p0. The goal of flltering is to estimate the conditional
density
p(xkjy0:k); k = 0;1;:::; (2.2)
where y0:k = fy0;:::;ykg, all the observations from time 0 to k.
2.1 Extended Kalman Filter
In (2.1), if f and g are linear functions in x, u and v, u and v are Gaus-
sian noises, and the initial condition x0 is Gaussian, then the conditional density
p(xkjy0:k) is Gaussian for all time k, and there exists a flnite-dimensional optimal
fllter, namely, the Kalman fllter (KF) [46] [47].
If (2.1) takes the form as
xk = f(xk?1)+uk?1; k = 1;2;:::;
yk = h(xk)+vk; k = 0;1;:::; (2.3)
where f and h are nonlinear functions, then the Kalman fllter can be applied to
(2.3) by linearizing the system equations, and approximating the system noise and
observation noise as Gaussian noises. This approach is referred to as the extended
Kalman fllter (EKF) [31] [1] [2]. More speciflcally, the true conditional density
p(xkjy0:k) is approximated by a Gaussian density with mean ?xkjk and covariance
13
Pkjk that are deflned as
?xkjk , E[xkjy0:k]; Pkjk , E[(xk ? ?xkjk)(xk ? ?xkjk)Tjy0:k]:
The estimates ?xkjk and Pkjk are often called the posterior mean and covariance of
the state xk. Similarly, the predicted mean and covariance at time k are deflned as
?xkjk?1 , E[xkjy0:k?1]; Pkjk?1 , E[(xk ? ?xkjk?1)(xk ? ?xkjk?1)Tjy0:k?1]:
To linearize the system (2.3), deflne
Fk = @f(x)@x jx=?xkjk; Hk = @h(x)@x jx=?xkjk?1:
Without loss of generality, we assume that uk and vk are zero mean. Let Qk and Rk
denote the covariance matrices of uk and vk, respectively. Then the linearized and
Gaussianized system of (2.3) is
xk = Fk?1xk?1 +uk?1;
yk = Hkxk +vk; (2.4)
where uk?1 ? N(0;Qk?1) and vk ? N(0;Rk). For (2.4), if the initial condition x0
follows a Gaussian distribution N(?x0;P0), the conditional density is Gaussian for
every time k, and the optimal fllter is the Kalman fllter, which recursively propagates
the predicted mean ?xkjk?1 and the predicted covariance matrix Pkjk?1, and updates
the posterior mean ?xkjk and the posterior covariance matrix Pkjk using the new
observation yk. Mathematically, the Kalman fllter consists of the following recursive
14
equations
?xkjk?1 = Fk?1?xk?1jk?1;
Pkjk?1 = Fk?1Pk?1jk?1FTk?1 +Qk?1;
Kk = Pkjk?1HTk (Rk +HkPkjk?1HTk )?1;
?xkjk = ?xkjk?1 +Kk(yk ?Hk?xkjk?1);
Pkjk = Pkjk?1 ?KkHkPkjk?1; (2.5)
with initialization ?x0j?1 = ?x0; P0j?1 = P0. The EKF extends the KF to the
nonlinear system (2.3) as follows:
?xkjk?1 = f(?xk?1jk?1);
Pkjk?1 = Fk?1Pk?1jk?1FTk?1 +Qk?1;
Kk = Pkjk?1HTk (Rk +HkPkjk?1HTk )?1;
?xkjk = ?xkjk?1 +Kk(yk ?h(?xkjk?1));
Pkjk = Pkjk?1 ?KkHkPkjk?1; (2.6)
with initialization ?x0j?1 = ?x0; P0j?1 = P0. The EKF (2.6) looks almost the same as
the KF (2.5), except that f(?xk?1jk?1) replaces Fk?1?xk?1jk?1 and h(?xkjk?1) replaces
Hk?xkjk?1. The EKF is a suboptimal fllter, and its convergence is not guaranteed.
2.2 Weighted Extended Kalman Filter
Sometimes the system model is not an accurate model of the actual system.
The model inaccuracy degrades the value of past information [44]. Therefore, we
15
sometimes want to discount old observations, or equivalently, to put more weight on
the more recent observations. Fading memory of old observations can compensate
the model inaccuracy, and the physical justiflcation is that \old observations, when
predicted over long time arcs through an erroneous system, can be valueless" (pp.
305, [44]). This idea of fading memory of old observations has been proposed and
studied theoretically and empirically for several decades for the Kalman fllter [1][56]
[31]. One approach often used is exponential data weighting, or in other words,
exponentially discounting the old observations.
Anderson and Moore derived the Kalman fllter with exponential data weighting
(KF-EDW) (pp. 135-138, [1]), by observing that the KF estimate ?xkjk?1 is equal
to x?k, which is the last component of the solution (x?0;:::;x?k) to the minimization
problem
minx
0;:::;xk
Jk = 12(x0 ? ?x0)TPy0(x0 ? ?x0)
+ 12
k?1X
i=0
(yi ?HTi xi)Ryi(yi ?HTi xi)
+ 12
k?1X
i=0
uTi Qyiui; (2.7)
s:t: xi = Fi?1xi?1 +ui?1; i = 1;:::;k:
where the superscript y denotes pseudo inverse. The function (2.7) can be viewed as
a total cost function on the estimation errors from time 0 to k?1. With this inter-
pretation, placing more emphasis on the more recent observations is equivalent to
penalizing the more recent estimation errors more heavily. This suggests increasing
the weighting matrices Ryi and Qyi in (2.7) for larger values of i, such as replacing
16
Ri and Qi by fi?2iRi and fi?2(i+1)Qi in (2.7), yielding
Jk = 12(x0 ? ?x0)TPy0(x0 ? ?x0)
+ 12
k?1X
i=0
(yi ?HTk xi)fi2iRyi(yi ?HTi xi)
+ 12
k?1X
i=0
uTi fi2(i+1)Qyiui; (2.8)
where fi is some constant greater than 1. In view of the relationship between (2.7)
and the Kalman fllter, there could exist a similar relationship between (2.8) and
a designed fllter. More speciflcally, we can replace the actual covariance matrices
Rk and Qk?1 in (2.5) by the design values fi?2kRk and fi?2kQk?1, to obtain the
designed fllter that achieves the minimum of (2.8). The resultant designed fllter is
the KF-EDW:
?xkjk?1 = Fk?1?xk?1jk?1;
Pfikjk?1 = fi2Fk?1Pfik?1jk?1FTk?1 +Qk?1;
Kk = Pfikjk?1HTk (Rk +HkPfikjk?1HTk )?1;
?xkjk = ?xkjk?1 +Kk(yk ?Hk?xkjk?1);
Pfikjk = Pfikjk?1 ?KkHkPfikjk?1; (2.9)
where
Pfikjk?1 = fi2kPkjk?1; Pfikjk = fi2kPkjk;
with initialization ?x0j?1 = ?x0; Pfi0j?1 = P0. The parameter fi is called the forgetting
factor, because it indicates how fast the observation is forgotten. Notice the main
difierence between (2.9) and (2.5) is the coe?cient fi2 in the second equation. Also
17
notice that while Pkjk in the KF is the error covariance E[(xk??xkjk)(xk??xkjk)Tjy0:k],
the estimate Pfikjk in the KF-EDW is not. Although the KF-EDW is not optimal,
the exponential data weighting promotes exponential stability of the fllter.
In a similar spirit to how the EKF extends the KF, we extend the KF-EDW to
the nonlinear system (2.3) as follows to obtain the weighted extended Kalman fllter
(WEKF):
?xkjk?1 = f(?xk?1jk?1);
Pfikjk?1 = fi2Fk?1Pfik?1jk?1FTk?1 +Qk?1;
Kk = Pfikjk?1HTk (Rk +HkPfikjk?1HTk )?1;
?xkjk = ?xkjk?1 +Kk(yk ?h(?xkjk?1));
Pfikjk = Pfikjk?1 ?KkHkPfikjk?1; (2.10)
with initialization ?x0j?1 = ?x0; Pfi0j?1 = P0. Unlike the optimality difierence between
the KF and the KF-EDW, the EKF and the WEKF are both suboptimal, and
none of the estimates Pkjk and Pfikjk are the error covariance. Like the EKF, the
convergence of the WEKF is not guaranteed either.
2.3 Particle Filtering
Particle flltering is a class of fllters that utilize Monte Carlo simulation and
importance sampling techniques to estimate the conditional density of the state
given the past observations [4] [29] [21]. It is also called bootstrap flltering [34],
sampling importance resampling [4], the condensation algorithm [42], and sequential
Monte-Carlo method [21]. Particle flltering approximates the conditional density
18
p(xkjy0:k) using a flnite number of particles and mimicking the evolution of the
conditional density through the propagation of these particles. More speciflcally,
the particle fllter approximates p(xkjy0:k) by a probability mass function
^p(xkjy0:k) =
NX
i=1
wik?(xk ?xik); (2.11)
where ? denotes the Dirac delta function, fxik;i = 1;:::;Ng are the random support
points, and fwik;i = 1;:::;Ng are the associated weights satisfying fwik ? 0;i =
1;:::;N;PNi=1 wik = 1g.
Since p(xkjy0:k) is unknown, we opt to generate the particles by sampling from
another known density q(xkjy0:k), and adjust the weights of the samples to get an
estimate of p(xkjy0:k). This approach is known as the importance sampling, and
the density q(xkjy0:k) is referred to as the importance density. The rationale of
importance sampling can be observed from the following transformation:
Ep[`(x)] =
Z
`(x)p(x)dx =
Z
`(x)p(x)q(x)q(x)dx = Eq[`(x)p(x)q(x)]; (2.12)
where ` is an arbitrary integrable function, p is the target density, and q is the
importance density. Hence, from (2.12) it is easy to see that in order to approximate
p(xkjy0:k), for samples fxik;i = 1;:::;Ng drawn i.i.d. from q(xkjy0:k), their weights
should be
wik / p(x
i
kjy0:k)
q(xikjy0:k); (2.13)
where / means \proportional to", and the weights should be normalized.
To carry out the estimation recursively, we use the Bayes? rule to derive the
19
following recursive equation for the conditional density:
p(xkjy0:k) = p(xk;ykjy0:k?1)p(y
kjy0:k?1)
= p(ykjy0:k?1;xk)p(xkjy0:k?1)p(y
kjy0:k?1)
= p(ykjxk)
R p(x
kjy0:k?1;xk?1)p(xk?1jy0:k?1)dxk?1
p(ykjy0:k?1)
/ p(ykjxk)
Z
p(xkjxk?1)p(xk?1jy0:k?1)dxk?1; (2.14)
where p(ykjy0:k?1;xk) = p(ykjxk) and p(xkjy0:k?1;xk?1) = p(xkjxk?1) both follow
from the Markovian property of model (2.1), the denominator p(ykjy0:k?1) does not
explicitly depend on xk and k, and / means p(xkjy0:k) is the normalized version of
the right-hand side. The state transition density p(xkjxk?1) is induced from the state
equation in (2.1) and the distribution of the system noise uk?1, and the likelihood
p(ykjxk) is induced from the observation equation in (2.1) and the distribution of
the observation noise vk. Substituting (2.14) into (2.13), we get
wik / p(ykjx
i
k)p(x
i
kjx
i
k?1)
q(xikjy0:k) p(x
i
k?1jy0:k?1): (2.15)
If the importance density q(xkjy0:k) is chosen to be factored as
q(xkjy0:k) = q(xkjxk?1;yk)q(xk?1jy0:k?1); (2.16)
then
wik / p(ykjx
i
k)p(x
i
kjx
i
k?1)
q(xikjxik?1;yk)
p(xik?1jy0:k?1)
q(xik?1jy0:k?1)
/ p(ykjx
i
k)p(x
i
kjx
i
k?1)
q(xikjxik?1;yk) w
i
k?1: (2.17)
Moreover, to avoid sample degeneracy, new samples are resampled i.i.d. from
the approximate conditional density ^p(xkjy0:k) at each step, hence the weights are
20
reset to wik?1 = 1=N, and (2.17) is reduced to
wik / p(ykjx
i
k)p(x
i
kjx
i
k?1)
q(xikjxik?1;yk) ;i = 1;:::;N: (2.18)
In the plain particle fllter, the importance density q(xkjxik?1;yk) is chosen to be the
state transition density p(xkjxik?1), which is independent of the current observation
yk, yielding
wik / p(ykjxik);i = 1;:::;N: (2.19)
The plain particle fllter recursively propagates the support points and updates
the associated weights. The algorithm is as follows:
Algorithm 2.1 Plain Particle Filter
1. Initialization: Sample x10;:::;xN0 i.i.d. from an initial p.d.f./p.m.f. p0. Set
k = 1.
2. Importance Sampling/Propagation: Sample xik from p(xkjxik?1), i = 1;:::;N.
3. Bayes? Updating: Receive new observation yk. The conditional density is ap-
proximated by ^p(xkjy0:k) = PNi=1 wik?(x?xik), where wik is computed according
to (2.19) and normalized.
4. Resampling: Sample x1k;:::;xNk i.i.d. from ^p(xkjy0:k).
5. k ? k +1 and go to step 2.
For the resampling step, we can similarly use the importance sampling tech-
nique to resample from an importance density gk?1(xk?1jy0:k?1), which we will refer
21
to as the resampling importance density to distinguish from qk(xkjxk?1;yk). Hence,
in general, the weights are updated according to
wik / p(ykjx
i
k)p(x
i
kjx
i
k?1)p(x
i
k?1jy0:k?1)
qk(xikjxik?1;yk)gk?1(xik?1jy0:k?1) ; (2.20)
The plain particle fllter (Algorithm 2.2) is a special case of the general particle
fllter, with the particular choice of the importance density qk(xkjxk?1;yk) and the
resampling importance density gk(xkjy0:k).
Algorithm 2.2 General Particle Filter
1. Initialization: Sample xi0;:::;xN0 i.i.d. from an initial p.d.f./p.m.f. p0. Set
k = 1.
2. Importance Sampling: Sample xik from qk(xkjxik?1;yk), i = 1;:::;N.
3. Bayes? Updating: Receive new observation yk. The conditional density is ap-
proximated by ^p(xkjy0:k) = PNi=1 wik?(xk ?xik), where wik is computed according
to (2.20) and normalized.
4. Importance Resampling: Sample x1k;:::;xNk i.i.d. from gk(xkjy0:k).
5. k ? k +1 and go to step 2.
It has been proved that ^p(xkjy0:k) converges to p(xkjy0:k) as the sample size N
increases to inflnity [23] [54]. However, uniform convergence in time has only been
proved for the special case where the system dynamics has a mixing kernel that
ensures that any error is forgotten (exponentially) in time. Usually, the particle
fllter needs an increasing number of samples as time k increases to ensure a given
precision of the approximation ^p(xkjy0:k) for all k.
22
2.4 Fading Memory Particle Filter
Particle flltering weights equally all the observations when estimating the cur-
rent state. However, if, for example, the system model is inaccurate, the prediction
based on all the past observations and the system model may not be a good one,
which we have already mentioned in Section 2.2. Bad predictions can be also caused
by the simulation in the particle flltering, when the system noise is large and the
number of samples is not su?cient. This can also be viewed as model inaccuracy,
since the system noise is inaccurately represented by the samples due to simulation.
The model inaccuracy degrades the value of past information [44]. Therefore, we
sometimes want to discount old observations, or equivalently, to put more weight
on the more recent observations. This idea of fading memory of old observations
has been proposed and studied theoretically and empirically for several decades for
the Kalman fllter but rarely in particle flltering. Related work on particle flltering
with fading memory [25] [28] [66] addresses the issue by incorporating the fading
memory into the model not the fllter, and the models are tailored for special cases
and do not address the general setting.
Our approach is to incorporate the weighted extended Kalman fllter to gen-
erate the importance density in the particle fllter. The weighted extended Kalman
fllter is based on the Kalman fllter with exponential data weighting. The expo-
nential data weighting weights more heavily the more recent observations, and it
promotes exponential stability of the Kalman fllter. Since the fading memory only
afiects the locations of the generated particles, the convergence property of the par-
23
ticle fllter is preserved. Moreover, we expect that the fading memory can enhance
the convergence of the particle fllter, as it does for the Kalman fllter.
Like many improved particle fllters, such as the auxiliary particle fllter [68], the
extended Kalman particle fller [25], and the unscented particle fllter [88], the fading
memory particle fllter also incorporates the current observation into generating the
importance density and thus holds the promise for generating better importance
densities than the plain particle fllter which takes p(xkjxk?1) as the importance
density. The fading memory particle fllter looks similar to the extended Kalman
particle fllter [25] [88]; however, the fading memory particle fllter uses the weighted
extended Kalman fllter to generate the importance density instead of the unweighted
one. With the simple addition of a forgetting factor, the weighted extended Kalman
fllter has the advantages in giving the fllter a prescribed degree of stability and
curing many error problems.
Retaining the convergence property of the particle fllter, we incorporate the
WEKF into the particle fllter through the importance density. At time k, the WEKF
generates a Gaussian approximation, denoted as N(?xkjk;Pfikjk). Let the importance
density be
q(xkjxik?1;yk) = N(?xkjk;Pfikjk);i = 1;:::;N; (2.21)
from which the random support points fxikgNi=1 are drawn. Substituting (2.21) into
the weight updating equation (2.20), and assuming resampling is applied at time
k ?1, we obtain
wik / p(ykjx
i
k)p(x
i
kjx
i
k?1)
N(xikj?xkjk;Pfikjk) ;i = 1;2;:::;N: (2.22)
24
Now that an approximated conditional density ^p(xkjy0:k) = PNi=1 wik?(xk ?xik) is
obtained, it can be used to update the WEKF mean ?xkjk and variance Pfikjk according
to
?xkjk = E^pk[x]; Pfikjk = Var^pk(x);
where ^pk is short for ^p(xkjy0:k). The algorithm is as follows:
Algorithm 2.3 Fading Memory Particle Filter (FMPF)
1. Initialization: Set forgetting factor fi > 1. Sample x10;:::;xN0 i.i.d. from the
initial density p0. Set ?x0 = Ep0[x];Pfi0 = Varp0(x). Set k = 1.
2. Importance Sampling:
? Generate the importance density. Compute ?xkjk and Pfik by the WEKF
(2.10), using ?xk?1jk?1, ?Pfik?1jk?1 and yk.
? Sample xik;:::;xN0 i.i.d. from N(?xkjk;Pfikjk).
3. Bayes? Updating: Receive new observation yk. The conditional density is ap-
proximated by ^p(xkjy0:k) = PNi=1 wik?(xk ?xik), where wik is computed according
to (2.22) and normalized.
4. Resampling. Sample x1k;:::;xNk i.i.d. from ^p(xkjy0:k).
5. Parameter Updating: ?xkjk = E^pk[x];Pfikjk = Var^pk(x).
6. k ? k +1 and go to step 2.
The FMPF can be viewed from two viewpoints. The obvious viewpoint is that
the WEKF is incorporated to improve the PF, since it is used to generate a better
25
importance density for the PF. From another viewpoint, the PF is incorporated to
improve the WEKF, since particles sampled from the WEKF approximated density
N(?xkjk;Pfikjk) are updated to yield an empirical density ^p(xkjy0:k), which is used to
tune the WEKF parameters ?xkjk and Pfikjk. Therefore, we expect that the FMKF
retains the advantages of both PF and WEKF, namely the asymptotic convergence
of the PF and the stability of the WEKF.
Fading memory is not only a technique of weighting heavily the more recent
data, but also a way to prevent the divergence of the fllter (see pp. 137-138, [1] for
details). As the exponential data weighting promotes exponential stability of the
Kalman fllter, our hope is that fading memory also enhances uniform convergence
of the particle fllter.
NotethattheFMPFonlyneedstocomputeoneimportancedensityN(?xkjk;Pfikjk)
at time k, whereas the extended Kalman particle fllter (EKPF) needs to compute a
difierent importance density N(?xikjk;Pikjk) in order to draw each particle xik (see [88],
[4] and [25] for details on EKPF). Hence, the EKPF is much more computationally
expensive than the FMPF, especially when the state is multi-dimensional, since the
computation of the EKF involves matrix inversions.
We test the FMPF, EKPF, and PPF numerically on the example in [88]
xk+1 = 1+sin(0:04?k)+?xk +uk;x0 = 0;
yk =
8
>><
>>:
0:2x2k +vk k ? 30
0:5xk ?2+vk k > 30;
where ? is a scalar system parameter. The system noise uk follows a Gamma dis-
tribution ?(3; ), where  is the scale parameter. The observation noise vk follows a
26
Gaussian distribution N(0;1e?5). A larger  means a more spread distribution and
hence implies a larger system noise.
We compare the performance of the PPF, EKPF, and FMPF in terms of
tracking the true state xk by the flltered state mean E[xkjy0:k] in two cases:
? Inaccurate models of system dynamics, i.e., the real system parameter ?r is
not equal to the model system parameter ?m.
? Difierent values of system noise parameter  , with other parameters flxed and
?r = ?m.
For each case, we simulate 100 independent runs of length 60 with random initial-
izations, calculate the mean square error (MSE) of the estimated states for each run,
and then calculate the means and standard errors of the MSEs of the 100 runs. All
three fllters use 200 particles and stratifled resampling [74]. The forgetting factor
in the FMPF is chosen as fi = 1:2.
Table 2.1 lists the means and standard errors of the MSEs under difierent real
system parameters ?r, when the model system parameter is ?m = 0:5, the system
noise is uk ? ?(3;1) (see Fig. 2.1 for a graphical illustration). As we can see, when
the model is accurate, i.e., ?r = 0:5 = ?m, the MSEs are smallest for all three fllters,
which is consistent with our intuition. The PPF is very sensitive to the inaccuracy
of the model parameter, while the EKPF and FMPF are much more robust. The
FMPF performs best among all three.
Table 2.2 lists the means and standard errors of the MSEs for difierent values of
system noise, with the system parameter ?r = ?m = 0:5 (see Fig. 2.2 for a graphical
27
illustration). As the system noise increases, the PPF and EKPF deteriorate quickly,
and their performances at  = 2 are probably unacceptable, whereas the FMPF
performs well under all cases with slightly increasing MSEs.
Fig. 2.3 shows a typical run of the estimated state generated by the three
fllters tracking the true state, with  = 2, and ?r = ?m = 0:5. Fig. 2.4 shows a
typical run of the estimated states tracking the true state when the system model is
inaccurate, where the real system parameter is ?r = 0:6, and other parameters are
the same as Fig. 2.3. The PPF apparently misses many points before time k = 30 in
Fig. 2.3, and even beyond k = 30 in Fig. 2.4. The EKPF tends to overshoot when
there is a big upward change in the trajectory, such as at times k = 2;19;22;29 in
Fig. 2.3, and k = 4;9;11;13;17;22;26;30 in Fig. 2.4. In contrast, the FMPF tracks
the true states very closely in both cases, and shows its robustness with respect to
the inaccuracy in the model system parameter. All three fllters perform better after
time k = 30 than before k = 30, because the observation function is not invertible
in a unique way before k = 30.
28
Table 2.1: Performances under difierent real system parameters ?r, with model system
parameter ?m = 0:5, and system noise uk ? ?(3;1). Each entry shows the mean and
standard error (in parentheses) of the MSEs based on 100 independent runs.
?r ?r = 0:3 ?r = 0:4 ?r = 0:5 ?r = 0:6
PPF 3.77 (0.111) 1.63 (0.0801) 1.48 (0.102) 3.62 (0.207)
EKPF 0.323 (0.0305) 0.290 (0.0294) 0.290 (0.0392) 0.344 (0.0219)
FMPF 0.112 (0.0092) 0.0696 (0.0069) 0.0630 (0.0124) 0.0854 (0.0092)
Table 2.2: Performances under difierent system noises, with system parameter ?r =
?m = 0:5. Each entry shows the mean and standard error (in parentheses) of the MSEs
based on 100 independent runs.
Gamma(3; )  = 0:5  = 1  = 1:5  = 2
PPF 0.226 (0.025) 1.48 (0.102) 4.10 (0.271) 8.71 (0.412)
EKPF 0.0087 (0.0010) 0.290 (0.0392) 1.38 (0.123) 5.18 (0.440)
FMPF 0.0086 (0.0010) 0.0630 (0.0124) 0.192 (0.0265) 0.421 (0.0614))
29
0.3 0.4 0.5 0.6 0.7?1
0
1
2
3
4
5
Real system parameter ?
MSE
MSEs under Inaccurate System Models
 
 
FMPFEKPF
PPF
Figure 2.1: A graphic illustration of Table 2.1: Performances (mean and standard error of
MSEs based on 100 independent runs) under difierent values of the real system parameter
?r.
0.5 1 1.5 2?2
0
2
4
6
8
10
12
System noise parameter ?
MSE
MSEs under Different System Noises
 
 
FMPFEKPF
PPF
Figure 2.2: A graphic illustration of Table 2.2: Performances (mean and standard error
of MSEs based on 100 independent runs) under difierent system noises.
30
0 10 20 30 40 50 600
5
10
15
20
25
30
Time k
State estimate E[x(k)|y(1:k)]
State Estimate Tracking True State
 
 True state x(k)
FMPF estimateEKPF estimate
data4
Figure 2.3: A typical run of the estimated states tracking the true state, when the system
noise uk ? ?(3;2), and the system parameter ?r = ?m = 0:5.
0 10 20 30 40 50 600
5
10
15
20
25
30
Time k
State estimate E[x(k)|y(1:k)]
State Estimate Tracking True State for Inaccurate Model
 
 
True state x(k)FMPF estimate
EKPF estimatePPF estimate
Figure 2.4: A typical run of the estimated states tracking the true state when the system
model is inaccurate. The real system parameter ?r = 0:6, the model system parameter
?m = 0:5, and the system noise uk ? ?(3;2).
31
Chapter 3
Stochastic Control
The stochastic control problem involves designing a controller for a stochastic
process in order to minimize (or maximize) some cost (or reward) function. In this
thesis, we focus on discrete-time problems, namely, Markov decision processes and
partially observable Markov decision processes.
3.1 Markov Decision Processes
Consider a stationary discrete-time system model:
xk = f(xk?1;ak?1;uk?1); k = 1;2:::; (3.1)
where for all k, the state xk is in a state space S  Rnx, the action ak is in an action
space A  Rna, and the random disturbance uk 2 D  Rnu is a sequence of i.i.d.
random vectors with known distributions. Assume that fukg is independent of x0,
which follows a distribution p0. For simplicity, we assume that all actions in A are
admissible to each state x 2 S, and that D is a countable set to avoid mathematical
complications.
Given an initial state x0, the objective is to flnd a policy ? that consists of a
sequence of functions ? = f?0; ?1;:::g, where ?k : S ! A, such that ? minimizes
32
the inflnite-horizon discounted cost function
J?(x0) = lim
H!1
EfukgH
k=0
( HX
k=0
 kg(xk; ?k(xk);uk)
)
;
subject to the system equation (3.1). The one-step cost function g : S?A?D !R
is given,  2 (0;1) is the discount factor, and EfukgH
k=0
denotes the expectation with
respect to the joint distribution of u0;:::;uH. For simplicity, we assume that the
above limit deflning J?(x0) exists.
Denote the set of all admissible policies by ?. The optimal value function is
deflned by
J?(x) = min
?2?
J?(x); 8x 2 S
An optimal policy, denoted by ??, is an admissible policy that achieves J?. A station-
ary policy is an admissible policy of the form ? = f?;?;:::g (i.e., ?k is independent
of k), referred to as the stationary policy ? for brevity, and its corresponding value
function is denoted by J?.
For any function J : S !R, the dynamic programming (DP) mapping applied
to J is deflned as
(TJ)(x) = min
a2A
Eu[g(x;a;u)+ J(f(x;a;u))]; x 2 S:
Hence, the mapping T transforms the function J on S into the function TJ on S.
For a given stationary policy ?, the mapping T? is deflned as
(T?J)(x) = Eu[g(x;?(x);u)+ J(f(x;?(x);u))]; x 2 S:
Throughout this section, we assume the following:
33
Assumption 3.1 The one-step cost function g satisfles
jg(x;a;u)j? M < 1; 8(x;a;u) 2 S ?A?D:
Please note this is the simplest type of inflnite-horizon discounted cost MDPs.
Nonetheless, the boundedness of the one-step cost function is not as restrictive as
it seems to be. The boundedness is satisfled if S, A, and D are of flnite cardinality.
Even if they are not, the boundedness is satisfled in many computation methods
because they are approximated by flnite sets. It needs more complicated technical
treatment for MDPs involving unbounded one-step cost functions, which we will not
delve into.
The following proposition (cf. prop. 1.2.2 [10]) shows that the value function
J? is the unique solution of Bellman?s equation.
Proposition 1 The optimal value function J? satisfles
J?(x) = mina Eu[g(x;a;u)+ J?(f(x;a;u))]; x 2 S;
or equivalently,
J? = TJ?:
Furthermore, J? is the unique solution of this equation within the class of bounded
functions.
3.1.1 Value Iteration
Value iteration is an iterative method to compute the value function J?. It is
basically the dynamic programming algorithm, and its performance is based on the
34
convergence of dynamic programming, as shown in the next proposition (cf. prop.
1.2.1 in [10]).
Proposition 2 For any bounded function J : S !R, the value function satisfles
J?(x) = lim
k!1
(TkJ)(x); 8x 2 S:
Value iteration starts with an arbitrary bounded function J0(x);8x 2 S, and
then at each iteration k applies the DP mapping T to the old function Jk(x) to get
a new function Jk+1(x) for all x 2 S. More speciflcally, at each iteration k,
Jk+1(x) = (TJk)(x) = min
a2A
Eu[g(x;a;u)+ Jk(f(x;a;u))]; 8x 2 S:
In view of Proposition 2, the value iteration method usually needs an inflnite
number of iterations to achieve convergence. Hence, it is desirable to have an es-
timate on the convergence rate. The following convergence rate [10] [73] has been
established for any bounded function J:
max
x2S
j(TkJ)(x)?J?(x)j?  k max
x2S
jJ(x)?J?(x)j; k = 0;1;::::
The value iteration method may be implementable only approximately, if the
state space is continuous or has a flnite but large number of states. Instead of
computing (TJ)(x) for all states x 2 S, (TJ)(x) can be computed for only some of
the states and estimated for the other states. The expectation in (3.1.1) may also
be computed through approximation or simulation.
35
3.1.2 Policy Iteration
As an alternative to value iteration, the policy iteration method directly works
with policies to flnd an optimal stationary policy. It starts with an arbitrary sta-
tionary policy ?0, at each iteration, evaluates the current policy ?k to obtain the
associated value function J?k, and then computes a new policy ?k+1 based on J?k.
The evaluation of the policy ?k is based on the following corollary (cf. cor. 1.2.2.1
in [10]) that is derived from Proposition 2.
Corollary 1 For every stationary policy ?, the associated value functions satisfles
J?(x) = Eu[g(x;?(x);u)+ J?(f(x;?(x);u))]; 8x 2 S;
or, equivalently,
J? = T?J?:
Furthermore, J? is the unique solution of this equation within the class of bounded
functions.
Therefore, the detailed algorithm for the policy iteration method is as follows.
Algorithm 3.1 Policy Iteration Algorithm
1. Initialization: Guess an initial stationary policy ?0.
2. Policy Evaluation: Evaluate the value function J?k associated with the current
stationary policy ?k by solving
J?k = T?kJ?k:
36
3. Policy Improvement: Compute a new stationary policy ?k+1 by applying the
DP mapping T to the value function J?k, i.e.,
T?k+1J?k = TJ?k:
4. If J?k = TJ?k, stop; otherwise return to the policy evaluation step and repeat
the process.
It can proved that J?k+1 ? J?k, and the strict inequality J?k+1(x) < J?k(x)
holds for at least one x 2 S, if ?k is not optimal. Hence, ?k+1 is strictly better than
?k, if ?k is not optimal. For flnite state and action spaces, the number of admissible
policies is flnite, and hence, the policy iteration algorithm terminates after a flnite
number of iterations to yield an optimal policy.
Similar to the value iteration method, the policy iteration method is imple-
mentable only approximately, if the state space is continuous or has a flnite but large
number of states. The policy evaluation step may be approximated using a flnite
number of value iterations or linear programming. The policy improvement step
may be carried out for only some of the states, and approximated for the remaining
states.
In addition to variations of the value iteration and policy iteration methods,
there are other approximation methods for solving large-state MDPs, such as the
state aggregation approach [11], approximation using basis functions and linear
programming [80], and approximation using post-decision state variable [72].
37
3.1.3 Simulation-Based Methods
The above methods are applicable only when the system model and cost func-
tion are available. However, sometimes the system model is not available, but
instead, the system and cost structure can be simulated. The above methods can
still be applied to this case, with the help of using Monte Carlo simulation to calcu-
late approximately the transition probabilities and cost functions. Another branch
of simulation-based methods is the so-called reinforcement learning [86] or neuro
dynamic programming, such as temporal difierence learning [85], and Q-learning
[89]. The reinforcement learning methods learn the system model (and sometimes
a policy) through repeatedly simulating the system using the current policy and
improving the current policy. Some of the recent simulation-based methods con-
cern e?cient allocation of the simulation budget by focusing on the more promising
actions or policies, such as the multi-stage adaptive sampling algorithm, and the
population-based evolutionary approaches, both described in [22].
3.2 Partially Observable Markov Decision Processes
Consider a stationary discrete-time continuous-state system model:
xk = f(xk?1;ak?1;uk?1);k = 1;2:::; (3.2)
yk = h(xk;ak?1;vk);k = 1;2;:::; y0 = h0(x0;v0); (3.3)
where for all k, the state xk is in a continuous state space S  Rnx, the action ak
is in an action space A ? Rna, the observation yk is in a continuous observation
space O  Rny, and the random disturbances uk 2Rnu and vk 2Rnv are sequences
38
of i.i.d. random vectors with known distributions. Assume that fukg and fvkg are
independent of each other, and independent of x0, which follows a distribution p0.
Also assume that f(x;a;u) is continuous in x for every a 2 A and u 2Rnu, h(x;a;v)
is continuous in x for every a 2 A and v 2Rnv, and h0(x;v) is continuous in x for
every v 2Rnv. Eq. (3.2) is often called the state equation, and (3.3) the observation
equation.
All the information available to the decision maker at time k can be summa-
rized by means of an information vector Ik, which is deflned as
Ik = (y0;y1;:::;yk;a0;a1;:::;ak?1);k = 1;2;:::;
I0 = y0:
The objective is to flnd a policy ? consisting of a sequence of functions ? =
f?0; ?1;:::g, where each function ?k maps the information vector Ik onto the action
space A, that minimizes the cost function
J? = lim
H!1
Ex0;fukgH?1
k=0 ;fvkg
H
k=0
( HX
k=0
 kg(xk;?k(Ik))
)
;
where g : S ?A !R is the one-step cost function,  2 (0;1) is the discount factor,
and Ex0;fukgH?1
k=0 ;fvkg
H
k=0
denotes the expectation with respect to the joint distribu-
tion of x0;u0;:::;uH?1;v0;:::;vH. For simplicity, we assume that the above limit
deflning J? exists. The value function is deflned by
J? = min
?2?
J?;
where ? is the set of all admissible policies. An optimal policy, denoted by ??, is
an admissible policy that achieves J?. A stationary policy is an admissible policy of
39
the form ? = f?;?;:::g, referred to as the stationary policy ? for brevity, and its
corresponding value function is denoted by J?.
3.2.1 Belief MDPs
The information vector Ik grows as the history expands. The standard ap-
proach to encode historical information is the use of the belief state, which is the
conditional probability density of the current state xk given the past history, i.e.,
bk : S ! [0;1) :
bk(x) , p(xk = xjIk):
Given our assumptions on (3.2) and (3.3), bk exists, and can be computed recursively
via Bayes? rule:
bk(x) = p(xk = xjIk?1;ak?1;yk)
/ p(ykjxk = x;ak?1)
Z
S
p(xk = xjIk?1;ak?1;xk?1):::
p(xk?1jIk?1;ak?1)dxk?1
/ p(ykjxk = x;ak?1)
Z
S
p(xk = xjak?1;xk?1):::
bk?1(xk?1)dxk?1; (3.4)
where the second line follows from the Markovian property; / means \proportional
to" because the denominator p(ykjIk?1;ak?1) does not explicitly depend on xk or k;
and the third line follows from the Markovian property of fxkg and the fact that
ak?1 is a function of Ik?1 given a policy. The righthand side of (3.4) can be expressed
40
in terms of bk?1, ak?1 and yk. Hence,
bk = ?(bk?1;ak?1;yk); (3.5)
whereyk ischaracterizedbythetime-homogeneousconditionaldistributionPY (ykjbk?1)
that is induced by (3.2) and (3.3), and does not depend on fy0;:::;yk?1g.
A POMDP can be converted to an MDP by conditioning on the information
vectors, and the converted MDP is called the belief MDP (Chapter 5, [10]). The
states of the belief MDP are the belief states, which follow the system dynamics
(3.5), where yk can be seen as the system noise with the distribution PY . The state
space of the belief MDP is the belief space, denoted by B, which is the set of all
belief states, i.e., a set of probability densities. A policy ? is a sequence of functions
? = f?0;?1;:::g, where each function ?k maps the belief state bk onto the action
space A. Notice that
Ex0;fuigk?1
i=0 ;fvig
k
i=0
fg(xk;ak)g = EfExkfg(xk;ak)jIkgg;
thus the one-step cost function can be written in terms of the belief state as the
belief one-step cost function
~g(bk;ak) , Exk fg(xk;ak)jIkg
=
Z
x2S
g(x;ak)bk(x)dx
, hg(?;a);bi:
Assuming there exists a stationary optimal policy, the optimal value function
is given by
J?(b) = lim
k!1
TkJ(b); 8b 2 B;
41
where T is the dynamic programming (DP) mapping operated on any bounded
function J : S !R according to
TJ(b) = min
a2A
[hg(?;a);bi+ EYfJ(?(b;a;Y))g]; (3.6)
where EY denotes the expectation with respect to the distribution PY .
3.2.2 Finite-State vs. Continuous-State
For flnite-state POMDPs, the belief state b is a vector with each entry being
the probability of being at one of the states, and hence, the belief space B is a flnite-
dimensional probability simplex. Past research on numerical solutions of POMDPs
is mostly focused on flnite-state problems. For flnite-state POMDPs, it is proved
that the value function is a piecewise linear convex function after a flnite number of
iterations, provided that the one-step cost function is piecewise linear and convex
[81]. This feature has been exploited in various exact and approximate algorithms
such as those found in [82], [81], [83], [57], and [36]. The algorithm in [81] and the
Witness algorithm in [57] carry out exact value iterations by flnding and updating
the minimum set of linear functions that determine the value function at each itera-
tion for a flnite-horizon problem. Because the number of such linear functions grow
exponentially with the number of iterations, these algorithms are computationally
very expensive and are limited to very simple problems in practice. Howard [39] and
Sondik [83] use policy iteration to solve the inflnite-horizon discounted-cost prob-
lems. Hauskrecht [36] summarizes several value function approximation methods in
a nice framework of modifying the value iteration equation by changing the order
42
of summations and maximization, including approximation with fully observable
MDP, approximation with Q-functions, fast informed bound method, and approx-
imation with unobservable MDP. [36] also summarizes many grid-based methods
with various techniques of choosing the set of grids and approximating the value
function.
For continuous-state POMDPs, the belief state b is a continuous density, and
thus, the belief space B is an inflnite-dimensional space that contains all sorts of
continuous densities. For continuous-state POMDPs, the value function preserves
convexity [91], but value iteration algorithms are not directly applicable because the
belief space is inflnite dimensional. The inflnite-dimensionality of the belief space
also creates di?culties in applying the approximate algorithms that were developed
for flnite-state POMDPs. For example, one straightforward and commonly used
approach is to approximate a continuous-state POMDP by a flnite-state one via
discretization of the state space. In practice, this could lead to computational
di?culties, either resulting in a belief space that is of huge dimension or in a solution
that is not accurate enough. In addition, note that even for a relatively nice prior
distribution bk (e.g., a Gaussian distribution), the exact evaluation of the posterior
distribution bk+1 is computationally intractable; moreover, the update bk+1 may not
have any structure, and therefore can be very di?cult to handle. Past research is
relatively sparse on numerically solving continuous-state POMDPs.
43
Chapter 4
Solving Continuous-State POMDPs
4.1 Related Work and Motivation
As described at the end of last chapter, a POMDP can be converted to a
continuous-state Markov decision process (MDP) by introducing the notion of the
belief state [10], which is the conditional distribution of the current state given the
history of observations and actions. For a flnite-state POMDP, the belief space
is flnite dimensional (i.e., a probability simplex), whereas for a continuous-state
POMDP, the belief space is an inflnite-dimensional space of continuous probability
distributions. This difierence suggests that simple generalizations of many of the
flnite-state algorithms to continuous-state models are not appropriate or applicable.
For example, discretization of the continuous-state space may result in a flnite-state
POMDP of dimension either too large to solve computationally or not su?ciently
precise. Taking another example, many algorithms for solving flnite-state POMDPs
(see [36] for a survey) are based on discretization of the flnite-dimensional probabil-
ity simplex; however, it is usually not feasible to discretize an inflnite-dimensional
probability distribution space. Throughout this chapter, when we use the word
\dimension" or \dimensional", we refer to the dimension of the belief space/state.
Despite the abundance of algorithms for flnite-state POMDPs, the aforemen-
tioned di?culty has motivated some researchers to look for e?cient algorithms for
44
continuous-state POMDPs [69] [70] [87] [75] [17] [18]. Assuming discrete observa-
tion and action spaces, Porta et al. [69] showed that the optimal flnite-horizon
value function is deflned by a flnite set of \fi-functions", and model all functions of
interest by Gaussian mixtures. In a later work [70], they extended their result and
method to continuous observation and action spaces using sampling strategies. How-
ever, the number of Gaussian mixtures in representing belief states and fi-functions
grows exponentially in value iteration as the number of iterations increases. Thrun
[87] addressed continuous-state POMDPs using particle flltering to simulate the
propagation of belief states and represent the belief states by a flnite number of
samples. The number of samples determines the dimension of the belief space, and
the dimension could be very high in order to approximate the belief states closely.
Roy [75] and Brooks et al. [17] used su?cient statistics to reduce the di-
mension of the belief space, which is often referred to as belief compression in the
Artiflcial Intelligence literature. Roy [75] proposed an augmented MDP (AMDP),
using maximum likelihood state and entropy to characterize belief states, which are
usually not su?cient statistics except for a linear Gaussian model. As shown by
Roy himself, the algorithm fails in a simple robot navigation problem, since the two
statistics are not su?cient for distinguishing between a unimodal distribution and
a bimodal distribution. Brooks et al. [17] proposed a parametric POMDP, rep-
resenting the belief state as a Gaussian distribution so as to convert the POMDP
to a problem of computing the value function over a two-dimensional continuous
space, and using the extended Kalman fllter to estimate the transition of the ap-
proximated belief state. The restriction to the Gaussian representation has the same
45
problem as the AMDP. The algorithm recently proposed in Brooks and Williams
[18] is similar to ours, in that they also approximate the belief state by a parame-
terized density and solve the approximate belief MDP on the parameter space using
Monte Carlo simulation-based methods. However, they did not specify how to cal-
culate the parameters except for Gaussian densities, whereas we explicitly provide
an analytical way to calculate the parameters for exponential families of densities.
Moreover, we develop rigorous theoretical error bounds for our algorithm. There are
some other belief compression algorithms designed for flnite-state POMDP, such as
value-directed compression [71] and the exponential family principle components
analysis (E-PCA) belief compression [76]. They are not suitable for generalization
to continuous-state models, since they are based on a flxed set of support points.
Motivated by the work of [87] [75] and [17], we develop a computationally
tractable algorithm that efiectively reduces the dimension of the belief state and
has the  exibility to represent arbitrary belief states, such as multimodal or heavy
tail distributions. The idea is to project the original high/inflnite-dimensional belief
space to a low-dimensional family of parameterized distributions by minimizing the
Kullback-Leibler (KL) divergence between the belief state and its projection on
that family of distributions. For an exponential family, the minimization of KL
divergence can be carried out in analytical form, making the method very easy to
implement. The projected belief MDP can then be solved on the parameter space
by using simulation-based algorithms, or can be further approximated by a flnite-
state MDP via a suitable discretization of the parameter space and thus solved by
using standard solution techniques such as value iteration and policy iteration. Our
46
method can be viewed as a generalization of the AMDP in [75] and the parametric
POMDP in [17], which considers only the family of Gaussian distributions. In
addition, we provide theoretical results on the error bound of the value function and
the performance of the near-optimal policy generated by our method.
We also develop a projection particle fllter for online flltering and decision
making, by incorporating the density projection technique into particle flltering.
The projection particle fllter we propose here is a modiflcation of the projection
particle fllter in [5]. Unlike in [5] where the predicted conditional density is projected,
we project the updated conditional density, so as to ensure the projected belief state
remains in the given family of densities. Although seemingly a small modiflcation
in the algorithm, we prove under much less restrictive assumptions a similar bound
on the error between our projection particle fllter and the exact fllter.
4.2 Dimension Reduction
4.2.1 Density Projection
Density projection is a useful idea to approximate an arbitrary (most likely,
inflnite-dimensional) density as accurately as possible by a density in a chosen family
that is characterized by only a few parameters. A projection mapping from the belief
space B to a family of parameterized densities ?, denoted as Proj? : B ! ?, is
deflned by
Proj?(b) , argmin
f2?
DKL(bkf); b 2 B; (4.1)
47
where DKL(bkf) denotes the Kullback-Leibler (KL) divergence (or relative entropy)
between b and f, which is
DKL(bkf) ,
Z
b(x)log b(x)f(x)dx: (4.2)
Hence, the projection of b on ? has the minimum KL divergence from b among all
the densities in ?.
When ? is an exponential family of densities, the minimization (4.1) has an
analytical solution and can be carried out easily. The exponential families include
many common families of densities, such as Gaussian, binomial, Poisson, Gamma,
etc. An exponential family of densities is deflned as follows [6]:
Deflnition 4.1 Let fc1(?);:::;cm(?)g be a?nely independent scalar functions de-
flned on Rn, i.e., for distinct points x1;:::;xm+1, Pm+1i=1 ?ic(xi) = 0 and Pm+1i=1 ?i =
0 implies ?1;:::;?m+1 = 0, where c(x) = [c1(x);:::;cm(x)]T. Assuming that ?0 =
f 2 Rm : ?( ) = logR exp( Tc(x))dx < 1g is a convex set with a nonempty
interior, then ? deflned by
? = ff(?; ); 2 ?g;
f(x; ) = exp[ Tc(x)??( )];
where ?  ?0 is open, is called an exponential family of probability densities, with
 its parameter and c(x) its su?cient statistic.
Substituting f(x) = f(x; ) into (4.2) and expressing it further as
DKL(bkf(?; )) =
Z
b(x)logb(x)dx?
Z
b(x)logf(x; )dx;
48
we can see that the flrst term does not depend on f(?; ), hence minDKL(bkf(?; ))
is equivalent to
max
Z
b(x)logf(x; )dx;
which by Deflnition 4.1 is the same as
max
Z
( Tc(x)??( ))b(x)dx: (4.3)
Recall the fact that the log-likelihood l( ) =  Tc(x) ? ?( ) is strictly concave in
 [55], and therefore, R ( Tc(x)??( ))b(x)dx is also strictly concave in  . Hence,
(4.3) has a unique maximum and the maximum is achieved when the flrst-order
optimality condition is satisfled, i.e.,
Z  
cj(x)?
R c
j(x)exp( Tc(x))dxR
exp( Tc(x))dx
?
b(x)dx = 0:
Therefore, b and its projection f(?; ) is related by
Eb [cj(X)] = E [cj(X)]; j = 1;:::;m; (4.4)
where Eb and E denote the expectations with respect to b and f(?; ), respectively.
4.2.2 Projected Belief MDP
Using the idea of density projection, we can transform the belief MDP to
another MDP conflned on a low-dimensional belief space, and then solve this MDP
problem. We call such an MDP the projected belief MDP. Its state is the projected
belief state bpk 2 ? that satisfles the system dynamics
bp0 = Proj?(b0);
bpk = ?(bpk?1;ak?1;yk)p; k = 0;1;:::;
49
where ?(bpk?1;ak?1;yk)p = Proj?(?(bpk?1;ak?1;yk)), and the dynamic programming
mapping on the projected belief MDP is
TpJ(bp) = min
a2A
[hg(?;a);bpi+ EY fJ(?(bp;a;Y)p)g]: (4.5)
For the projected belief MDP, a policy is denoted as ?p = f?p0;?p1;:::g, where each
function ?pk maps the projected belief state bpk into the action space A. Similarly,
a stationary policy is denoted as ?p; an optimal stationary policy is denoted as ?p?;
and the optimal value function is denoted as Jp?(bp).
The projected belief MDP is in fact a low-dimensional continuous-state MDP,
and can be solved in numerous ways. One common approach is to use value iteration
or policy iteration by converting the projected belief MDP to a discrete-state MDP
problem via a suitable discretization of the projected belief space (i.e., the parameter
space) and then estimating the one-step cost function and transition probabilities
on the discretized mesh. The efiect of the discretization procedure on dynamic
programming has been studied in [9]. We describe this approach in detail below.
Discretization of the projected belief space ? is equivalent to discretization of
the parameter space ?, which yields a set of grid points, denoted by G = f i;i =
1;:::;Ng. Let ~g( i;a) denote the one-step cost function associated with taking
action a at the projected belief state bp = f(?; i). Let ~P( i;a)( j) denote the
transition probability from the current projected belief state bpk = f(?; i) to the next
projected belief state bpk+1 = f(?; j) by taking action a. Estimation of ~P( i;a)( j) is
done using a variation of the projection particle flltering algorithm, to be described
50
in the next section. ~g( i;a) can be estimated by its sample mean:
~g( i;a) = 1N
NX
j=1
g(xj;a); (4.6)
where x1;:::;xN are sampled i.i.d. from f(?; i).
Remark 4.1 The approach for solving the projected belief MDP described here is
probably the most intuitive, but not necessarily the most computationally e?cient.
Other more e?cient techniques for solving continuous-state MDPs can be used to
solve the projected belief MDP, such as the linear programming approach [27], neuro-
dynamic programming methods [12], and simulation-based methods [22].
4.3 Projection Particle Filtering
Solving the projected belief MDP gives us a policy, which tells us what action
to take at each projected belief state. In an online implementation, at each time k,
the decision maker receives a new observation yk, estimates the belief state bk, and
then chooses his action ak according to bk and that policy. Hence, to implement our
approach requires addressing the problem of estimating the belief state. Estimation
of bk, or simply called flltering, does not have an analytical solution in most cases
except linear Gaussian systems, but it can be solved using many approximation
methods, such as the extended Kalman fllter and particle flltering. Here we focus
on particle flltering, because 1) it outperforms the extended Kalman fllter in many
nonlinear/non-Gaussian systems [4], and 2) we will develop a projection particle
fllter to be used in conjunction with the projected belief MDP.
51
Toobtainareasonable approximationofthebelief state, particle flltering needs
a large number of samples/particles. Since the number of samples/particles is the
dimension of the approximate belief state ^b, particle flltering is not very helpful in
reducing the dimensionality of the belief space. Moreover, particle flltering does not
give us an approximate belief state in the projected belief space ?, hence the policy
we obtained by solving the projected belief MDP is not immediately applicable.
We incorporate the idea of density projection into particle flltering, so as to
approximate the belief state by a density in ?. The projection particle fllter we
propose here is a modiflcation of the one in [5]. Their projection particle fllter
projects the empirical predicted belief state, not the empirical updated belief state,
onto a parametric family of densities, and hence, the approximate belief state might
not be in that family after Bayes? updating. We will directly project the empirical
updated belief state onto a parametric family. In addition, we will need much less
restrictive assumptions than [5] to obtain similar error bounds. Since resampling is
from a continuous distribution instead of an empirical (discrete) one, the proposed
projection particle fllter also overcomes the di?culty of sample impoverishment [4]
that occurs in the bootstrap fllter.
Applying the density projection technique we described in the last section,
projecting the empirical belief state ^bk onto an exponential family ? involves flnding
a f(?; ) with the parameter  satisfying (4.4). Hence, letting b = ^bk in (4.4) and
plugging in (2.11),  should satisfy
NX
i=1
wicj(xikjk?1) = E [cj]; j = 1;:::;m;
52
which constitutes the projection step in the projection particle fllter.
Algorithm 4.1 (Projection particle flltering for an exponential family of densities
(PPF))
? Input: a (stationary) policy ?p on the projected belief MDP; a family of ex-
ponential densities ? = ff(?; ); 2 ?g; a sequence of observations y1;y2;:::
arriving sequentially at time k = 1;2;:::.
? Output: a sequence of approximate belief states f(?; ^ 1);f(?; ^ 2);:::.
? Step 1. Initialization: Sample x10;:::;xN0 i.i.d. from the approximate initial
belief state f(?; ^ 0). Set k = 1.
? Step 2. Prediction: Compute x1kjk?1;:::;xNkjk?1 by propagating x1k?1;:::;xNk?1
according to the system dynamics (3.2) using the action ak?1 = ?p(f(?; ^ k?1))
and randomly generated noise fuik?1gNi=1, i.e., sample xikjk?1 from p(?jxik?1;ak?1),
i = 1;:::;N.
? Step 3. Bayes? updating: Receive a new observation yk. Calculate weights as
wik = p(ykjx
i
k;ak?1)P
N
i=1 p(ykjx
i
k;ak?1)
; i = 1;:::;N:
? Step 4. Projection: The approximate belief state is f(?; ^ k), where ^ k satisfles
the equations
NX
i=1
wikcj(xikjk?1) = E^ k[cj]; j = 1;:::;m:
? Step 5. Resampling: Sample x1k;:::;xNk from f(?; ^ k).
53
? Step 6. k ? k +1 and go to Step 2.
In an online implementation, at each time k, the PPF approximates bk by
f(?; ^ k), and then decides an action ak according to ak = ?p(f(?; ^ k)), where ?p is an
optimal policy solved for the projected belief MDP.
As mentioned in the last section, PPF can be varied slightly for estimating
the transition probabilities of the discretized projected belief MDP, as follows:
Algorithm 4.2 (Estimation of the transition probabilities)
? Input:  i;a;N;
? Output: ~P( i;a)( j);j = 1;:::;N.
? Step 1. Sampling: Sample x1;:::;xN from f(?; i).
? Step 2. Prediction: Compute ~x1;:::; ~xN by propagating x1;:::;xN according
to the system dynamics (3.2) using the action a and randomly generated noise
fuigNi=1.
? Step 3. Sampling observation: Compute y1;:::;yN from ~x1;:::; ~xN according
to the observation equation (3.3) using randomly generated noise fvigNi=1.
? Step 4. Bayes? updating: For each yk;k = 1;:::;N, the updated belief state is
~bk =
NX
i=1
wki ?(x? ~xi);
where
wki = p(ykj~xi;a)PN
i=1 p(ykj~xi;a)
; i = 1;:::;N:
54
? Step 5. Projection: For k = 1;:::;N, project each ~bk to the exponential family,
i.e., flnding ~ k that satisfles (4.4).
? Step 6. Estimation: For k = 1;:::;N, flnd the nearest-neighbor of ~ k in G.
For each  j 2 G, count the frequency ~P( i;a)( j) = (number of  j)=N.
4.4 Analysis of Projected Belief MDP
4.4.1 Main Idea
Our method solves the projected belief MDP instead of the original belief
MDP, and that raises two questions: How well does the optimal value function of
the projected belief MDP approximate the optimal value function of the original
belief MDP? How well does the optimal policy obtained by solving the projected
belief MDP perform on the original belief MDP? To answer these questions, we flrst
need to rephrase them mathematically.
Here we assume perfect computation of the belief states and the projected
belief states. We also assume the existence of an optimal policy that is stationary,
as stated below.
Assumption 4.1 There exist a stationary optimal policy for the belief MDP, de-
noted by ??, and a stationary optimal policy for the projected belief MDP, denoted
by ?p?.
Assumption 4.1 holds under some mild conditions [10], [37]. Using the station-
arity, and the dynamic programming mapping on the belief MDP and the projected
55
belief MDP given by (3.6) and (4.5), the optimal value function J?(b) for the belief
MDP can be obtained by
J?(b) , J??(b) = lim
k!1
TkJ0(b);
and the optimal value function for the projected belief MDP obtained by
Jp?(bp) , Jp?p
?
(bp) = lim
k!1
(Tp)kJ0(bp):
Therefore, the questions posed at the beginning of this section can be formu-
lated mathematically as:
1. How well the optimal value function of the projected belief MDP approxi-
mates the true optimal value function can be measured by
jJ?(b)?Jp?(bp)j:
2. How well the optimal policy ?p? for the projected belief MDP performs on
the original belief space can be measured by
flflJ
?(b)?J?p?(b)
flfl;
where ?p?(b) , ?p? ?Proj?(b) = ?p?(bp).
4.4.2 Error Bound
The next assumption bounds the difierence between the belief state b and its
projection bp, and also the difierence between their one-step evolutions ?(b;a;y) and
?(bp;a;y)p. It is an assumption on the projection error.
56
Assumption 4.2 There exist ?1 > 0 and ?1 > 0 such that for all a 2 A;y 2 O and
b 2 B,
jhg(?;a);b?bpij? ?1;
jhg(?;a);?(b;a;y)??(bp;a;y)pij? ?1:
The following assumption can be seen as a continuity property of the value
function.
Assumption 4.3 Given ? > 0 that satisfles jhg(?;a);b?b0ij? ?, there exists ? > 0
such that jJk(b)?Jk(b0)j ? ?, 8,?b0 2 B, 8k, and there exists a ~? > 0 such that
jJ?(b)?J?(b0)j? ~?, 8b;b0 2 B, 8? 2 ?.
Now we present our main result.
Theorem 4.1 Under Assumptions 4.1, 4.2 and 4.3, for all b 2 B,
jJ?(b)?Jp?(bp)j ? ?1 + ?21? ; (4.7)
flflJ
?(b)?J?p?(b)
flfl ? 2?1 + (?2 +?3)
1? ; (4.8)
where ?1 is the constant in Assumption 4.2, and ?2, ?3 are the constants ? and ~?,
respectively, in Assumption 4.3 corresponding to ? = ?1.
Proof:
Denote Jk(b) , TkJ0(b);Jpk(bp) , (Tp)kJ0(bp);k = 0;1;:::, and deflne
bk(b;a) = hg(?;a);bi+ EY fJk?1(?(b;a;Y))g;
?k(b) = argmin
a2A
Qk(b;a);
57
bpk(b;a) = hg(?;a);bpi+ EY fJk?1(?(bp;a;Y)p)g;
?pk(b) = argmin
a2A
Qpk(b;a):
Hence,
Jk(b) = min
a2A
Qk(b;a) = Qk(b;?k(b));
Jpk(bp) = min
a2A
Qpk(b;a) = Qpk(b;?pk(b)):
Denote errk , maxb2B jJk(b)?Jpk(bp)j;k = 1;2;:::.
We consider the flrst iteration. Initialize with J0(b) = Jp0(bp) = 0. By As-
sumption 4.2, 8a 2 A,
jQ1(b;a)?Qp1(b;a)j = jhg(?;a);b?bpij? ?1; 8b 2 B: (4.9)
Hence, with a = ?p1(b), the above inequality yields Q1(b;?p1(b)) ? Jp1(bp)+?1. Using
J1(b) ? Q1(b;?p1(b)), we get
J1(b) ? Jp1(bp)+?1; 8b 2 B: (4.10)
With a = ?1(b), (4.9) yields Qp1(b;?1(b))??1 ? J1(b). Using Jp1(bp) ? Qp1(b;?1(b)),
we get
Jp1(bp)??1 ? J1(b); 8b 2 B: (4.11)
From (4.10) and (4.11), we conclude
jJ1(b)?Jp1(bp)j? ?1; 8b 2 B:
Taking the maximum over b on both sides of the above inequality yields
err1 ? ?1: (4.12)
58
Now we consider the (k+1)th iteration. For a flxed Y = y, by Assumption 4.2,
jhg(?;a);?(b;a;y)??(bp;a;y)pij? ?1:
Let ?1 be the ? in Assumption 4.3 and denote the corresponding ? by ?2. Then
jJk(?(b;a;y))?Jk (?(bp;a;y)p)j? ?2; 8b 2 B: (4.13)
Therefore, 8a 2 A,
flflQ
k+1(b;a)?Q
p
k+1(b;a)
flfl
? jhg(?;a);b?bpij+  EY fjJk(?(b;a;Y))?Jpk(?(bp;a;Y)p)jg
? ?1 + EYfjJk(?(b;a;Y))?Jk(?(bp;a;Y)p)j+ jJk(?(bp;a;Y)p)?Jpk(?(bp;a;Y)p)jg
? ?1 + (?2 +errk); 8b 2 B:
The third inequality follows from (4.13) and the deflnition of errk. Using an argu-
ment similar to that used to prove (4.12) from (4.9), we conclude that
errk+1 ? ?1 + (?2 +errk): (4.14)
Using induction on (4.14) with initial condition (4.12) and taking k !1, we obtain
jJ?(b)?Jp?(bp)j ?
1X
k=0
 k?1 +
1X
k=1
 k?2
= ?1 + ?21? : (4.15)
Therefore, (4.7) is proved.
Fixing a policy ? on the original belief MDP, deflne the mappings under policy
? on the belief MDP and the projected belief MDP as
T?J(b)=hg(?;?(b));bi+ EY fJ(?(b;?(b);Y))g; (4.16)
Tp?J(b)=hg(?;?(b));bpi+ EY fJ(?(bp;?(b);Y)p)g; (4.17)
59
respectively. Since ?p? is a stationary policy for the projected belief MDP, ?p? =
?p? ?Proj? is stationary for the original belief MDP. Hence,
Jp?(bp) = Tp?p
?
Jp?(bp);
J?p?(b) = T?p?J?p?(b):
Subtracting both sides of the above two equations, and substituting in the deflnitions
of Tp and T (i.e., (4.17) and (4.16)) for the righthand sides respectively, we get
Jp?(bp)?J?p?(b) = hg(?;?p?(bp));bp ?bi:::
+  EY 'Jp?(?(bp;?p?(bp);Y)p)?J?p? (?(b;?p?(bp);Y))?:(4.18)
For a flxed Y = y,
flflJp
?(?(b
p;?p
?(b
p);y)p)?J
?p?(?(b;?
p
?(b
p);y))flfl
?
flfl
flJp?(~b)?J?p?(~b)
flfl
fl+
flflJ
?p? (?(b
p;?p
?(b
p);y)p)?J
?p? (?(b;?
p
?(b
p);y))flfl;
where ~b = ?(bp;?p?(bp);y)p 2 B. By Assumption 4.2,
jhg(?;a);?(bp;?p?(bp);y)p ??(b;?p?(bp);y)ij? ?1:
Letting ? = ?1 in Assumption 4.3 and denoting the corresponding ~? by ?3, we get
the second term
flflJ
?p? (?(b
p;?p
?(b
p);y)p)?J
?p? (?(b;?
p
?(b
p);y))flfl? ?3:
Denoting err , maxb2B flflJp?(bp)?J?p?(b)flfl, we obtain
flflJp
? (?(b
p;?p
?(b
p);y)p)?J
?p? (?(b;?
p
?(b
p);y))flfl? err +?3:
60
Therefore, (4.18) becomes
flflJp
?(b
p)?J
?p?(b)
flfl? ?
1 + (err +?3):
Taking the maximum over b on both sides of the above inequality yields
err ? ?1 + (err +?3):
Hence,
err ? ?1 + ?31? : (4.19)
With (4.15) and (4.19), we obtain
flflJ
?(b)?J?p?(b)
flfl ? jJ
?(b)?Jp?(bp)j+
flflJp
?(b
p)?J
?p?(b)
flfl
? 2?1 + (?2 +?3)1? ; 8b 2 B:
Therefore, (4.8) is proved. ?
Remark 4.2 In (4.7) and (4.8), ?1 is a projection error, and ?2 and ?3 decreases
as the projection error ?1 decreases. Therefore, as the projection error decreases,
the optimal value function of the projected belief MDP Jp?(bp) converges to the true
optimal value function J?(b), and the corresponding policy ?p? converges to the true
optimal policy ??. Roughly speaking, the projection error decreases as the number
of su?cient statistics in the chosen exponential family increases (for details, please
see Section 4.6: Validation of the Assumptions).
4.5 Analysis of Projection Particle Filtering
In the above analysis, we assumed perfect computation of the belief states
and the projected belief states. In this section, we consider the flltering error, and
61
compute an error bound on the approximate belief state generated by the projection
particle fllter (PPF).
4.5.1 Notations
Let Cb(Rn) be the set of all continuous bounded functions on Rn. Let B(Rn)
be the set of all bounded measurable functions onRn. Let k?k denote the supremum
norm on B(Rn), i.e., k`k, supx2Rn j`(x)j;` 2 B(Rn). Let M+(Rn) and P(Rn) be
the sets of nonnegative measures and probability measures on Rn, respectively. If
? 2M+(Rn) and ` :Rn !R is an integrable function with respect to ?, then
h?;`i,
Z
`d?:
Moreover, if ? 2P(Rn),
E?[`] = h?;`i;
Var?(`) = h?;`2i?h?;`i2:
We will use the representations on the two sides of the above equalities interchange-
ably in the sequel.
The belief state and the projected belief state are probability densities; how-
ever, we will prove our results in terms of their corresponding probability mea-
sures, which we refer to as \conditional distributions" (belief states are conditional
densities). The two representations are essentially the same once we assume the
probability measures admit probability densities. Therefore, the notations used for
probability densities before are used to denote their corresponding probability mea-
sures from now on. Namely, we use b to denote a probability measure on Rnx and
62
assume it admits a probability density with respect to Lebesgue measure, which is
the belief state. Similarly, we use f(?; ) to denote a probability measure on Rnx
and assume it admits a probability density with respect to Lebesgue measure in the
chosen exponential family with parameter  .
A probability transition kernel K : P(Rnx)?Rnx !R is deflned by
K?(E) ,
Z
Rnx
?(dx)K(E;x);
where E is a set in the Borel  -algebra on Rnx. For ` : Rnx ! R, an integrable
function with respect to K(?;x),
K`(x) ,
Z
Rnx
`(x0)K(dx0;x):
Let Kk(dxk;xk?1) denote the probability transition kernel of the system (3.2) at
time k, which satisfles
bkjk?1(dxk) = Kkbk?1(dxkjk?1)
=
Z
Rnx
bk?1(dxk?1)Kk(dxkjk?1;xk?1):
We let ?k denote the likelihood function associated with the observation equa-
tion (3.3) at time k, and assume that ?k 2 Cb(Rnx). Hence,
bk = ?kbkjk?1hb
kjk?1;?ki
:
4.5.2 Main Idea
The exact fllter (EF) at time k can be described as
bk?1 ?! bkjk?1 = Kkbk?1 ?! bk = ?kbkjk?1hb
kjk?1;?ki
:
prediction updating
63
The PPF at time k can be described as
^f(?; ^ k?1) ?! ^bkjk?1 = Kkf(?; ^ k?1) ?! :::
prediction updating
^bk = ?k^bkjk?1
h^bkjk?1;?ki ?! f(?;
^ k) ?! ^f(?; ^ k):
projection resampling
To facilitate our analysis, we introduce a conceptual fllter (CF), which at each
time k is reinitialized by f(?; ^ k?1), performs exact prediction and updating to yield
b0kjk?1 and b0k, respectively, and does projection to obtain f(?; 0k). It can be described
as
f(?; ^ k?1) ?! b0kjk?1 = Kkf(?; ^ k?1) ?! :::
prediction updating
b0k = ?kb
0
kjk?1
hb0kjk?1;?ki ?! f(?; 
0
k):
projection
The CF serves as an bridge to connect the EF and PPF, as we describe below.
We are interested in the difierence between the true conditional distribution
bk and the PPF generated projected conditional distribution f(?; ^ k) for each time
k. The total error between the two can be decomposed as follows:
bk ?f(?; ^ k) = (bk ?b0k)+(b0k ?f(?; 0k))+(f(?; 0k)?f(?; ^ k)): (4.20)
The flrst term (bk ?b0k) is the error due to the inexact initial condition of the CF
compared to EF, i.e., (bk?1 ?f(?; ^ k?1)), which is also the total error at time k?1.
64
Table 4.1: Notations of Difierent Conditional Distributions
bk exact conditional distribution
^bk PPF conditional distribution before projection
f(?; ^ k) PPF projected conditional distribution
b0k CF conditional distribution before projection
f(?; 0k) CF projected conditional distribution
The second term (b0k?f(?; 0k)) evaluates the minimum deviation from the exponen-
tial family generated by one step of exact flltering, since f(?; 0k) is the projection
of b0k. The third term (f(?; 0k)?f(?; ^ k)) is purely due to Monte Carlo simulation,
since f(?; 0k) and f(?; ^ k) are obtained using the same steps from f(?; ^ k?1) and its
empirical version ^f(?; ^ k?1), respectively. We will flnd error bounds on each of the
three terms, and flnally flnd the total error at time k by induction.
4.5.3 Error Bound
We shall look at the the case in which the observation process has an arbitrary
but flxed value y0:k = fy0;:::;ykg. Hence, all the expectations E in this section are
with respect to the sampling in the algorithm only. We consider the test function
65
` 2 B(Rnx). It can be seen that K` 2 B(Rnx) and kK`k?k`k, since
jK`(x)j =
flfl
flfl
Z
Rnx
`(x0)K(dx0;x)
flfl
flfl
?
Z
Rnx
j`(x0)K(dx0;x)j
? k`k
Z
Rnx
K(dx0;x) = k`k:
Since ? 2 Cb(Rnx), we know that ? 2 B(Rnx) and ?` 2 B(Rnx).
We also need the following assumptions.
Assumption 4.4 All the projected distributions are in a compact subset of the given
exponential family. In other words, there exists a compact set ?0  ? such that
^ k 2 ?0, and  0k 2 ?0, 8k.
Assumption 4.5 For all k 2N,
hbkjk?1;?ki > 0;
hb0kjk?1;?ki > 0; w:p:1;
h^bkjk?1;?ki > 0; w:p:1:
Assumption 4.5 guarantees that the normalizing constant in the Bayes? updat-
ing is nonzero, so that the conditional distribution is well deflned. Under Assump-
tion 4.4, the second inequality in Assumption 4.5 can be strengthened using the com-
pactness of ?0. Since f(?;ak;uk) in (3.2) is continuous in x, Kk is weakly continuous
(pp. 175-177, [37]). Hence, hb0kjk?1;?ki = hKkf(?; ^ k?1);?ki = hf(?; ^ k?1);Kk?ki is
continuous in ^ k?1, where ^ k?1 2 ?0. Since ?0 is compact, there exists a constant
? > 0 such that for each k
hb0kjk?1;?ki? 1?; w:p:1: (4.21)
66
The assumption below guarantees that the conditional distribution stays close
to the given exponential family after one step of exact flltering if the initial dis-
tribution is in the exponential family. Recall that starting with initial distribution
f(?; ^ k?1), one step of exact flltering yields b0k, which is then projected to yield
f(?; 0k), where ^ k?1 2 ?0; 0k 2 ?0.
Assumption 4.6 There exists a constant ? > 0 such that for all ` 2 B(Rnx) and
all k 2N,
E[jhb0k;`i?hf(?; 0k);`ij] ? ?k`k:
Remark 4.3 Assumption 4.6 is our main assumption, which essentially assumes
an error bound on the projection error. Our assumptions are much less restrictive
than the assumptions in [5], while our conclusion is similar to but slightly difierent
from that in [5], which will be seen later. Although Assumption 4.6 appears similar
to Assumption 3 in [5], it is essentially difierent. Assumption 3 in [5] says that
the optimal conditional density stays close to the given exponential family for all
time, whereas Assumption 4.6 only assumes that if the exact fllter starts in the
given exponential family, after one step the conditional distribution stays close to the
family. Moreover, we do not need any assumption like the restrictive Assumption 4
in [5].
Lemma 4.1 considers the bound on the flrst term in (4.20).
Lemma 4.1 For each k 2 N, suppose E[jhbk?1 ? f(?; ^ k?1);`ij] ? ek?1k`k, 8` 2
B(Rnx), where ek?1 is a positive constant. Then under Assumptions 4.4 and 4.5,
67
for each k 2N, there exists a constant ?k > 0 such that for all ` 2 B(Rnx)
E[jhbk ?b0k;`ij] ? ?kek?1k`k: (4.22)
Proof:
E
hflfl
flhbk?1 ?f(?; ^ k?1);`i
flfl
fl
i
is the error from time k?1, which is also the initial
error for time k. Hence, the prediction step yields
E?flflhbkjk?1 ?b0kjk?1;`iflfl?
= E
hflfl
flhKk(bk?1 ?f(?; ^ k?1));`i
flfl
fl
i
= E
hflfl
flhbk?1 ?f(?; ^ k?1);Kk`i
flfl
fl
i
? ek?1kKk`k
? ek?1k`k: (4.23)
Under Assumptions 4.4 and 4.5, we have showed (4.21). Using (4.21) and
(4.23), the Bayes? updating step yields
E[jhbk ?b0k;`ij]
= E
"flfl
flfl
fl
hbkjk?1;?k`i
hbkjk?1;?ki ?
hb0kjk?1;?k`i
hb0kjk?1;?ki
flfl
flfl
fl
#
= E
"flfl
flfl
fl
hbkjk?1;?k`i
hbkjk?1;?ki ?
hbkjk?1;?k`i
hb0kjk?1;?ki
flfl
flfl
fl
#
+ E
"flfl
flfl
fl
hbkjk?1;?k`i
hb0kjk?1;?ki ?
hb0kjk?1;?k`i
hb0kjk?1;?ki
flfl
flfl
fl
#
? E
"flfl
flfl
fl
hbkjk?1;?k`ihbkjk?1 ?b0kjk?1;?ki
hbkjk?1;?kihb0kjk?1;?ki
flfl
flfl
fl
#
+ E
"flfl
flfl
fl
hbkjk?1 ?b0kjk?1;?k`i
hb0kjk?1;?ki
flfl
flfl
fl
#
? ?
flflhb
kjk?1;?k`i
flfl
hbkjk?1;?ki E
?flflhb
kjk?1 ?b0kjk?1;?ki
flfl?+ ?E?flflhb
kjk?1 ?b0kjk?1;?k`i
flfl?
? ?ek?1k`kk?kk+?ek?1k?k`k
? 2?ek?1k?kkk`k = ?kek?1k`k;
68
where ?k = 2?k?kk. ?
Lemma 4.2 considers the bound on the third term in (4.20) before projection.
Lemma 4.2 Under Assumptions 4.4 and 4.5, for each k 2N, there exists a constant
?k > 0 such that for all ` 2 B(Rnx)
E
hflfl
flh^bk ?b0k;`i
flfl
fl
i
? ?k k`kpN:
Proof:
This lemma uses essentially the same proof technique as Lemmas 3 and 4 in
[23]. However, it is not quite obvious how these lemmas imply our lemma here.
Therefore, we state the proof to make this chapter more accessible. After the re-
sampling step, ^f(?; ^ k?1) = 1N PNi=1 ?(x?xik?1), where xik?1;i = 1;:::;N are i.i.d.
samples from f(?; ^ k?1). Using the Cauchy-Schwartz inequality, we have
?
E
h
h^f(?; ^ k?1)?f(?; ^ k?1);`i2
i?1=2
=
0
@E
2
4
?
1
N
NX
i=1
?
`(xik?1)?hf(?; ^ k?1);`i
?!23
5
1
A
1=2
= 1pN
?
E
"
1
N
NX
i=1
(`(xik?1)?hf(?; ^ k?1);`i)2
#!1=2
= 1pN
?
hf(?; ^ k?1);`2i?hf(?; ^ k?1);`i2
?1=2
? 1pNhf(?; ^ k?1);`2i1=2
? 1pNk`k: (4.24)
69
The Bayes? updating step yields
E
hflfl
flh^bk ?b0k;`i
flfl
fl
i
= E
"flfl
flfl
fl
h^bkjk?1;?k`i
h^bkjk?1;?ki ?
hb0kjk?1;?k`i
hb0kjk?1;?ki
flfl
flfl
fl
#
= E
"flfl
flfl
fl
h^bkjk?1;?k`i
h^bkjk?1;?ki ?
h^bkjk?1;?k`i
hb0kjk?1;?ki
flfl
flfl
fl
#
+ E
"flfl
flfl
fl
h^bkjk?1;?k`i
hb0kjk?1;?ki ?
hb0kjk?1;?k`i
hb0kjk?1;?ki
flfl
flfl
fl
#
? E
"flfl
flfl
fl
h^bkjk?1;?k`ih^bkjk?1 ?b0kjk?1;?ki
h^bkjk?1;?kihb0kjk?1;?ki
flfl
flfl
fl
#
+ E
"flfl
flfl
fl
h^bkjk?1 ?b0kjk?1;?k`i
hb0kjk?1;?ki
flfl
flfl
fl
#
:
Under Assumptions 4.4 and 4.5, we have showed (4.21). Using the Cauchy-Schwartz
inequality, (4.21) and (4.24), the flrst term can be simplifled as
E
"flfl
flfl
fl
h^bkjk?1;?k`ih^bkjk?1 ?b0kjk?1;?ki
h^bkjk?1;?kihb0kjk?1;?ki
flfl
flfl
fl
#
? ?
?
E
"
h^bkjk?1;?k`i2
h^bkjk?1;?ki2
#!1=2?
E
h
h^bkjk?1 ?b0kjk?1;?ki2
i?1=2
= ?
?
E
"
h^bkjk?1;?k`i2
h^bkjk?1;?ki2
#!1=2?
E
h
hf(?; ^ 0k?1)?f(?; 0k?1);Kk?ki2
i?1=2
? ?k`k 1pNk?kk;
and the second term can be simplifled as
E
"flfl
flfl
fl
h^bkjk?1 ?b0kjk?1;?k`i
hb0kjk?1;?ki
flfl
flfl
fl
#
? ?
?
E
h
h^bkjk?1 ?b0kjk?1;?k`i2
i?1=2
= ?
?
E
h
h^f(?; ^ k?1)?f(?; ^ k?1);Kk?k`i2
i?1=2
? ? 1pNk?k`k
? ? 1pNk?kkk`k:
Therefore, adding these two terms yields
E
hflfl
flh^bk ?b0k;`i
flfl
fl
i
? 2?k?kkk`kpN = ?k k`kpN;
70
where ?k = 2?k?kk: ?
Lemma 4.3 considers the bound on the third term in (4.20) based on the
result of Lemma 4.2. The key idea of proof is to connect the errors before and after
projection through (4.4), which we derived for the density projection that minimizes
the KL divergence.
Lemma 4.3 Let cj;j = 1;:::;m be the su?cient statistics of the exponential family
as deflned in Deflnition 4.1, and assume cj 2 B(Rnx);j = 1;:::;m. Then under
Assumptions 4.4 and 4.5, for each k 2N, there exists a constant dk > 0 such that
for all ` 2 B(Rnx)
E
hflfl
flf(?; ^ k)?hf(?; 0k);`i
flfl
fl
i
? dk k`kpN: (4.25)
Proof:
The key idea of the proof for Lemma 4 in [5] is used here. From (4.4), we
know that E^ k[cj(X)] = E^bk[cj(X)] and E 0k[cj(X)] = Eb0k[cj(X)]. Hence, we obtain
E?flflE^ k(cj(X))?E 0k(cj(X))flfl? = E
hflfl
flh^bk ?b0k;cji
flfl
fl
i
;
for j = 1;:::;m. Taking summation over j, we obtain
E
" mX
j=1
flflE
^ k (cj(X))?E 0k (cj(X))
flfl# = mX
j=1
E
hflfl
flh^bk ?b0k;cji
flfl
fl
i
:
Since cj 2 B(Rnx), we apply Lemma 4.2 with ` = cj and thus obtain
E
hflfl
flh^bk ?b0k;cji
flfl
fl
i
? bkkcjkpN ; j = 1;:::;m:
Therefore,
E
h  
E^ k(c(X))?E 0k(c(X))  1
i
? ~?kpN;
71
where k?k1 denotes the L1 norm on Rnx, c = [c1;:::;cm]T, and ~?k = Pmj=1kcjk.
Since ?0 is compact and the Fisher information matrix
[E [ci(X)cj(X)]?E [ci(X)]E [cj(X)]]ij is positive deflnite, we get (cf. Fact 2 in
[5] for a detailed proof)
  
 ^ k ? 0k
  
 
1
? fi  E^ k(c(X))?E 0k(c(X))  1 :
Taking the expectation on both sides yields
E
h  
 ^ k ? 0k
  
 
1
i
? fiE
h  
E^ k(c(X))?E 0k(c(X))  1
i
? fi~?k 1pN:
On the other hand, taking derivative of E [`(X)] with respect to  i yields
flfl
flfl d
d iE [`(X)]
flfl
flfl = jE [ci(X)`(X)]?E [ci(X)]E [`(X)]j
?
p
Var (ci)Var (`)
?
q
E (c2i)E (`2)
? kcikk`k:
Hence,
  
  d
d E [`(X)]
  
  
1
?
? mX
i=1
kcik
!
k`k:
Since ?0 is compact, there exists a constant fl > 0 such that E [`(X)] is Lipschitz
over  2 ?0 with Lipschitz constant flk`k (cf. the proof of Fact 2 in [5]), i.e.,
flflE
^ k[`]?E 0k[`]
flfl? flk`k  
 ^ k ? 0k
  
 
1
:
72
Taking expectation on both sides yields
E
hflfl
flf(?; ^ k)?f(?; 0k);`i
flfl
fl
i
? flk`kE
h  
 ^ k ? 0k
  
 
1
i
? flk`kfi~?k 1pN = dk j`kpN;
where dk = fifl~?k. ?
Now we present our main result on the error bound of the projection particle
fllter.
Theorem 4.2 Suppose E[jhb0 ?f(?; ^ 0);`ij] ? e0k`k;e0 ? 0, 8` 2 B(Rnx). Under
Assumptions 4.4, 4.5 and 4.6, and assuming that cj 2 B(Rnx);j = 1;:::;m, there
exist ?i > 0;di > 0;i = 1;:::;k such that for all ` 2 B(Rnx) ,
E
hflfl
flhbk ?f(?; ^ k);`i
flfl
fl
i
? ekk`k; k = 1;2;:::;
where
ek = ?k1e0 +
? kX
i=2
?ki +1
!
?+
? kX
i=2
?ki di?1 +dk
!
1p
N; (4.26)
?ki = Qkj=i ?j for k ? i, and ?ki = 0 for k < i, ? is the constant in Assumption 4.6.
Proof:
Applying Lemma 4.1, Assumption 4.6, and Lemma 4.3, we have that for all
` 2 B(Rnx) and k 2N, there exist ?i > 0;di > 0;i = 1;:::;k such that
E
hflfl
flhbk ?f(?; ^ k);`i
flfl
fl
i
? E[jhbk ?b0k;`ij]+E[jhb0k ?f(?; 0k);`ij]:::
+ E
hflfl
flhf(?; 0k)?f(?; ^ k);`i
flfl
fl
i
?
 
?kek?1 +?+dk 1pN
?
k`k = ekk`k:
73
It is easy to deduce by induction that
ek = ?k1e0 +
? kX
i=2
?ki +1
!
?+
? kX
i=2
?ki di?1 +dk
!
1p
N:
?
Remark 4.4 As we mentioned in Remark 4.2, the projection error e0 and ? decrease
as the number of su?cient statistics in the chosen exponential family, m, increases.
The error ek decreases at the rate of 1pN, as we increase the number of samples
in the projection particle fllter. However, notice that the coe?cient in front of 1pN
grows as time, so we have to use an increasing number of samples as time goes on,
in order to ensure a uniform error bound with respect to time.
4.6 Validation of Assumptions
Assumptions 4.2 and 4.6 are the main assumptions of our analysis. They are
assumptions on the projection error, assuming that density projection introduces
a \small" error. We will show that in certain cases these assumptions hold, and
the projection error converges to 0 as the number of su?cient statistics, m, goes to
inflnity. We will flrst state a convergence result from [7]. However, as their conver-
gence result is in the sense of KL divergence, we will further show the convergence
in the sense employed in our assumptions by using an intermediate result in [7].
Consider a probability density function b deflned on a bounded interval, and
approximate it by bp, a density function in an exponential family, whose su?cient
statistics consist of polynomials, splines or trigonometric series. The following the-
orem is proved in [7].
74
Theorem 4.3 If logb has r square-integrable derivatives, i.e., R jDr logbj2 < 1,
then DKL(bjjbp) converges to 0 at rate m?2r as m !1.
Theorem 4.3 says the projected density bp converges to b in the sense of KL
divergence, as m goes to inflnity. An intermediate result (see (6.6) in [7]) is:
klogb=bpk ? ?m, where ?m is a constant that depends on m, and ?m ! 0 as
m !1.
Since b is bounded and log(?) is a continuously difierentiable function, there ex-
ists a constant ? such that kb?bpk? ?klogb?logbpk. Hence, with the intermediate
result above,
jh`;b?bpij ? k`k
Z
kb?bpkdx
? k`k
Z
?klog bbpkdx ?k`k?l?m;
where l is the length of the bounded interval that b is deflned on. Since ?m can
be made arbitrarily small by taking large enough m, it is easy to see that Assump-
tions 4.2 and 4.6 hold in the cases that we consider.
4.7 Numerical Experiments
4.7.1 Scalability and Computational Issues
Estimation of the one-step cost function (4.6) and transition probabilities (Al-
gorithm 4.2) are executed for every belief-action pair that is in the discretized mesh
G and the action space A. Hence, the algorithms scale according to O(jGjjAjN)
and O(jGjjAjN2), respectively, where jGj is the number of grid points, jAj is the
75
number of actions, and N is the number of samples specifled in the algorithms. In
implementation, we found that most of the computation time is spent on executing
Algorithm 4.2 over all belief-action pairs. However, estimation of cost functions and
transition probabilities can be pre-computed and stored, and hence only needs to
be done once.
The advantage of the algorithms is that the scalability is independent of the
size of the actual state space, since G is a grid mesh on the parameter space of
the projected belief space. That is exactly what is desired by employing density
projection. However, to get a better approximation, more parameters in the expo-
nential family should be used, and that will lead to a higher-dimensional parameter
space to discretize. Increasing the number of parameters in the exponential family
also makes sampling more di?cult. Sampling from a general exponential family is
usually not easy, and may require some advanced techniques, such as Markov chain
Monte Carlo (MCMC) [33], and hence more computation time. This di?culty can
be avoided by resampling from the discrete particles instead of the projected density,
which is equivalent to using the plain particle fllter and then doing projection out-
side the fllter. This may lead to sample degeneracy however. The trade-ofi between
a better approximation and less computation time is complicated and needs more
research. We plan to study how to appropriately choose the exponential family and
improve the simulation e?ciency in the future.
76
4.7.2 Simulation Results
Since most of the benchmark POMDP problems in the literature assume a
discrete state space, it is di?cult to compare against the state of the art. Here we
consider an inventory control problem by adding a partial observation equation to
a fully observable inventory control problem. The fully observable problem has an
optimal threshold policy [60], which allows us to verify our method in the limiting
case by setting the observation noise very close to 0. In our inventory control
problem, the inventory level is reviewed at discrete times, and the observations are
noisy because of, e.g., inventory spoilage, misplacement, distributed storage. At
each period, inventory is either replenished by an order of a flxed amount or not
replenished. The customer demands arrive randomly with known distribution. The
demand is fllled if there is enough inventory remaining. Otherwise, in the case of
a shortage, excess demand is not satisfled and a penalty is issued on the lost sales
amount. We assume that the demand and the observation noise are both continuous
random variables; hence the state, i.e., the inventory level, and the observation, are
continuous random variables.
Let xk denote the inventory level at period k, uk the i.i.d. random demand at
period k, ak the replenish decision at period k (i.e., ak = 0 or 1), Q the flxed order
amount, yk the observation of inventory level xk, vk the i.i.d. observation noise, h
the per period per unit inventory holding cost, s the per period per unit inventory
77
shortage penalty cost. The system equations are as follows
xk+1 = max(xk +akQ?uk;0); k = 0;1;:::;
yk = xk +vk; k = 0;1;::::
The cost incurred in period k is
gk(xk;ak;uk) = hmax(xk +akQ?uk;0)+ smax(uk ?xk ?akQ;0):
We consider two objective functions: average cost per period and discounted total
cost, given by
limsup
H!1
E
hP
H
k=0 gk
i
H ; limH!1E
" HX
k=0
 kgk
#
;
where  2 (0;1) is the discount factor.
In the simulation, we flrst choose an exponential family and specify a grid mesh
on its parameter space, then implement (4.6) and Algorithm 4.2 on the grid mesh,
and use value iteration to solve for a policy. These are done o?ine. In an online
run, Algorithm 4.1 (PPF) is used for flltering and making decisions with the policy
obtained o?ine. We also consider a small variation of this method: instead of using
PPF, we use Algorithm 2.1 (PF) and do density projection outside the fllter each
time. We compare our two methods (called \Ours 1" and \Ours 2", respectively)
described above to four other algorithms: (1) Certainty equivalence using the mean
estimate (CE-Mean); (2) Certainty equivalence using the maximum likelihood esti-
mate (CE-MLE); (3) EKF-based Parametric POMDP (EKF-PPOMDP) in [17]; (4)
Greedy policy. Certainty equivalence methods treat the state estimate as the true
state in the solution to the full observation problem. We use the bootstrap fllter to
78
obtain the mean estimate and the MLE of the states for the certainty equivalence
method. EKF-PPOMDP approximates the belief state by a Gaussian distribution,
and uses the extended Kalman fllter to estimate the transition of the belief state.
Similar to our method, it also solves a discretized MDP deflned on the parameter
space of the Gaussian density. The greedy policy chooses an action ak that attains
the minimum in the expression minak2A Exk;uk[gk(xk;akQ;uk)jIk].
Numerical experiments are carried out in the following settings:
? Problem parameters: initial inventory level x0 = 5, holding cost h = 1, short-
age penalty cost s = 10, flxed order amount b = 10, random demand uk ?
exp(5), discount factor  = 0:9, inventory observation noise vk ? N(0; 2)
with  ranging from 0:1 to 3:3 in steps of 0:2.
? Algorithm parameters: The number of particles in both the usual particle
fllter and the projection particle fllter is N = 200; the exponential family
in the projection particle fllter is chosen as the Gaussian family; the set of
grids on the projected belief space is G = f mean = [0 : 0:5 : 15], standard
deviation = [0 : 0:2 : 5]g for both our methods and EKF-PPOMDP; one run of
horizon length H = 105 for each average cost criterion case, 1000 independent
runs of horizon length H = 40 for each discounted total cost criterion case;
nearest neighbor as the value function approximator in both our methods and
EKF-PPOMDP.
? Simulation issues: We use common random variables among difierent policies
and difierent  ?s.
79
In order to implement the certainty equivalence methods, we use Monte Carlo
simulation to flnd the optimal threshold policy for the fully observed problem (i.e.,
yk = xk): if the inventory level is below the threshold L, the store/warehouse
should order to replenish its inventory; otherwise, if the inventory level is above L,
the store/warehouse should not order. That is,
ak =
8>
><
>>:
0; if xk > L;
1; if xk < L.
(4.27)
The simulation result indicates both the average and discounted cost functions are
convex in the threshold and the minimum is achieved at L = 7:7 for both.
Table 4.2 and Table 4.3 list the simulated average costs and discounted total
cost using difierent policies under difierent observation noises, respectively. Our
methods generally outperforms all the other algorithms under all observation noise
levels. also performs very well, and slightly outperforms CE-MLE. EKF-PPOMDP
performs better in the average cost case than the discounted cost case. The greedy
policy is much worse than all other algorithms. While our methods and the EKF-
PPOMDP involve o?ine computation, the more critical online computation time of
all the simulated methods is approximately the same.
For all the algorithms, the average cost/discounted cost increases as the obser-
vation noise increases. That is consistent with the intuition that we cannot perform
better with less information. Fig. 5.2 shows 1000 actions taken by our method ver-
sus the true inventory levels in the average cost case (the discounted total cost case
is similar and is omitted here). The dotted vertical line is the optimal threshold
under full observation L. Our algorithm yields a policy that picks actions very close
80
to those of the optimal threshold policy when the observation noise is small (cf.
Fig. 5.2(a)), indicating that our algorithm is indeed flnding the optimal policy. As
the observation noise increases, more actions picked by our policy violate the op-
timal threshold, and that again shows the value of information in determining the
actions.
The performances of our two methods are very close, with one slightly better
than the other. Solely for the purpose of flltering, doing projection outside the
fllter is easier to implement if we want to use a general exponential family, and
also gives a better estimate of the belief state, since the projection error will not
accumulate. However, for solving POMDPs, we conjecture that PPF would work
better in conjunction with the policy solved from the projected belief MDP, since
the projected belief MDP assumes that the transition of the belief state is also
projected. However, we do not know which one is better.
Our method outperforms the EKF-PPOMDP, mainly because the projection
particle fllter used in our method is better than the extend Kalman fllter used in the
EKF-PPOMDP for estimating the belief transition probabilities. This agrees with
the results in [18], which also observed that Monte Carlo simulation of the belief
transitions is better than the EKF estimate.
Although the performance of the certainty equivalence method is comparable
to that of our methods for this particular example, certainty equivalence is generally
a suboptimal policy except in some special cases (cf. section 6.1 in [10]), and it does
not have a theoretical error bound. Moreover, to use certainty equivalence method
requires solving the full observation problem, which is also very di?cult in many
81
0 5 10 15 20 25
0
1
State x (Inventory level)
Action a (replenish or not replenish)
 
 (state, action yielded by our method)
optimal threshold under full observation
0 5 10 15 20 25
0
1
State x (Inventory level)
Action a (replenish or not replenish)
 
 (state, action) yielded by our method
optimal threshold under full observation
(a) observation noise  = 0:1 (b) observation noise  = 1:1
0 5 10 15 20 25
0   
1   
State x (Inventory level)
Action a (replenish or not replenish)
 
 (state, action) yielded by our method
optimal threshold under full observation
0 5 10 15 20 25
0
1
State x (Inventory level)
Action a (replenish or not replenish)
 
 (state,action) yielded by our method
optimal threshold under full observation
(c) observation noise  = 2:1 (d) observation noise  = 3:1
Figure 4.1: Our algorithm: actions taken for difierent inventory levels under difierent
observation noise variances.
cases. In contrast, our method has a proven error bound on the performance, and
works with the POMDP directly without having to solve the MDP problem under
full observation.
82
4.8 Conclusions
We developed a method that efiectively reduces the dimension of the belief
space via the orthogonal projection of the belief states onto a parameterized fam-
ily of probability densities. For an exponential family, the density projection has
an analytical form and can be carried out e?ciently. The exponential family is
fully represented by a flnite (small) number of parameters, hence the belief space
is mapped to a low-dimensional parameter space and the resultant belief MDP is
called the projected belief MDP. The projected belief MDP can then be solved in
numerous ways, such as standard value iteration or policy iteration, to generate a
policy. This policy is used in conjunction with the projection particle fllter for online
decision making.
We analyzed the performance of the policy generated by solving the projected
belief MDP in terms of the difierence between the value function associated with
this policy and the optimal value function of the POMDP. We also provided a bound
on the error between our projection particle fllter and exact flltering.
We applied our method to an inventory control problem, and it generally out-
performed other methods. When the observation noise is small, our algorithm yields
a policy that picks the actions very closely to the optimal threshold policy for the
fully observed problem. Although we only proved theoretical results for discounted
cost problems, the simulation results indicate that our method also works well on
average cost problems. We should point out that our method is also applicable
to flnite horizon problems, and is suitable for large-state POMDPs in addition to
83
continuous-state POMDPs.
84
Table 4.2: Optimal average cost estimates for the inventory control problem using
difierent methods. Each entry represents the average cost of a run of horizon 105.
 Ours 1 Ours 2 CE-Mean CE-MLE PPOMDP Greedy
0.1 12.849 12.849 12.842 12.837 12.941 25.454
0.3 12.845 12.837 12.857 12.861 12.950 25.467
0.5 12.864 12.862 12.867 12.884 12.964 25.457
0.7 12.881 12.884 12.882 12.890 12.990 25.452
0.9 12.904 12.918 12.908 12.940 13.006 25.450
1.1 12.938 12.943 12.945 12.969 13.059 25.428
1.3 12.973 12.986 12.977 12.993 13.105 25.356
1.5 13.016 13.017 13.034 13.029 13.141 25.293
1.7 13.066 13.067 13.100 13.117 13.182 25.324
1.9 13.110 13.105 13.159 13.172 13.214 25.343
2.1 13.123 13.140 13.183 13.227 13.255 25.332
2.3 13.210 13.201 13.263 13.292 13.307 25.355
2.5 13.250 13.246 13.314 13.333 13.380 25.402
2.7 13.323 13.324 13.382 13.454 13.441 25.428
2.9 13.374 13.384 13.458 13.497 13.491 25.478
3.1 13.444 13.459 13.527 13.580 13.553 25.553
85
Table 4.3: Optimal discounted cost estimate for the inventory control problem using
difierent methods. Each entry represents the discounted cost (mean, standard error
in parentheses) based on 1000 independent runs of horizon 40.
 Ours 1 Ours 2 CE-Mean CE-MLE PPOMDP Greedy
0.1 126.79 127.26 129.12 129.09 137.41 241.67
(1.64) (1.63) (1.81) (1.81) (1.65) (2.99)
0.3 126.86 126.95 129.17 129.11 137.64 242.08
(1.63) (1.63) (1.78) (1.78) (1.62) (2.98)
0.5 126.61 126.35 129.12 129.16 138.16 242.66
(1.63) (1.62) (1.77) (1.78) (1.60) (2.98)
0.7 126.42 126.99 129.30 129.62 141.78 243.33
(1.62) (1.61) (1.77) (1.79) (1.55) (2.98)
0.9 126.78 126.86 129.59 129.76 138.23 244.00
(1.63) (1.63) (1.76) (1.78) (1.60) (2.97)
1.1 127.49 127.74 130.19 130.23 140.86 244.81
(1.64) (1.63) (1.77) (1.75) (1.57) (2.97)
1.3 128.74 128.30 130.49 130.54 146.02 245.67
(1.65) (1.64) (1.76) (1.72) (1.52) (2.96)
1.5 129.30 129.45 130.74 131.09 144.88 246.71
(1.68) (1.66) (1.75) (1.77) (1.52) (2.95)
86
Table 4.4: Continue Table 4.3
 Ours 1 Ours 2 CE CE-MLE PPOMDP Greedy
1.7 129.71 128.93 130.95 131.45 146.80 247.70
(1.67) (1.67) (1.76) (1.77) (1.52) (2.96)
1.9 130.11 129.85 131.29 131.60 147.16 248.55
(1.69) (1.67) (1.75) (1.73) (1.56) (2.93)
2.1 130.67 130.17 131.76 132.24 144.67 249.45
(1.69) (1.67) (1.74) (1.79) (1.54) (2.95)
2.3 130.96 130.36 132.22 132.76 145.35 250.07
(1.68) (1.67) (1.75) (1.78) (1.55) (2.97)
2.5 131.90 130.86 132.54 133.47 145.06 250.49
(1.68) (1.68) (1.76) (1.78) (1.58) (2.96)
2.7 131.81 131.66 133.18 133.98 148.39 250.76
(1.68) (1.68) (1.75) (1.78) (1.54) (2.96)
2.9 132.36 131.78 133.61 134.56 146.27 250.81
(1.68) (1.68) (1.75) (1.83) (1.57) (2.96)
3.1 132.95 133.51 134.09 135.83 147.96 250.89
(1.70) (1.70) (1.76) (1.79) (1.54) (2.95)
3.3 133.08 132.76 134.81 136.12 145.32 250.77
(1.69) (1.69) (1.76) (1.84) (1.60) (2.94)
87
Chapter 5
Particle Filtering Framework for Optimization
5.1 Related Work and Motivation
We consider the global optimization problem:
x? = argmax
x2X
H(x);
and assume that it has a unique global optimal solution x?.
Many of the simulation-based global optimization methods, such as the esti-
mation of distribution algorithms (EDAs) [64] [59], the cross-entropy (CE) method
[78] [79], and model reference adaptive search (MRAS) method [41], fall into the
category of \model-based methods" as classifled by [98]. They share the similarities
of iteratively repeating the following two steps:
? Generate candidate solutions from an intermediate distribution over the solu-
tion space;
? Update the intermediate distribution using the candidate solutions.
This intermediate distribution is often referred to as a probabilistic model, since it
often imposes a model on the relationship between the components that are needed
to represent a solution. The choice and updating of the probabilistic model (or
intermediate distribution) play a key role in determining the e?ciency and accuracy
of the algorithm.
88
EDAs were flrst proposed by [64] in the fleld of evolutionary computation, with
the goal of eliminating the mutation and cross-over operations in genetic algorithms
(GAs) in order to avoid partial solutions. EDAs generate ofispring by sampling from
a distribution over the solution space that is estimated from the candidate solutions
of the previous iteration. The estimation of this distribution is often based on a
probabilistic model that explicitly expresses the relationship between the underlying
variables [53].
The cross-entropy (CE) method was originally introduced for estimating prob-
abilities of rare events in complex stochastic networks [77], and later was modifled
slightly to be used for solving combinatorial and continuous optimization problems
[78]. A key idea of the CE method is to minimize the Kullback-Leibler (KL) di-
vergence between a desired density (the optimal importance sampling density) and
a family of parameterized densities, in particular an exponential family, since an
analytical solution can be calculated in this case.
The MRAS method was introduced in [41]. MRAS implicitly constructs a
sequence of reference distributions and uses this sequence to facilitate and guide
the parameter updating associated with a family of parameterized distributions.
At each iteration, candidate solutions are sampled from the distribution (in the
prescribed family) that has the minimumKL divergence with respect to the reference
distribution of the previous iteration.
The aforementioned various ways of updating the probabilistic model motivate
us to look for a unifying and systematic approach to the probabilistic model-based
methods for optimization. Our main idea is to transform the optimization problem
89
into a flltering problem. The goal of flltering is to estimate the unobserved state
in a dynamic system through a sequence of noisy observations of the state. The
unobserved state corresponds to the optimal solution to be estimated; the noisy ob-
servations in flltering brings randomization into the optimization algorithm; and the
conditional distribution of the unobserved state is a distribution over the solution
space, which approaches a delta function concentrated on the optimal solution as
the system evolves. Hence, the task of searching for the optimal solutions is carried
out through the procedure of estimating the conditional density sequentially. This
idea is only conceptual, since flltering can hardly be solved analytically, and some
approximate flltering methods are needed. We apply particle flltering to solve the
transformed flltering problem. Based on particle flltering, we propose a plain parti-
cle flltering framework and a general particle flltering framework for optimization.
The former framework is a special case of the latter one and more intuitive, while
the latter framework is a generalization and hence provides more opportunities for
developing new algorithms. To the best of our knowledge, it is the flrst work on
applying particle flltering to the fleld of optimization.
The particle flltering framework unifles EDAs, the CE method, and MRAS,
and provides new insights into the three optimization methods from another view-
point. More speciflcally, EDAs and the CE method flt in the plain particle flltering
framework, and in particular, the CE method corresponds to the projection parti-
cle flltering described in section 4.3. EDAs and the CE method difier only in their
ways of constructing an approximation for the conditional density based on the sam-
ples/particles. The MRAS method flts in the general particle flltering framework,
90
with a speciflc way to construct the resampling importance density.
The particle flltering framework also sheds light on developing new improved
algorithms for global optimization. The possibilities of new algorithms come from
the freedom in the particle flltering framework, as well as the vast arrays of tech-
niques of improving nonlinear flltering and particle flltering. We focus on three
promising directions: adjusting the trade-ofi between exploration and exploitation
in the search by appropriately choosing the importance densities in particle flltering;
incorporating gradient-based local search into simulation-based global search by in-
cluding the gradient term of the objective function in the state-space model in the
flltering problem; preventing premature convergence in simulation-based optimiza-
tion methods by introducing the idea of \persistent excitation" to optimization.
5.2 Filtering for Optimization
We consider the global optimization problem:
x? = argmax
x2X
H(x); (5.1)
where the solution space X is a nonempty set in Rn, and H(?) : X ! Y is a real-
valued function that is bounded, i.e., 9M1 > ?1; M2 < 1 s.t. M1 ? H(x) ? M2,
8x 2 X. We assume that (5.1) has a unique global optimal solution, i.e., 9x? 2 X
s.t. H(x) < H(x?), 8x 6= x?, x 2X.
The optimization problem (5.1) can be transformed into a flltering problem
91
by constructing an appropriate state-space model. Let the state-space model be
xk = xk?1; k = 1;2;:::;
yk = H(xk)?vk; k = 0;1;:::; (5.2)
where xk 2 Rn is the unobserved state to be estimated, yk 2 R is the observation,
vk 2 R is the observation noise that is an i.i.d. sequence, and the initial state
is x0 = x? which is unobserved. We assume that vk has a p.d.f. ?(?) that is
nondecreasing on its support.
As shown before, the conditional density bk(xk) , p(xkjy1:k) is updated recur-
sively according to
bk(xk) / p(ykjxk)
Z
p(xkjxk?1)bk?1(xk?1)dxk?1: (5.3)
For the above state-space model (5.2), the transition density is
p(xkjxk?1) = ?(xk ?xk?1); (5.4)
where ? denotes the Dirac delta function. The likelihood function is
p(ykjxk) = ?(H(xk)?yk): (5.5)
Substituting (5.4) and (5.5) into (5.3), we obtain
bk(xk) = ?(H(xk)?yk)bk?1(xk)R ?(H(x
k)?yk)bk?1(xk)dxk
: (5.6)
The connection between the flltering problem (5.2) and the optimization prob-
lem (5.1) takes at several places. The underlying value of the unobserved state xk is
x?, and hence, the goal of flltering (to estimate xk) is the same as that of optimiza-
tion (to flnd x?). In flltering, at each time k, we estimate a conditional density bk of
92
xk. In optimization, bk is a density over the solution space, interpreted as the our
beliefs about the possible values that x? might take, and serves as the intermediate
distribution for drawing new candidate solutions at the next iteration . At each iter-
ation k, bk is updated according to (5.6) for an incoming new observation yk, which
reveals new information about x?. The noise in yk, or in other words, a nondegener-
ated p.d.f. ?, brings randomization into the optimization algorithm. We can choose
? to determine how the conditional density (i.e., bk?1) is tuned by the performance
of solutions to yield a new conditional density (i.e., bk) at the next iteration. For
example, if ?(x) is an increasing function of x, the likelihood function (5.5) assigns
more weight to the candidate solutions that have better performance; if ?(x) = 0 for
x < 0, then (5.5) discards inferior candidate solutions whose performance is worse
than yk.
It should be expected that if yk increases with k, the conditional density bk
will get closer to the density of xk, i.e., a Dirac delta function concentrated on x?.
From the viewpoint of flltering, bk is the posterior density of xk that approaches
the density of xk. From the optimization viewpoint, bk is a density deflned on the
solution space that becomes more and more concentrated on the optimal solution as
k increases. Fig. 5.2 is an illustration of how bk changes in the flrst three iterations:
at k = 0, it has no prior knowledge of x? (the true value of xk), and hence, b0 is
uniform over the solution space; at k = 1, as observation y0 provides new information
about x?, b1 has more weight on the solution space centered around x?; at k = 2, b2
becomes even more concentrated on x?.
In summary, to solve the maximization problem (5.1) is equivalent to recur-
93
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
x*
b0
b1
b2
Figure 5.1: At the flrst three iterations, the conditional density bk becomes more
and more concentrated on the optimal solution.
sively estimating bk of the model (5.2) while constructing an increasing sequence of
observations fykg.
5.3 Particle Filtering Framework for Optimization
The flltering idea is only conceptual and the resulting problem is not analyt-
ically solvable. Hence, we need some approximation methods to solve the above
flltering problem. We apply the plain particle fllter and the general particle fllter
to the flltering problem, with just a little tweak to adapt to the optimization prob-
lem. The application of particle flltering turns out to be a framework for many
simulation-based optimization algorithms. In the following, we present the plain
particle fllter framework for optimization (PPFO), and then the general particle fll-
ter framework for optimization (GPFO). The former framework is a special case of
the latter and provides more intuition, while the latter framework is more general
94
and allows more variations in the development of new algorithms.
Algorithm 5.1 Plain Particle Filter Framework for Optimization (PPFO)
1. Initialization. Specify ? 2 (0;1], and an initial p.d.f./p.m.f. b0 that is deflned
on X. Sample fxi1gNi=1 i.i.d. from b0. Set k = 1.
2. Observation Construction. Let yk be the sample (1??)-quantile of fH(xik)gNi=1.
If k > 1 and yk ? yk?1, then set yk = yk?1.
3. Bayes? Updating. ^bk(xk) = PNi=1 wik?(xk ?xik), where weights are calculated
according to
wik / ?(H(xik)?yk);i = 1;2;:::;N;
and normalized.
4. Resampling. Construct a continuous approximation ~bk(xk) from ^bk(xk). Sam-
ple fxik+1gNi=1 i.i.d. from ~bk(xk).
5. Stopping. If a stopping criterion is satisfled, then stop; else, k ? k + 1 and
go to step 2.
At initialization, the PPFO algorithm draws samples from an initial distribu-
tion b0 that is deflned on X. A parameter ? is specifled to determine the (1??)-
quantile samples that will be used to construct a nondecreasing the observation se-
quencefykg. The requirementof nondecreasing is to ensure the observationsequence
does not become worse as the algorithm goes on. Since the transition probability is
1, the importance sampling step is omitted with suitable change of the indices. The
95
Bayes? updating step assigns weights to the samples according to their performance.
Slightly difierent from the plain particle fllter, the resampling step here constructs
a continuous density ~bk flrst, since the discrete approximation ^bk does not provide
any new samples. The new samples drawn from ~bk are more concentrated in the
promising areas than the old samples. To adapt to optimization, a stopping step
is added to the algorithm; whereas in flltering, the algorithm keeps going on as the
real system operates.
Similarly, by applying the general particle fllter, we introduce the general
particle flltering framework for optimization as follows:
Algorithm 5.2 A General Particle Filtering Framework for Optimization (GPFO)
1. Initialization. Specify ? 2 (0;1], a nonnegative nonincreasing sequence f?kg,
and an initial p.d.f./p.m.f. p0 that is deflned on X. Sample fxi0gNi=1 i.i.d. from
b0. Set k = 1.
2. Importance Sampling. Sample xik from qk(xkjxik?1;yk), i = 1;:::;N.
3. Observation Construction. Let yk be the sample (1??)-quantile of fH(xik)gNi=1.
If k > 1 and yk ? yk?1, then set yk = yk?1.
4. Bayes? Updating. ^bk(xk) = PNi=1 wik?(xk ?xik), where the normalized weights
are calculated as
wik / ?(H(x
i
k)?yk)bk?1(x
i
k?1)
qk(xikjxik?1;yk)gk?1(xik?1jy0:k?1):
5. Importance Resampling. Sample fxikgNi=1 i.i.d. from gk(xkjy0:k).
96
6. Stopping. If a stopping criterion is satisfled, then stop; else, k ? k + 1 and
go to step 2.
5.4 Interpretation of EDAS, CE, MRAS
In this section, we use the particle flltering framework to interpret some of
the existing optimization algorithms: estimation of distribution algorithms (EDAs),
the cross entropy (CE) method, and model reference adaptive search (MRAS). A
class of EDAs can flt in the plain particle flltering framework. The CE method can
be viewed as projection particle flltering, which also flts in the plain particle fllter-
ing framework with a speciflc way to construct a continuous approximation of the
conditional density, namely the density projection approach. The MRAS method
can flt in the general particle flltering framework where the resampling importance
density is constructed via density projection. This interpretation also provides an-
other insight into the relationships between the three methods in addition to the
view in [41]. Speciflcally, the main di?culty in EDAs is to estimate a distribu-
tion from the samples, and this di?culty is solved in the CE method by projecting
the empirical distribution of the samples to obtain an approximate continuous den-
sity. However, the density projection introduces an error, which is corrected in the
MRAS method by taking the projected density as a resampling importance density
and hence weighting the samples difierently from the CE method.
In all three methods, the most common sample selection scheme is the so-
called truncated selection [93], which selects the elite samples whose performance is
97
above a threshold. In the following, we will focus on the truncated selection scheme.
Recall that in the optimization problem (5.1), the objective function H(x) is
bounded by M1 ? H(x) ? M2. In the state-space model (5.2), let the observation
noise vk follow a uniform distribution U(0;M2 ?M1). Since vk = H(xk)?yk, the
likelihood function is
p(ykjxk) =
8>
><
>>:
1
M2?M1; if 0 ? H(xk)?yk ? M2 ?M1;
0; otherwise.
(5.7)
Since yk = H(x);x 2X, the inequality H(xk)?yk ? M2?M1 always holds. Hence,
(5.7) can be written in a more compact way as
p(ykjxk) = 1M
2 ?M1
IfH(xk)?ykg; (5.8)
where If?g denotes the indicator function.
Substituting (5.8) into (5.6), we obtain
bk(xk) = IfH(xk)?ykgbk?1(xk)R I
fH(xk)?ykgbk?1(xk)dxk
: (5.9)
With i.i.d. samples fxikgNi=1 drawn from bk?1, bk(xk) can be approximated by
^bk(xk) =
PN
i=1 IfH(xik)?ykg?(xk ?x
i
k)P
N
i=1 IfH(xik)?ykg
: (5.10)
Thus, (5.9) is equivalent to selecting the elite solutions to tune the sampling dis-
tribution at the previous iteration, and (5.10) is the Monte Carlo version of (5.9).
These two equations are the cornerstone of the rest of this section.
5.4.1 Estimation of Distribution Algorithms
EDAs are a class of optimization algorithms based on the key idea of iteratively
doing the two steps:
98
1. Select elite samples from a pool of samples that are generated from a proba-
bility distribution;
2. Estimate the probability distribution of selected samples and generate new
samples from it.
With the truncation selection scheme, one class of EDAs can be viewed as an
instantiation of the plain particle flltering framework as follows:
Algorithm 5.3 Instantiation 1 of Plain Particle Filter Framework for Optimiza-
tion (PPFO1)
1. Initialization. Specify ? 2 (0;1], and an initial p.d.f./p.m.f. b0 that is deflned
on X. Sample fxi1gNi=1 i.i.d. from b0. Set k = 1.
2. Observation Construction. Let yk be the sample (1??)-quantile of fH(xik)gNi=1.
If k > 1 and yk ? yk?1, then set yk = yk?1.
3. Bayes? Updating. The discrete approximation ^bk(xk) is as (5.10).
4. Resampling. Estimate a continuous approximation ~bk(xk) from ^bk(xk). Sample
fxik+1gNi=1 i.i.d. from ~bk(xk).
5. Stopping. If a stopping criterion is satisfled, then stop; else, k ? k + 1 and
go to step 2.
It is obvious that the observation construction and Bayes? updating steps es-
sentially select the elite samples according to the truncated selection scheme, cor-
responding to step 1 in EDAs; and the resampling step corresponds to step 2 in
EDAs.
99
The main di?culty in EDAs is to estimate the distribution of the selected
samples. When doing so, EDAs often take into account the interaction between the
underlying variables that represent a solution, and express the interaction explicitly
through the use of difierent probabilistic models. One way is to use a dynamic
Bayesian network (DBN) to represent such a probabilistic model and infer the un-
derlying probability distribution [53]. Put in our context, the relationship between
the components of the state vector xk is expressed through the use of a DBN, and
the estimation of the joint distribution of the components is ~bk(xk). Interestingly,
there is a particular particle fllter designed especially for DBNs [49], which samples
xk according to the relationship between its components so that the sampling is
more e?cient. This particle fllter can be adopted to improve EDAs that use the
DBN representation.
5.4.2 Cross Entropy Method
The standard CE method (we use the word \standard" to distinguish it from
the extended version of standard CE [26]) for the optimization problem (5.1) is as
follows:
Algorithm 5.4 Standard CE Algorithm for Optimization
1. Choose an initial p.d.f./p.m.f. f(?; 0),  0 2 ?. Specify the parameter ? 2
(0;1], and set k = 1.
2. Generate samples fxikgNi=1 from the density f(?; k?1) and compute the sample
(1??)-quantile yk of the performances fH(xik)gNi=1.
100
3. Compute the new parameter according to
 k = argmax
 2?
1
N
NX
i=1
IfH(xi
k)?ykg
logf(xik; ): (5.11)
4. If a stopping criterion is satisfled, then terminate; else, set k = k + 1 and go
to step 2.
Equation (5.11) comes from the density projection of the optimal importance
sampling density onto a parameterized family of densities ff(?; ); 2 ?g. Projec-
tion particle flltering [95] also uses the density projection technique, but for a very
difierent reason. It projects the discrete approximation ^bk onto the parameterized
family ff(?; ); 2 ?g in order to obtain a continuous approximation~bk that is char-
acterized by only a few parameters, which is very useful in reducing the complexity
of dynamic programming in a decision making problem. Speciflcally, projection
particle flltering chooses a value of the parameter  such that the Kullback-Leibler
(KL) divergence between ^bk and f(?; ) is minimized. The KL divergence between
^bk and f(?; ) is:
DKL(^bkkf(?; )) =
Z
^bk log ^bk
f(?; )
=
Z
^bk log^bk ?
Z
^bk logf(?; ):
Since the flrst term does not depend on f(?; k), minimizing the above equation is
equivalent to solving the maximization problem
max
 2?
E^bk[logf(?; )]:
101
Since ^bk(xk) satisfles (5.10), the above maximization problem can be approximated
by
max
 2?
PN
i=1 IfH(xik)?ykglogf(x
i
k; )P
N
i=1 IfH(xik)?ykg
;
which is equivalent to
max
 2?
1
N
NX
i=1
IfH(xi
k)?ykg
logf(xik; ): (5.12)
Therefore, the optimization algorithm adapted from projection particle flltering is
as follows:
Algorithm 5.5 Instantiation 2 of Plain Particle Filter Framework for Optimiza-
tion (PPFO2)
1. Initialization. Specify ? 2 (0;1], and an initial p.d.f./p.m.f. f(x0; 0) that is
deflned on X. Sample fxi1gNi=1 i.i.d. from f(x0; 0). Set k = 1.
2. Observation Construction. Let yk be the sample (1??)-quantile of fH(xik)gNi=1.
If k > 1 and yk ? yk?1, then set yk = yk?1.
3. Bayes? Updating. The discrete approximation ^bk(xk) is as 5.10.
4. Resampling. Construct a continuous approximation ~bk(xk) = f(xk; k), where
 k = argmax
 2?
1
N
NX
i=1
IfH(xi
k)?ykg
logf(xik; ): (5.13)
Sample fxik+1gNi=1 i.i.d. from ~bk(xk).
5. Stopping. If a stopping criterion is satisfled, then stop; else, k ? k + 1 and
go to step 2.
102
It is easy to see that this algorithm is essentially the same as the standard
CE algorithm (Note that (5.11) and (5.13) are exactly the same). Compared with
EDAs, the CE method avoids complicated estimation of the density bk through
the use of density projection without the need to specify the relationships among
the components of xk. However, from a flltering viewpoint, the projection particle
fllteringintroducesaprojectionerrorthatisaccumulatedoveriterations. Thereason
can be seen by scrutinizing the one-step evolution of the approximate density. Since
samples fxikgNi=1 are sampled from ~bk?1 = f(?; k?1), the density that the algorithm
actually tries to approximate at iteration k is
bk(xk) = IfH(xk)?ykgf(xk?1; k?1)R I
fH(xk)?ykgf(xk?1; k?1))dxk?1
:
Compared with the original equation (5.9) for bk, bk?1 is replaced by its approxima-
tion f(?; k?1), which introduces a projection error that is accumulated to the next
iteration. This projection error can be corrected by taking f(?; k?1) as an impor-
tance density and hence taken care of by the weights of the samples. This leads to
another instantiation of the particle flltering framework, which coincides with the
instantiation of MRAS developed in [41].
5.4.3 Model Reference Adaptive Search
As in the CE method, at each iteration the MRAS method projects a desired
density onto a family of parameterized densities to yield a density from which the
candidate solutions are drawn. In the CE method the target distribution is a single
optimal importance sampling density, whereas in the MRAS method, the parameter
103
updating is guided by an implicit sequence of distributions called the reference
distributions. Here, we consider the Monte Carlo version of the MRAS0 algorithm
with truncated selection scheme presented in [41]:
Algorithm 5.6 Model Reference Adaptive Search Method with Truncated Selection
Scheme
1. Choose an initial p.d.f./p.m.f. f(?; 0),  0 2 ?. Specify the parameter ? 2
(0;1], and set k = 1.
2. Generate samples fxikgNi=1 from the density f(?; k?1) and compute the sample
(1??)-quantile yk of the performances fH(xik)gNi=1.
3. Compute the new parameter according to
 k = argmax
 2?
1
N
NX
i=1
IfH(xi
k)?ykg
f(xik; k?1) logf(x
i
k; ): (5.14)
4. If a stopping criterion is satisfled, then terminate; else, set k = k + 1 and go
to step 2.
Equation (5.14) comes from the projection of the implicit reference distribution onto
the family of parameterized densities. In the particle flltering framework, we will
see that the sequence of reference distributions is the sequence of the approximated
conditional densities f^bkg, which guide the design of the resampling importance
densities ff(?; k)g. Speciflcally, let the initial p.d.f./p.m.f. be
b0(x0) = IfH(x0)?y0gR I
fH(x0)?y0gdx0
:
104
Since yk is a nondecreasing sequence, using the recursive equation (5.9), it can be
shown by induction that
bk(xk) = IfH(xk)?ykgR I
fH(xk)?ykgdxk
: (5.15)
Suppose the resampling importance density gk?1 is
gk?1(xk?1jy0:k?1) = f(xk?1; k?1);
from which the i.i.d. samples fxikgNi=1 are drawn, then the weight of each xik is
wik = bk(x
i
k)
gk?1(xik?1jy0:k?1)
/ IfH(xik)?ykgf(xi
k?1; k?1)
: (5.16)
As shown before, projection of ^bk(xk) onto the parameterized family of densities
ff(?; ); 2 ?g is equivalent to the maximization problem
max
 2?
E^bk[logf(?; )]: (5.17)
Since ^bk(xk) = PNi=1 wik(xk ?xik), where wik satisfles (5.16), and xik?1 = xik, (5.17)
can be approximated by
max
 2?
NX
i=1
IfH(xi
k)?ykg
f(xik; k?1) logf(x
i
k; ):
Thus, the algorithm is as follows:
Algorithm 5.7 Instantiation 1 of General Particle Filter Framework for Optimiza-
tion (GPFO1)
1. Initialization. Specify ? 2 (0;1], and an initial p.d.f./p.m.f. f(x0; 0) that is
deflned on X. Sample fxi1gNi=1 i.i.d. from f(x0; 0). Set k = 1.
105
2. Observation Construction. Let yk be the sample (1??)-quantile of fH(xik)gNi=1.
If k > 1 and yk ? yk?1, then set yk = yk?1.
3. Bayes? Updating. ^bk(xk) = PNi=1 wik(xk ?xik), where the weights are
wik / IfH(xik)?ykgf(xi
k; k?1)
:
4. Resampling. Construct a resampling importance density gk(xkjy0:k) = f(xk; k),
where
 k = argmax
 
1
N
NX
i=1
IfH(xi
k)?ykg
f(xik; k?1) logf(x
i
k; ): (5.18)
Sample fxik+1gNi=1 i.i.d. from gk(xkjy0:k).
5. Stopping. If a specifled stopping criterion is satisfled, then stop; else, k ? k+1
and go to step 2.
Note that the (5.18) is exactly the same as the updating equation (5.14) in
MRAS, and Algorithm 5.7 and Algorithm 5.6 are essentially the same.
5.5 Implication for New Algorithms
The particle flltering framework for optimization not only provides a unifying
framework for some of the existing algorithms, but also opens up the possibility for
new algorithms. There is a considerable amount of freedom in the framework, such
as the choices of the observation noise, the observation sequence, the sampling and
resampling importance densities, each of which can lead to a difierent algorithm. In
addition, many of the techniques that have been used to improve particle flltering
106
can also be adapted to optimization, such as the improvement of EDAs with the
particle fllter for DBNs as we mentioned in last section. In the rest of this section,
we will focus on discussing three promising directions for developing new improved
algorithms under this framework.
5.5.1 Balancing Exploration and Exploitation
Proper sampling and resampling importance densities can be chosen to adjust
the trade-ofi between exploitation and exploration. Construction of the resampling
importance density using the kernel method for density estimation [65], or approx-
imation with Gaussian mixture [50] is very easy to implement, and the obtained
continuous distributions are easy to sample from. They add more exploration on
the solution space, compared to a single Gaussian density that is often used in the
CE and MRAS. A Markov chain Monte Carlo (MCMC) step can be added after
resampling [32] to further adjust the trade-ofi between exploitation and exploration,
or add some local search.
5.5.2 Combining Global Search with Local Search
We can incorporate local/gradient search into global search systematically by
including the gradient of the objective function into the state-space model. First,
assume that the objective function H(x) is difierentiable with respect to x and its
derivative is denoted as
G(x) ,OH(x): (5.19)
107
Let the state-space model be
xk = xk?1 +?kG(xk?1); k = 1;2;:::;
yk = H(xk)?vk; k = 0;1;:::; (5.20)
wheref?kg is a sequence of step sizes that are properly chosen. The intuition of
model (5.20) is that the unobserved state xk has a stationary underlying value x?
if the initial condition x0 = x?. Since G(x?) = 0, the state stays at x? for all time
k if it starts at x?. For this model, the propagation step in the particle flltering
framework moves each sample along its gradient. Therefore, the resulting algorithm
incorporates a gradient-based local search into the simulation-based global search.
5.5.3 Overcoming Premature Convergence
Many of the simulation-based optimization algorithms sufier from the problem
of premature convergence, i.e., they converge too fast such that they get stuck at a
local optimum. This problem is especially severe if the objective function has many
local optima. A similar phenomenon often happens in flltering for a static state, also
called system identiflcation: the algorithm converges too fast such that it converges
to a wrong value. To avoid premature convergence, many system identiflcation
algorithms employs the idea of \persistent excitation", which adds an artiflcial noise
to the static system dynamics to keep a continuous exploration of the state space [48]
[58]. The artiflcial noise is large at the beginning to prevent premature convergence
to a wrong value, and gradually dies down as the estimate converges to the true
108
value. Hence, in the state-space model (5.2), the state equation becomes
xk = xk?1 +fik!k; k = 1;2;:::; (5.21)
where
fik = fi1flk?1; fl 2 (0;1):
From the optimization viewpoint, the propagation step moves each candidate solu-
tion around randomly with a decreasing randomness at each iteration, which allows
for more exploration at the beginning. This idea is very efiective in preventing
premature convergence in optimization algorithms, as shown in the next section.
5.6 Numerical Experiments
The standard CE method often sufiers from the problem of premature con-
vergence due to the quick convergence of the parameterized family of distributions
f(?; k) to a degenerate measure (Dirac measure) [26]. A smoothed parameter up-
dating procedure has been used to address this problem [26], i.e., a smoothed version
of the distribution parameter  k is computed at each iteration k according to
~ k = ? k +(1??)~ k?1; ? 2 (0;1);
where ? is the smoothing parameter, and ~ k?1 the smoothed distribution parameter
at iteration k ?1.
Since the standard CE method can be included in the particle flltering frame-
work, it is straightforward to apply \persistent excitation" to the standard CE
method. Hence, instead of smoothing the parameter, at each iteration we randomly
109
move the candidate solutions according to (5.21) before updating the parameterized
density. In the following, we numerically compare our proposed method (abbrevi-
ated as CEA) and the CE method with smoothing on some benchmark problems,
which have been previously studied in [67] [51] [41]. These test functions are:
(1) Rosenbrock function (n = 20)
H(x) =
n?1X
i=1
100(xi+1 ?x2i)2 +(xi ?1)2;
where x? = (1;:::;1)T;H(x?) = 0.
(2) Dejong?s 5th function (n = 2)
H(x) = [0:002+
25X
j=1
1
j +P2i=1 (xi ?aj;i)6]
?1;
where aj;1 = f?32;?16;0;16;32;?32;?16;0;16;32;?32;?16;0;16;32;?32;?16;
0;16;32;?32;?16;0;16;32g, aj;2 = f?32;?32;?32;?32;?32;?16;?16;?16;?16;
?16;0;0;0;0;0;16;16;16;16;16;32;32;32;32;32g; with 24 local minima and one
global minimum at x? = (?32;?32)T, H(x?) ? 0:998.
(3) Powel singular function (n = 20)
H(x) =
n?2X
i=1
[(xi?1 +10xi)2 +5(xi+1 ?xi+2)2 +(xi ?2xi+1)4 +10(xi?1 ?xi+2)4];
where x? = (0;:::;0)T, H(x?) = 0.
(4) Pint?er?s function (n = 20)
H(x) =
nX
i=1
ix2i +
nX
i=1
20isin2(xi?1 sinxi ?xi +sinxi+1)
+
nX
i=1
ilog10(1+i(x2i?1 ?2xi +3xi+1 ?cosxi +1))2;
where x0 = xn, xn+1 = x1, x? = (0;:::;0)T, H(x?) = 0.
110
(5) Shekel?s function (n = 4)
H(x) =
X
i=1
5((x?ai)T(x?ai)+ci)?1;
where a1 = (4;4;4;4)T, a2 = (1;1;1;1)T, a3 = (8;8;8;8)T, a4 = (6;6;6;6)T, a5 =
(3;7;3;7)T, and c = (0:1;0:2;0:2;0:4;0:4)T, x? ? (4;4;4;4)T, H(x?) ??10:153.
Dejong?s 5th function and Shekel?s function are relatively low dimensional and
has only a few local optima, but the optima are separated by plateaus. Rosenbrock
function and Powel singular function are 20-dimensional badly scaled problems.
Pinter?s function has many local optima, and the number of local optima increases
exponentially as the number of dimension increases. Fig. 5.6 shows some of the
functions in two dimensions.
For the CE method, we use the parameter values suggested by [51], with
smoothing parameter ? = 0:7. We also found by trial and error that ? = 0:2 works
very well for the CE method with smoothing. For CEA, the noise !k is chosen to be
a standard Gassian N(0;In?n), where n is the dimension of the solution space. In
both methods, the parametrized family of distributions is chosen to be a multivariate
Gaussian family. Table 5.1 and Fig. 5.3 show that CEA converges to better solutions
than the CE method on these benchmark problems, and it converges faster than the
CE method with ? = 0:2.
Note that on the Rosenbrock function, CEA can approach the optimum very
closely with appropriately chosen parameters; whereas for whatever smoothing pa-
rameter values fi and fl, the CE method with smoothing always gets stuck at some-
where far from the optimum. Table 5.2 shows on the Rosenbrock function how the
111
(a) Dejong?s 5th function, ?50 ? xi ? 50, i = 1;2
?50
0
50
?50
0
500
100
200
300
400
500
(b) Pinter?s function, ?5 ? xi ? 5, i = 1;2
?5
0
5
?5
0
50
50
100
150
(c) Rosenbrock function, ?5 ? xi ? 5, i = 1;2
?5
0
5
?5
0
50
2
4
6
8
10
x 104
Figure 5.2: Some benchmark problems in two dimensions.
112
Table 5.1: Average performance of CEA and CE on some benchmark problems.
Each entry presents the mean of H(x?) with standard error in parentheses, based
on 100 independent runs.
Function H(x?) CEA (fi1 = 10;fl = 0:95) CE (? = 0:7) CE (? = 0:2)
Rosenbrock 0 16.84 (0.971) 19.41 (1.11) 17.39 (0.009)
Dejong 5th 0.998 0.998 (1.8e-12) 1.01 (0.010) 0.998 (2.3e-15)
Powel 0 2.4e-5 (2.0e-6) 420.8 (266.8) 2.5e-4 (2.4e-4)
Pint?er?s 0 0.0068 (4.0e-4) 4.51 (0.15) 3.82 (0.055)
Shekel?s -10.153 -10.153 (8.5e-9) -9.033 (0.268) -9.929 (0.128)
value of fl, i.e. the decreasing rate of the artiflcial noise, afiects the performance of
CEA: slower decrease of the noise yields better solutions, but not surprisingly takes
more time to converge.
Table 5.2: Average performance of CEA with difierent parameter values of fi and fl
on the Rosenbrock function. Each entry presents the mean of H(x?) with standard
error in parentheses, based on 100 independent runs.
Function H(x?) fi1 = 10;fl = 0:95 fi1 = 10;fl = 0:98 fi1 = 10;fl = 0:995
Rosenbrock 0 16.84 (0.971) 11.90 (0.023) 0.505 (0.010)
113
0 0.5 1 1.5 2x 105
100
101
102
Total sample size
Function value
Dejong?s 5th
 
 CEA
CE v = 0.2CE v = 0.7
0 0.5 1 1.5 2 2.5 3 3.5 4x 10510?5
100
105
1010
Total sample size
Function value
20?D Powel
 
 CEA
CE v = 0.2CE v = 0.7
0 0.5 1 1.5 2 2.5 3 3.5 4x 10510?4
10?2
100
102
104
106
Total sample size
Function value
20?D Pinter
 
 CEA
CE v = 0.2CE v = 0.7
0 0.5 1 1.5 2 2.5 3 3.5 4x 105?102
?101
?100
?10?1
?10?2
?10?3
Total sample size
Function value
4?D Shekel
 
 CEA
CE v = 0.2CE v = 0.7
0 1 2 3 4 5 6x 106
100
102
104
106
108
Total sample size
Function value
20?D Rosenbrock
 
 CEA ?=10, ?=0.995
CEA ?=10, ?=0.98CE v = 0.2
CE v = 0.7
Figure 5.3: Average performance of CEA and CE on some benchmark problems.
114
5.7 Conclusions and Future Research
We transformed a global optimization problem to a flltering problem, and
hence many ideas and results from flltering can be adapted to global optimization.
Based on particle flltering, we proposed a framework for many simulation-based op-
timization algorithms. In particular, the framework unifles: EDAs, the CE method
and MRAS, and provides new insight into the relationship between them. More-
over, the framework holds the promise for developing new optimization algorithms
through the choice of observation noise, sampling and resampling importance den-
sities, as well as a vast array of improving techniques for nonlinear flltering and
particle flltering.
There are two important lines of future research that we will continue to
pursue. First, we will further study how to develop new algorithms using the particle
flltering framework and the performance of these new algorithms. Secondly, we
plan to investigate the convergence property of the particle flltering framework for
optimization. Although convergence has been proved for EDAs [93], the CE method
[78], and MRAS [41] individually, we are interested in unifying convergence results
under the particle flltering framework. The analysis will be based on some existing
stability results for nonlinear fllters, such as [20].
115
Chapter 6
Conclusions and Future Research
6.1 Conclusions
This dissertation has proposed and developed new methods and results for
solving problems in partially observable Markov decision processes and global opti-
mization.
Theflrstpartofthedissertationfocusesoncontinuous-statePOMDPs. Continuous-
statePOMPDsprovideamorenaturalmathematicalmodelthanflnite-statePOMDPs
for many application problems. While there are a good number of numerical meth-
ods for flnite-state POMDPs, there are only a few for continuous-state ones and their
convergence results are sparse. Existing algorithms (for flnite-state POMDPs) are
hard to extend to continuous-state POMDPs, mainly due to the inflnite dimensional
belief space in a continuous-state POMDP as opposed to a flnite-dimensional belief
space in a flnite-state POMDP. Based on the idea of density projection with particle
flltering, we have developed a numerical method for efiective dimension reduction
and scalable approximation for an arbitrary belief state, such as a multi-modal or
a heavy-tail distribution. The idea of density projection orthogonally projects an
arbitrary density onto a parameterized family of densities, namely, an exponential
family of densities in our algorithm, to yield a flnite low-dimensional representation
for that density. Based on some existing results, we have shown that under certain
116
mild conditions, the approximate density approaches the true density as the number
of parameters increases, and hence, shown the convergence of our algorithms. We
have proved error bounds for our proposed POMDP solving algorithm and online
flltering algorithm. We have applied our method to an inventory control problem for
which there are some known analytical results for the purpose of comparison, and
obtained good numerical results that show the promise of our method. Although
we have only proved error bounds for the inflnite-horizon discounted cost criterion,
the numerical results also indicate our method works very well for the average cost
criterion. In addition, with a little straightforward modiflcation, our method can be
easily applied to flnite-horizon problems or (flnite) large-state problems.
The second part of the dissertation is devoted to the understanding and im-
proving of a class of simulation-based algorithms for global optimization. We have
transformed a global optimization algorithm into a flltering problem, and hence,
many results in flltering can be adapted to global optimization. In particular, we
have used a novel interpretation of particle flltering to develop a unifying frame-
work for many simulation-based optimization algorithms, such as the cross-entropy
method, the estimation of distribution algorithms, and the model reference adaptive
search. The frameworkreveals relationships betweenthese algorithms and new inter-
esting insights. By better understanding these algorithms, we have proposed several
promising directions under the same framework for new improved algorithms, such
as balancing the exploitation and exploration, combining simulation-based global
search with gradient-based local search, and preventing premature convergence. We
explored the last direction, and obtained a new improved algorithm which shows
117
better performance than the existing cross-entropy algorithm on some benchmark
problems.
6.2 Future Research
Currently I am applying the method for continuous-state POMDPs described
in Chapter 4 to some stochastic control problems in flnance. Many flnancial prob-
lems, such as portfolio optimization or hedging, can be modeled as stochastic control
problems, where the control is the investment strategy. In many flnancial models,
such as the classical Black-Scholes model [15], the price of a risky asset is modeled as
a stochastic process, where the volatility of the risky asset is treated as a constant.
However, to better model the reality, more and more flnancial models assume that
the volatility itself is also a stochastic process. While many analytical results have
been derived for constant volatility models, stochastic volatility models generally do
not admit analytical solutions except in some rare cases, such as the Heston model
for option pricing [38]. Hence, problems involving stochastic volatility often needs
to be solved numerically. A POMDP is a natural model for a portfolio optimization
or hedging problem involving assets with stochastic volatility, where the volatility
is the hidden state, the price observed at discrete times is a partial observation of
the volatility, and the utility function is the objective function. Since the asset price
and capital are discrete at least down to the pennies, there are a huge number of
states such that the algorithms for flnite-state POMPDs are not applicable here.
Our proposed method for continuous-state POMDPs provides an e?cient numerical
118
tool for these flnancial problems to flnd the best investment strategy and study the
efiect of the amount of initial capital.
Another ongoing work is the study of the convergence of the particle fllter-
ing framework for global optimization. The hope is that the conditional density
converges to a degenerated density concentrated on the optimal solution in spite of
the initial conditional. That is closely related with the stability result of nonlinear
flltering [20].
There are two lines of future research that we plan to pursue. The flrst line is
to develop numerical methods for POMDPs with the state or/and observation equa-
tion being a jump difiusion process. The second line of future research is to extend
the particle flltering framework to stochastic optimization, i.e., optimization prob-
lems where the objective functions cannot be evaluated exactly. ? For POMDPs, the
state equation (3.2) can be viewed as either a discrete state difierence equation, or a
discretized difiusion process. If it is a difiusion process, the discretization introduces
error in the simulation of the propagation of the state. Some recent research has
studied the exact simulation (without discretization error) of a difiusion process [14]
[13], to replace the usual Euler-discretization scheme of the state equation between
observations. However, the exact simulation has not yet been extended to jump-
difiusion processes, and recent research on the inference of a jump-difiusion process
still uses an Euler scheme [43]. Therefore, we need to further investigate e?cient
flltering methods of jump-difiusion processes, and numerical solutions to the cor-
responding POMDPs. In flnancial engineering, many models with jump-difiusion
processes have been proposed to better model the flnancial market than those mod-
119
els with only difiusion processes. For example, Bates [8] added a Poisson process
to the asset price process based on Heston?s stochastic volatility model. In recent
years, much research has focused on jump-difiusion process models on the so-called
ultra-high-frequency data [30] due to more frequent intra-day transactions. More
examples of jump-difiusion processes in flnancial applications can be found in [35].
120
Bibliography
[1] B. D. O. Anderson and J. B. Moore. Optimal Filtering. Prentice Hall, New
Jersey, 1979.
[2] B. D. O. Anderson, J. B. Moore, and M. Eslami. Optimal flltering. IEEE
Transactions on Systems, Man and Cybernetics, 12(2):235{236, 1982.
[3] C. Andrieu, J. F. G. deFreitas, and A. Doucet. Sequential MCMC for Bayesian
model selection. IEEE Higher Order Statistics Workshop, 1999.
[4] S. Arulampalam, S. Maskell, N. J. Gordon, and T. Clapp. A tutorial on particle
fllters for on-line non-linear/non-Gaussian Bayesian tracking. IEEE Transac-
tions on Signal Processing, 50(2):174{188, 2002.
[5] B. Azimi-Sadjadi and P. S. Krishnaprasad. Approximate nonlinear flltering and
its application in navigation. Automatica, 41(6):945{956, 2005.
[6] O. E. Barndorfi-Nielsen. Information and Exponential Families in Statistical
Theory. Wiley, New York, 1978.
[7] A. R. Barron and C. Sheu. Approximation of density functions by sequences
of exponential family. The Annals of Statistics, 19(3):1347{1369, 1991.
[8] D. S. Bates. Jumps and stochastic volatility: Exchange rate processes implicit
in deutschemark option. The Review of Financial Studies, 9(1):69{107, 1996.
[9] D. P. Bertsekas. Convergence of discretization procedures in dynamic program-
ming. IEEE Trasactions on Automatic Control, 20(3):415{419, 1975.
[10] D.P.Bertsekas. Dynamic Programming and Optimal Control. AthenaScientiflc,
1995.
[11] D. P. Bertsekas and D. A. Castanon. Adaptive aggregation methods for infl-
nite horizon dynamic programming. IEEE Trasactions on Automatic Control,
34(6):589{598, 1989.
[12] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Optimiza-
tion and Neural Computation Series. Athena Scientiflc, 1st edition, 1996.
[13] A. Beskos, O. Papaspiliopoulos, and G. Roberts. Computationally e?cient
likelihood-based estimation for discretely observed difiusion processes (with dis-
cussion and reply from the authors). Journal of the Royal Statistical Society,
Series B, Statistical Methodology, 68:1{29, 2006.
[14] A. Beskos and G. Roberts. Exact simulation of difiusions. Annals of Applied
Probability, 15:2422{2444, 2005.
121
[15] F. Black and M. Scholes. The pricing of options and corporate liabilities. Jour-
nal of Political Economics, 81:637{659, 1973.
[16] R. W. Brockett. Remarks on flnite dimensional nonlinear estimation. In C. Lo-
bry, editor, Analyse des Syst emes, 1978.
[17] A. Brooks, A. Makarenkoa, S. Williamsa, and H. Durrant-Whytea. Parametric
POMDPs for planning in continuous state spaces. Robotics and Autonomous
Systems, 54(11):887{897, 2006.
[18] A. Brooks and S. Williams. A monte carlo update for parametric POMDPs.
International Symposium of Robotics Research, Nov. 2007.
[19] A. Budhiraja, L. Chen, and C. Lee. A survey of numerical methods for nonlinear
flltering problems. Physica D: Nonlinear Phenomena, 230:27{36, 2007.
[20] A. Budhiraja and D. Ocone. Exponential stability of discrete-time fllters for
bounded observation noise. System and Control Letters, 30:185{193, 1997.
[21] O. Capp?e, S. J. Godsill, and E. Moulines. An overview of existing methods and
recent advances in sequential MonteCarlo. Proceedings of the IEEE, 95(5):899{
924, 2007.
[22] H. S. Chang, M. C. Fu, J. Hu, and S. I. Marcus. Simulation-based Algorithms for
Markov Decision Processes. Communications and Control Engineering Series.
Springer, New York, 1st edition, 2007.
[23] D. Crisan and A. Doucet. A survey of convergence results on particle flltering
methods for practitioners. IEEE Transaction on Signal Processing, 50(3):736{
746, 2002.
[24] M. H. A. Davis and S. I. Marcus. An introduction to nonlinear flltering. In
M. Hazewinkel and J. C. Willems, editors, Stochastic Systems: The Mathemat-
ics of Filtering and Identiflcation and Applications, pages 53{75, Amsterdam,
The Netherlands, 1981. Reidel.
[25] J. F. G. de Freitas, M. Niranjan, and A. H. Gee. Hierachical bayesian Kalman
models for regularisation and ARD in sequential learning. Technical Report
CUED/F-INFENG/TR 307, Cambridge University Engineering Department,
1998.
[26] P. T. DeBoer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein. A tutorial on
the cross-entropy method. Annals of Operations Research, 134:19{67, 2005.
[27] D. deFarias and B. VanRoy. The linear programming approach to approximate
dynamic programming. Operations Research, 51(6), 2003.
[28] P. M. Djuric, J. H. Kotecha, F. Esteve, and E. Perret. Sequential parameter
estimation of time-varying non-Gaussian autoregressive processes. EURASIP
Journal on Applied Signal Processing, 8:865{875, 2002.
122
[29] A. Doucet, J. F. G. deFreitas, and N. J. Gordon, editors. Sequential Monte
Carlo Methods In Practice. Springer, New York, 2001.
[30] R. Engle. The econometrics of ultra-high-frequency data. Econometrica, 68:1{
22, 2000.
[31] A. Gelb. Applied Optimal Estimation. The M.I.T. Press, 1974.
[32] W.GilksandC.Berzuini. Followingamovingtarget-MonteCarloinferencefor
dynamic Bayesian models. Journal of the Royal Statistical Society, 63(1):127{
146, 2001.
[33] W.R. Gilks, S. Richardson, and D.J. Spiegelhalter, editors. Markov Chain
Monte Carlo in Practice. Chapman & Hall, 1996.
[34] N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to
nonlinear/non-Gaussian Bayesian state estimation. In IEE Proceedings F
(Radar and Signal Processing), volume 140, pages 107{113, 1993.
[35] F.B. Hanson. Appplied Stochastic Processes and Control for Jump-Difiusions:
Modeling, Analysis and Computation, chapter Financial Engineering Applica-
tions. SIAM Books: Advances in Design and Control Series. Society For Indus-
trial & Applied Mathematics,U.S., 2007.
[36] M. Hauskrecht. Value-function approximations for partially observable Markov
decision processes. Journal of Artiflcial Intelligence Research, 13:33{95, 2000.
[37] O. Hernandez-Lerma and J. B. Lasserre. Discrete-Time Markov Control Pro-
cesses Basic Optimality Criteria. New York:Springer, 1996.
[38] S. L. Heston. A closed-form solution for options with stochastic volatility with
applications to bond and currency options. Review of Financial Studies, 6(327-
343), 1993.
[39] R. A. Howard. Dynamic Programming and Markov Processes. The M.I.T.
press, Cambridge, 1960.
[40] G.Q. Hu and S.S.-T. Yau. Finite-dimensional fllters with nonlinear drift xv:
New direct method for construction of universal flnite-dimensional fllter. IEEE
Transactions on Aerospace and Electronic Systems, 38(1), 2002.
[41] J. Hu, M. C. Fu, and S. I. Marcus. A model reference adaptive search method
for global optimization. Operations Research, 55:549{568, 2007.
[42] M. Isard and A. Blake. Contour tracking by stochastic propagation of con-
ditional density. In Proceedings of European Conference on Computer Vision,
pages 343{356, 1996.
[43] A. Jasra, D. Stephens, A. Doucet, and T. Tsagaris. Inference for Levy driven
stochastic volatility models via adaptive SMC. Preprint, 2008.
123
[44] A. H. Jazwinski. Stochastic Processes and Filtering Theory. New York: Aca-
demic Press, 1970.
[45] G. Kallianpur. Stochastic Filtering Theory. Springer-Verlag, New York, 1980.
[46] R. E. Kalman. A new approach to linear flltering and prediction problems.
Journal of Basic Engineering, 82:33{45, 1960.
[47] R. E. Kalman and R. S. Bucy. New results in linear flltering and prediction
theory. Journal of Basic Engineering, 83:95{108, 1961.
[48] G. Kitagawa. A self-organizing state-space model. Journal of the American
Statistical Association, 33(443):1203{1215, 1998.
[49] D. Koller and U. Lerner. Sequential Monte Carlo Methods in Practice, chap-
ter 10: Sampling in factored dynamic systems, pages 445{464. Statistics for
engineering and information science. Springer-Verlag, New York, 2001.
[50] J. H. Kotecha and P. M. Djuric. Gaussian sum particle flltering. IEEE Trans-
actions on Signal Processing, 51(10):2602{2612, 2003.
[51] D. P. Kroese, S. Porotsky, and R. Y. Rubinstein. The cross-entropy method for
continuous multiextremal optimization. Methodology and Computing in Applied
Probability, 8:383{407, 2006.
[52] H. J. Kushner. On the difierential equations satisfled by conditional probability
densities of Markov processes. SIAM Journal of Control, 2:106{119, 1964.
[53] P. Larranaga, R. Etxeberria, J. A. Lozano, and J. M. Pena. Optimization
by learning and simulation of Bayesian and Gaussian networks. Technical
Report EHU-KZAA-IK-4/99, Department of Computer Science and Artiflcial
Intelligence, University of the Basque Country, 1999.
[54] F. LeGland and N. Oudjane. Stability and uniform approximation of nonlinear
fllters using the Hilbert metric and application to particle fllter. The Annals
of Applied Probability, 14(1):144{187, 2004.
[55] E. L. Lemann and G. Casella. Theory of Point Estimation. New York: Springer,
1998.
[56] F. L. Lewis. Optimal Estimation: With an Introduction to Stochastic Control
Theory. Wiley-Interscience, 1986.
[57] M. L. Littman. The witness algorithm: Solving partially observable Markov
decision processes. Tr cs-94-40, Department of Computer Science, Brown Uni-
versity, Providence, RI, 1994.
[58] J. Liu and M. West. Combined parameter and state estimation in simulation-
based flltering. In A. Doucet, J. F. G. deFreitas, and N. J. Gordon, editors,
Sequential Monte Carlo Methods in Practice, New York, 2001. Springer-Verlag.
124
[59] J. A. Lozano, P. Larranaga, I. Inza, and E. Bengoetxea, editors. Towards a
New Evolutionary Computation: Advances on Estimation of Distribution Algo-
rithms. Springer Verlag, New York, 2006.
[60] P. Naor M. Resh. An inventory problem with discrete time review and replen-
ishment by batches of flxed size. Management Science, 10(1):109{118, 1963.
[61] S. I. Marcus. Nonliear estimation. In Systems Control Encyclopedia, pages 3293
{ 3304, Oxford, 1987. Pergamon Press.
[62] S. I. Marcus. Low dimensional fllters for a class of flnite state estimation prob-
lems with Poisson observations. System Control Letters, 1:237{241, January
1982.
[63] S. I. Marcus and A. S. Willsky. Algebraic structure and flnite dimensional
nonlinear estimation. SIAM Journal on Mathematical Analysis, 9:312{327,
April 1978.
[64] H. Muhlenbein and G. Paa?. From recombination of genes to the estimation of
distributions: I. binary parameters. In H. M. Voigt, W. Ebeling, I. Rechenberg,
and H. P. Schwefel, editors, Parallel Problem Solving from Nature-PPSN IV,
pages 178{187, Berlin, Germany, 1996. Springer Verlag.
[65] C. Musso, N. Oudjane, and F. LeGland. Improving regularized paricle fllters.
In A. Doucet, J. F. G. deFreitas, and N. J. Gordon, editors, Sequential Monte
Carlo Methods in Practice, New York, 2001. Springer-Verlag.
[66] K. Nummiaro, E. Koller-Meier, and L. Van Gool. An adaptive color-based
particle fllter. Image and Vision Computing, 21(1):99{110, 2003.
[67] J. D. Pint?er. Global Optimization in Action. Kluwer Academic Publishers,
Dordrecht, The Neitherlands, 1996.
[68] M. K. Pitt and N. Shephard. Filtering via simulation: Auxilliary particle fllters.
Journal of the American Statistical Association, 94(446):590{599, 1999.
[69] J. M. Porta, M. T. J. Spaan, and N. Vlassis. Robot planning in partially
observable continuous domains. In Robotics: Science and Systems, Cambridge,
MA, 2005. MIT.
[70] J. M. Porta, N. Vlassis, and M. T.J. Spaan amd P. Poupart. Point-based value
iteration for continuous POMDPs. Journal of Machine Learning Research,
7:2329{2367, 2006.
[71] P. Poupart and C. Boutilier. Value-directed compression of POMDPs. Ad-
vances in Neural Information Processing Systems, 15:1547{1554, 2003.
[72] W. B. Powell. Approximate dynamic programming: Solving the curses of di-
mensionality. Wiley, New York, 22007.
125
[73] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic
Programming. Wiley & Sons, New York, 1994.
[74] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer Texts
in Statistics. Springer, New York, 2004.
[75] N. Roy. Finding Approximate POMDP Solutions through Belief Compression.
PhD thesis, Robotics Institute, Carnegie Mellon University, Pittsburg, PA,
2003.
[76] N. Roy and G. Gordon. Exponential family PCA for belief compression in
POMDPs. Advances in Neural Information Processing Systems, 15(1635-
1642), 2003.
[77] R.Y.Rubinstein. Optimizationofcomputersimulationmodelswithrareevents.
European Journal of Operational Research, 99:89{112, 1997.
[78] R. Y. Rubinstein. The cross-entropy method for combinatorial and continuous
optimization. Methodology and Computing in Applied Probability, 2:127{190,
1999.
[79] R. Y. Rubinstein and D. P. Kroese. The Cross-Entropy Method: A Unifled Ap-
proach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine
Learning. Springer-Verlag, New York, 2004.
[80] P. J. Schweitzer and A. Seidman. Generalized polynomial approximations in
markovian decision problems. Journal of Mathematical Analysis and Applica-
tions, 110:568{582, 1985.
[81] R. D. Smallwood and E. J. Sondik. The optimal control of partially observable
Markov processes over a flnite horizon. Operations Research, 21(5):1071{1088,
1973.
[82] E. J. Sondik. The Optimal Control of Partially Observable Markov Processes.
PhD thesis, Stanford University, Palo Alto, CA, 1971.
[83] E. J. Sondik. The optimal control of partially observable Markov processes
over the inflnite horizon: Discounted costs. Operations Research, 26(2):282{
304, 1978.
[84] R. L. Stratonovich. Conditional Markov Processes and Their Application to the
Theory of Optimal Control. Elsevier, New York, 1968.
[85] R. S. Sutton. Learning to predict by the method of temporal difierences. Ma-
chine Learning, 3(1):9{44, 1988.
[86] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT
Press, Cambridge, 1998.
126
[87] S. Thrun. Monte Carlo POMDPs. Advances in Neural Information Processing
Systems, 12:1064{1070, 2000.
[88] R. van der Merwe, N. de Freitas, A. Doucet, and E. Wan. The unscented
particle fllter. Advances in Neural Information Processing Systems, 13, 2001.
[89] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279{
292, 1992.
[90] S.S.-T. Yau and G.Q. Hu. Classiflcation of flnite dimensional estimation alge-
bras of maximal rank with arbitrary state-space dimension and mitter conjec-
ture. International Journal of Control, 78(10), 2005.
[91] H. J. Yu. Approximate Solution Methods for Partially Observable Markov and
Semi-Markov Decision Processes. PhD thesis, M.I.T., Cambridge, MA, 2006.
[92] M. Zakai. On the optimal flltering of difiusion processes. Probability Theory
and Related Fields, 11:230{243, 1969.
[93] Q. Zhang and H. Muhlenbein. On the convergence of a class of estimation
of distribution algorithm. IEEE Transactions on Evolutionary Computation,
8:127{136, 2004.
[94] E. Zhou, M. C. Fu, and S. I. Marcus. A fading memory particle fllter. In
preparation for submission.
[95] E. Zhou, M. C. Fu, and S. I. Marcus. Solving continuous-state POMDPs via
density projection. IEEE Transactions on Automatic Control. Conditionally
accepted, full paper.
[96] E. Zhou, M. C. Fu, and S. I. Marcus. A density projection approach to di-
mension reduction for continuous-state pomdps. Proceedings of 47th IEEE
Conference on Decision and Control, page 5576 C 5581, 2008.
[97] E. Zhou, M. C. Fu, and S. I. Marcus. A paricle flltering framework for ran-
domized optmization algorithms. Proceedings of the 2008 Winter Simulation
Conference, pages 647 { 654, 2008.
[98] M. Zlochin, M. Birattari, N. Meuleau, and M. Dorigo. Model-based search for
combinatorial optimization: A critical survey. Annals of Operations Research,
131:373{395, 2004.
127