ABSTRACT Title of dissertation: PARTICLE FILTERING FOR STOCHASTIC CONTROL AND GLOBAL OPTIMIZATION Enlu Zhou, Doctor of Philosophy, 2009 Dissertation directed by: Professor Steven I. Marcus Department of Electrical and Computer Engineering Professor Michael C. Fu Department of Decision, Operations, and Information Technologies This thesis explores new algorithms and results in stochastic control and global optimization through the use of particle flltering. Stochastic control and global optimization are two areas that have many applications but are often di?cult to solve. In stochastic control, an important class of problems, namely, partially observ- able Markov decision processes (POMDPs), provides an ideal paradigm to model discrete-time sequential decision making under uncertainty and partial observation. However, POMDPs usually do not admit analytical solutions, and are computation- ally very expensive to solve most of the time. While many e?cient numerical algo- rithms have been developed for flnite-state POMDPs, there are only a few proposed for continuous-state POMDPs, and even more sparse are relevant analytical results regarding convergence and error bounds. From the modeling viewpoint, many ap- plication problems are modeled more naturally by continuous-state POMDPs rather than flnite-state POMDPs. Therefore, one part of the thesis is devoted to devel- oping a new e?cient algorithm for continuous-state POMDPs and studying the performance of the algorithm both analytically and numerically. Based on the idea of density projection with particle flltering, the proposed algorithm reduces the inflnite-dimensional problem to a flnite-low-dimensional one, and also has the ex- ibility and scalability for better approximation if given more computational power. Error bounds are proved for the algorithm, and numerical experiments are carried out on an inventory control problem. In global optimization, many problems are very di?cult to solve due to the presence of multiple local optima or badly scaled objective functions. Many ap- proximate solutions methods have been developed and studied. Among them, a recent class of simulation-based methods share the common characteristic of repeat- edly drawing candidate solutions from an intermediate probability distribution and then updating the distribution using these candidate solutions, until the probabil- ity distribution becomes concentrated on the optimal solution. The e?ciency and accuracy of these algorithms depend very much on the choice of the intermediate probability distributions and the updating schemes. Using a novel interpretation of particle flltering, these algorithms are unifled under one framework, and hence, many new insights are revealed. By better understanding these existing algorithms, the framework also holds the promise for developing new improved algorithms. Some directions for new improved algorithms are proposed, and numerical experiments are carried out on a few benchmark problems. PARTICLE FILTERING FOR STOCHASTIC CONTROL AND GLOBAL OPTIMIZATION by Enlu Zhou Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulflllment of the requirements for the degree of Doctor of Philosophy 2009 Advisory Committee: Professor Steven I. Marcus, Chair/Advisor Professor Michael C. Fu, Co-Chair/Co-Advisor Professor P. S. Krishnaprasad Professor Andr?e L. Tits Professor S. Raghavan c Copyright by Enlu Zhou 2009 Dedication In memory of my grandfather Mr. Mengquan Yu. ii Acknowledgments First of all, I would like to express my sincerest gratitude to my advisors Professor Steven I. Marcus and Professor Michael C. Fu. They have guided me in research and in career, provided me with tremendous support, and always had faith in me. I regard them as my role models for my career life, because of their dedication to the quality and integrity of research, love and patience for students, and being wonderful human beings. I have become a better person because of the flve years of interaction with them. I will always cherish my experience as a graduate student under their supervision. I would also like to thank Professor P. S. Krishnaprasad. I took flve graduate courses taught by him, and often went to ask him questions or discuss problems. He is also the flrst person that introduced me to the fleld of particle flltering. His lectures are insightful, inspiring, and challenging. I have been greatly in uenced by his great passion in research and pursuit of new knowledge. I also want to thank Professor Andr?e L. Tits. He is not only a great teacher but also a good friend. I have learned a lot from him in class, and especially beneflted from his technical rigorousness. I also had a lot fun with him in skiing, playing ping pong, and chatting at cofiee hours. I cannot express enough thanks to my mother Zhilin Yu and father Huaishen Zhou, who always give me unconditional love and support, and are the source of my courage in hard times. I also want to thank Peng Qiu, for the continuous happy time spent together and the discrete times listening to my complaints. iii Table of Contents List of Tables vi List of Figures vii List of Abbreviations viii 1 Introduction 1 1.1 Particle Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Partially Observable Markov Decision Processes . . . . . . . . . . . . 4 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2 Nonlinear Filtering 12 2.1 Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Weighted Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . 15 2.3 Particle Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4 Fading Memory Particle Filter . . . . . . . . . . . . . . . . . . . . . . 23 3 Stochastic Control 32 3.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1.1 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.2 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1.3 Simulation-Based Methods . . . . . . . . . . . . . . . . . . . . 38 3.2 Partially Observable Markov Decision Processes . . . . . . . . . . . . 38 3.2.1 Belief MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2.2 Finite-State vs. Continuous-State . . . . . . . . . . . . . . . . 42 4 Solving Continuous-State POMDPs 44 4.1 Related Work and Motivation . . . . . . . . . . . . . . . . . . . . . . 44 4.2 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.1 Density Projection . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.2 Projected Belief MDP . . . . . . . . . . . . . . . . . . . . . . 49 4.3 Projection Particle Filtering . . . . . . . . . . . . . . . . . . . . . . . 51 4.4 Analysis of Projected Belief MDP . . . . . . . . . . . . . . . . . . . . 55 4.4.1 Main Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4.2 Error Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.5 Analysis of Projection Particle Filtering . . . . . . . . . . . . . . . . 61 4.5.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.5.2 Main Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.5.3 Error Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.6 Validation of Assumptions . . . . . . . . . . . . . . . . . . . . . . . . 74 4.7 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.7.1 Scalability and Computational Issues . . . . . . . . . . . . . . 75 iv 4.7.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 77 4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5 Particle Filtering Framework for Optimization 88 5.1 Related Work and Motivation . . . . . . . . . . . . . . . . . . . . . . 88 5.2 Filtering for Optimization . . . . . . . . . . . . . . . . . . . . . . . . 91 5.3 Particle Filtering Framework for Optimization . . . . . . . . . . . . . 94 5.4 Interpretation of EDAS, CE, MRAS . . . . . . . . . . . . . . . . . . . 97 5.4.1 Estimation of Distribution Algorithms . . . . . . . . . . . . . 98 5.4.2 Cross Entropy Method . . . . . . . . . . . . . . . . . . . . . . 100 5.4.3 Model Reference Adaptive Search . . . . . . . . . . . . . . . . 103 5.5 Implication for New Algorithms . . . . . . . . . . . . . . . . . . . . . 106 5.5.1 Balancing Exploration and Exploitation . . . . . . . . . . . . 107 5.5.2 Combining Global Search with Local Search . . . . . . . . . . 107 5.5.3 Overcoming Premature Convergence . . . . . . . . . . . . . . 108 5.6 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.7 Conclusions and Future Research . . . . . . . . . . . . . . . . . . . . 115 6 Conclusions and Future Research 116 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Bibliography 121 v List of Tables 2.1 Performances under difierent real system parameters ?r, with model sys- tem parameter ?m = 0:5, and system noise uk ? ?(3;1). Each entry shows the mean and standard error (in parentheses) of the MSEs based on 100 independent runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2 Performances under difierent system noises, with system parameter ?r = ?m = 0:5. Each entry shows the mean and standard error (in parentheses) of the MSEs based on 100 independent runs. . . . . . . . . . . . . . . 29 4.1 Notations of Difierent Conditional Distributions . . . . . . . . . . . . 65 4.2 Optimal average cost estimates for the inventory control problem us- ing difierent methods. Each entry represents the average cost of a run of horizon 105. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.3 Optimal discounted cost estimate for the inventory control problem using difierent methods. Each entry represents the discounted cost (mean, standard error in parentheses) based on 1000 independent runs of horizon 40. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.4 Continue Table 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.1 Average performance of CEA and CE on some benchmark problems. Each entry presents the mean of H(x?) with standard error in paren- theses, based on 100 independent runs. . . . . . . . . . . . . . . . . . 113 5.2 Average performance of CEA with difierent parameter values of fi and fl on the Rosenbrock function. Each entry presents the mean of H(x?) with standard error in parentheses, based on 100 independent runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 vi List of Figures 2.1 A graphic illustration of Table 2.1: Performances (mean and standard error of MSEs based on 100 independent runs) under difierent values of the real system parameter ?r. . . . . . . . . . . . . . . . . . . . . . . . 30 2.2 AgraphicillustrationofTable2.2: Performances(meanandstandarderror of MSEs based on 100 independent runs) under difierent system noises. . 30 2.3 A typical run of the estimated states tracking the true state, when the system noise uk ? ?(3;2), and the system parameter ?r = ?m = 0:5. . . . 31 2.4 A typical run of the estimated states tracking the true state when the system model is inaccurate. The real system parameter ?r = 0:6, the model system parameter ?m = 0:5, and the system noise uk ? ?(3;2). . . . 31 4.1 Our algorithm: actions taken for difierent inventory levels under dif- ferent observation noise variances. . . . . . . . . . . . . . . . . . . . 82 5.1 At the flrst three iterations, the conditional density bk becomes more and more concentrated on the optimal solution. . . . . . . . . . . . . 94 5.2 Some benchmark problems in two dimensions. . . . . . . . . . . . . 112 5.3 Average performance of CEA and CE on some benchmark problems. 114 vii List of Abbreviations CE Cross Entropy Method DP Dynamic Programming EDA Estimation of Distribution Algorithm EKF Extended Kalman Filter MDP Markov Decision Process MRAS Model Reference Adaptive Search KF Kalman Filter KL Kullback-Leibler PF Particle Filter POMDP Partially Observable Markov Decision Process viii Chapter 1 Introduction Stochastic control and global optimization are flelds that share two charac- teristics: (i) their models can be used to formulate problems in many applications areas; (ii) algorithms for solving problems in these flelds are usually intractable and di?cult to analyze. To attack di?cult problems of a size that are found in most applications requires signiflcant new methodologies. This dissertation attempts to solve problems in stochastic control and global optimization through the use of par- ticle flltering. This chapter gives a brief introduction to particle flltering and an important class of problems in stochastic control, namely, the partially observable Markov decision processes (POMDPs). 1.1 Particle Filtering The goal of flltering is to estimate the unobserved states of a dynamic system given a noisy observation process of the states. The classic problem is to estimate the conditional density of the current state given the history of observations [24]. The conditional density usually evolves according to some equation, for instance, in the case of a difiusion process the normalized Kushner-Stratonovich equation [52] [84] or the unnormalized Zakai equation [92]. However, these equations usually do not admit an analytical form of the solution except in some special cases, such as 1 linear Gaussian systems and flnite state space hidden Markov models. For more general systems, many approximation flltering methods have been developed. One approach of approximation is to modify the system in such a way that exact fllters can be applied to the modifled system. For example, a well-known method is the Extended Kalman fllter (EKF) (pp. 182-190, [31]) for nonlinear/non- Gaussian systems. EKF linearizes the system equation and observation equation, and approximates the system noise and observation noise by their flrst two moments, i.e., Gaussian random variables. Hence, the approximated linear Gaussian system can be flltered using the standard Kalman fllter [46] [47]. EKF is computationally e?cient, but the convergence of the fllter is not guaranteed and the performance could be poor in many situations. Another method is to discretize the state space and transform the system to a flnite state hidden Markov model. The disadvantage of this method is that the number of grids has to be su?ciently large to obtain a good approximation and the number of grids increases dramatically as the dimension of the state space increases. A recent and powerful method for nonlinear flltering is particle flltering. It is a class of Monte Carlo simulation-based flltering methods for nonlinear/non-Gaussian systems, asetting where traditional methods often lead tocomputational intractabil- ity. Particle flltering was flrst introduced by Gordon et al. [34], and is also called bootstrap flltering [34], sampling importance resampling [4], the condensation algo- rithm [42], and sequential Monte-Carlo method [4]. A good tutorial can be found in [4], a good survey of the fleld and recent trends can be found in [21], and more details can be found in the book [29]. 2 The idea of particle flltering is to approximate the conditional density by a flnite number of particles/samples and let these particles propagate in a certain way to mimic the evolution of the conditional density. The approximated conditional density converges to the true conditional density in a certain sense as the number of particles increases to inflnity [23], [54]. The plain particle fllter consists of three steps at each time/iteration: impor- tance sampling, weighting, and resampling. In the importance sampling step, i.i.d. particles are sampled from a importance density. In the weighting step, particles are weighted according to the ratio of the conditional density to the importance den- sity and also according to the Bayes? rule using the new observation of the current state. These weighted particles represent a discrete distribution with support points equal to the locations of the particles and the associated probabilities equal to the weights of the particles. This discrete distribution is an approximation to the true conditional distribution. In the last step, new particles are resampled from the old particles according to their weights. These new particles go through the same steps at the next time/iteration. The importance distribution is often chosen as the distribution of the current state given the previous state. The beneflt of this choice is twofold. First, this importance distribution is easy to sample, since it is equivalent to propagating the particles through the system equation. Thus, the importance sampling step is also called propagation or prediction step in this case. Second, the weight of each particle is proportional to the likelihood of the current observation given that particle, and hence is very easy to calculate. However, this is not the optimal choice of importance 3 distribution. The simulation can be very ine?cient if the importance distribution is very difierent from the target distribution. For instance, if the conditional distribu- tion is peaky and the importance distribution is at, it is very likely that most of the particles sampled from the importance distribution lie in an area of probability close to zero according to the conditional distribution. Therefore, many improved particle fllters have been proposed based on a better choice of the importance distri- bution, such as the extended Kalman particle fllter [25], the auxiliary particle fllter [68], and the unscented particle fllter [88]. The plain particle fllter often sufiers from the problem of sample impover- ishment or loss of diversity, meaning that all particles collapse to a small num- ber of particles within a few iterations. Since new particles are resampled from the old particles according to their weights/probabilities, old particles with large weights/probablities will have more copies in the next iteration. The problem of sample impoverishment is especially severe when the system noise is small, because the copies of the same particle cannot be dispersed far away enough through prop- agation. Therefore, improving the resampling step is also a goal of many improved particle fllters, such as the regularized particle fllter [65], and the particle fllter with MCMC steps [3]. 1.2 Partially Observable Markov Decision Processes POMDPs provide a powerful paradigm for modeling discrete-time optimal decision making under uncertainty and partial observation. It has been used to 4 model application problems arising in the areas such as manufacturing systems, artiflcial intelligence, flnancial engineering, and operations research. A POMDP model consists of a time-indexed state equation that models the system dynamics, a time-indexed observation equation that models the observation or measurement of the state, and an objective function that models the cost (or reward) that needs to be minimized (or maximized). A decision maker interacts with the environment over a flnite or inflnite time horizon that is divided into stages or periods. At each stage, the decision maker receives a partial observation of the current state, takes an action based on this piece of information along with the history of observations and actions, and then transitions to a new state probabilistically at the next stage. The action taken incurs a one-step cost (or reward). The objective is to minimize (or maximize) a cost (or reward) function, where the one-step costs (or rewards) are accrued in each stage. The difierence between a POMDP and a Markov Decision Process (MDP) is in the observation of the state. The state can be fully recovered from the observation in an MDP, whereas the observation only provides partial information for the state in a POMDP. A POMDP can be transformed to a so-called belief MDP, which is an MDP whose states are conditional probability distributions of the true states of the POMDP and are called belief states. This transformation allows us to utilize the existing various techniques for solving MDPs. However, it also creates new di?cul- ties. The flrst di?culty is how to accurately and e?ciently estimate the distribution of the states, which is the goal of a flltering problem. The second di?culty is that although it is an MDP problem, the state space is generally continuous and may 5 have inflnite dimensionality. The reason is explained as follows. Consider a POMDP problem with a flnite state space. The estimation of the state is a discrete probability distribution, which sits in a flnite-dimensional probability simplex. Hence, the corresponding belief MDP has a flnite-dimensional continuous state space, which sufiers from the curse of dimensionality of the continu- ous state space of the MDP. Now consider a POMDP problem that has a continuous state space. The probability distribution of the state sits in an inflnite-dimensional space of continuous probability distributions. Hence, the corresponding belief MDP has a inflnite-dimensional continuous state space that makes the problem even more di?cult to solve. Therefore, e?cient numerical solution of POMDPs with large or continuous state spaces is a challenging research area. The past research on numerical solutions of POMDPs is mostly focused on flnite-state problems. For flnite-state POMDPs, it is proved that the value function is a piecewise linear convex function after a flnite number of iterations, provided that the one-step cost function is piecewise linear and convex [81]. This feature has been exploited in various exact and approximate algorithms such as those found in [82], [81], [83], [57], and [36]. The algorithm in [81] and the Witness algorithm in [57] carry out exact value iterations by flnding and updating the minimum set of linear functions that determine the value function at each iteration for a flnite- horizon problem. Because the number of such linear functions grow exponentially with the number of iterations, these algorithms are computationally very expensive and are limited to very simple problems in practice. Howard [39] and Sondik [83] use policy iteration to solve the inflnite-horizon discounted-cost problems. Hauskrecht 6 [36] summarizes several value function approximation methods in a nice framework of modifying the value iteration equation by changing the order of summations and maximization, including approximation with fully observable MDP, approximation with Q-functions, fast informed bound method, and approximation with unobserv- able MDP. [36] also summarizes many grid-based methods with various techniques of choosing the set of grids and approximating the value function. Despite the abundance of algorithms for flnite-state POMDPs, the aforemen- tioned inflnite-dimensionality of continuous-state POMDPs suggests that simple generalizations of many of the discrete-state algorithms to continuous-state mod- els are not appropriate or applicable. Therefore, some researchers have been mo- tivated to look for e?cient algorithms for continuous-state POMDPs [87] [69] [75] [17]. However, compared to flnite-state POMDPs, the research on continuous-state POMDPs is more recent and still sparse. Therefore, it is a fleld worth exploring. 1.3 Contributions My doctoral research has yielded new solution methods to partially observ- able Markov decision processes and global optimization through the use of particle flltering. To the best of our knowledge, application of particle flltering to POMDPs is relatively sparse, and it has never before been applied to the fleld of global opti- mization. The flrst line of our research aims to establish a framework for solving large or continuous-statePOMDPs. Mostoftheexistingapproximatesolutionsforcontinuous- 7 state POMDPs involve some kind of dimension reduction of the belief space, i.e., reducing the inflnite dimensionality to a flnite (low) dimensionality. Dimension reduction is a very broad idea, and it is worth studying how to do dimension re- duction in an e?cient and meaningful way. On the one hand, we want to reduce the dimensionality as much as possible so that the computational demand can be lowered. On the other hand, we do not want to lose too much information during dimension reduction so that the approximation is useful. Moreover, we would like to have a mechanism that allows us to control the tradeofi between complexity and accuracy: we can increase the accuracy of the solution by increasing the computa- tional power, and vice versa. These considerations have motivated our research to investigate a new dimension reduction technique to develop e?cient numerical solu- tions for large/continuous-state POMDPs. Based on the idea of density projection with particle flltering, we have developed a theoretically sound method that efiec- tively reduces the dimension of the belief state and has the exibility to represent arbitrary belief states, such as multimodal or heavy-tailed distributions. We have proved rigorous convergence results and error bounds of the algorithm. We have also applied the approach to and obtained good numerical results on an inventory control problem and portfolio optimization problems in flnancial engineering. The development and analysis of the approach appear in our papers [95] [96]. Applica- tions of the approach to flnancial engineering are to appear in our working paper [94]. As a second line of our research, we have proposed a flltering approach to opti- mization, and in particular, have developed a framework based on particle flltering. 8 Global optimization problems can be extremely di?cult to solve, due to the presence of multiple local optimal solutions in the general setting. A class of simulation-based methods for global optimization has been introduced recently, which includes the estimation of distribution algorithms (EDAs), the cross-entropy (CE) method, and model reference adaptive search (MRAS). In these algorithms, new solutions are generated from an intermediate probability distribution that is updated or induced from the previously generated solutions. Algorithms that fall in this category difier in the choice and updating of the intermediate probability distribution, which plays a key role in determining the efiectiveness of the algorithm. This has motivated us to look for a unifying and systematic approach to such simulation-based methods for optimization. We have introduced a framework based on particle flltering that unifles EDAs, the CE method and MRAS, as well as combining the simulation-based global search with the gradient-based local search in a nice way. This exible frame- work holds the promise of generating new improved algorithms by incorporating many of the vast array of techniques that have been developed to improve particle fllters, and other recent results in nonlinear flltering. This framework unifles many recent simulation-based algorithms, and combines simulation-based global search with gradient-based local search in a nice way. More importantly, the framework holds the promise of generating new improved algorithms by incorporating many of the vast array of techniques that have been developed to improve particle fllters, and other recent results in nonlinear flltering. We are currently developing and testing new algorithms under this framework, and analyzing convergence properties and error bounds of the framework. Preliminary results of this work appear in our 9 paper [97]. A more complete and developed work appears in our paper [96]. Besides the two main lines of research, we have also developed a fading memory particle fllter. It is an improved particle fllter in situations where the system models are inaccurate and simulation is insu?cient. 1.4 Outline The rest of the dissertation is organized as follows. Chapter 2 provides the necessary background and literature review on nonlin- ear flltering, and proposes a fading memory particle fllter. We describe the problem formulation of nonlinear flltering, and describe in details some approximate non- linear fllters, including the extended Kalman fllter, the weighted extended Kalman fllter, and particle flltering in general. In the end of the chapter, we present the derivation of a fading memory particle fllter and results from numerical experiments. Chapter 3 provides the necessary background and literature review on stochas- tic control. In particular, we focus on Markov decision processes and partially ob- servable Markov decision processes, and describe the problem formulation of each. For MDPs, we describe in details the value iteration method and policy iteration method, and brie y describe simulation-based methods. For POMDPs, we describe the transformation to belief MDPs, and discuss numerical complexity of flnite-state vs. continuous-state POMDPs. Chapter 4 presents our work on a new e?cient numerical method for solving continuous-state POMDPs. Section 4.1 reviews related work and explains our moti- 10 vation for this work; section 4.2 describes the density projection technique, and uses it to develop the projected belief MDP; section 4.3 develops the projection particle fllter; section 4.4 computes error bounds for the projected belief MDP; section 4.5 computes an error bound for the projection particle fllter; section 4.6 presents a validation of the assumptions used in the proof of error bounds; section 4.7 dis- cusses scalability and computational issues of the method, and applies the method to a simulation example of an inventory control problem; section 4.8 concludes and discusses the work. Chapter 5 presents a particle flltering framework for simulation-based opti- mization algorithms. Section 5.1 reviews related work and explains our motivation for this work; section 5.2 transforms a global optimization problem to a flltering problem; section 5.3 applies particle flltering to the transformed flltering problem and develops a general framework for simulation-based optimization algorithms; section 5.4 uses the framework to interpret some existing algorithms and reveals some new insights; section 5.5 discusses the directions for developing new improved algorithms under this framework; section 5.6 presents numerical results of a new improved algorithm in comparison with the cross-entropy method on some bench- mark problems; section 5.7 concludes our current work and discusses some future research direction. Chapter 6 concludes the dissertation and outlines some future research. 11 Chapter 2 Nonlinear Filtering The nonlinear flltering problem [24] [61] [45] involves the estimation of a stochastic process x (called the state process) which cannot be observed directly. Information concerning x is obtained from observations of a related process y (the observation process). The objective is the computation, for each t, of least-square estimates of functions of the state xt given the \past" observations fys;0 ? s ? tg, i.e., to compute quantities of the form E[`(xt)jys;0 ? s ? t], or perhaps to compute the entire conditional distribution of the state xt given fys;0 ? s ? tg. When the observations are being received sequentially, it is also desired that this computation be done recursively. Except in some special cases such as the linear Gaussian case, there do not exist statistics of the conditional distribution that can be computed recursively with flnite-dimensional fllters [16] [63] [62] [40] [90]. Hence, there have been many attempts to develop recursive flnite dimensional suboptimal fllters, and signiflcant efiort has been dedicated to developing numerical methods for solving nonlinear flltering problems (see [19] for a recent survey). In this thesis, we focus on a discrete-time model xk = f(xk?1;uk?1); k = 1;2;:::; yk = h(xk;vk); k = 0;1;:::; (2.1) where for all k, xk 2 Rnx is the state , yk 2 Rny is the observation, the random 12 disturbances fukg 2 Rnu and fvkg 2 Rnv are sequences of i.i.d. continuous random vectors, and nx, ny, nu, and nv are the dimensions of xk, yk, uk, and vk, respectively. Assume that fukg and fvkg are independent of each other, and independent of x0, which follows a distribution p0. The goal of flltering is to estimate the conditional density p(xkjy0:k); k = 0;1;:::; (2.2) where y0:k = fy0;:::;ykg, all the observations from time 0 to k. 2.1 Extended Kalman Filter In (2.1), if f and g are linear functions in x, u and v, u and v are Gaus- sian noises, and the initial condition x0 is Gaussian, then the conditional density p(xkjy0:k) is Gaussian for all time k, and there exists a flnite-dimensional optimal fllter, namely, the Kalman fllter (KF) [46] [47]. If (2.1) takes the form as xk = f(xk?1)+uk?1; k = 1;2;:::; yk = h(xk)+vk; k = 0;1;:::; (2.3) where f and h are nonlinear functions, then the Kalman fllter can be applied to (2.3) by linearizing the system equations, and approximating the system noise and observation noise as Gaussian noises. This approach is referred to as the extended Kalman fllter (EKF) [31] [1] [2]. More speciflcally, the true conditional density p(xkjy0:k) is approximated by a Gaussian density with mean ?xkjk and covariance 13 Pkjk that are deflned as ?xkjk , E[xkjy0:k]; Pkjk , E[(xk ? ?xkjk)(xk ? ?xkjk)Tjy0:k]: The estimates ?xkjk and Pkjk are often called the posterior mean and covariance of the state xk. Similarly, the predicted mean and covariance at time k are deflned as ?xkjk?1 , E[xkjy0:k?1]; Pkjk?1 , E[(xk ? ?xkjk?1)(xk ? ?xkjk?1)Tjy0:k?1]: To linearize the system (2.3), deflne Fk = @f(x)@x jx=?xkjk; Hk = @h(x)@x jx=?xkjk?1: Without loss of generality, we assume that uk and vk are zero mean. Let Qk and Rk denote the covariance matrices of uk and vk, respectively. Then the linearized and Gaussianized system of (2.3) is xk = Fk?1xk?1 +uk?1; yk = Hkxk +vk; (2.4) where uk?1 ? N(0;Qk?1) and vk ? N(0;Rk). For (2.4), if the initial condition x0 follows a Gaussian distribution N(?x0;P0), the conditional density is Gaussian for every time k, and the optimal fllter is the Kalman fllter, which recursively propagates the predicted mean ?xkjk?1 and the predicted covariance matrix Pkjk?1, and updates the posterior mean ?xkjk and the posterior covariance matrix Pkjk using the new observation yk. Mathematically, the Kalman fllter consists of the following recursive 14 equations ?xkjk?1 = Fk?1?xk?1jk?1; Pkjk?1 = Fk?1Pk?1jk?1FTk?1 +Qk?1; Kk = Pkjk?1HTk (Rk +HkPkjk?1HTk )?1; ?xkjk = ?xkjk?1 +Kk(yk ?Hk?xkjk?1); Pkjk = Pkjk?1 ?KkHkPkjk?1; (2.5) with initialization ?x0j?1 = ?x0; P0j?1 = P0. The EKF extends the KF to the nonlinear system (2.3) as follows: ?xkjk?1 = f(?xk?1jk?1); Pkjk?1 = Fk?1Pk?1jk?1FTk?1 +Qk?1; Kk = Pkjk?1HTk (Rk +HkPkjk?1HTk )?1; ?xkjk = ?xkjk?1 +Kk(yk ?h(?xkjk?1)); Pkjk = Pkjk?1 ?KkHkPkjk?1; (2.6) with initialization ?x0j?1 = ?x0; P0j?1 = P0. The EKF (2.6) looks almost the same as the KF (2.5), except that f(?xk?1jk?1) replaces Fk?1?xk?1jk?1 and h(?xkjk?1) replaces Hk?xkjk?1. The EKF is a suboptimal fllter, and its convergence is not guaranteed. 2.2 Weighted Extended Kalman Filter Sometimes the system model is not an accurate model of the actual system. The model inaccuracy degrades the value of past information [44]. Therefore, we 15 sometimes want to discount old observations, or equivalently, to put more weight on the more recent observations. Fading memory of old observations can compensate the model inaccuracy, and the physical justiflcation is that \old observations, when predicted over long time arcs through an erroneous system, can be valueless" (pp. 305, [44]). This idea of fading memory of old observations has been proposed and studied theoretically and empirically for several decades for the Kalman fllter [1][56] [31]. One approach often used is exponential data weighting, or in other words, exponentially discounting the old observations. Anderson and Moore derived the Kalman fllter with exponential data weighting (KF-EDW) (pp. 135-138, [1]), by observing that the KF estimate ?xkjk?1 is equal to x?k, which is the last component of the solution (x?0;:::;x?k) to the minimization problem minx 0;:::;xk Jk = 12(x0 ? ?x0)TPy0(x0 ? ?x0) + 12 k?1X i=0 (yi ?HTi xi)Ryi(yi ?HTi xi) + 12 k?1X i=0 uTi Qyiui; (2.7) s:t: xi = Fi?1xi?1 +ui?1; i = 1;:::;k: where the superscript y denotes pseudo inverse. The function (2.7) can be viewed as a total cost function on the estimation errors from time 0 to k?1. With this inter- pretation, placing more emphasis on the more recent observations is equivalent to penalizing the more recent estimation errors more heavily. This suggests increasing the weighting matrices Ryi and Qyi in (2.7) for larger values of i, such as replacing 16 Ri and Qi by fi?2iRi and fi?2(i+1)Qi in (2.7), yielding Jk = 12(x0 ? ?x0)TPy0(x0 ? ?x0) + 12 k?1X i=0 (yi ?HTk xi)fi2iRyi(yi ?HTi xi) + 12 k?1X i=0 uTi fi2(i+1)Qyiui; (2.8) where fi is some constant greater than 1. In view of the relationship between (2.7) and the Kalman fllter, there could exist a similar relationship between (2.8) and a designed fllter. More speciflcally, we can replace the actual covariance matrices Rk and Qk?1 in (2.5) by the design values fi?2kRk and fi?2kQk?1, to obtain the designed fllter that achieves the minimum of (2.8). The resultant designed fllter is the KF-EDW: ?xkjk?1 = Fk?1?xk?1jk?1; Pfikjk?1 = fi2Fk?1Pfik?1jk?1FTk?1 +Qk?1; Kk = Pfikjk?1HTk (Rk +HkPfikjk?1HTk )?1; ?xkjk = ?xkjk?1 +Kk(yk ?Hk?xkjk?1); Pfikjk = Pfikjk?1 ?KkHkPfikjk?1; (2.9) where Pfikjk?1 = fi2kPkjk?1; Pfikjk = fi2kPkjk; with initialization ?x0j?1 = ?x0; Pfi0j?1 = P0. The parameter fi is called the forgetting factor, because it indicates how fast the observation is forgotten. Notice the main difierence between (2.9) and (2.5) is the coe?cient fi2 in the second equation. Also 17 notice that while Pkjk in the KF is the error covariance E[(xk??xkjk)(xk??xkjk)Tjy0:k], the estimate Pfikjk in the KF-EDW is not. Although the KF-EDW is not optimal, the exponential data weighting promotes exponential stability of the fllter. In a similar spirit to how the EKF extends the KF, we extend the KF-EDW to the nonlinear system (2.3) as follows to obtain the weighted extended Kalman fllter (WEKF): ?xkjk?1 = f(?xk?1jk?1); Pfikjk?1 = fi2Fk?1Pfik?1jk?1FTk?1 +Qk?1; Kk = Pfikjk?1HTk (Rk +HkPfikjk?1HTk )?1; ?xkjk = ?xkjk?1 +Kk(yk ?h(?xkjk?1)); Pfikjk = Pfikjk?1 ?KkHkPfikjk?1; (2.10) with initialization ?x0j?1 = ?x0; Pfi0j?1 = P0. Unlike the optimality difierence between the KF and the KF-EDW, the EKF and the WEKF are both suboptimal, and none of the estimates Pkjk and Pfikjk are the error covariance. Like the EKF, the convergence of the WEKF is not guaranteed either. 2.3 Particle Filtering Particle flltering is a class of fllters that utilize Monte Carlo simulation and importance sampling techniques to estimate the conditional density of the state given the past observations [4] [29] [21]. It is also called bootstrap flltering [34], sampling importance resampling [4], the condensation algorithm [42], and sequential Monte-Carlo method [21]. Particle flltering approximates the conditional density 18 p(xkjy0:k) using a flnite number of particles and mimicking the evolution of the conditional density through the propagation of these particles. More speciflcally, the particle fllter approximates p(xkjy0:k) by a probability mass function ^p(xkjy0:k) = NX i=1 wik?(xk ?xik); (2.11) where ? denotes the Dirac delta function, fxik;i = 1;:::;Ng are the random support points, and fwik;i = 1;:::;Ng are the associated weights satisfying fwik ? 0;i = 1;:::;N;PNi=1 wik = 1g. Since p(xkjy0:k) is unknown, we opt to generate the particles by sampling from another known density q(xkjy0:k), and adjust the weights of the samples to get an estimate of p(xkjy0:k). This approach is known as the importance sampling, and the density q(xkjy0:k) is referred to as the importance density. The rationale of importance sampling can be observed from the following transformation: Ep[`(x)] = Z `(x)p(x)dx = Z `(x)p(x)q(x)q(x)dx = Eq[`(x)p(x)q(x)]; (2.12) where ` is an arbitrary integrable function, p is the target density, and q is the importance density. Hence, from (2.12) it is easy to see that in order to approximate p(xkjy0:k), for samples fxik;i = 1;:::;Ng drawn i.i.d. from q(xkjy0:k), their weights should be wik / p(x i kjy0:k) q(xikjy0:k); (2.13) where / means \proportional to", and the weights should be normalized. To carry out the estimation recursively, we use the Bayes? rule to derive the 19 following recursive equation for the conditional density: p(xkjy0:k) = p(xk;ykjy0:k?1)p(y kjy0:k?1) = p(ykjy0:k?1;xk)p(xkjy0:k?1)p(y kjy0:k?1) = p(ykjxk) R p(x kjy0:k?1;xk?1)p(xk?1jy0:k?1)dxk?1 p(ykjy0:k?1) / p(ykjxk) Z p(xkjxk?1)p(xk?1jy0:k?1)dxk?1; (2.14) where p(ykjy0:k?1;xk) = p(ykjxk) and p(xkjy0:k?1;xk?1) = p(xkjxk?1) both follow from the Markovian property of model (2.1), the denominator p(ykjy0:k?1) does not explicitly depend on xk and k, and / means p(xkjy0:k) is the normalized version of the right-hand side. The state transition density p(xkjxk?1) is induced from the state equation in (2.1) and the distribution of the system noise uk?1, and the likelihood p(ykjxk) is induced from the observation equation in (2.1) and the distribution of the observation noise vk. Substituting (2.14) into (2.13), we get wik / p(ykjx i k)p(x i kjx i k?1) q(xikjy0:k) p(x i k?1jy0:k?1): (2.15) If the importance density q(xkjy0:k) is chosen to be factored as q(xkjy0:k) = q(xkjxk?1;yk)q(xk?1jy0:k?1); (2.16) then wik / p(ykjx i k)p(x i kjx i k?1) q(xikjxik?1;yk) p(xik?1jy0:k?1) q(xik?1jy0:k?1) / p(ykjx i k)p(x i kjx i k?1) q(xikjxik?1;yk) w i k?1: (2.17) Moreover, to avoid sample degeneracy, new samples are resampled i.i.d. from the approximate conditional density ^p(xkjy0:k) at each step, hence the weights are 20 reset to wik?1 = 1=N, and (2.17) is reduced to wik / p(ykjx i k)p(x i kjx i k?1) q(xikjxik?1;yk) ;i = 1;:::;N: (2.18) In the plain particle fllter, the importance density q(xkjxik?1;yk) is chosen to be the state transition density p(xkjxik?1), which is independent of the current observation yk, yielding wik / p(ykjxik);i = 1;:::;N: (2.19) The plain particle fllter recursively propagates the support points and updates the associated weights. The algorithm is as follows: Algorithm 2.1 Plain Particle Filter 1. Initialization: Sample x10;:::;xN0 i.i.d. from an initial p.d.f./p.m.f. p0. Set k = 1. 2. Importance Sampling/Propagation: Sample xik from p(xkjxik?1), i = 1;:::;N. 3. Bayes? Updating: Receive new observation yk. The conditional density is ap- proximated by ^p(xkjy0:k) = PNi=1 wik?(x?xik), where wik is computed according to (2.19) and normalized. 4. Resampling: Sample x1k;:::;xNk i.i.d. from ^p(xkjy0:k). 5. k ? k +1 and go to step 2. For the resampling step, we can similarly use the importance sampling tech- nique to resample from an importance density gk?1(xk?1jy0:k?1), which we will refer 21 to as the resampling importance density to distinguish from qk(xkjxk?1;yk). Hence, in general, the weights are updated according to wik / p(ykjx i k)p(x i kjx i k?1)p(x i k?1jy0:k?1) qk(xikjxik?1;yk)gk?1(xik?1jy0:k?1) ; (2.20) The plain particle fllter (Algorithm 2.2) is a special case of the general particle fllter, with the particular choice of the importance density qk(xkjxk?1;yk) and the resampling importance density gk(xkjy0:k). Algorithm 2.2 General Particle Filter 1. Initialization: Sample xi0;:::;xN0 i.i.d. from an initial p.d.f./p.m.f. p0. Set k = 1. 2. Importance Sampling: Sample xik from qk(xkjxik?1;yk), i = 1;:::;N. 3. Bayes? Updating: Receive new observation yk. The conditional density is ap- proximated by ^p(xkjy0:k) = PNi=1 wik?(xk ?xik), where wik is computed according to (2.20) and normalized. 4. Importance Resampling: Sample x1k;:::;xNk i.i.d. from gk(xkjy0:k). 5. k ? k +1 and go to step 2. It has been proved that ^p(xkjy0:k) converges to p(xkjy0:k) as the sample size N increases to inflnity [23] [54]. However, uniform convergence in time has only been proved for the special case where the system dynamics has a mixing kernel that ensures that any error is forgotten (exponentially) in time. Usually, the particle fllter needs an increasing number of samples as time k increases to ensure a given precision of the approximation ^p(xkjy0:k) for all k. 22 2.4 Fading Memory Particle Filter Particle flltering weights equally all the observations when estimating the cur- rent state. However, if, for example, the system model is inaccurate, the prediction based on all the past observations and the system model may not be a good one, which we have already mentioned in Section 2.2. Bad predictions can be also caused by the simulation in the particle flltering, when the system noise is large and the number of samples is not su?cient. This can also be viewed as model inaccuracy, since the system noise is inaccurately represented by the samples due to simulation. The model inaccuracy degrades the value of past information [44]. Therefore, we sometimes want to discount old observations, or equivalently, to put more weight on the more recent observations. This idea of fading memory of old observations has been proposed and studied theoretically and empirically for several decades for the Kalman fllter but rarely in particle flltering. Related work on particle flltering with fading memory [25] [28] [66] addresses the issue by incorporating the fading memory into the model not the fllter, and the models are tailored for special cases and do not address the general setting. Our approach is to incorporate the weighted extended Kalman fllter to gen- erate the importance density in the particle fllter. The weighted extended Kalman fllter is based on the Kalman fllter with exponential data weighting. The expo- nential data weighting weights more heavily the more recent observations, and it promotes exponential stability of the Kalman fllter. Since the fading memory only afiects the locations of the generated particles, the convergence property of the par- 23 ticle fllter is preserved. Moreover, we expect that the fading memory can enhance the convergence of the particle fllter, as it does for the Kalman fllter. Like many improved particle fllters, such as the auxiliary particle fllter [68], the extended Kalman particle fller [25], and the unscented particle fllter [88], the fading memory particle fllter also incorporates the current observation into generating the importance density and thus holds the promise for generating better importance densities than the plain particle fllter which takes p(xkjxk?1) as the importance density. The fading memory particle fllter looks similar to the extended Kalman particle fllter [25] [88]; however, the fading memory particle fllter uses the weighted extended Kalman fllter to generate the importance density instead of the unweighted one. With the simple addition of a forgetting factor, the weighted extended Kalman fllter has the advantages in giving the fllter a prescribed degree of stability and curing many error problems. Retaining the convergence property of the particle fllter, we incorporate the WEKF into the particle fllter through the importance density. At time k, the WEKF generates a Gaussian approximation, denoted as N(?xkjk;Pfikjk). Let the importance density be q(xkjxik?1;yk) = N(?xkjk;Pfikjk);i = 1;:::;N; (2.21) from which the random support points fxikgNi=1 are drawn. Substituting (2.21) into the weight updating equation (2.20), and assuming resampling is applied at time k ?1, we obtain wik / p(ykjx i k)p(x i kjx i k?1) N(xikj?xkjk;Pfikjk) ;i = 1;2;:::;N: (2.22) 24 Now that an approximated conditional density ^p(xkjy0:k) = PNi=1 wik?(xk ?xik) is obtained, it can be used to update the WEKF mean ?xkjk and variance Pfikjk according to ?xkjk = E^pk[x]; Pfikjk = Var^pk(x); where ^pk is short for ^p(xkjy0:k). The algorithm is as follows: Algorithm 2.3 Fading Memory Particle Filter (FMPF) 1. Initialization: Set forgetting factor fi > 1. Sample x10;:::;xN0 i.i.d. from the initial density p0. Set ?x0 = Ep0[x];Pfi0 = Varp0(x). Set k = 1. 2. Importance Sampling: ? Generate the importance density. Compute ?xkjk and Pfik by the WEKF (2.10), using ?xk?1jk?1, ?Pfik?1jk?1 and yk. ? Sample xik;:::;xN0 i.i.d. from N(?xkjk;Pfikjk). 3. Bayes? Updating: Receive new observation yk. The conditional density is ap- proximated by ^p(xkjy0:k) = PNi=1 wik?(xk ?xik), where wik is computed according to (2.22) and normalized. 4. Resampling. Sample x1k;:::;xNk i.i.d. from ^p(xkjy0:k). 5. Parameter Updating: ?xkjk = E^pk[x];Pfikjk = Var^pk(x). 6. k ? k +1 and go to step 2. The FMPF can be viewed from two viewpoints. The obvious viewpoint is that the WEKF is incorporated to improve the PF, since it is used to generate a better 25 importance density for the PF. From another viewpoint, the PF is incorporated to improve the WEKF, since particles sampled from the WEKF approximated density N(?xkjk;Pfikjk) are updated to yield an empirical density ^p(xkjy0:k), which is used to tune the WEKF parameters ?xkjk and Pfikjk. Therefore, we expect that the FMKF retains the advantages of both PF and WEKF, namely the asymptotic convergence of the PF and the stability of the WEKF. Fading memory is not only a technique of weighting heavily the more recent data, but also a way to prevent the divergence of the fllter (see pp. 137-138, [1] for details). As the exponential data weighting promotes exponential stability of the Kalman fllter, our hope is that fading memory also enhances uniform convergence of the particle fllter. NotethattheFMPFonlyneedstocomputeoneimportancedensityN(?xkjk;Pfikjk) at time k, whereas the extended Kalman particle fllter (EKPF) needs to compute a difierent importance density N(?xikjk;Pikjk) in order to draw each particle xik (see [88], [4] and [25] for details on EKPF). Hence, the EKPF is much more computationally expensive than the FMPF, especially when the state is multi-dimensional, since the computation of the EKF involves matrix inversions. We test the FMPF, EKPF, and PPF numerically on the example in [88] xk+1 = 1+sin(0:04?k)+?xk +uk;x0 = 0; yk = 8 >>< >>: 0:2x2k +vk k ? 30 0:5xk ?2+vk k > 30; where ? is a scalar system parameter. The system noise uk follows a Gamma dis- tribution ?(3; ), where is the scale parameter. The observation noise vk follows a 26 Gaussian distribution N(0;1e?5). A larger means a more spread distribution and hence implies a larger system noise. We compare the performance of the PPF, EKPF, and FMPF in terms of tracking the true state xk by the flltered state mean E[xkjy0:k] in two cases: ? Inaccurate models of system dynamics, i.e., the real system parameter ?r is not equal to the model system parameter ?m. ? Difierent values of system noise parameter , with other parameters flxed and ?r = ?m. For each case, we simulate 100 independent runs of length 60 with random initial- izations, calculate the mean square error (MSE) of the estimated states for each run, and then calculate the means and standard errors of the MSEs of the 100 runs. All three fllters use 200 particles and stratifled resampling [74]. The forgetting factor in the FMPF is chosen as fi = 1:2. Table 2.1 lists the means and standard errors of the MSEs under difierent real system parameters ?r, when the model system parameter is ?m = 0:5, the system noise is uk ? ?(3;1) (see Fig. 2.1 for a graphical illustration). As we can see, when the model is accurate, i.e., ?r = 0:5 = ?m, the MSEs are smallest for all three fllters, which is consistent with our intuition. The PPF is very sensitive to the inaccuracy of the model parameter, while the EKPF and FMPF are much more robust. The FMPF performs best among all three. Table 2.2 lists the means and standard errors of the MSEs for difierent values of system noise, with the system parameter ?r = ?m = 0:5 (see Fig. 2.2 for a graphical 27 illustration). As the system noise increases, the PPF and EKPF deteriorate quickly, and their performances at = 2 are probably unacceptable, whereas the FMPF performs well under all cases with slightly increasing MSEs. Fig. 2.3 shows a typical run of the estimated state generated by the three fllters tracking the true state, with = 2, and ?r = ?m = 0:5. Fig. 2.4 shows a typical run of the estimated states tracking the true state when the system model is inaccurate, where the real system parameter is ?r = 0:6, and other parameters are the same as Fig. 2.3. The PPF apparently misses many points before time k = 30 in Fig. 2.3, and even beyond k = 30 in Fig. 2.4. The EKPF tends to overshoot when there is a big upward change in the trajectory, such as at times k = 2;19;22;29 in Fig. 2.3, and k = 4;9;11;13;17;22;26;30 in Fig. 2.4. In contrast, the FMPF tracks the true states very closely in both cases, and shows its robustness with respect to the inaccuracy in the model system parameter. All three fllters perform better after time k = 30 than before k = 30, because the observation function is not invertible in a unique way before k = 30. 28 Table 2.1: Performances under difierent real system parameters ?r, with model system parameter ?m = 0:5, and system noise uk ? ?(3;1). Each entry shows the mean and standard error (in parentheses) of the MSEs based on 100 independent runs. ?r ?r = 0:3 ?r = 0:4 ?r = 0:5 ?r = 0:6 PPF 3.77 (0.111) 1.63 (0.0801) 1.48 (0.102) 3.62 (0.207) EKPF 0.323 (0.0305) 0.290 (0.0294) 0.290 (0.0392) 0.344 (0.0219) FMPF 0.112 (0.0092) 0.0696 (0.0069) 0.0630 (0.0124) 0.0854 (0.0092) Table 2.2: Performances under difierent system noises, with system parameter ?r = ?m = 0:5. Each entry shows the mean and standard error (in parentheses) of the MSEs based on 100 independent runs. Gamma(3; ) = 0:5 = 1 = 1:5 = 2 PPF 0.226 (0.025) 1.48 (0.102) 4.10 (0.271) 8.71 (0.412) EKPF 0.0087 (0.0010) 0.290 (0.0392) 1.38 (0.123) 5.18 (0.440) FMPF 0.0086 (0.0010) 0.0630 (0.0124) 0.192 (0.0265) 0.421 (0.0614)) 29 0.3 0.4 0.5 0.6 0.7?1 0 1 2 3 4 5 Real system parameter ? MSE MSEs under Inaccurate System Models FMPFEKPF PPF Figure 2.1: A graphic illustration of Table 2.1: Performances (mean and standard error of MSEs based on 100 independent runs) under difierent values of the real system parameter ?r. 0.5 1 1.5 2?2 0 2 4 6 8 10 12 System noise parameter ? MSE MSEs under Different System Noises FMPFEKPF PPF Figure 2.2: A graphic illustration of Table 2.2: Performances (mean and standard error of MSEs based on 100 independent runs) under difierent system noises. 30 0 10 20 30 40 50 600 5 10 15 20 25 30 Time k State estimate E[x(k)|y(1:k)] State Estimate Tracking True State True state x(k) FMPF estimateEKPF estimate data4 Figure 2.3: A typical run of the estimated states tracking the true state, when the system noise uk ? ?(3;2), and the system parameter ?r = ?m = 0:5. 0 10 20 30 40 50 600 5 10 15 20 25 30 Time k State estimate E[x(k)|y(1:k)] State Estimate Tracking True State for Inaccurate Model True state x(k)FMPF estimate EKPF estimatePPF estimate Figure 2.4: A typical run of the estimated states tracking the true state when the system model is inaccurate. The real system parameter ?r = 0:6, the model system parameter ?m = 0:5, and the system noise uk ? ?(3;2). 31 Chapter 3 Stochastic Control The stochastic control problem involves designing a controller for a stochastic process in order to minimize (or maximize) some cost (or reward) function. In this thesis, we focus on discrete-time problems, namely, Markov decision processes and partially observable Markov decision processes. 3.1 Markov Decision Processes Consider a stationary discrete-time system model: xk = f(xk?1;ak?1;uk?1); k = 1;2:::; (3.1) where for all k, the state xk is in a state space S Rnx, the action ak is in an action space A Rna, and the random disturbance uk 2 D Rnu is a sequence of i.i.d. random vectors with known distributions. Assume that fukg is independent of x0, which follows a distribution p0. For simplicity, we assume that all actions in A are admissible to each state x 2 S, and that D is a countable set to avoid mathematical complications. Given an initial state x0, the objective is to flnd a policy ? that consists of a sequence of functions ? = f?0; ?1;:::g, where ?k : S ! A, such that ? minimizes 32 the inflnite-horizon discounted cost function J?(x0) = lim H!1 EfukgH k=0 ( HX k=0 kg(xk; ?k(xk);uk) ) ; subject to the system equation (3.1). The one-step cost function g : S?A?D !R is given, 2 (0;1) is the discount factor, and EfukgH k=0 denotes the expectation with respect to the joint distribution of u0;:::;uH. For simplicity, we assume that the above limit deflning J?(x0) exists. Denote the set of all admissible policies by ?. The optimal value function is deflned by J?(x) = min ?2? J?(x); 8x 2 S An optimal policy, denoted by ??, is an admissible policy that achieves J?. A station- ary policy is an admissible policy of the form ? = f?;?;:::g (i.e., ?k is independent of k), referred to as the stationary policy ? for brevity, and its corresponding value function is denoted by J?. For any function J : S !R, the dynamic programming (DP) mapping applied to J is deflned as (TJ)(x) = min a2A Eu[g(x;a;u)+ J(f(x;a;u))]; x 2 S: Hence, the mapping T transforms the function J on S into the function TJ on S. For a given stationary policy ?, the mapping T? is deflned as (T?J)(x) = Eu[g(x;?(x);u)+ J(f(x;?(x);u))]; x 2 S: Throughout this section, we assume the following: 33 Assumption 3.1 The one-step cost function g satisfles jg(x;a;u)j? M < 1; 8(x;a;u) 2 S ?A?D: Please note this is the simplest type of inflnite-horizon discounted cost MDPs. Nonetheless, the boundedness of the one-step cost function is not as restrictive as it seems to be. The boundedness is satisfled if S, A, and D are of flnite cardinality. Even if they are not, the boundedness is satisfled in many computation methods because they are approximated by flnite sets. It needs more complicated technical treatment for MDPs involving unbounded one-step cost functions, which we will not delve into. The following proposition (cf. prop. 1.2.2 [10]) shows that the value function J? is the unique solution of Bellman?s equation. Proposition 1 The optimal value function J? satisfles J?(x) = mina Eu[g(x;a;u)+ J?(f(x;a;u))]; x 2 S; or equivalently, J? = TJ?: Furthermore, J? is the unique solution of this equation within the class of bounded functions. 3.1.1 Value Iteration Value iteration is an iterative method to compute the value function J?. It is basically the dynamic programming algorithm, and its performance is based on the 34 convergence of dynamic programming, as shown in the next proposition (cf. prop. 1.2.1 in [10]). Proposition 2 For any bounded function J : S !R, the value function satisfles J?(x) = lim k!1 (TkJ)(x); 8x 2 S: Value iteration starts with an arbitrary bounded function J0(x);8x 2 S, and then at each iteration k applies the DP mapping T to the old function Jk(x) to get a new function Jk+1(x) for all x 2 S. More speciflcally, at each iteration k, Jk+1(x) = (TJk)(x) = min a2A Eu[g(x;a;u)+ Jk(f(x;a;u))]; 8x 2 S: In view of Proposition 2, the value iteration method usually needs an inflnite number of iterations to achieve convergence. Hence, it is desirable to have an es- timate on the convergence rate. The following convergence rate [10] [73] has been established for any bounded function J: max x2S j(TkJ)(x)?J?(x)j? k max x2S jJ(x)?J?(x)j; k = 0;1;:::: The value iteration method may be implementable only approximately, if the state space is continuous or has a flnite but large number of states. Instead of computing (TJ)(x) for all states x 2 S, (TJ)(x) can be computed for only some of the states and estimated for the other states. The expectation in (3.1.1) may also be computed through approximation or simulation. 35 3.1.2 Policy Iteration As an alternative to value iteration, the policy iteration method directly works with policies to flnd an optimal stationary policy. It starts with an arbitrary sta- tionary policy ?0, at each iteration, evaluates the current policy ?k to obtain the associated value function J?k, and then computes a new policy ?k+1 based on J?k. The evaluation of the policy ?k is based on the following corollary (cf. cor. 1.2.2.1 in [10]) that is derived from Proposition 2. Corollary 1 For every stationary policy ?, the associated value functions satisfles J?(x) = Eu[g(x;?(x);u)+ J?(f(x;?(x);u))]; 8x 2 S; or, equivalently, J? = T?J?: Furthermore, J? is the unique solution of this equation within the class of bounded functions. Therefore, the detailed algorithm for the policy iteration method is as follows. Algorithm 3.1 Policy Iteration Algorithm 1. Initialization: Guess an initial stationary policy ?0. 2. Policy Evaluation: Evaluate the value function J?k associated with the current stationary policy ?k by solving J?k = T?kJ?k: 36 3. Policy Improvement: Compute a new stationary policy ?k+1 by applying the DP mapping T to the value function J?k, i.e., T?k+1J?k = TJ?k: 4. If J?k = TJ?k, stop; otherwise return to the policy evaluation step and repeat the process. It can proved that J?k+1 ? J?k, and the strict inequality J?k+1(x) < J?k(x) holds for at least one x 2 S, if ?k is not optimal. Hence, ?k+1 is strictly better than ?k, if ?k is not optimal. For flnite state and action spaces, the number of admissible policies is flnite, and hence, the policy iteration algorithm terminates after a flnite number of iterations to yield an optimal policy. Similar to the value iteration method, the policy iteration method is imple- mentable only approximately, if the state space is continuous or has a flnite but large number of states. The policy evaluation step may be approximated using a flnite number of value iterations or linear programming. The policy improvement step may be carried out for only some of the states, and approximated for the remaining states. In addition to variations of the value iteration and policy iteration methods, there are other approximation methods for solving large-state MDPs, such as the state aggregation approach [11], approximation using basis functions and linear programming [80], and approximation using post-decision state variable [72]. 37 3.1.3 Simulation-Based Methods The above methods are applicable only when the system model and cost func- tion are available. However, sometimes the system model is not available, but instead, the system and cost structure can be simulated. The above methods can still be applied to this case, with the help of using Monte Carlo simulation to calcu- late approximately the transition probabilities and cost functions. Another branch of simulation-based methods is the so-called reinforcement learning [86] or neuro dynamic programming, such as temporal difierence learning [85], and Q-learning [89]. The reinforcement learning methods learn the system model (and sometimes a policy) through repeatedly simulating the system using the current policy and improving the current policy. Some of the recent simulation-based methods con- cern e?cient allocation of the simulation budget by focusing on the more promising actions or policies, such as the multi-stage adaptive sampling algorithm, and the population-based evolutionary approaches, both described in [22]. 3.2 Partially Observable Markov Decision Processes Consider a stationary discrete-time continuous-state system model: xk = f(xk?1;ak?1;uk?1);k = 1;2:::; (3.2) yk = h(xk;ak?1;vk);k = 1;2;:::; y0 = h0(x0;v0); (3.3) where for all k, the state xk is in a continuous state space S Rnx, the action ak is in an action space A ? Rna, the observation yk is in a continuous observation space O Rny, and the random disturbances uk 2Rnu and vk 2Rnv are sequences 38 of i.i.d. random vectors with known distributions. Assume that fukg and fvkg are independent of each other, and independent of x0, which follows a distribution p0. Also assume that f(x;a;u) is continuous in x for every a 2 A and u 2Rnu, h(x;a;v) is continuous in x for every a 2 A and v 2Rnv, and h0(x;v) is continuous in x for every v 2Rnv. Eq. (3.2) is often called the state equation, and (3.3) the observation equation. All the information available to the decision maker at time k can be summa- rized by means of an information vector Ik, which is deflned as Ik = (y0;y1;:::;yk;a0;a1;:::;ak?1);k = 1;2;:::; I0 = y0: The objective is to flnd a policy ? consisting of a sequence of functions ? = f?0; ?1;:::g, where each function ?k maps the information vector Ik onto the action space A, that minimizes the cost function J? = lim H!1 Ex0;fukgH?1 k=0 ;fvkg H k=0 ( HX k=0 kg(xk;?k(Ik)) ) ; where g : S ?A !R is the one-step cost function, 2 (0;1) is the discount factor, and Ex0;fukgH?1 k=0 ;fvkg H k=0 denotes the expectation with respect to the joint distribu- tion of x0;u0;:::;uH?1;v0;:::;vH. For simplicity, we assume that the above limit deflning J? exists. The value function is deflned by J? = min ?2? J?; where ? is the set of all admissible policies. An optimal policy, denoted by ??, is an admissible policy that achieves J?. A stationary policy is an admissible policy of 39 the form ? = f?;?;:::g, referred to as the stationary policy ? for brevity, and its corresponding value function is denoted by J?. 3.2.1 Belief MDPs The information vector Ik grows as the history expands. The standard ap- proach to encode historical information is the use of the belief state, which is the conditional probability density of the current state xk given the past history, i.e., bk : S ! [0;1) : bk(x) , p(xk = xjIk): Given our assumptions on (3.2) and (3.3), bk exists, and can be computed recursively via Bayes? rule: bk(x) = p(xk = xjIk?1;ak?1;yk) / p(ykjxk = x;ak?1) Z S p(xk = xjIk?1;ak?1;xk?1)::: p(xk?1jIk?1;ak?1)dxk?1 / p(ykjxk = x;ak?1) Z S p(xk = xjak?1;xk?1)::: bk?1(xk?1)dxk?1; (3.4) where the second line follows from the Markovian property; / means \proportional to" because the denominator p(ykjIk?1;ak?1) does not explicitly depend on xk or k; and the third line follows from the Markovian property of fxkg and the fact that ak?1 is a function of Ik?1 given a policy. The righthand side of (3.4) can be expressed 40 in terms of bk?1, ak?1 and yk. Hence, bk = ?(bk?1;ak?1;yk); (3.5) whereyk ischaracterizedbythetime-homogeneousconditionaldistributionPY (ykjbk?1) that is induced by (3.2) and (3.3), and does not depend on fy0;:::;yk?1g. A POMDP can be converted to an MDP by conditioning on the information vectors, and the converted MDP is called the belief MDP (Chapter 5, [10]). The states of the belief MDP are the belief states, which follow the system dynamics (3.5), where yk can be seen as the system noise with the distribution PY . The state space of the belief MDP is the belief space, denoted by B, which is the set of all belief states, i.e., a set of probability densities. A policy ? is a sequence of functions ? = f?0;?1;:::g, where each function ?k maps the belief state bk onto the action space A. Notice that Ex0;fuigk?1 i=0 ;fvig k i=0 fg(xk;ak)g = EfExkfg(xk;ak)jIkgg; thus the one-step cost function can be written in terms of the belief state as the belief one-step cost function ~g(bk;ak) , Exk fg(xk;ak)jIkg = Z x2S g(x;ak)bk(x)dx , hg(?;a);bi: Assuming there exists a stationary optimal policy, the optimal value function is given by J?(b) = lim k!1 TkJ(b); 8b 2 B; 41 where T is the dynamic programming (DP) mapping operated on any bounded function J : S !R according to TJ(b) = min a2A [hg(?;a);bi+ EYfJ(?(b;a;Y))g]; (3.6) where EY denotes the expectation with respect to the distribution PY . 3.2.2 Finite-State vs. Continuous-State For flnite-state POMDPs, the belief state b is a vector with each entry being the probability of being at one of the states, and hence, the belief space B is a flnite- dimensional probability simplex. Past research on numerical solutions of POMDPs is mostly focused on flnite-state problems. For flnite-state POMDPs, it is proved that the value function is a piecewise linear convex function after a flnite number of iterations, provided that the one-step cost function is piecewise linear and convex [81]. This feature has been exploited in various exact and approximate algorithms such as those found in [82], [81], [83], [57], and [36]. The algorithm in [81] and the Witness algorithm in [57] carry out exact value iterations by flnding and updating the minimum set of linear functions that determine the value function at each itera- tion for a flnite-horizon problem. Because the number of such linear functions grow exponentially with the number of iterations, these algorithms are computationally very expensive and are limited to very simple problems in practice. Howard [39] and Sondik [83] use policy iteration to solve the inflnite-horizon discounted-cost prob- lems. Hauskrecht [36] summarizes several value function approximation methods in a nice framework of modifying the value iteration equation by changing the order 42 of summations and maximization, including approximation with fully observable MDP, approximation with Q-functions, fast informed bound method, and approx- imation with unobservable MDP. [36] also summarizes many grid-based methods with various techniques of choosing the set of grids and approximating the value function. For continuous-state POMDPs, the belief state b is a continuous density, and thus, the belief space B is an inflnite-dimensional space that contains all sorts of continuous densities. For continuous-state POMDPs, the value function preserves convexity [91], but value iteration algorithms are not directly applicable because the belief space is inflnite dimensional. The inflnite-dimensionality of the belief space also creates di?culties in applying the approximate algorithms that were developed for flnite-state POMDPs. For example, one straightforward and commonly used approach is to approximate a continuous-state POMDP by a flnite-state one via discretization of the state space. In practice, this could lead to computational di?culties, either resulting in a belief space that is of huge dimension or in a solution that is not accurate enough. In addition, note that even for a relatively nice prior distribution bk (e.g., a Gaussian distribution), the exact evaluation of the posterior distribution bk+1 is computationally intractable; moreover, the update bk+1 may not have any structure, and therefore can be very di?cult to handle. Past research is relatively sparse on numerically solving continuous-state POMDPs. 43 Chapter 4 Solving Continuous-State POMDPs 4.1 Related Work and Motivation As described at the end of last chapter, a POMDP can be converted to a continuous-state Markov decision process (MDP) by introducing the notion of the belief state [10], which is the conditional distribution of the current state given the history of observations and actions. For a flnite-state POMDP, the belief space is flnite dimensional (i.e., a probability simplex), whereas for a continuous-state POMDP, the belief space is an inflnite-dimensional space of continuous probability distributions. This difierence suggests that simple generalizations of many of the flnite-state algorithms to continuous-state models are not appropriate or applicable. For example, discretization of the continuous-state space may result in a flnite-state POMDP of dimension either too large to solve computationally or not su?ciently precise. Taking another example, many algorithms for solving flnite-state POMDPs (see [36] for a survey) are based on discretization of the flnite-dimensional probabil- ity simplex; however, it is usually not feasible to discretize an inflnite-dimensional probability distribution space. Throughout this chapter, when we use the word \dimension" or \dimensional", we refer to the dimension of the belief space/state. Despite the abundance of algorithms for flnite-state POMDPs, the aforemen- tioned di?culty has motivated some researchers to look for e?cient algorithms for 44 continuous-state POMDPs [69] [70] [87] [75] [17] [18]. Assuming discrete observa- tion and action spaces, Porta et al. [69] showed that the optimal flnite-horizon value function is deflned by a flnite set of \fi-functions", and model all functions of interest by Gaussian mixtures. In a later work [70], they extended their result and method to continuous observation and action spaces using sampling strategies. How- ever, the number of Gaussian mixtures in representing belief states and fi-functions grows exponentially in value iteration as the number of iterations increases. Thrun [87] addressed continuous-state POMDPs using particle flltering to simulate the propagation of belief states and represent the belief states by a flnite number of samples. The number of samples determines the dimension of the belief space, and the dimension could be very high in order to approximate the belief states closely. Roy [75] and Brooks et al. [17] used su?cient statistics to reduce the di- mension of the belief space, which is often referred to as belief compression in the Artiflcial Intelligence literature. Roy [75] proposed an augmented MDP (AMDP), using maximum likelihood state and entropy to characterize belief states, which are usually not su?cient statistics except for a linear Gaussian model. As shown by Roy himself, the algorithm fails in a simple robot navigation problem, since the two statistics are not su?cient for distinguishing between a unimodal distribution and a bimodal distribution. Brooks et al. [17] proposed a parametric POMDP, rep- resenting the belief state as a Gaussian distribution so as to convert the POMDP to a problem of computing the value function over a two-dimensional continuous space, and using the extended Kalman fllter to estimate the transition of the ap- proximated belief state. The restriction to the Gaussian representation has the same 45 problem as the AMDP. The algorithm recently proposed in Brooks and Williams [18] is similar to ours, in that they also approximate the belief state by a parame- terized density and solve the approximate belief MDP on the parameter space using Monte Carlo simulation-based methods. However, they did not specify how to cal- culate the parameters except for Gaussian densities, whereas we explicitly provide an analytical way to calculate the parameters for exponential families of densities. Moreover, we develop rigorous theoretical error bounds for our algorithm. There are some other belief compression algorithms designed for flnite-state POMDP, such as value-directed compression [71] and the exponential family principle components analysis (E-PCA) belief compression [76]. They are not suitable for generalization to continuous-state models, since they are based on a flxed set of support points. Motivated by the work of [87] [75] and [17], we develop a computationally tractable algorithm that efiectively reduces the dimension of the belief state and has the exibility to represent arbitrary belief states, such as multimodal or heavy tail distributions. The idea is to project the original high/inflnite-dimensional belief space to a low-dimensional family of parameterized distributions by minimizing the Kullback-Leibler (KL) divergence between the belief state and its projection on that family of distributions. For an exponential family, the minimization of KL divergence can be carried out in analytical form, making the method very easy to implement. The projected belief MDP can then be solved on the parameter space by using simulation-based algorithms, or can be further approximated by a flnite- state MDP via a suitable discretization of the parameter space and thus solved by using standard solution techniques such as value iteration and policy iteration. Our 46 method can be viewed as a generalization of the AMDP in [75] and the parametric POMDP in [17], which considers only the family of Gaussian distributions. In addition, we provide theoretical results on the error bound of the value function and the performance of the near-optimal policy generated by our method. We also develop a projection particle fllter for online flltering and decision making, by incorporating the density projection technique into particle flltering. The projection particle fllter we propose here is a modiflcation of the projection particle fllter in [5]. Unlike in [5] where the predicted conditional density is projected, we project the updated conditional density, so as to ensure the projected belief state remains in the given family of densities. Although seemingly a small modiflcation in the algorithm, we prove under much less restrictive assumptions a similar bound on the error between our projection particle fllter and the exact fllter. 4.2 Dimension Reduction 4.2.1 Density Projection Density projection is a useful idea to approximate an arbitrary (most likely, inflnite-dimensional) density as accurately as possible by a density in a chosen family that is characterized by only a few parameters. A projection mapping from the belief space B to a family of parameterized densities ?, denoted as Proj? : B ! ?, is deflned by Proj?(b) , argmin f2? DKL(bkf); b 2 B; (4.1) 47 where DKL(bkf) denotes the Kullback-Leibler (KL) divergence (or relative entropy) between b and f, which is DKL(bkf) , Z b(x)log b(x)f(x)dx: (4.2) Hence, the projection of b on ? has the minimum KL divergence from b among all the densities in ?. When ? is an exponential family of densities, the minimization (4.1) has an analytical solution and can be carried out easily. The exponential families include many common families of densities, such as Gaussian, binomial, Poisson, Gamma, etc. An exponential family of densities is deflned as follows [6]: Deflnition 4.1 Let fc1(?);:::;cm(?)g be a?nely independent scalar functions de- flned on Rn, i.e., for distinct points x1;:::;xm+1, Pm+1i=1 ?ic(xi) = 0 and Pm+1i=1 ?i = 0 implies ?1;:::;?m+1 = 0, where c(x) = [c1(x);:::;cm(x)]T. Assuming that ?0 = f 2 Rm : ?( ) = logR exp( Tc(x))dx < 1g is a convex set with a nonempty interior, then ? deflned by ? = ff(?; ); 2 ?g; f(x; ) = exp[ Tc(x)??( )]; where ? ?0 is open, is called an exponential family of probability densities, with its parameter and c(x) its su?cient statistic. Substituting f(x) = f(x; ) into (4.2) and expressing it further as DKL(bkf(?; )) = Z b(x)logb(x)dx? Z b(x)logf(x; )dx; 48 we can see that the flrst term does not depend on f(?; ), hence minDKL(bkf(?; )) is equivalent to max Z b(x)logf(x; )dx; which by Deflnition 4.1 is the same as max Z ( Tc(x)??( ))b(x)dx: (4.3) Recall the fact that the log-likelihood l( ) = Tc(x) ? ?( ) is strictly concave in [55], and therefore, R ( Tc(x)??( ))b(x)dx is also strictly concave in . Hence, (4.3) has a unique maximum and the maximum is achieved when the flrst-order optimality condition is satisfled, i.e., Z cj(x)? R c j(x)exp( Tc(x))dxR exp( Tc(x))dx ? b(x)dx = 0: Therefore, b and its projection f(?; ) is related by Eb [cj(X)] = E [cj(X)]; j = 1;:::;m; (4.4) where Eb and E denote the expectations with respect to b and f(?; ), respectively. 4.2.2 Projected Belief MDP Using the idea of density projection, we can transform the belief MDP to another MDP conflned on a low-dimensional belief space, and then solve this MDP problem. We call such an MDP the projected belief MDP. Its state is the projected belief state bpk 2 ? that satisfles the system dynamics bp0 = Proj?(b0); bpk = ?(bpk?1;ak?1;yk)p; k = 0;1;:::; 49 where ?(bpk?1;ak?1;yk)p = Proj?(?(bpk?1;ak?1;yk)), and the dynamic programming mapping on the projected belief MDP is TpJ(bp) = min a2A [hg(?;a);bpi+ EY fJ(?(bp;a;Y)p)g]: (4.5) For the projected belief MDP, a policy is denoted as ?p = f?p0;?p1;:::g, where each function ?pk maps the projected belief state bpk into the action space A. Similarly, a stationary policy is denoted as ?p; an optimal stationary policy is denoted as ?p?; and the optimal value function is denoted as Jp?(bp). The projected belief MDP is in fact a low-dimensional continuous-state MDP, and can be solved in numerous ways. One common approach is to use value iteration or policy iteration by converting the projected belief MDP to a discrete-state MDP problem via a suitable discretization of the projected belief space (i.e., the parameter space) and then estimating the one-step cost function and transition probabilities on the discretized mesh. The efiect of the discretization procedure on dynamic programming has been studied in [9]. We describe this approach in detail below. Discretization of the projected belief space ? is equivalent to discretization of the parameter space ?, which yields a set of grid points, denoted by G = f i;i = 1;:::;Ng. Let ~g( i;a) denote the one-step cost function associated with taking action a at the projected belief state bp = f(?; i). Let ~P( i;a)( j) denote the transition probability from the current projected belief state bpk = f(?; i) to the next projected belief state bpk+1 = f(?; j) by taking action a. Estimation of ~P( i;a)( j) is done using a variation of the projection particle flltering algorithm, to be described 50 in the next section. ~g( i;a) can be estimated by its sample mean: ~g( i;a) = 1N NX j=1 g(xj;a); (4.6) where x1;:::;xN are sampled i.i.d. from f(?; i). Remark 4.1 The approach for solving the projected belief MDP described here is probably the most intuitive, but not necessarily the most computationally e?cient. Other more e?cient techniques for solving continuous-state MDPs can be used to solve the projected belief MDP, such as the linear programming approach [27], neuro- dynamic programming methods [12], and simulation-based methods [22]. 4.3 Projection Particle Filtering Solving the projected belief MDP gives us a policy, which tells us what action to take at each projected belief state. In an online implementation, at each time k, the decision maker receives a new observation yk, estimates the belief state bk, and then chooses his action ak according to bk and that policy. Hence, to implement our approach requires addressing the problem of estimating the belief state. Estimation of bk, or simply called flltering, does not have an analytical solution in most cases except linear Gaussian systems, but it can be solved using many approximation methods, such as the extended Kalman fllter and particle flltering. Here we focus on particle flltering, because 1) it outperforms the extended Kalman fllter in many nonlinear/non-Gaussian systems [4], and 2) we will develop a projection particle fllter to be used in conjunction with the projected belief MDP. 51 Toobtainareasonable approximationofthebelief state, particle flltering needs a large number of samples/particles. Since the number of samples/particles is the dimension of the approximate belief state ^b, particle flltering is not very helpful in reducing the dimensionality of the belief space. Moreover, particle flltering does not give us an approximate belief state in the projected belief space ?, hence the policy we obtained by solving the projected belief MDP is not immediately applicable. We incorporate the idea of density projection into particle flltering, so as to approximate the belief state by a density in ?. The projection particle fllter we propose here is a modiflcation of the one in [5]. Their projection particle fllter projects the empirical predicted belief state, not the empirical updated belief state, onto a parametric family of densities, and hence, the approximate belief state might not be in that family after Bayes? updating. We will directly project the empirical updated belief state onto a parametric family. In addition, we will need much less restrictive assumptions than [5] to obtain similar error bounds. Since resampling is from a continuous distribution instead of an empirical (discrete) one, the proposed projection particle fllter also overcomes the di?culty of sample impoverishment [4] that occurs in the bootstrap fllter. Applying the density projection technique we described in the last section, projecting the empirical belief state ^bk onto an exponential family ? involves flnding a f(?; ) with the parameter satisfying (4.4). Hence, letting b = ^bk in (4.4) and plugging in (2.11), should satisfy NX i=1 wicj(xikjk?1) = E [cj]; j = 1;:::;m; 52 which constitutes the projection step in the projection particle fllter. Algorithm 4.1 (Projection particle flltering for an exponential family of densities (PPF)) ? Input: a (stationary) policy ?p on the projected belief MDP; a family of ex- ponential densities ? = ff(?; ); 2 ?g; a sequence of observations y1;y2;::: arriving sequentially at time k = 1;2;:::. ? Output: a sequence of approximate belief states f(?; ^ 1);f(?; ^ 2);:::. ? Step 1. Initialization: Sample x10;:::;xN0 i.i.d. from the approximate initial belief state f(?; ^ 0). Set k = 1. ? Step 2. Prediction: Compute x1kjk?1;:::;xNkjk?1 by propagating x1k?1;:::;xNk?1 according to the system dynamics (3.2) using the action ak?1 = ?p(f(?; ^ k?1)) and randomly generated noise fuik?1gNi=1, i.e., sample xikjk?1 from p(?jxik?1;ak?1), i = 1;:::;N. ? Step 3. Bayes? updating: Receive a new observation yk. Calculate weights as wik = p(ykjx i k;ak?1)P N i=1 p(ykjx i k;ak?1) ; i = 1;:::;N: ? Step 4. Projection: The approximate belief state is f(?; ^ k), where ^ k satisfles the equations NX i=1 wikcj(xikjk?1) = E^ k[cj]; j = 1;:::;m: ? Step 5. Resampling: Sample x1k;:::;xNk from f(?; ^ k). 53 ? Step 6. k ? k +1 and go to Step 2. In an online implementation, at each time k, the PPF approximates bk by f(?; ^ k), and then decides an action ak according to ak = ?p(f(?; ^ k)), where ?p is an optimal policy solved for the projected belief MDP. As mentioned in the last section, PPF can be varied slightly for estimating the transition probabilities of the discretized projected belief MDP, as follows: Algorithm 4.2 (Estimation of the transition probabilities) ? Input: i;a;N; ? Output: ~P( i;a)( j);j = 1;:::;N. ? Step 1. Sampling: Sample x1;:::;xN from f(?; i). ? Step 2. Prediction: Compute ~x1;:::; ~xN by propagating x1;:::;xN according to the system dynamics (3.2) using the action a and randomly generated noise fuigNi=1. ? Step 3. Sampling observation: Compute y1;:::;yN from ~x1;:::; ~xN according to the observation equation (3.3) using randomly generated noise fvigNi=1. ? Step 4. Bayes? updating: For each yk;k = 1;:::;N, the updated belief state is ~bk = NX i=1 wki ?(x? ~xi); where wki = p(ykj~xi;a)PN i=1 p(ykj~xi;a) ; i = 1;:::;N: 54 ? Step 5. Projection: For k = 1;:::;N, project each ~bk to the exponential family, i.e., flnding ~ k that satisfles (4.4). ? Step 6. Estimation: For k = 1;:::;N, flnd the nearest-neighbor of ~ k in G. For each j 2 G, count the frequency ~P( i;a)( j) = (number of j)=N. 4.4 Analysis of Projected Belief MDP 4.4.1 Main Idea Our method solves the projected belief MDP instead of the original belief MDP, and that raises two questions: How well does the optimal value function of the projected belief MDP approximate the optimal value function of the original belief MDP? How well does the optimal policy obtained by solving the projected belief MDP perform on the original belief MDP? To answer these questions, we flrst need to rephrase them mathematically. Here we assume perfect computation of the belief states and the projected belief states. We also assume the existence of an optimal policy that is stationary, as stated below. Assumption 4.1 There exist a stationary optimal policy for the belief MDP, de- noted by ??, and a stationary optimal policy for the projected belief MDP, denoted by ?p?. Assumption 4.1 holds under some mild conditions [10], [37]. Using the station- arity, and the dynamic programming mapping on the belief MDP and the projected 55 belief MDP given by (3.6) and (4.5), the optimal value function J?(b) for the belief MDP can be obtained by J?(b) , J??(b) = lim k!1 TkJ0(b); and the optimal value function for the projected belief MDP obtained by Jp?(bp) , Jp?p ? (bp) = lim k!1 (Tp)kJ0(bp): Therefore, the questions posed at the beginning of this section can be formu- lated mathematically as: 1. How well the optimal value function of the projected belief MDP approxi- mates the true optimal value function can be measured by jJ?(b)?Jp?(bp)j: 2. How well the optimal policy ?p? for the projected belief MDP performs on the original belief space can be measured by flflJ ?(b)?J?p?(b) flfl; where ?p?(b) , ?p? ?Proj?(b) = ?p?(bp). 4.4.2 Error Bound The next assumption bounds the difierence between the belief state b and its projection bp, and also the difierence between their one-step evolutions ?(b;a;y) and ?(bp;a;y)p. It is an assumption on the projection error. 56 Assumption 4.2 There exist ?1 > 0 and ?1 > 0 such that for all a 2 A;y 2 O and b 2 B, jhg(?;a);b?bpij? ?1; jhg(?;a);?(b;a;y)??(bp;a;y)pij? ?1: The following assumption can be seen as a continuity property of the value function. Assumption 4.3 Given ? > 0 that satisfles jhg(?;a);b?b0ij? ?, there exists ? > 0 such that jJk(b)?Jk(b0)j ? ?, 8,?b0 2 B, 8k, and there exists a ~? > 0 such that jJ?(b)?J?(b0)j? ~?, 8b;b0 2 B, 8? 2 ?. Now we present our main result. Theorem 4.1 Under Assumptions 4.1, 4.2 and 4.3, for all b 2 B, jJ?(b)?Jp?(bp)j ? ?1 + ?21? ; (4.7) flflJ ?(b)?J?p?(b) flfl ? 2?1 + (?2 +?3) 1? ; (4.8) where ?1 is the constant in Assumption 4.2, and ?2, ?3 are the constants ? and ~?, respectively, in Assumption 4.3 corresponding to ? = ?1. Proof: Denote Jk(b) , TkJ0(b);Jpk(bp) , (Tp)kJ0(bp);k = 0;1;:::, and deflne bk(b;a) = hg(?;a);bi+ EY fJk?1(?(b;a;Y))g; ?k(b) = argmin a2A Qk(b;a); 57 bpk(b;a) = hg(?;a);bpi+ EY fJk?1(?(bp;a;Y)p)g; ?pk(b) = argmin a2A Qpk(b;a): Hence, Jk(b) = min a2A Qk(b;a) = Qk(b;?k(b)); Jpk(bp) = min a2A Qpk(b;a) = Qpk(b;?pk(b)): Denote errk , maxb2B jJk(b)?Jpk(bp)j;k = 1;2;:::. We consider the flrst iteration. Initialize with J0(b) = Jp0(bp) = 0. By As- sumption 4.2, 8a 2 A, jQ1(b;a)?Qp1(b;a)j = jhg(?;a);b?bpij? ?1; 8b 2 B: (4.9) Hence, with a = ?p1(b), the above inequality yields Q1(b;?p1(b)) ? Jp1(bp)+?1. Using J1(b) ? Q1(b;?p1(b)), we get J1(b) ? Jp1(bp)+?1; 8b 2 B: (4.10) With a = ?1(b), (4.9) yields Qp1(b;?1(b))??1 ? J1(b). Using Jp1(bp) ? Qp1(b;?1(b)), we get Jp1(bp)??1 ? J1(b); 8b 2 B: (4.11) From (4.10) and (4.11), we conclude jJ1(b)?Jp1(bp)j? ?1; 8b 2 B: Taking the maximum over b on both sides of the above inequality yields err1 ? ?1: (4.12) 58 Now we consider the (k+1)th iteration. For a flxed Y = y, by Assumption 4.2, jhg(?;a);?(b;a;y)??(bp;a;y)pij? ?1: Let ?1 be the ? in Assumption 4.3 and denote the corresponding ? by ?2. Then jJk(?(b;a;y))?Jk (?(bp;a;y)p)j? ?2; 8b 2 B: (4.13) Therefore, 8a 2 A, flflQ k+1(b;a)?Q p k+1(b;a) flfl ? jhg(?;a);b?bpij+ EY fjJk(?(b;a;Y))?Jpk(?(bp;a;Y)p)jg ? ?1 + EYfjJk(?(b;a;Y))?Jk(?(bp;a;Y)p)j+ jJk(?(bp;a;Y)p)?Jpk(?(bp;a;Y)p)jg ? ?1 + (?2 +errk); 8b 2 B: The third inequality follows from (4.13) and the deflnition of errk. Using an argu- ment similar to that used to prove (4.12) from (4.9), we conclude that errk+1 ? ?1 + (?2 +errk): (4.14) Using induction on (4.14) with initial condition (4.12) and taking k !1, we obtain jJ?(b)?Jp?(bp)j ? 1X k=0 k?1 + 1X k=1 k?2 = ?1 + ?21? : (4.15) Therefore, (4.7) is proved. Fixing a policy ? on the original belief MDP, deflne the mappings under policy ? on the belief MDP and the projected belief MDP as T?J(b)=hg(?;?(b));bi+ EY fJ(?(b;?(b);Y))g; (4.16) Tp?J(b)=hg(?;?(b));bpi+ EY fJ(?(bp;?(b);Y)p)g; (4.17) 59 respectively. Since ?p? is a stationary policy for the projected belief MDP, ?p? = ?p? ?Proj? is stationary for the original belief MDP. Hence, Jp?(bp) = Tp?p ? Jp?(bp); J?p?(b) = T?p?J?p?(b): Subtracting both sides of the above two equations, and substituting in the deflnitions of Tp and T (i.e., (4.17) and (4.16)) for the righthand sides respectively, we get Jp?(bp)?J?p?(b) = hg(?;?p?(bp));bp ?bi::: + EY 'Jp?(?(bp;?p?(bp);Y)p)?J?p? (?(b;?p?(bp);Y))?:(4.18) For a flxed Y = y, flflJp ?(?(b p;?p ?(b p);y)p)?J ?p?(?(b;? p ?(b p);y))flfl ? flfl flJp?(~b)?J?p?(~b) flfl fl+ flflJ ?p? (?(b p;?p ?(b p);y)p)?J ?p? (?(b;? p ?(b p);y))flfl; where ~b = ?(bp;?p?(bp);y)p 2 B. By Assumption 4.2, jhg(?;a);?(bp;?p?(bp);y)p ??(b;?p?(bp);y)ij? ?1: Letting ? = ?1 in Assumption 4.3 and denoting the corresponding ~? by ?3, we get the second term flflJ ?p? (?(b p;?p ?(b p);y)p)?J ?p? (?(b;? p ?(b p);y))flfl? ?3: Denoting err , maxb2B flflJp?(bp)?J?p?(b)flfl, we obtain flflJp ? (?(b p;?p ?(b p);y)p)?J ?p? (?(b;? p ?(b p);y))flfl? err +?3: 60 Therefore, (4.18) becomes flflJp ?(b p)?J ?p?(b) flfl? ? 1 + (err +?3): Taking the maximum over b on both sides of the above inequality yields err ? ?1 + (err +?3): Hence, err ? ?1 + ?31? : (4.19) With (4.15) and (4.19), we obtain flflJ ?(b)?J?p?(b) flfl ? jJ ?(b)?Jp?(bp)j+ flflJp ?(b p)?J ?p?(b) flfl ? 2?1 + (?2 +?3)1? ; 8b 2 B: Therefore, (4.8) is proved. ? Remark 4.2 In (4.7) and (4.8), ?1 is a projection error, and ?2 and ?3 decreases as the projection error ?1 decreases. Therefore, as the projection error decreases, the optimal value function of the projected belief MDP Jp?(bp) converges to the true optimal value function J?(b), and the corresponding policy ?p? converges to the true optimal policy ??. Roughly speaking, the projection error decreases as the number of su?cient statistics in the chosen exponential family increases (for details, please see Section 4.6: Validation of the Assumptions). 4.5 Analysis of Projection Particle Filtering In the above analysis, we assumed perfect computation of the belief states and the projected belief states. In this section, we consider the flltering error, and 61 compute an error bound on the approximate belief state generated by the projection particle fllter (PPF). 4.5.1 Notations Let Cb(Rn) be the set of all continuous bounded functions on Rn. Let B(Rn) be the set of all bounded measurable functions onRn. Let k?k denote the supremum norm on B(Rn), i.e., k`k, supx2Rn j`(x)j;` 2 B(Rn). Let M+(Rn) and P(Rn) be the sets of nonnegative measures and probability measures on Rn, respectively. If ? 2M+(Rn) and ` :Rn !R is an integrable function with respect to ?, then h?;`i, Z `d?: Moreover, if ? 2P(Rn), E?[`] = h?;`i; Var?(`) = h?;`2i?h?;`i2: We will use the representations on the two sides of the above equalities interchange- ably in the sequel. The belief state and the projected belief state are probability densities; how- ever, we will prove our results in terms of their corresponding probability mea- sures, which we refer to as \conditional distributions" (belief states are conditional densities). The two representations are essentially the same once we assume the probability measures admit probability densities. Therefore, the notations used for probability densities before are used to denote their corresponding probability mea- sures from now on. Namely, we use b to denote a probability measure on Rnx and 62 assume it admits a probability density with respect to Lebesgue measure, which is the belief state. Similarly, we use f(?; ) to denote a probability measure on Rnx and assume it admits a probability density with respect to Lebesgue measure in the chosen exponential family with parameter . A probability transition kernel K : P(Rnx)?Rnx !R is deflned by K?(E) , Z Rnx ?(dx)K(E;x); where E is a set in the Borel -algebra on Rnx. For ` : Rnx ! R, an integrable function with respect to K(?;x), K`(x) , Z Rnx `(x0)K(dx0;x): Let Kk(dxk;xk?1) denote the probability transition kernel of the system (3.2) at time k, which satisfles bkjk?1(dxk) = Kkbk?1(dxkjk?1) = Z Rnx bk?1(dxk?1)Kk(dxkjk?1;xk?1): We let ?k denote the likelihood function associated with the observation equa- tion (3.3) at time k, and assume that ?k 2 Cb(Rnx). Hence, bk = ?kbkjk?1hb kjk?1;?ki : 4.5.2 Main Idea The exact fllter (EF) at time k can be described as bk?1 ?! bkjk?1 = Kkbk?1 ?! bk = ?kbkjk?1hb kjk?1;?ki : prediction updating 63 The PPF at time k can be described as ^f(?; ^ k?1) ?! ^bkjk?1 = Kkf(?; ^ k?1) ?! ::: prediction updating ^bk = ?k^bkjk?1 h^bkjk?1;?ki ?! f(?; ^ k) ?! ^f(?; ^ k): projection resampling To facilitate our analysis, we introduce a conceptual fllter (CF), which at each time k is reinitialized by f(?; ^ k?1), performs exact prediction and updating to yield b0kjk?1 and b0k, respectively, and does projection to obtain f(?; 0k). It can be described as f(?; ^ k?1) ?! b0kjk?1 = Kkf(?; ^ k?1) ?! ::: prediction updating b0k = ?kb 0 kjk?1 hb0kjk?1;?ki ?! f(?; 0 k): projection The CF serves as an bridge to connect the EF and PPF, as we describe below. We are interested in the difierence between the true conditional distribution bk and the PPF generated projected conditional distribution f(?; ^ k) for each time k. The total error between the two can be decomposed as follows: bk ?f(?; ^ k) = (bk ?b0k)+(b0k ?f(?; 0k))+(f(?; 0k)?f(?; ^ k)): (4.20) The flrst term (bk ?b0k) is the error due to the inexact initial condition of the CF compared to EF, i.e., (bk?1 ?f(?; ^ k?1)), which is also the total error at time k?1. 64 Table 4.1: Notations of Difierent Conditional Distributions bk exact conditional distribution ^bk PPF conditional distribution before projection f(?; ^ k) PPF projected conditional distribution b0k CF conditional distribution before projection f(?; 0k) CF projected conditional distribution The second term (b0k?f(?; 0k)) evaluates the minimum deviation from the exponen- tial family generated by one step of exact flltering, since f(?; 0k) is the projection of b0k. The third term (f(?; 0k)?f(?; ^ k)) is purely due to Monte Carlo simulation, since f(?; 0k) and f(?; ^ k) are obtained using the same steps from f(?; ^ k?1) and its empirical version ^f(?; ^ k?1), respectively. We will flnd error bounds on each of the three terms, and flnally flnd the total error at time k by induction. 4.5.3 Error Bound We shall look at the the case in which the observation process has an arbitrary but flxed value y0:k = fy0;:::;ykg. Hence, all the expectations E in this section are with respect to the sampling in the algorithm only. We consider the test function 65 ` 2 B(Rnx). It can be seen that K` 2 B(Rnx) and kK`k?k`k, since jK`(x)j = flfl flfl Z Rnx `(x0)K(dx0;x) flfl flfl ? Z Rnx j`(x0)K(dx0;x)j ? k`k Z Rnx K(dx0;x) = k`k: Since ? 2 Cb(Rnx), we know that ? 2 B(Rnx) and ?` 2 B(Rnx). We also need the following assumptions. Assumption 4.4 All the projected distributions are in a compact subset of the given exponential family. In other words, there exists a compact set ?0 ? such that ^ k 2 ?0, and 0k 2 ?0, 8k. Assumption 4.5 For all k 2N, hbkjk?1;?ki > 0; hb0kjk?1;?ki > 0; w:p:1; h^bkjk?1;?ki > 0; w:p:1: Assumption 4.5 guarantees that the normalizing constant in the Bayes? updat- ing is nonzero, so that the conditional distribution is well deflned. Under Assump- tion 4.4, the second inequality in Assumption 4.5 can be strengthened using the com- pactness of ?0. Since f(?;ak;uk) in (3.2) is continuous in x, Kk is weakly continuous (pp. 175-177, [37]). Hence, hb0kjk?1;?ki = hKkf(?; ^ k?1);?ki = hf(?; ^ k?1);Kk?ki is continuous in ^ k?1, where ^ k?1 2 ?0. Since ?0 is compact, there exists a constant ? > 0 such that for each k hb0kjk?1;?ki? 1?; w:p:1: (4.21) 66 The assumption below guarantees that the conditional distribution stays close to the given exponential family after one step of exact flltering if the initial dis- tribution is in the exponential family. Recall that starting with initial distribution f(?; ^ k?1), one step of exact flltering yields b0k, which is then projected to yield f(?; 0k), where ^ k?1 2 ?0; 0k 2 ?0. Assumption 4.6 There exists a constant ? > 0 such that for all ` 2 B(Rnx) and all k 2N, E[jhb0k;`i?hf(?; 0k);`ij] ? ?k`k: Remark 4.3 Assumption 4.6 is our main assumption, which essentially assumes an error bound on the projection error. Our assumptions are much less restrictive than the assumptions in [5], while our conclusion is similar to but slightly difierent from that in [5], which will be seen later. Although Assumption 4.6 appears similar to Assumption 3 in [5], it is essentially difierent. Assumption 3 in [5] says that the optimal conditional density stays close to the given exponential family for all time, whereas Assumption 4.6 only assumes that if the exact fllter starts in the given exponential family, after one step the conditional distribution stays close to the family. Moreover, we do not need any assumption like the restrictive Assumption 4 in [5]. Lemma 4.1 considers the bound on the flrst term in (4.20). Lemma 4.1 For each k 2 N, suppose E[jhbk?1 ? f(?; ^ k?1);`ij] ? ek?1k`k, 8` 2 B(Rnx), where ek?1 is a positive constant. Then under Assumptions 4.4 and 4.5, 67 for each k 2N, there exists a constant ?k > 0 such that for all ` 2 B(Rnx) E[jhbk ?b0k;`ij] ? ?kek?1k`k: (4.22) Proof: E hflfl flhbk?1 ?f(?; ^ k?1);`i flfl fl i is the error from time k?1, which is also the initial error for time k. Hence, the prediction step yields E?flflhbkjk?1 ?b0kjk?1;`iflfl? = E hflfl flhKk(bk?1 ?f(?; ^ k?1));`i flfl fl i = E hflfl flhbk?1 ?f(?; ^ k?1);Kk`i flfl fl i ? ek?1kKk`k ? ek?1k`k: (4.23) Under Assumptions 4.4 and 4.5, we have showed (4.21). Using (4.21) and (4.23), the Bayes? updating step yields E[jhbk ?b0k;`ij] = E "flfl flfl fl hbkjk?1;?k`i hbkjk?1;?ki ? hb0kjk?1;?k`i hb0kjk?1;?ki flfl flfl fl # = E "flfl flfl fl hbkjk?1;?k`i hbkjk?1;?ki ? hbkjk?1;?k`i hb0kjk?1;?ki flfl flfl fl # + E "flfl flfl fl hbkjk?1;?k`i hb0kjk?1;?ki ? hb0kjk?1;?k`i hb0kjk?1;?ki flfl flfl fl # ? E "flfl flfl fl hbkjk?1;?k`ihbkjk?1 ?b0kjk?1;?ki hbkjk?1;?kihb0kjk?1;?ki flfl flfl fl # + E "flfl flfl fl hbkjk?1 ?b0kjk?1;?k`i hb0kjk?1;?ki flfl flfl fl # ? ? flflhb kjk?1;?k`i flfl hbkjk?1;?ki E ?flflhb kjk?1 ?b0kjk?1;?ki flfl?+ ?E?flflhb kjk?1 ?b0kjk?1;?k`i flfl? ? ?ek?1k`kk?kk+?ek?1k?k`k ? 2?ek?1k?kkk`k = ?kek?1k`k; 68 where ?k = 2?k?kk. ? Lemma 4.2 considers the bound on the third term in (4.20) before projection. Lemma 4.2 Under Assumptions 4.4 and 4.5, for each k 2N, there exists a constant ?k > 0 such that for all ` 2 B(Rnx) E hflfl flh^bk ?b0k;`i flfl fl i ? ?k k`kpN: Proof: This lemma uses essentially the same proof technique as Lemmas 3 and 4 in [23]. However, it is not quite obvious how these lemmas imply our lemma here. Therefore, we state the proof to make this chapter more accessible. After the re- sampling step, ^f(?; ^ k?1) = 1N PNi=1 ?(x?xik?1), where xik?1;i = 1;:::;N are i.i.d. samples from f(?; ^ k?1). Using the Cauchy-Schwartz inequality, we have ? E h h^f(?; ^ k?1)?f(?; ^ k?1);`i2 i?1=2 = 0 @E 2 4 ? 1 N NX i=1 ? `(xik?1)?hf(?; ^ k?1);`i ?!23 5 1 A 1=2 = 1pN ? E " 1 N NX i=1 (`(xik?1)?hf(?; ^ k?1);`i)2 #!1=2 = 1pN ? hf(?; ^ k?1);`2i?hf(?; ^ k?1);`i2 ?1=2 ? 1pNhf(?; ^ k?1);`2i1=2 ? 1pNk`k: (4.24) 69 The Bayes? updating step yields E hflfl flh^bk ?b0k;`i flfl fl i = E "flfl flfl fl h^bkjk?1;?k`i h^bkjk?1;?ki ? hb0kjk?1;?k`i hb0kjk?1;?ki flfl flfl fl # = E "flfl flfl fl h^bkjk?1;?k`i h^bkjk?1;?ki ? h^bkjk?1;?k`i hb0kjk?1;?ki flfl flfl fl # + E "flfl flfl fl h^bkjk?1;?k`i hb0kjk?1;?ki ? hb0kjk?1;?k`i hb0kjk?1;?ki flfl flfl fl # ? E "flfl flfl fl h^bkjk?1;?k`ih^bkjk?1 ?b0kjk?1;?ki h^bkjk?1;?kihb0kjk?1;?ki flfl flfl fl # + E "flfl flfl fl h^bkjk?1 ?b0kjk?1;?k`i hb0kjk?1;?ki flfl flfl fl # : Under Assumptions 4.4 and 4.5, we have showed (4.21). Using the Cauchy-Schwartz inequality, (4.21) and (4.24), the flrst term can be simplifled as E "flfl flfl fl h^bkjk?1;?k`ih^bkjk?1 ?b0kjk?1;?ki h^bkjk?1;?kihb0kjk?1;?ki flfl flfl fl # ? ? ? E " h^bkjk?1;?k`i2 h^bkjk?1;?ki2 #!1=2? E h h^bkjk?1 ?b0kjk?1;?ki2 i?1=2 = ? ? E " h^bkjk?1;?k`i2 h^bkjk?1;?ki2 #!1=2? E h hf(?; ^ 0k?1)?f(?; 0k?1);Kk?ki2 i?1=2 ? ?k`k 1pNk?kk; and the second term can be simplifled as E "flfl flfl fl h^bkjk?1 ?b0kjk?1;?k`i hb0kjk?1;?ki flfl flfl fl # ? ? ? E h h^bkjk?1 ?b0kjk?1;?k`i2 i?1=2 = ? ? E h h^f(?; ^ k?1)?f(?; ^ k?1);Kk?k`i2 i?1=2 ? ? 1pNk?k`k ? ? 1pNk?kkk`k: Therefore, adding these two terms yields E hflfl flh^bk ?b0k;`i flfl fl i ? 2?k?kkk`kpN = ?k k`kpN; 70 where ?k = 2?k?kk: ? Lemma 4.3 considers the bound on the third term in (4.20) based on the result of Lemma 4.2. The key idea of proof is to connect the errors before and after projection through (4.4), which we derived for the density projection that minimizes the KL divergence. Lemma 4.3 Let cj;j = 1;:::;m be the su?cient statistics of the exponential family as deflned in Deflnition 4.1, and assume cj 2 B(Rnx);j = 1;:::;m. Then under Assumptions 4.4 and 4.5, for each k 2N, there exists a constant dk > 0 such that for all ` 2 B(Rnx) E hflfl flf(?; ^ k)?hf(?; 0k);`i flfl fl i ? dk k`kpN: (4.25) Proof: The key idea of the proof for Lemma 4 in [5] is used here. From (4.4), we know that E^ k[cj(X)] = E^bk[cj(X)] and E 0k[cj(X)] = Eb0k[cj(X)]. Hence, we obtain E?flflE^ k(cj(X))?E 0k(cj(X))flfl? = E hflfl flh^bk ?b0k;cji flfl fl i ; for j = 1;:::;m. Taking summation over j, we obtain E " mX j=1 flflE ^ k (cj(X))?E 0k (cj(X)) flfl# = mX j=1 E hflfl flh^bk ?b0k;cji flfl fl i : Since cj 2 B(Rnx), we apply Lemma 4.2 with ` = cj and thus obtain E hflfl flh^bk ?b0k;cji flfl fl i ? bkkcjkpN ; j = 1;:::;m: Therefore, E h E^ k(c(X))?E 0k(c(X)) 1 i ? ~?kpN; 71 where k?k1 denotes the L1 norm on Rnx, c = [c1;:::;cm]T, and ~?k = Pmj=1kcjk. Since ?0 is compact and the Fisher information matrix [E [ci(X)cj(X)]?E [ci(X)]E [cj(X)]]ij is positive deflnite, we get (cf. Fact 2 in [5] for a detailed proof) ^ k ? 0k 1 ? fi E^ k(c(X))?E 0k(c(X)) 1 : Taking the expectation on both sides yields E h ^ k ? 0k 1 i ? fiE h E^ k(c(X))?E 0k(c(X)) 1 i ? fi~?k 1pN: On the other hand, taking derivative of E [`(X)] with respect to i yields flfl flfl d d iE [`(X)] flfl flfl = jE [ci(X)`(X)]?E [ci(X)]E [`(X)]j ? p Var (ci)Var (`) ? q E (c2i)E (`2) ? kcikk`k: Hence, d d E [`(X)] 1 ? ? mX i=1 kcik ! k`k: Since ?0 is compact, there exists a constant fl > 0 such that E [`(X)] is Lipschitz over 2 ?0 with Lipschitz constant flk`k (cf. the proof of Fact 2 in [5]), i.e., flflE ^ k[`]?E 0k[`] flfl? flk`k ^ k ? 0k 1 : 72 Taking expectation on both sides yields E hflfl flf(?; ^ k)?f(?; 0k);`i flfl fl i ? flk`kE h ^ k ? 0k 1 i ? flk`kfi~?k 1pN = dk j`kpN; where dk = fifl~?k. ? Now we present our main result on the error bound of the projection particle fllter. Theorem 4.2 Suppose E[jhb0 ?f(?; ^ 0);`ij] ? e0k`k;e0 ? 0, 8` 2 B(Rnx). Under Assumptions 4.4, 4.5 and 4.6, and assuming that cj 2 B(Rnx);j = 1;:::;m, there exist ?i > 0;di > 0;i = 1;:::;k such that for all ` 2 B(Rnx) , E hflfl flhbk ?f(?; ^ k);`i flfl fl i ? ekk`k; k = 1;2;:::; where ek = ?k1e0 + ? kX i=2 ?ki +1 ! ?+ ? kX i=2 ?ki di?1 +dk ! 1p N; (4.26) ?ki = Qkj=i ?j for k ? i, and ?ki = 0 for k < i, ? is the constant in Assumption 4.6. Proof: Applying Lemma 4.1, Assumption 4.6, and Lemma 4.3, we have that for all ` 2 B(Rnx) and k 2N, there exist ?i > 0;di > 0;i = 1;:::;k such that E hflfl flhbk ?f(?; ^ k);`i flfl fl i ? E[jhbk ?b0k;`ij]+E[jhb0k ?f(?; 0k);`ij]::: + E hflfl flhf(?; 0k)?f(?; ^ k);`i flfl fl i ? ?kek?1 +?+dk 1pN ? k`k = ekk`k: 73 It is easy to deduce by induction that ek = ?k1e0 + ? kX i=2 ?ki +1 ! ?+ ? kX i=2 ?ki di?1 +dk ! 1p N: ? Remark 4.4 As we mentioned in Remark 4.2, the projection error e0 and ? decrease as the number of su?cient statistics in the chosen exponential family, m, increases. The error ek decreases at the rate of 1pN, as we increase the number of samples in the projection particle fllter. However, notice that the coe?cient in front of 1pN grows as time, so we have to use an increasing number of samples as time goes on, in order to ensure a uniform error bound with respect to time. 4.6 Validation of Assumptions Assumptions 4.2 and 4.6 are the main assumptions of our analysis. They are assumptions on the projection error, assuming that density projection introduces a \small" error. We will show that in certain cases these assumptions hold, and the projection error converges to 0 as the number of su?cient statistics, m, goes to inflnity. We will flrst state a convergence result from [7]. However, as their conver- gence result is in the sense of KL divergence, we will further show the convergence in the sense employed in our assumptions by using an intermediate result in [7]. Consider a probability density function b deflned on a bounded interval, and approximate it by bp, a density function in an exponential family, whose su?cient statistics consist of polynomials, splines or trigonometric series. The following the- orem is proved in [7]. 74 Theorem 4.3 If logb has r square-integrable derivatives, i.e., R jDr logbj2 < 1, then DKL(bjjbp) converges to 0 at rate m?2r as m !1. Theorem 4.3 says the projected density bp converges to b in the sense of KL divergence, as m goes to inflnity. An intermediate result (see (6.6) in [7]) is: klogb=bpk ? ?m, where ?m is a constant that depends on m, and ?m ! 0 as m !1. Since b is bounded and log(?) is a continuously difierentiable function, there ex- ists a constant ? such that kb?bpk? ?klogb?logbpk. Hence, with the intermediate result above, jh`;b?bpij ? k`k Z kb?bpkdx ? k`k Z ?klog bbpkdx ?k`k?l?m; where l is the length of the bounded interval that b is deflned on. Since ?m can be made arbitrarily small by taking large enough m, it is easy to see that Assump- tions 4.2 and 4.6 hold in the cases that we consider. 4.7 Numerical Experiments 4.7.1 Scalability and Computational Issues Estimation of the one-step cost function (4.6) and transition probabilities (Al- gorithm 4.2) are executed for every belief-action pair that is in the discretized mesh G and the action space A. Hence, the algorithms scale according to O(jGjjAjN) and O(jGjjAjN2), respectively, where jGj is the number of grid points, jAj is the 75 number of actions, and N is the number of samples specifled in the algorithms. In implementation, we found that most of the computation time is spent on executing Algorithm 4.2 over all belief-action pairs. However, estimation of cost functions and transition probabilities can be pre-computed and stored, and hence only needs to be done once. The advantage of the algorithms is that the scalability is independent of the size of the actual state space, since G is a grid mesh on the parameter space of the projected belief space. That is exactly what is desired by employing density projection. However, to get a better approximation, more parameters in the expo- nential family should be used, and that will lead to a higher-dimensional parameter space to discretize. Increasing the number of parameters in the exponential family also makes sampling more di?cult. Sampling from a general exponential family is usually not easy, and may require some advanced techniques, such as Markov chain Monte Carlo (MCMC) [33], and hence more computation time. This di?culty can be avoided by resampling from the discrete particles instead of the projected density, which is equivalent to using the plain particle fllter and then doing projection out- side the fllter. This may lead to sample degeneracy however. The trade-ofi between a better approximation and less computation time is complicated and needs more research. We plan to study how to appropriately choose the exponential family and improve the simulation e?ciency in the future. 76 4.7.2 Simulation Results Since most of the benchmark POMDP problems in the literature assume a discrete state space, it is di?cult to compare against the state of the art. Here we consider an inventory control problem by adding a partial observation equation to a fully observable inventory control problem. The fully observable problem has an optimal threshold policy [60], which allows us to verify our method in the limiting case by setting the observation noise very close to 0. In our inventory control problem, the inventory level is reviewed at discrete times, and the observations are noisy because of, e.g., inventory spoilage, misplacement, distributed storage. At each period, inventory is either replenished by an order of a flxed amount or not replenished. The customer demands arrive randomly with known distribution. The demand is fllled if there is enough inventory remaining. Otherwise, in the case of a shortage, excess demand is not satisfled and a penalty is issued on the lost sales amount. We assume that the demand and the observation noise are both continuous random variables; hence the state, i.e., the inventory level, and the observation, are continuous random variables. Let xk denote the inventory level at period k, uk the i.i.d. random demand at period k, ak the replenish decision at period k (i.e., ak = 0 or 1), Q the flxed order amount, yk the observation of inventory level xk, vk the i.i.d. observation noise, h the per period per unit inventory holding cost, s the per period per unit inventory 77 shortage penalty cost. The system equations are as follows xk+1 = max(xk +akQ?uk;0); k = 0;1;:::; yk = xk +vk; k = 0;1;:::: The cost incurred in period k is gk(xk;ak;uk) = hmax(xk +akQ?uk;0)+ smax(uk ?xk ?akQ;0): We consider two objective functions: average cost per period and discounted total cost, given by limsup H!1 E hP H k=0 gk i H ; limH!1E " HX k=0 kgk # ; where 2 (0;1) is the discount factor. In the simulation, we flrst choose an exponential family and specify a grid mesh on its parameter space, then implement (4.6) and Algorithm 4.2 on the grid mesh, and use value iteration to solve for a policy. These are done o?ine. In an online run, Algorithm 4.1 (PPF) is used for flltering and making decisions with the policy obtained o?ine. We also consider a small variation of this method: instead of using PPF, we use Algorithm 2.1 (PF) and do density projection outside the fllter each time. We compare our two methods (called \Ours 1" and \Ours 2", respectively) described above to four other algorithms: (1) Certainty equivalence using the mean estimate (CE-Mean); (2) Certainty equivalence using the maximum likelihood esti- mate (CE-MLE); (3) EKF-based Parametric POMDP (EKF-PPOMDP) in [17]; (4) Greedy policy. Certainty equivalence methods treat the state estimate as the true state in the solution to the full observation problem. We use the bootstrap fllter to 78 obtain the mean estimate and the MLE of the states for the certainty equivalence method. EKF-PPOMDP approximates the belief state by a Gaussian distribution, and uses the extended Kalman fllter to estimate the transition of the belief state. Similar to our method, it also solves a discretized MDP deflned on the parameter space of the Gaussian density. The greedy policy chooses an action ak that attains the minimum in the expression minak2A Exk;uk[gk(xk;akQ;uk)jIk]. Numerical experiments are carried out in the following settings: ? Problem parameters: initial inventory level x0 = 5, holding cost h = 1, short- age penalty cost s = 10, flxed order amount b = 10, random demand uk ? exp(5), discount factor = 0:9, inventory observation noise vk ? N(0; 2) with ranging from 0:1 to 3:3 in steps of 0:2. ? Algorithm parameters: The number of particles in both the usual particle fllter and the projection particle fllter is N = 200; the exponential family in the projection particle fllter is chosen as the Gaussian family; the set of grids on the projected belief space is G = f mean = [0 : 0:5 : 15], standard deviation = [0 : 0:2 : 5]g for both our methods and EKF-PPOMDP; one run of horizon length H = 105 for each average cost criterion case, 1000 independent runs of horizon length H = 40 for each discounted total cost criterion case; nearest neighbor as the value function approximator in both our methods and EKF-PPOMDP. ? Simulation issues: We use common random variables among difierent policies and difierent ?s. 79 In order to implement the certainty equivalence methods, we use Monte Carlo simulation to flnd the optimal threshold policy for the fully observed problem (i.e., yk = xk): if the inventory level is below the threshold L, the store/warehouse should order to replenish its inventory; otherwise, if the inventory level is above L, the store/warehouse should not order. That is, ak = 8> >< >>: 0; if xk > L; 1; if xk < L. (4.27) The simulation result indicates both the average and discounted cost functions are convex in the threshold and the minimum is achieved at L = 7:7 for both. Table 4.2 and Table 4.3 list the simulated average costs and discounted total cost using difierent policies under difierent observation noises, respectively. Our methods generally outperforms all the other algorithms under all observation noise levels. also performs very well, and slightly outperforms CE-MLE. EKF-PPOMDP performs better in the average cost case than the discounted cost case. The greedy policy is much worse than all other algorithms. While our methods and the EKF- PPOMDP involve o?ine computation, the more critical online computation time of all the simulated methods is approximately the same. For all the algorithms, the average cost/discounted cost increases as the obser- vation noise increases. That is consistent with the intuition that we cannot perform better with less information. Fig. 5.2 shows 1000 actions taken by our method ver- sus the true inventory levels in the average cost case (the discounted total cost case is similar and is omitted here). The dotted vertical line is the optimal threshold under full observation L. Our algorithm yields a policy that picks actions very close 80 to those of the optimal threshold policy when the observation noise is small (cf. Fig. 5.2(a)), indicating that our algorithm is indeed flnding the optimal policy. As the observation noise increases, more actions picked by our policy violate the op- timal threshold, and that again shows the value of information in determining the actions. The performances of our two methods are very close, with one slightly better than the other. Solely for the purpose of flltering, doing projection outside the fllter is easier to implement if we want to use a general exponential family, and also gives a better estimate of the belief state, since the projection error will not accumulate. However, for solving POMDPs, we conjecture that PPF would work better in conjunction with the policy solved from the projected belief MDP, since the projected belief MDP assumes that the transition of the belief state is also projected. However, we do not know which one is better. Our method outperforms the EKF-PPOMDP, mainly because the projection particle fllter used in our method is better than the extend Kalman fllter used in the EKF-PPOMDP for estimating the belief transition probabilities. This agrees with the results in [18], which also observed that Monte Carlo simulation of the belief transitions is better than the EKF estimate. Although the performance of the certainty equivalence method is comparable to that of our methods for this particular example, certainty equivalence is generally a suboptimal policy except in some special cases (cf. section 6.1 in [10]), and it does not have a theoretical error bound. Moreover, to use certainty equivalence method requires solving the full observation problem, which is also very di?cult in many 81 0 5 10 15 20 25 0 1 State x (Inventory level) Action a (replenish or not replenish) (state, action yielded by our method) optimal threshold under full observation 0 5 10 15 20 25 0 1 State x (Inventory level) Action a (replenish or not replenish) (state, action) yielded by our method optimal threshold under full observation (a) observation noise = 0:1 (b) observation noise = 1:1 0 5 10 15 20 25 0 1 State x (Inventory level) Action a (replenish or not replenish) (state, action) yielded by our method optimal threshold under full observation 0 5 10 15 20 25 0 1 State x (Inventory level) Action a (replenish or not replenish) (state,action) yielded by our method optimal threshold under full observation (c) observation noise = 2:1 (d) observation noise = 3:1 Figure 4.1: Our algorithm: actions taken for difierent inventory levels under difierent observation noise variances. cases. In contrast, our method has a proven error bound on the performance, and works with the POMDP directly without having to solve the MDP problem under full observation. 82 4.8 Conclusions We developed a method that efiectively reduces the dimension of the belief space via the orthogonal projection of the belief states onto a parameterized fam- ily of probability densities. For an exponential family, the density projection has an analytical form and can be carried out e?ciently. The exponential family is fully represented by a flnite (small) number of parameters, hence the belief space is mapped to a low-dimensional parameter space and the resultant belief MDP is called the projected belief MDP. The projected belief MDP can then be solved in numerous ways, such as standard value iteration or policy iteration, to generate a policy. This policy is used in conjunction with the projection particle fllter for online decision making. We analyzed the performance of the policy generated by solving the projected belief MDP in terms of the difierence between the value function associated with this policy and the optimal value function of the POMDP. We also provided a bound on the error between our projection particle fllter and exact flltering. We applied our method to an inventory control problem, and it generally out- performed other methods. When the observation noise is small, our algorithm yields a policy that picks the actions very closely to the optimal threshold policy for the fully observed problem. Although we only proved theoretical results for discounted cost problems, the simulation results indicate that our method also works well on average cost problems. We should point out that our method is also applicable to flnite horizon problems, and is suitable for large-state POMDPs in addition to 83 continuous-state POMDPs. 84 Table 4.2: Optimal average cost estimates for the inventory control problem using difierent methods. Each entry represents the average cost of a run of horizon 105. Ours 1 Ours 2 CE-Mean CE-MLE PPOMDP Greedy 0.1 12.849 12.849 12.842 12.837 12.941 25.454 0.3 12.845 12.837 12.857 12.861 12.950 25.467 0.5 12.864 12.862 12.867 12.884 12.964 25.457 0.7 12.881 12.884 12.882 12.890 12.990 25.452 0.9 12.904 12.918 12.908 12.940 13.006 25.450 1.1 12.938 12.943 12.945 12.969 13.059 25.428 1.3 12.973 12.986 12.977 12.993 13.105 25.356 1.5 13.016 13.017 13.034 13.029 13.141 25.293 1.7 13.066 13.067 13.100 13.117 13.182 25.324 1.9 13.110 13.105 13.159 13.172 13.214 25.343 2.1 13.123 13.140 13.183 13.227 13.255 25.332 2.3 13.210 13.201 13.263 13.292 13.307 25.355 2.5 13.250 13.246 13.314 13.333 13.380 25.402 2.7 13.323 13.324 13.382 13.454 13.441 25.428 2.9 13.374 13.384 13.458 13.497 13.491 25.478 3.1 13.444 13.459 13.527 13.580 13.553 25.553 85 Table 4.3: Optimal discounted cost estimate for the inventory control problem using difierent methods. Each entry represents the discounted cost (mean, standard error in parentheses) based on 1000 independent runs of horizon 40. Ours 1 Ours 2 CE-Mean CE-MLE PPOMDP Greedy 0.1 126.79 127.26 129.12 129.09 137.41 241.67 (1.64) (1.63) (1.81) (1.81) (1.65) (2.99) 0.3 126.86 126.95 129.17 129.11 137.64 242.08 (1.63) (1.63) (1.78) (1.78) (1.62) (2.98) 0.5 126.61 126.35 129.12 129.16 138.16 242.66 (1.63) (1.62) (1.77) (1.78) (1.60) (2.98) 0.7 126.42 126.99 129.30 129.62 141.78 243.33 (1.62) (1.61) (1.77) (1.79) (1.55) (2.98) 0.9 126.78 126.86 129.59 129.76 138.23 244.00 (1.63) (1.63) (1.76) (1.78) (1.60) (2.97) 1.1 127.49 127.74 130.19 130.23 140.86 244.81 (1.64) (1.63) (1.77) (1.75) (1.57) (2.97) 1.3 128.74 128.30 130.49 130.54 146.02 245.67 (1.65) (1.64) (1.76) (1.72) (1.52) (2.96) 1.5 129.30 129.45 130.74 131.09 144.88 246.71 (1.68) (1.66) (1.75) (1.77) (1.52) (2.95) 86 Table 4.4: Continue Table 4.3 Ours 1 Ours 2 CE CE-MLE PPOMDP Greedy 1.7 129.71 128.93 130.95 131.45 146.80 247.70 (1.67) (1.67) (1.76) (1.77) (1.52) (2.96) 1.9 130.11 129.85 131.29 131.60 147.16 248.55 (1.69) (1.67) (1.75) (1.73) (1.56) (2.93) 2.1 130.67 130.17 131.76 132.24 144.67 249.45 (1.69) (1.67) (1.74) (1.79) (1.54) (2.95) 2.3 130.96 130.36 132.22 132.76 145.35 250.07 (1.68) (1.67) (1.75) (1.78) (1.55) (2.97) 2.5 131.90 130.86 132.54 133.47 145.06 250.49 (1.68) (1.68) (1.76) (1.78) (1.58) (2.96) 2.7 131.81 131.66 133.18 133.98 148.39 250.76 (1.68) (1.68) (1.75) (1.78) (1.54) (2.96) 2.9 132.36 131.78 133.61 134.56 146.27 250.81 (1.68) (1.68) (1.75) (1.83) (1.57) (2.96) 3.1 132.95 133.51 134.09 135.83 147.96 250.89 (1.70) (1.70) (1.76) (1.79) (1.54) (2.95) 3.3 133.08 132.76 134.81 136.12 145.32 250.77 (1.69) (1.69) (1.76) (1.84) (1.60) (2.94) 87 Chapter 5 Particle Filtering Framework for Optimization 5.1 Related Work and Motivation We consider the global optimization problem: x? = argmax x2X H(x); and assume that it has a unique global optimal solution x?. Many of the simulation-based global optimization methods, such as the esti- mation of distribution algorithms (EDAs) [64] [59], the cross-entropy (CE) method [78] [79], and model reference adaptive search (MRAS) method [41], fall into the category of \model-based methods" as classifled by [98]. They share the similarities of iteratively repeating the following two steps: ? Generate candidate solutions from an intermediate distribution over the solu- tion space; ? Update the intermediate distribution using the candidate solutions. This intermediate distribution is often referred to as a probabilistic model, since it often imposes a model on the relationship between the components that are needed to represent a solution. The choice and updating of the probabilistic model (or intermediate distribution) play a key role in determining the e?ciency and accuracy of the algorithm. 88 EDAs were flrst proposed by [64] in the fleld of evolutionary computation, with the goal of eliminating the mutation and cross-over operations in genetic algorithms (GAs) in order to avoid partial solutions. EDAs generate ofispring by sampling from a distribution over the solution space that is estimated from the candidate solutions of the previous iteration. The estimation of this distribution is often based on a probabilistic model that explicitly expresses the relationship between the underlying variables [53]. The cross-entropy (CE) method was originally introduced for estimating prob- abilities of rare events in complex stochastic networks [77], and later was modifled slightly to be used for solving combinatorial and continuous optimization problems [78]. A key idea of the CE method is to minimize the Kullback-Leibler (KL) di- vergence between a desired density (the optimal importance sampling density) and a family of parameterized densities, in particular an exponential family, since an analytical solution can be calculated in this case. The MRAS method was introduced in [41]. MRAS implicitly constructs a sequence of reference distributions and uses this sequence to facilitate and guide the parameter updating associated with a family of parameterized distributions. At each iteration, candidate solutions are sampled from the distribution (in the prescribed family) that has the minimumKL divergence with respect to the reference distribution of the previous iteration. The aforementioned various ways of updating the probabilistic model motivate us to look for a unifying and systematic approach to the probabilistic model-based methods for optimization. Our main idea is to transform the optimization problem 89 into a flltering problem. The goal of flltering is to estimate the unobserved state in a dynamic system through a sequence of noisy observations of the state. The unobserved state corresponds to the optimal solution to be estimated; the noisy ob- servations in flltering brings randomization into the optimization algorithm; and the conditional distribution of the unobserved state is a distribution over the solution space, which approaches a delta function concentrated on the optimal solution as the system evolves. Hence, the task of searching for the optimal solutions is carried out through the procedure of estimating the conditional density sequentially. This idea is only conceptual, since flltering can hardly be solved analytically, and some approximate flltering methods are needed. We apply particle flltering to solve the transformed flltering problem. Based on particle flltering, we propose a plain parti- cle flltering framework and a general particle flltering framework for optimization. The former framework is a special case of the latter one and more intuitive, while the latter framework is a generalization and hence provides more opportunities for developing new algorithms. To the best of our knowledge, it is the flrst work on applying particle flltering to the fleld of optimization. The particle flltering framework unifles EDAs, the CE method, and MRAS, and provides new insights into the three optimization methods from another view- point. More speciflcally, EDAs and the CE method flt in the plain particle flltering framework, and in particular, the CE method corresponds to the projection parti- cle flltering described in section 4.3. EDAs and the CE method difier only in their ways of constructing an approximation for the conditional density based on the sam- ples/particles. The MRAS method flts in the general particle flltering framework, 90 with a speciflc way to construct the resampling importance density. The particle flltering framework also sheds light on developing new improved algorithms for global optimization. The possibilities of new algorithms come from the freedom in the particle flltering framework, as well as the vast arrays of tech- niques of improving nonlinear flltering and particle flltering. We focus on three promising directions: adjusting the trade-ofi between exploration and exploitation in the search by appropriately choosing the importance densities in particle flltering; incorporating gradient-based local search into simulation-based global search by in- cluding the gradient term of the objective function in the state-space model in the flltering problem; preventing premature convergence in simulation-based optimiza- tion methods by introducing the idea of \persistent excitation" to optimization. 5.2 Filtering for Optimization We consider the global optimization problem: x? = argmax x2X H(x); (5.1) where the solution space X is a nonempty set in Rn, and H(?) : X ! Y is a real- valued function that is bounded, i.e., 9M1 > ?1; M2 < 1 s.t. M1 ? H(x) ? M2, 8x 2 X. We assume that (5.1) has a unique global optimal solution, i.e., 9x? 2 X s.t. H(x) < H(x?), 8x 6= x?, x 2X. The optimization problem (5.1) can be transformed into a flltering problem 91 by constructing an appropriate state-space model. Let the state-space model be xk = xk?1; k = 1;2;:::; yk = H(xk)?vk; k = 0;1;:::; (5.2) where xk 2 Rn is the unobserved state to be estimated, yk 2 R is the observation, vk 2 R is the observation noise that is an i.i.d. sequence, and the initial state is x0 = x? which is unobserved. We assume that vk has a p.d.f. ?(?) that is nondecreasing on its support. As shown before, the conditional density bk(xk) , p(xkjy1:k) is updated recur- sively according to bk(xk) / p(ykjxk) Z p(xkjxk?1)bk?1(xk?1)dxk?1: (5.3) For the above state-space model (5.2), the transition density is p(xkjxk?1) = ?(xk ?xk?1); (5.4) where ? denotes the Dirac delta function. The likelihood function is p(ykjxk) = ?(H(xk)?yk): (5.5) Substituting (5.4) and (5.5) into (5.3), we obtain bk(xk) = ?(H(xk)?yk)bk?1(xk)R ?(H(x k)?yk)bk?1(xk)dxk : (5.6) The connection between the flltering problem (5.2) and the optimization prob- lem (5.1) takes at several places. The underlying value of the unobserved state xk is x?, and hence, the goal of flltering (to estimate xk) is the same as that of optimiza- tion (to flnd x?). In flltering, at each time k, we estimate a conditional density bk of 92 xk. In optimization, bk is a density over the solution space, interpreted as the our beliefs about the possible values that x? might take, and serves as the intermediate distribution for drawing new candidate solutions at the next iteration . At each iter- ation k, bk is updated according to (5.6) for an incoming new observation yk, which reveals new information about x?. The noise in yk, or in other words, a nondegener- ated p.d.f. ?, brings randomization into the optimization algorithm. We can choose ? to determine how the conditional density (i.e., bk?1) is tuned by the performance of solutions to yield a new conditional density (i.e., bk) at the next iteration. For example, if ?(x) is an increasing function of x, the likelihood function (5.5) assigns more weight to the candidate solutions that have better performance; if ?(x) = 0 for x < 0, then (5.5) discards inferior candidate solutions whose performance is worse than yk. It should be expected that if yk increases with k, the conditional density bk will get closer to the density of xk, i.e., a Dirac delta function concentrated on x?. From the viewpoint of flltering, bk is the posterior density of xk that approaches the density of xk. From the optimization viewpoint, bk is a density deflned on the solution space that becomes more and more concentrated on the optimal solution as k increases. Fig. 5.2 is an illustration of how bk changes in the flrst three iterations: at k = 0, it has no prior knowledge of x? (the true value of xk), and hence, b0 is uniform over the solution space; at k = 1, as observation y0 provides new information about x?, b1 has more weight on the solution space centered around x?; at k = 2, b2 becomes even more concentrated on x?. In summary, to solve the maximization problem (5.1) is equivalent to recur- 93 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 x* b0 b1 b2 Figure 5.1: At the flrst three iterations, the conditional density bk becomes more and more concentrated on the optimal solution. sively estimating bk of the model (5.2) while constructing an increasing sequence of observations fykg. 5.3 Particle Filtering Framework for Optimization The flltering idea is only conceptual and the resulting problem is not analyt- ically solvable. Hence, we need some approximation methods to solve the above flltering problem. We apply the plain particle fllter and the general particle fllter to the flltering problem, with just a little tweak to adapt to the optimization prob- lem. The application of particle flltering turns out to be a framework for many simulation-based optimization algorithms. In the following, we present the plain particle fllter framework for optimization (PPFO), and then the general particle fll- ter framework for optimization (GPFO). The former framework is a special case of the latter and provides more intuition, while the latter framework is more general 94 and allows more variations in the development of new algorithms. Algorithm 5.1 Plain Particle Filter Framework for Optimization (PPFO) 1. Initialization. Specify ? 2 (0;1], and an initial p.d.f./p.m.f. b0 that is deflned on X. Sample fxi1gNi=1 i.i.d. from b0. Set k = 1. 2. Observation Construction. Let yk be the sample (1??)-quantile of fH(xik)gNi=1. If k > 1 and yk ? yk?1, then set yk = yk?1. 3. Bayes? Updating. ^bk(xk) = PNi=1 wik?(xk ?xik), where weights are calculated according to wik / ?(H(xik)?yk);i = 1;2;:::;N; and normalized. 4. Resampling. Construct a continuous approximation ~bk(xk) from ^bk(xk). Sam- ple fxik+1gNi=1 i.i.d. from ~bk(xk). 5. Stopping. If a stopping criterion is satisfled, then stop; else, k ? k + 1 and go to step 2. At initialization, the PPFO algorithm draws samples from an initial distribu- tion b0 that is deflned on X. A parameter ? is specifled to determine the (1??)- quantile samples that will be used to construct a nondecreasing the observation se- quencefykg. The requirementof nondecreasing is to ensure the observationsequence does not become worse as the algorithm goes on. Since the transition probability is 1, the importance sampling step is omitted with suitable change of the indices. The 95 Bayes? updating step assigns weights to the samples according to their performance. Slightly difierent from the plain particle fllter, the resampling step here constructs a continuous density ~bk flrst, since the discrete approximation ^bk does not provide any new samples. The new samples drawn from ~bk are more concentrated in the promising areas than the old samples. To adapt to optimization, a stopping step is added to the algorithm; whereas in flltering, the algorithm keeps going on as the real system operates. Similarly, by applying the general particle fllter, we introduce the general particle flltering framework for optimization as follows: Algorithm 5.2 A General Particle Filtering Framework for Optimization (GPFO) 1. Initialization. Specify ? 2 (0;1], a nonnegative nonincreasing sequence f?kg, and an initial p.d.f./p.m.f. p0 that is deflned on X. Sample fxi0gNi=1 i.i.d. from b0. Set k = 1. 2. Importance Sampling. Sample xik from qk(xkjxik?1;yk), i = 1;:::;N. 3. Observation Construction. Let yk be the sample (1??)-quantile of fH(xik)gNi=1. If k > 1 and yk ? yk?1, then set yk = yk?1. 4. Bayes? Updating. ^bk(xk) = PNi=1 wik?(xk ?xik), where the normalized weights are calculated as wik / ?(H(x i k)?yk)bk?1(x i k?1) qk(xikjxik?1;yk)gk?1(xik?1jy0:k?1): 5. Importance Resampling. Sample fxikgNi=1 i.i.d. from gk(xkjy0:k). 96 6. Stopping. If a stopping criterion is satisfled, then stop; else, k ? k + 1 and go to step 2. 5.4 Interpretation of EDAS, CE, MRAS In this section, we use the particle flltering framework to interpret some of the existing optimization algorithms: estimation of distribution algorithms (EDAs), the cross entropy (CE) method, and model reference adaptive search (MRAS). A class of EDAs can flt in the plain particle flltering framework. The CE method can be viewed as projection particle flltering, which also flts in the plain particle fllter- ing framework with a speciflc way to construct a continuous approximation of the conditional density, namely the density projection approach. The MRAS method can flt in the general particle flltering framework where the resampling importance density is constructed via density projection. This interpretation also provides an- other insight into the relationships between the three methods in addition to the view in [41]. Speciflcally, the main di?culty in EDAs is to estimate a distribu- tion from the samples, and this di?culty is solved in the CE method by projecting the empirical distribution of the samples to obtain an approximate continuous den- sity. However, the density projection introduces an error, which is corrected in the MRAS method by taking the projected density as a resampling importance density and hence weighting the samples difierently from the CE method. In all three methods, the most common sample selection scheme is the so- called truncated selection [93], which selects the elite samples whose performance is 97 above a threshold. In the following, we will focus on the truncated selection scheme. Recall that in the optimization problem (5.1), the objective function H(x) is bounded by M1 ? H(x) ? M2. In the state-space model (5.2), let the observation noise vk follow a uniform distribution U(0;M2 ?M1). Since vk = H(xk)?yk, the likelihood function is p(ykjxk) = 8> >< >>: 1 M2?M1; if 0 ? H(xk)?yk ? M2 ?M1; 0; otherwise. (5.7) Since yk = H(x);x 2X, the inequality H(xk)?yk ? M2?M1 always holds. Hence, (5.7) can be written in a more compact way as p(ykjxk) = 1M 2 ?M1 IfH(xk)?ykg; (5.8) where If?g denotes the indicator function. Substituting (5.8) into (5.6), we obtain bk(xk) = IfH(xk)?ykgbk?1(xk)R I fH(xk)?ykgbk?1(xk)dxk : (5.9) With i.i.d. samples fxikgNi=1 drawn from bk?1, bk(xk) can be approximated by ^bk(xk) = PN i=1 IfH(xik)?ykg?(xk ?x i k)P N i=1 IfH(xik)?ykg : (5.10) Thus, (5.9) is equivalent to selecting the elite solutions to tune the sampling dis- tribution at the previous iteration, and (5.10) is the Monte Carlo version of (5.9). These two equations are the cornerstone of the rest of this section. 5.4.1 Estimation of Distribution Algorithms EDAs are a class of optimization algorithms based on the key idea of iteratively doing the two steps: 98 1. Select elite samples from a pool of samples that are generated from a proba- bility distribution; 2. Estimate the probability distribution of selected samples and generate new samples from it. With the truncation selection scheme, one class of EDAs can be viewed as an instantiation of the plain particle flltering framework as follows: Algorithm 5.3 Instantiation 1 of Plain Particle Filter Framework for Optimiza- tion (PPFO1) 1. Initialization. Specify ? 2 (0;1], and an initial p.d.f./p.m.f. b0 that is deflned on X. Sample fxi1gNi=1 i.i.d. from b0. Set k = 1. 2. Observation Construction. Let yk be the sample (1??)-quantile of fH(xik)gNi=1. If k > 1 and yk ? yk?1, then set yk = yk?1. 3. Bayes? Updating. The discrete approximation ^bk(xk) is as (5.10). 4. Resampling. Estimate a continuous approximation ~bk(xk) from ^bk(xk). Sample fxik+1gNi=1 i.i.d. from ~bk(xk). 5. Stopping. If a stopping criterion is satisfled, then stop; else, k ? k + 1 and go to step 2. It is obvious that the observation construction and Bayes? updating steps es- sentially select the elite samples according to the truncated selection scheme, cor- responding to step 1 in EDAs; and the resampling step corresponds to step 2 in EDAs. 99 The main di?culty in EDAs is to estimate the distribution of the selected samples. When doing so, EDAs often take into account the interaction between the underlying variables that represent a solution, and express the interaction explicitly through the use of difierent probabilistic models. One way is to use a dynamic Bayesian network (DBN) to represent such a probabilistic model and infer the un- derlying probability distribution [53]. Put in our context, the relationship between the components of the state vector xk is expressed through the use of a DBN, and the estimation of the joint distribution of the components is ~bk(xk). Interestingly, there is a particular particle fllter designed especially for DBNs [49], which samples xk according to the relationship between its components so that the sampling is more e?cient. This particle fllter can be adopted to improve EDAs that use the DBN representation. 5.4.2 Cross Entropy Method The standard CE method (we use the word \standard" to distinguish it from the extended version of standard CE [26]) for the optimization problem (5.1) is as follows: Algorithm 5.4 Standard CE Algorithm for Optimization 1. Choose an initial p.d.f./p.m.f. f(?; 0), 0 2 ?. Specify the parameter ? 2 (0;1], and set k = 1. 2. Generate samples fxikgNi=1 from the density f(?; k?1) and compute the sample (1??)-quantile yk of the performances fH(xik)gNi=1. 100 3. Compute the new parameter according to k = argmax 2? 1 N NX i=1 IfH(xi k)?ykg logf(xik; ): (5.11) 4. If a stopping criterion is satisfled, then terminate; else, set k = k + 1 and go to step 2. Equation (5.11) comes from the density projection of the optimal importance sampling density onto a parameterized family of densities ff(?; ); 2 ?g. Projec- tion particle flltering [95] also uses the density projection technique, but for a very difierent reason. It projects the discrete approximation ^bk onto the parameterized family ff(?; ); 2 ?g in order to obtain a continuous approximation~bk that is char- acterized by only a few parameters, which is very useful in reducing the complexity of dynamic programming in a decision making problem. Speciflcally, projection particle flltering chooses a value of the parameter such that the Kullback-Leibler (KL) divergence between ^bk and f(?; ) is minimized. The KL divergence between ^bk and f(?; ) is: DKL(^bkkf(?; )) = Z ^bk log ^bk f(?; ) = Z ^bk log^bk ? Z ^bk logf(?; ): Since the flrst term does not depend on f(?; k), minimizing the above equation is equivalent to solving the maximization problem max 2? E^bk[logf(?; )]: 101 Since ^bk(xk) satisfles (5.10), the above maximization problem can be approximated by max 2? PN i=1 IfH(xik)?ykglogf(x i k; )P N i=1 IfH(xik)?ykg ; which is equivalent to max 2? 1 N NX i=1 IfH(xi k)?ykg logf(xik; ): (5.12) Therefore, the optimization algorithm adapted from projection particle flltering is as follows: Algorithm 5.5 Instantiation 2 of Plain Particle Filter Framework for Optimiza- tion (PPFO2) 1. Initialization. Specify ? 2 (0;1], and an initial p.d.f./p.m.f. f(x0; 0) that is deflned on X. Sample fxi1gNi=1 i.i.d. from f(x0; 0). Set k = 1. 2. Observation Construction. Let yk be the sample (1??)-quantile of fH(xik)gNi=1. If k > 1 and yk ? yk?1, then set yk = yk?1. 3. Bayes? Updating. The discrete approximation ^bk(xk) is as 5.10. 4. Resampling. Construct a continuous approximation ~bk(xk) = f(xk; k), where k = argmax 2? 1 N NX i=1 IfH(xi k)?ykg logf(xik; ): (5.13) Sample fxik+1gNi=1 i.i.d. from ~bk(xk). 5. Stopping. If a stopping criterion is satisfled, then stop; else, k ? k + 1 and go to step 2. 102 It is easy to see that this algorithm is essentially the same as the standard CE algorithm (Note that (5.11) and (5.13) are exactly the same). Compared with EDAs, the CE method avoids complicated estimation of the density bk through the use of density projection without the need to specify the relationships among the components of xk. However, from a flltering viewpoint, the projection particle fllteringintroducesaprojectionerrorthatisaccumulatedoveriterations. Thereason can be seen by scrutinizing the one-step evolution of the approximate density. Since samples fxikgNi=1 are sampled from ~bk?1 = f(?; k?1), the density that the algorithm actually tries to approximate at iteration k is bk(xk) = IfH(xk)?ykgf(xk?1; k?1)R I fH(xk)?ykgf(xk?1; k?1))dxk?1 : Compared with the original equation (5.9) for bk, bk?1 is replaced by its approxima- tion f(?; k?1), which introduces a projection error that is accumulated to the next iteration. This projection error can be corrected by taking f(?; k?1) as an impor- tance density and hence taken care of by the weights of the samples. This leads to another instantiation of the particle flltering framework, which coincides with the instantiation of MRAS developed in [41]. 5.4.3 Model Reference Adaptive Search As in the CE method, at each iteration the MRAS method projects a desired density onto a family of parameterized densities to yield a density from which the candidate solutions are drawn. In the CE method the target distribution is a single optimal importance sampling density, whereas in the MRAS method, the parameter 103 updating is guided by an implicit sequence of distributions called the reference distributions. Here, we consider the Monte Carlo version of the MRAS0 algorithm with truncated selection scheme presented in [41]: Algorithm 5.6 Model Reference Adaptive Search Method with Truncated Selection Scheme 1. Choose an initial p.d.f./p.m.f. f(?; 0), 0 2 ?. Specify the parameter ? 2 (0;1], and set k = 1. 2. Generate samples fxikgNi=1 from the density f(?; k?1) and compute the sample (1??)-quantile yk of the performances fH(xik)gNi=1. 3. Compute the new parameter according to k = argmax 2? 1 N NX i=1 IfH(xi k)?ykg f(xik; k?1) logf(x i k; ): (5.14) 4. If a stopping criterion is satisfled, then terminate; else, set k = k + 1 and go to step 2. Equation (5.14) comes from the projection of the implicit reference distribution onto the family of parameterized densities. In the particle flltering framework, we will see that the sequence of reference distributions is the sequence of the approximated conditional densities f^bkg, which guide the design of the resampling importance densities ff(?; k)g. Speciflcally, let the initial p.d.f./p.m.f. be b0(x0) = IfH(x0)?y0gR I fH(x0)?y0gdx0 : 104 Since yk is a nondecreasing sequence, using the recursive equation (5.9), it can be shown by induction that bk(xk) = IfH(xk)?ykgR I fH(xk)?ykgdxk : (5.15) Suppose the resampling importance density gk?1 is gk?1(xk?1jy0:k?1) = f(xk?1; k?1); from which the i.i.d. samples fxikgNi=1 are drawn, then the weight of each xik is wik = bk(x i k) gk?1(xik?1jy0:k?1) / IfH(xik)?ykgf(xi k?1; k?1) : (5.16) As shown before, projection of ^bk(xk) onto the parameterized family of densities ff(?; ); 2 ?g is equivalent to the maximization problem max 2? E^bk[logf(?; )]: (5.17) Since ^bk(xk) = PNi=1 wik(xk ?xik), where wik satisfles (5.16), and xik?1 = xik, (5.17) can be approximated by max 2? NX i=1 IfH(xi k)?ykg f(xik; k?1) logf(x i k; ): Thus, the algorithm is as follows: Algorithm 5.7 Instantiation 1 of General Particle Filter Framework for Optimiza- tion (GPFO1) 1. Initialization. Specify ? 2 (0;1], and an initial p.d.f./p.m.f. f(x0; 0) that is deflned on X. Sample fxi1gNi=1 i.i.d. from f(x0; 0). Set k = 1. 105 2. Observation Construction. Let yk be the sample (1??)-quantile of fH(xik)gNi=1. If k > 1 and yk ? yk?1, then set yk = yk?1. 3. Bayes? Updating. ^bk(xk) = PNi=1 wik(xk ?xik), where the weights are wik / IfH(xik)?ykgf(xi k; k?1) : 4. Resampling. Construct a resampling importance density gk(xkjy0:k) = f(xk; k), where k = argmax 1 N NX i=1 IfH(xi k)?ykg f(xik; k?1) logf(x i k; ): (5.18) Sample fxik+1gNi=1 i.i.d. from gk(xkjy0:k). 5. Stopping. If a specifled stopping criterion is satisfled, then stop; else, k ? k+1 and go to step 2. Note that the (5.18) is exactly the same as the updating equation (5.14) in MRAS, and Algorithm 5.7 and Algorithm 5.6 are essentially the same. 5.5 Implication for New Algorithms The particle flltering framework for optimization not only provides a unifying framework for some of the existing algorithms, but also opens up the possibility for new algorithms. There is a considerable amount of freedom in the framework, such as the choices of the observation noise, the observation sequence, the sampling and resampling importance densities, each of which can lead to a difierent algorithm. In addition, many of the techniques that have been used to improve particle flltering 106 can also be adapted to optimization, such as the improvement of EDAs with the particle fllter for DBNs as we mentioned in last section. In the rest of this section, we will focus on discussing three promising directions for developing new improved algorithms under this framework. 5.5.1 Balancing Exploration and Exploitation Proper sampling and resampling importance densities can be chosen to adjust the trade-ofi between exploitation and exploration. Construction of the resampling importance density using the kernel method for density estimation [65], or approx- imation with Gaussian mixture [50] is very easy to implement, and the obtained continuous distributions are easy to sample from. They add more exploration on the solution space, compared to a single Gaussian density that is often used in the CE and MRAS. A Markov chain Monte Carlo (MCMC) step can be added after resampling [32] to further adjust the trade-ofi between exploitation and exploration, or add some local search. 5.5.2 Combining Global Search with Local Search We can incorporate local/gradient search into global search systematically by including the gradient of the objective function into the state-space model. First, assume that the objective function H(x) is difierentiable with respect to x and its derivative is denoted as G(x) ,OH(x): (5.19) 107 Let the state-space model be xk = xk?1 +?kG(xk?1); k = 1;2;:::; yk = H(xk)?vk; k = 0;1;:::; (5.20) wheref?kg is a sequence of step sizes that are properly chosen. The intuition of model (5.20) is that the unobserved state xk has a stationary underlying value x? if the initial condition x0 = x?. Since G(x?) = 0, the state stays at x? for all time k if it starts at x?. For this model, the propagation step in the particle flltering framework moves each sample along its gradient. Therefore, the resulting algorithm incorporates a gradient-based local search into the simulation-based global search. 5.5.3 Overcoming Premature Convergence Many of the simulation-based optimization algorithms sufier from the problem of premature convergence, i.e., they converge too fast such that they get stuck at a local optimum. This problem is especially severe if the objective function has many local optima. A similar phenomenon often happens in flltering for a static state, also called system identiflcation: the algorithm converges too fast such that it converges to a wrong value. To avoid premature convergence, many system identiflcation algorithms employs the idea of \persistent excitation", which adds an artiflcial noise to the static system dynamics to keep a continuous exploration of the state space [48] [58]. The artiflcial noise is large at the beginning to prevent premature convergence to a wrong value, and gradually dies down as the estimate converges to the true 108 value. Hence, in the state-space model (5.2), the state equation becomes xk = xk?1 +fik!k; k = 1;2;:::; (5.21) where fik = fi1flk?1; fl 2 (0;1): From the optimization viewpoint, the propagation step moves each candidate solu- tion around randomly with a decreasing randomness at each iteration, which allows for more exploration at the beginning. This idea is very efiective in preventing premature convergence in optimization algorithms, as shown in the next section. 5.6 Numerical Experiments The standard CE method often sufiers from the problem of premature con- vergence due to the quick convergence of the parameterized family of distributions f(?; k) to a degenerate measure (Dirac measure) [26]. A smoothed parameter up- dating procedure has been used to address this problem [26], i.e., a smoothed version of the distribution parameter k is computed at each iteration k according to ~ k = ? k +(1??)~ k?1; ? 2 (0;1); where ? is the smoothing parameter, and ~ k?1 the smoothed distribution parameter at iteration k ?1. Since the standard CE method can be included in the particle flltering frame- work, it is straightforward to apply \persistent excitation" to the standard CE method. Hence, instead of smoothing the parameter, at each iteration we randomly 109 move the candidate solutions according to (5.21) before updating the parameterized density. In the following, we numerically compare our proposed method (abbrevi- ated as CEA) and the CE method with smoothing on some benchmark problems, which have been previously studied in [67] [51] [41]. These test functions are: (1) Rosenbrock function (n = 20) H(x) = n?1X i=1 100(xi+1 ?x2i)2 +(xi ?1)2; where x? = (1;:::;1)T;H(x?) = 0. (2) Dejong?s 5th function (n = 2) H(x) = [0:002+ 25X j=1 1 j +P2i=1 (xi ?aj;i)6] ?1; where aj;1 = f?32;?16;0;16;32;?32;?16;0;16;32;?32;?16;0;16;32;?32;?16; 0;16;32;?32;?16;0;16;32g, aj;2 = f?32;?32;?32;?32;?32;?16;?16;?16;?16; ?16;0;0;0;0;0;16;16;16;16;16;32;32;32;32;32g; with 24 local minima and one global minimum at x? = (?32;?32)T, H(x?) ? 0:998. (3) Powel singular function (n = 20) H(x) = n?2X i=1 [(xi?1 +10xi)2 +5(xi+1 ?xi+2)2 +(xi ?2xi+1)4 +10(xi?1 ?xi+2)4]; where x? = (0;:::;0)T, H(x?) = 0. (4) Pint?er?s function (n = 20) H(x) = nX i=1 ix2i + nX i=1 20isin2(xi?1 sinxi ?xi +sinxi+1) + nX i=1 ilog10(1+i(x2i?1 ?2xi +3xi+1 ?cosxi +1))2; where x0 = xn, xn+1 = x1, x? = (0;:::;0)T, H(x?) = 0. 110 (5) Shekel?s function (n = 4) H(x) = X i=1 5((x?ai)T(x?ai)+ci)?1; where a1 = (4;4;4;4)T, a2 = (1;1;1;1)T, a3 = (8;8;8;8)T, a4 = (6;6;6;6)T, a5 = (3;7;3;7)T, and c = (0:1;0:2;0:2;0:4;0:4)T, x? ? (4;4;4;4)T, H(x?) ??10:153. Dejong?s 5th function and Shekel?s function are relatively low dimensional and has only a few local optima, but the optima are separated by plateaus. Rosenbrock function and Powel singular function are 20-dimensional badly scaled problems. Pinter?s function has many local optima, and the number of local optima increases exponentially as the number of dimension increases. Fig. 5.6 shows some of the functions in two dimensions. For the CE method, we use the parameter values suggested by [51], with smoothing parameter ? = 0:7. We also found by trial and error that ? = 0:2 works very well for the CE method with smoothing. For CEA, the noise !k is chosen to be a standard Gassian N(0;In?n), where n is the dimension of the solution space. In both methods, the parametrized family of distributions is chosen to be a multivariate Gaussian family. Table 5.1 and Fig. 5.3 show that CEA converges to better solutions than the CE method on these benchmark problems, and it converges faster than the CE method with ? = 0:2. Note that on the Rosenbrock function, CEA can approach the optimum very closely with appropriately chosen parameters; whereas for whatever smoothing pa- rameter values fi and fl, the CE method with smoothing always gets stuck at some- where far from the optimum. Table 5.2 shows on the Rosenbrock function how the 111 (a) Dejong?s 5th function, ?50 ? xi ? 50, i = 1;2 ?50 0 50 ?50 0 500 100 200 300 400 500 (b) Pinter?s function, ?5 ? xi ? 5, i = 1;2 ?5 0 5 ?5 0 50 50 100 150 (c) Rosenbrock function, ?5 ? xi ? 5, i = 1;2 ?5 0 5 ?5 0 50 2 4 6 8 10 x 104 Figure 5.2: Some benchmark problems in two dimensions. 112 Table 5.1: Average performance of CEA and CE on some benchmark problems. Each entry presents the mean of H(x?) with standard error in parentheses, based on 100 independent runs. Function H(x?) CEA (fi1 = 10;fl = 0:95) CE (? = 0:7) CE (? = 0:2) Rosenbrock 0 16.84 (0.971) 19.41 (1.11) 17.39 (0.009) Dejong 5th 0.998 0.998 (1.8e-12) 1.01 (0.010) 0.998 (2.3e-15) Powel 0 2.4e-5 (2.0e-6) 420.8 (266.8) 2.5e-4 (2.4e-4) Pint?er?s 0 0.0068 (4.0e-4) 4.51 (0.15) 3.82 (0.055) Shekel?s -10.153 -10.153 (8.5e-9) -9.033 (0.268) -9.929 (0.128) value of fl, i.e. the decreasing rate of the artiflcial noise, afiects the performance of CEA: slower decrease of the noise yields better solutions, but not surprisingly takes more time to converge. Table 5.2: Average performance of CEA with difierent parameter values of fi and fl on the Rosenbrock function. Each entry presents the mean of H(x?) with standard error in parentheses, based on 100 independent runs. Function H(x?) fi1 = 10;fl = 0:95 fi1 = 10;fl = 0:98 fi1 = 10;fl = 0:995 Rosenbrock 0 16.84 (0.971) 11.90 (0.023) 0.505 (0.010) 113 0 0.5 1 1.5 2x 105 100 101 102 Total sample size Function value Dejong?s 5th CEA CE v = 0.2CE v = 0.7 0 0.5 1 1.5 2 2.5 3 3.5 4x 10510?5 100 105 1010 Total sample size Function value 20?D Powel CEA CE v = 0.2CE v = 0.7 0 0.5 1 1.5 2 2.5 3 3.5 4x 10510?4 10?2 100 102 104 106 Total sample size Function value 20?D Pinter CEA CE v = 0.2CE v = 0.7 0 0.5 1 1.5 2 2.5 3 3.5 4x 105?102 ?101 ?100 ?10?1 ?10?2 ?10?3 Total sample size Function value 4?D Shekel CEA CE v = 0.2CE v = 0.7 0 1 2 3 4 5 6x 106 100 102 104 106 108 Total sample size Function value 20?D Rosenbrock CEA ?=10, ?=0.995 CEA ?=10, ?=0.98CE v = 0.2 CE v = 0.7 Figure 5.3: Average performance of CEA and CE on some benchmark problems. 114 5.7 Conclusions and Future Research We transformed a global optimization problem to a flltering problem, and hence many ideas and results from flltering can be adapted to global optimization. Based on particle flltering, we proposed a framework for many simulation-based op- timization algorithms. In particular, the framework unifles: EDAs, the CE method and MRAS, and provides new insight into the relationship between them. More- over, the framework holds the promise for developing new optimization algorithms through the choice of observation noise, sampling and resampling importance den- sities, as well as a vast array of improving techniques for nonlinear flltering and particle flltering. There are two important lines of future research that we will continue to pursue. First, we will further study how to develop new algorithms using the particle flltering framework and the performance of these new algorithms. Secondly, we plan to investigate the convergence property of the particle flltering framework for optimization. Although convergence has been proved for EDAs [93], the CE method [78], and MRAS [41] individually, we are interested in unifying convergence results under the particle flltering framework. The analysis will be based on some existing stability results for nonlinear fllters, such as [20]. 115 Chapter 6 Conclusions and Future Research 6.1 Conclusions This dissertation has proposed and developed new methods and results for solving problems in partially observable Markov decision processes and global opti- mization. Theflrstpartofthedissertationfocusesoncontinuous-statePOMDPs. Continuous- statePOMPDsprovideamorenaturalmathematicalmodelthanflnite-statePOMDPs for many application problems. While there are a good number of numerical meth- ods for flnite-state POMDPs, there are only a few for continuous-state ones and their convergence results are sparse. Existing algorithms (for flnite-state POMDPs) are hard to extend to continuous-state POMDPs, mainly due to the inflnite dimensional belief space in a continuous-state POMDP as opposed to a flnite-dimensional belief space in a flnite-state POMDP. Based on the idea of density projection with particle flltering, we have developed a numerical method for efiective dimension reduction and scalable approximation for an arbitrary belief state, such as a multi-modal or a heavy-tail distribution. The idea of density projection orthogonally projects an arbitrary density onto a parameterized family of densities, namely, an exponential family of densities in our algorithm, to yield a flnite low-dimensional representation for that density. Based on some existing results, we have shown that under certain 116 mild conditions, the approximate density approaches the true density as the number of parameters increases, and hence, shown the convergence of our algorithms. We have proved error bounds for our proposed POMDP solving algorithm and online flltering algorithm. We have applied our method to an inventory control problem for which there are some known analytical results for the purpose of comparison, and obtained good numerical results that show the promise of our method. Although we have only proved error bounds for the inflnite-horizon discounted cost criterion, the numerical results also indicate our method works very well for the average cost criterion. In addition, with a little straightforward modiflcation, our method can be easily applied to flnite-horizon problems or (flnite) large-state problems. The second part of the dissertation is devoted to the understanding and im- proving of a class of simulation-based algorithms for global optimization. We have transformed a global optimization algorithm into a flltering problem, and hence, many results in flltering can be adapted to global optimization. In particular, we have used a novel interpretation of particle flltering to develop a unifying frame- work for many simulation-based optimization algorithms, such as the cross-entropy method, the estimation of distribution algorithms, and the model reference adaptive search. The frameworkreveals relationships betweenthese algorithms and new inter- esting insights. By better understanding these algorithms, we have proposed several promising directions under the same framework for new improved algorithms, such as balancing the exploitation and exploration, combining simulation-based global search with gradient-based local search, and preventing premature convergence. We explored the last direction, and obtained a new improved algorithm which shows 117 better performance than the existing cross-entropy algorithm on some benchmark problems. 6.2 Future Research Currently I am applying the method for continuous-state POMDPs described in Chapter 4 to some stochastic control problems in flnance. Many flnancial prob- lems, such as portfolio optimization or hedging, can be modeled as stochastic control problems, where the control is the investment strategy. In many flnancial models, such as the classical Black-Scholes model [15], the price of a risky asset is modeled as a stochastic process, where the volatility of the risky asset is treated as a constant. However, to better model the reality, more and more flnancial models assume that the volatility itself is also a stochastic process. While many analytical results have been derived for constant volatility models, stochastic volatility models generally do not admit analytical solutions except in some rare cases, such as the Heston model for option pricing [38]. Hence, problems involving stochastic volatility often needs to be solved numerically. A POMDP is a natural model for a portfolio optimization or hedging problem involving assets with stochastic volatility, where the volatility is the hidden state, the price observed at discrete times is a partial observation of the volatility, and the utility function is the objective function. Since the asset price and capital are discrete at least down to the pennies, there are a huge number of states such that the algorithms for flnite-state POMPDs are not applicable here. Our proposed method for continuous-state POMDPs provides an e?cient numerical 118 tool for these flnancial problems to flnd the best investment strategy and study the efiect of the amount of initial capital. Another ongoing work is the study of the convergence of the particle fllter- ing framework for global optimization. The hope is that the conditional density converges to a degenerated density concentrated on the optimal solution in spite of the initial conditional. That is closely related with the stability result of nonlinear flltering [20]. There are two lines of future research that we plan to pursue. The flrst line is to develop numerical methods for POMDPs with the state or/and observation equa- tion being a jump difiusion process. The second line of future research is to extend the particle flltering framework to stochastic optimization, i.e., optimization prob- lems where the objective functions cannot be evaluated exactly. ? For POMDPs, the state equation (3.2) can be viewed as either a discrete state difierence equation, or a discretized difiusion process. If it is a difiusion process, the discretization introduces error in the simulation of the propagation of the state. Some recent research has studied the exact simulation (without discretization error) of a difiusion process [14] [13], to replace the usual Euler-discretization scheme of the state equation between observations. However, the exact simulation has not yet been extended to jump- difiusion processes, and recent research on the inference of a jump-difiusion process still uses an Euler scheme [43]. Therefore, we need to further investigate e?cient flltering methods of jump-difiusion processes, and numerical solutions to the cor- responding POMDPs. In flnancial engineering, many models with jump-difiusion processes have been proposed to better model the flnancial market than those mod- 119 els with only difiusion processes. For example, Bates [8] added a Poisson process to the asset price process based on Heston?s stochastic volatility model. In recent years, much research has focused on jump-difiusion process models on the so-called ultra-high-frequency data [30] due to more frequent intra-day transactions. More examples of jump-difiusion processes in flnancial applications can be found in [35]. 120 Bibliography [1] B. D. O. Anderson and J. B. Moore. Optimal Filtering. Prentice Hall, New Jersey, 1979. [2] B. D. O. Anderson, J. B. Moore, and M. Eslami. Optimal flltering. IEEE Transactions on Systems, Man and Cybernetics, 12(2):235{236, 1982. [3] C. Andrieu, J. F. G. deFreitas, and A. Doucet. Sequential MCMC for Bayesian model selection. IEEE Higher Order Statistics Workshop, 1999. [4] S. Arulampalam, S. Maskell, N. J. Gordon, and T. Clapp. A tutorial on particle fllters for on-line non-linear/non-Gaussian Bayesian tracking. IEEE Transac- tions on Signal Processing, 50(2):174{188, 2002. [5] B. Azimi-Sadjadi and P. S. Krishnaprasad. Approximate nonlinear flltering and its application in navigation. Automatica, 41(6):945{956, 2005. [6] O. E. Barndorfi-Nielsen. Information and Exponential Families in Statistical Theory. Wiley, New York, 1978. [7] A. R. Barron and C. Sheu. Approximation of density functions by sequences of exponential family. The Annals of Statistics, 19(3):1347{1369, 1991. [8] D. S. Bates. Jumps and stochastic volatility: Exchange rate processes implicit in deutschemark option. The Review of Financial Studies, 9(1):69{107, 1996. [9] D. P. Bertsekas. Convergence of discretization procedures in dynamic program- ming. IEEE Trasactions on Automatic Control, 20(3):415{419, 1975. [10] D.P.Bertsekas. Dynamic Programming and Optimal Control. AthenaScientiflc, 1995. [11] D. P. Bertsekas and D. A. Castanon. Adaptive aggregation methods for infl- nite horizon dynamic programming. IEEE Trasactions on Automatic Control, 34(6):589{598, 1989. [12] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Optimiza- tion and Neural Computation Series. Athena Scientiflc, 1st edition, 1996. [13] A. Beskos, O. Papaspiliopoulos, and G. Roberts. Computationally e?cient likelihood-based estimation for discretely observed difiusion processes (with dis- cussion and reply from the authors). Journal of the Royal Statistical Society, Series B, Statistical Methodology, 68:1{29, 2006. [14] A. Beskos and G. Roberts. Exact simulation of difiusions. Annals of Applied Probability, 15:2422{2444, 2005. 121 [15] F. Black and M. Scholes. The pricing of options and corporate liabilities. Jour- nal of Political Economics, 81:637{659, 1973. [16] R. W. Brockett. Remarks on flnite dimensional nonlinear estimation. In C. Lo- bry, editor, Analyse des Syst emes, 1978. [17] A. Brooks, A. Makarenkoa, S. Williamsa, and H. Durrant-Whytea. Parametric POMDPs for planning in continuous state spaces. Robotics and Autonomous Systems, 54(11):887{897, 2006. [18] A. Brooks and S. Williams. A monte carlo update for parametric POMDPs. International Symposium of Robotics Research, Nov. 2007. [19] A. Budhiraja, L. Chen, and C. Lee. A survey of numerical methods for nonlinear flltering problems. Physica D: Nonlinear Phenomena, 230:27{36, 2007. [20] A. Budhiraja and D. Ocone. Exponential stability of discrete-time fllters for bounded observation noise. System and Control Letters, 30:185{193, 1997. [21] O. Capp?e, S. J. Godsill, and E. Moulines. An overview of existing methods and recent advances in sequential MonteCarlo. Proceedings of the IEEE, 95(5):899{ 924, 2007. [22] H. S. Chang, M. C. Fu, J. Hu, and S. I. Marcus. Simulation-based Algorithms for Markov Decision Processes. Communications and Control Engineering Series. Springer, New York, 1st edition, 2007. [23] D. Crisan and A. Doucet. A survey of convergence results on particle flltering methods for practitioners. IEEE Transaction on Signal Processing, 50(3):736{ 746, 2002. [24] M. H. A. Davis and S. I. Marcus. An introduction to nonlinear flltering. In M. Hazewinkel and J. C. Willems, editors, Stochastic Systems: The Mathemat- ics of Filtering and Identiflcation and Applications, pages 53{75, Amsterdam, The Netherlands, 1981. Reidel. [25] J. F. G. de Freitas, M. Niranjan, and A. H. Gee. Hierachical bayesian Kalman models for regularisation and ARD in sequential learning. Technical Report CUED/F-INFENG/TR 307, Cambridge University Engineering Department, 1998. [26] P. T. DeBoer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein. A tutorial on the cross-entropy method. Annals of Operations Research, 134:19{67, 2005. [27] D. deFarias and B. VanRoy. The linear programming approach to approximate dynamic programming. Operations Research, 51(6), 2003. [28] P. M. Djuric, J. H. Kotecha, F. Esteve, and E. Perret. Sequential parameter estimation of time-varying non-Gaussian autoregressive processes. EURASIP Journal on Applied Signal Processing, 8:865{875, 2002. 122 [29] A. Doucet, J. F. G. deFreitas, and N. J. Gordon, editors. Sequential Monte Carlo Methods In Practice. Springer, New York, 2001. [30] R. Engle. The econometrics of ultra-high-frequency data. Econometrica, 68:1{ 22, 2000. [31] A. Gelb. Applied Optimal Estimation. The M.I.T. Press, 1974. [32] W.GilksandC.Berzuini. Followingamovingtarget-MonteCarloinferencefor dynamic Bayesian models. Journal of the Royal Statistical Society, 63(1):127{ 146, 2001. [33] W.R. Gilks, S. Richardson, and D.J. Spiegelhalter, editors. Markov Chain Monte Carlo in Practice. Chapman & Hall, 1996. [34] N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. In IEE Proceedings F (Radar and Signal Processing), volume 140, pages 107{113, 1993. [35] F.B. Hanson. Appplied Stochastic Processes and Control for Jump-Difiusions: Modeling, Analysis and Computation, chapter Financial Engineering Applica- tions. SIAM Books: Advances in Design and Control Series. Society For Indus- trial & Applied Mathematics,U.S., 2007. [36] M. Hauskrecht. Value-function approximations for partially observable Markov decision processes. Journal of Artiflcial Intelligence Research, 13:33{95, 2000. [37] O. Hernandez-Lerma and J. B. Lasserre. Discrete-Time Markov Control Pro- cesses Basic Optimality Criteria. New York:Springer, 1996. [38] S. L. Heston. A closed-form solution for options with stochastic volatility with applications to bond and currency options. Review of Financial Studies, 6(327- 343), 1993. [39] R. A. Howard. Dynamic Programming and Markov Processes. The M.I.T. press, Cambridge, 1960. [40] G.Q. Hu and S.S.-T. Yau. Finite-dimensional fllters with nonlinear drift xv: New direct method for construction of universal flnite-dimensional fllter. IEEE Transactions on Aerospace and Electronic Systems, 38(1), 2002. [41] J. Hu, M. C. Fu, and S. I. Marcus. A model reference adaptive search method for global optimization. Operations Research, 55:549{568, 2007. [42] M. Isard and A. Blake. Contour tracking by stochastic propagation of con- ditional density. In Proceedings of European Conference on Computer Vision, pages 343{356, 1996. [43] A. Jasra, D. Stephens, A. Doucet, and T. Tsagaris. Inference for Levy driven stochastic volatility models via adaptive SMC. Preprint, 2008. 123 [44] A. H. Jazwinski. Stochastic Processes and Filtering Theory. New York: Aca- demic Press, 1970. [45] G. Kallianpur. Stochastic Filtering Theory. Springer-Verlag, New York, 1980. [46] R. E. Kalman. A new approach to linear flltering and prediction problems. Journal of Basic Engineering, 82:33{45, 1960. [47] R. E. Kalman and R. S. Bucy. New results in linear flltering and prediction theory. Journal of Basic Engineering, 83:95{108, 1961. [48] G. Kitagawa. A self-organizing state-space model. Journal of the American Statistical Association, 33(443):1203{1215, 1998. [49] D. Koller and U. Lerner. Sequential Monte Carlo Methods in Practice, chap- ter 10: Sampling in factored dynamic systems, pages 445{464. Statistics for engineering and information science. Springer-Verlag, New York, 2001. [50] J. H. Kotecha and P. M. Djuric. Gaussian sum particle flltering. IEEE Trans- actions on Signal Processing, 51(10):2602{2612, 2003. [51] D. P. Kroese, S. Porotsky, and R. Y. Rubinstein. The cross-entropy method for continuous multiextremal optimization. Methodology and Computing in Applied Probability, 8:383{407, 2006. [52] H. J. Kushner. On the difierential equations satisfled by conditional probability densities of Markov processes. SIAM Journal of Control, 2:106{119, 1964. [53] P. Larranaga, R. Etxeberria, J. A. Lozano, and J. M. Pena. Optimization by learning and simulation of Bayesian and Gaussian networks. Technical Report EHU-KZAA-IK-4/99, Department of Computer Science and Artiflcial Intelligence, University of the Basque Country, 1999. [54] F. LeGland and N. Oudjane. Stability and uniform approximation of nonlinear fllters using the Hilbert metric and application to particle fllter. The Annals of Applied Probability, 14(1):144{187, 2004. [55] E. L. Lemann and G. Casella. Theory of Point Estimation. New York: Springer, 1998. [56] F. L. Lewis. Optimal Estimation: With an Introduction to Stochastic Control Theory. Wiley-Interscience, 1986. [57] M. L. Littman. The witness algorithm: Solving partially observable Markov decision processes. Tr cs-94-40, Department of Computer Science, Brown Uni- versity, Providence, RI, 1994. [58] J. Liu and M. West. Combined parameter and state estimation in simulation- based flltering. In A. Doucet, J. F. G. deFreitas, and N. J. Gordon, editors, Sequential Monte Carlo Methods in Practice, New York, 2001. Springer-Verlag. 124 [59] J. A. Lozano, P. Larranaga, I. Inza, and E. Bengoetxea, editors. Towards a New Evolutionary Computation: Advances on Estimation of Distribution Algo- rithms. Springer Verlag, New York, 2006. [60] P. Naor M. Resh. An inventory problem with discrete time review and replen- ishment by batches of flxed size. Management Science, 10(1):109{118, 1963. [61] S. I. Marcus. Nonliear estimation. In Systems Control Encyclopedia, pages 3293 { 3304, Oxford, 1987. Pergamon Press. [62] S. I. Marcus. Low dimensional fllters for a class of flnite state estimation prob- lems with Poisson observations. System Control Letters, 1:237{241, January 1982. [63] S. I. Marcus and A. S. Willsky. Algebraic structure and flnite dimensional nonlinear estimation. SIAM Journal on Mathematical Analysis, 9:312{327, April 1978. [64] H. Muhlenbein and G. Paa?. From recombination of genes to the estimation of distributions: I. binary parameters. In H. M. Voigt, W. Ebeling, I. Rechenberg, and H. P. Schwefel, editors, Parallel Problem Solving from Nature-PPSN IV, pages 178{187, Berlin, Germany, 1996. Springer Verlag. [65] C. Musso, N. Oudjane, and F. LeGland. Improving regularized paricle fllters. In A. Doucet, J. F. G. deFreitas, and N. J. Gordon, editors, Sequential Monte Carlo Methods in Practice, New York, 2001. Springer-Verlag. [66] K. Nummiaro, E. Koller-Meier, and L. Van Gool. An adaptive color-based particle fllter. Image and Vision Computing, 21(1):99{110, 2003. [67] J. D. Pint?er. Global Optimization in Action. Kluwer Academic Publishers, Dordrecht, The Neitherlands, 1996. [68] M. K. Pitt and N. Shephard. Filtering via simulation: Auxilliary particle fllters. Journal of the American Statistical Association, 94(446):590{599, 1999. [69] J. M. Porta, M. T. J. Spaan, and N. Vlassis. Robot planning in partially observable continuous domains. In Robotics: Science and Systems, Cambridge, MA, 2005. MIT. [70] J. M. Porta, N. Vlassis, and M. T.J. Spaan amd P. Poupart. Point-based value iteration for continuous POMDPs. Journal of Machine Learning Research, 7:2329{2367, 2006. [71] P. Poupart and C. Boutilier. Value-directed compression of POMDPs. Ad- vances in Neural Information Processing Systems, 15:1547{1554, 2003. [72] W. B. Powell. Approximate dynamic programming: Solving the curses of di- mensionality. Wiley, New York, 22007. 125 [73] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley & Sons, New York, 1994. [74] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer Texts in Statistics. Springer, New York, 2004. [75] N. Roy. Finding Approximate POMDP Solutions through Belief Compression. PhD thesis, Robotics Institute, Carnegie Mellon University, Pittsburg, PA, 2003. [76] N. Roy and G. Gordon. Exponential family PCA for belief compression in POMDPs. Advances in Neural Information Processing Systems, 15(1635- 1642), 2003. [77] R.Y.Rubinstein. Optimizationofcomputersimulationmodelswithrareevents. European Journal of Operational Research, 99:89{112, 1997. [78] R. Y. Rubinstein. The cross-entropy method for combinatorial and continuous optimization. Methodology and Computing in Applied Probability, 2:127{190, 1999. [79] R. Y. Rubinstein and D. P. Kroese. The Cross-Entropy Method: A Unifled Ap- proach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning. Springer-Verlag, New York, 2004. [80] P. J. Schweitzer and A. Seidman. Generalized polynomial approximations in markovian decision problems. Journal of Mathematical Analysis and Applica- tions, 110:568{582, 1985. [81] R. D. Smallwood and E. J. Sondik. The optimal control of partially observable Markov processes over a flnite horizon. Operations Research, 21(5):1071{1088, 1973. [82] E. J. Sondik. The Optimal Control of Partially Observable Markov Processes. PhD thesis, Stanford University, Palo Alto, CA, 1971. [83] E. J. Sondik. The optimal control of partially observable Markov processes over the inflnite horizon: Discounted costs. Operations Research, 26(2):282{ 304, 1978. [84] R. L. Stratonovich. Conditional Markov Processes and Their Application to the Theory of Optimal Control. Elsevier, New York, 1968. [85] R. S. Sutton. Learning to predict by the method of temporal difierences. Ma- chine Learning, 3(1):9{44, 1988. [86] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, 1998. 126 [87] S. Thrun. Monte Carlo POMDPs. Advances in Neural Information Processing Systems, 12:1064{1070, 2000. [88] R. van der Merwe, N. de Freitas, A. Doucet, and E. Wan. The unscented particle fllter. Advances in Neural Information Processing Systems, 13, 2001. [89] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279{ 292, 1992. [90] S.S.-T. Yau and G.Q. Hu. Classiflcation of flnite dimensional estimation alge- bras of maximal rank with arbitrary state-space dimension and mitter conjec- ture. International Journal of Control, 78(10), 2005. [91] H. J. Yu. Approximate Solution Methods for Partially Observable Markov and Semi-Markov Decision Processes. PhD thesis, M.I.T., Cambridge, MA, 2006. [92] M. Zakai. On the optimal flltering of difiusion processes. Probability Theory and Related Fields, 11:230{243, 1969. [93] Q. Zhang and H. Muhlenbein. On the convergence of a class of estimation of distribution algorithm. IEEE Transactions on Evolutionary Computation, 8:127{136, 2004. [94] E. Zhou, M. C. Fu, and S. I. Marcus. A fading memory particle fllter. In preparation for submission. [95] E. Zhou, M. C. Fu, and S. I. Marcus. Solving continuous-state POMDPs via density projection. IEEE Transactions on Automatic Control. Conditionally accepted, full paper. [96] E. Zhou, M. C. Fu, and S. I. Marcus. A density projection approach to di- mension reduction for continuous-state pomdps. Proceedings of 47th IEEE Conference on Decision and Control, page 5576 C 5581, 2008. [97] E. Zhou, M. C. Fu, and S. I. Marcus. A paricle flltering framework for ran- domized optmization algorithms. Proceedings of the 2008 Winter Simulation Conference, pages 647 { 654, 2008. [98] M. Zlochin, M. Birattari, N. Meuleau, and M. Dorigo. Model-based search for combinatorial optimization: A critical survey. Annals of Operations Research, 131:373{395, 2004. 127