ABSTRACT Title of dissertation: STOCHASTIC SYSTEMS WITH CUMULATIVE PROSPECT THEORY Kun Lin, Doctor of Philosophy, 2013 Dissertation directed by: Professor Steven I. Marcus Department of Electrical & Computer Engineering Stochastic control problems arise in many fields. Traditionally, the most widely used class of performance criteria in stochastic control problems is risk- neutral. More recent attempts at introducing risk-sensitivity into stochastic control problems include the application of utility functions. The decision theory com- munity has long debated the merits of using expected utility for modeling human behaviors, as exemplified by the Allais paradox. Substantiated by strong experimen- tal evidence, Cumulative Prospect Theory (CPT) based performance measures have been proposed as alternatives to expected utility based performance measures for evaluating human-centric systems. Our goal is to study stochastic control problems using performance measures derived from the cumulative prospect theory. The first part of this thesis solves the problem of evaluating Markov decision processes (MDPs) using CPT-based performance measures. A well-known method of solving MDPs is dynamic programming, which has traditionally been applied with an expected utility criterion. When the performance measure is CPT-inspired, several complications arise. Firstly, when solving a problem via dynamic program- ming, it is important that the performance criterion has a recursive structure, which is not true for all CPT-based criteria. Secondly, we need to prove the traditional op- timality criteria for the updated problems (i.e., MDPs with CPT-based performance criteria). The theorems stated in this part of the thesis answer the question: what are the conditions required on a CPT-inspired criterion such that the corresponding MDP is solvable via dynamic programming? The second part of this thesis deals with stochastic global optimization prob- lems. Using ideas from the cumulative prospect theory, we are able to introduce a novel model-based randomized optimization algorithm: Cumulative Weighting Optimization (CWO). The key contributions of our research are: 1) proving the convergence of the algorithm to an optimal solution given a mild assumption on the initial condition; 2) showing that the well-known cross-entropy optimization al- gorithm is a special case of CWO-based algorithms. To the best knowledge of the author, there is no previous convergence proof for the cross-entropy method. In practice, numerical experiments have demonstrated that a CWO-based algorithm can find a better solution than the cross-entropy method. Finally, in the future, we would like to apply some of the ideas from cumulative prospect theory to games. In this thesis, we present a numerical example where cumulative prospect theory has an unexpected effect on the equilibrium points of the classic prisoner?s dilemma game. STOCHASTIC SYSTEMS WITH CUMULATIVE PROSPECT THEORY by Kun Lin Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2013 Advisory Committee: Professor Steven I. Marcus, Chair/Advisor Professor Michael C. Fu Professor P.S. Krishnaprasad Professor Nuno Martins Professor Gang Qu ? Copyright by Kun Lin 2013 Dedication To my family. ii Acknowledgments Most importantly, I would like to express my most sincere gratitude to Pro- fessor Steven I. Marcus. Without his support and guidance, this dissertation would not have been possible. His integrity, dedication and patience have made a lasting impression on me. A graduate experience, riddled with set-backs and struggles, was made much more pleasant by his willingness to provide sound advice that led me back on the right track. I would like to thank my committee members for taking the time out of their busy schedules to read this manuscript and attend my defense. From Professor Michael C. Fu, I have gained many insights through our frequent research meetings. Professor P.S Krishnaprasad, Professor Nuno Martins, and Professor Gang Qu have all inspired me through their interesting lectures in their respective fields. In addition, I would like to express many thanks to my colleagues and office- mates, Enlu Zhou, Yongqiang Wang, James Ferlez, and Bhaskar Ramasubramanian, for the many interesting discussions on topics both technical and non-technical. For the many people not mentioned here, but who have definitely contributed to the completion of this thesis, I would like to express my deepest gratitude. iii Contents List of Tables vi List of Figures vi 1 Overview 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Stochastic Optimal Control Problems . . . . . . . . . . . . . 2 1.1.2 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . 4 1.1.3 Performance Criteria: From Expected Value to Prospect Theory 5 1.1.4 Stochastic Optimization with Probability Weighting Functions 10 1.1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2 Dynamic Programming with Non-Convex Risk-Sensitive Measures 12 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Discrete-Time Markov Control Model . . . . . . . . . . . . . . 15 2.2.2 Cumulative Prospect Theory (CPT) . . . . . . . . . . . . . . 18 2.2.3 Reward Transition Mappings . . . . . . . . . . . . . . . . . . 22 2.2.4 Generalized Markov Dynamic Reward Measures . . . . . . . 25 2.2.4.1 Markov Conditional Reward Measures . . . . . . . 29 2.3 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3.1 Finite-Horizon . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3.1.1 Application: Cumulative Prospect Theory Measures 35 2.3.2 Discounted Infinite-Horizon . . . . . . . . . . . . . . . . . . . 45 2.3.3 Transient Markov Control Model . . . . . . . . . . . . . . . . 55 2.3.4 The Organ Transplant Example: A Comparative Analysis . . 75 2.4 Reward Measures and Optimal Policies . . . . . . . . . . . . . . . . . 82 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3 Cumulative Weighting Optimization 92 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.3 Probability Weighting Functions . . . . . . . . . . . . . . . . . . . . . 95 3.4 Discrete Solution Space . . . . . . . . . . . . . . . . . . . . . . . . . 99 3.5 Polish Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 3.6 Numerical Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 137 3.6.1 Numerical Examples: Asymmetric Traveling Salesman Prob- lems (ATSPs) . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 3.6.1.1 Weight-Update Methods . . . . . . . . . . . . . . . 142 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 iv 4 Contributions and Future Work 149 4.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 4.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 A Prospect Theory 154 A.1 St. Petersburg Paradox . . . . . . . . . . . . . . . . . . . . . . . . . 154 A.2 Axiomatization of Expected Utility: . . . . . . . . . . . . . . . . . . 154 B Multifunctions and Selectors 156 C Spaces of Probability Measures 159 C.1 Polish Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 C.2 The Prohorov Topology . . . . . . . . . . . . . . . . . . . . . . . . . 159 C.3 Compactness in PX . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 C.4 Metrics on PX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Bibliography 162 v List of Tables 1.1.1 Example: Data-Equivalence . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1 Transition probability matrix for becoming an entrepreneur . . . . . 41 2.3.2 Transition probability matrix for taking a normal job . . . . . . . . . 42 2.3.3 An optimal solution for Ex. 5 (a value function and an optimal policy) at time 0 and 1. . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.3.4 An optimal solution for Ex. 6 (a value function and an optimal policy) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.3.5 Transition Probabilities From State S . . . . . . . . . . . . . . . . . . 79 2.3.6 Organ transplant example: parameters for F (x). . . . . . . . . . . . 79 2.3.7 Organ Transplant Optimal Value and Policy Comparison . . . . . . . 81 3.6.1 Performance of CWO_T on various ATSP problems based on 30 independent replications . . . . . . . . . . . . . . . . . . . . . . . . . 144 3.6.2 CWO_U and CE performance Results . . . . . . . . . . . . . . . . . 146 4.2.1 Classic Prisoner?s Dilemma Problem . . . . . . . . . . . . . . . . . . 150 List of Figures 1.1.1 Data Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.1 Organ Transplant State Transitions & Rewards . . . . . . . . . . . . 76 2.3.2 Optimal Policy Comparison of the Organ Transplant Example . . . . 76 3.6.1 Derivatives of Eq. 3.6.3 as ? ? ? . . . . . . . . . . . . . . . . . . . 145 3.6.2 CE vs. CWO_U Sorted Trial Runs . . . . . . . . . . . . . . . . . . . 147 3.6.3 One trial of CE vs. CWO_U . . . . . . . . . . . . . . . . . . . . . . 147 List of Algorithms 1 Generic CWO Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 138 2 Tilted Weight Update . . . . . . . . . . . . . . . . . . . . . . . . . . 143 3 CWO_U Weight Update Algorithm . . . . . . . . . . . . . . . . . . . 145 vi Nomenclature ATSP Asymmetric Traveling Salesman Problems ATSP Asymmetric Traveling Salesman Problems CDF Cumulative Density Function CPT Cumulative Propsect Theory CWO Cumulative Weighting Optimization MDP Markov Decision Problem PDE Partial Differential Equation vii Chapter 1 Overview 1.1 Motivation Many relevant real-life problems can be modeled as stochastic systems (e.g., weather, traffic patterns, financial markets, communication systems). A system could be stochastic for many reasons. For one, the randomness could be introduced by inac- curate sensors (i.e., measurement error). Sometimes, we lack sufficient information about the system, and model our ignorance by intentionally incorporating random- ness in the model (i.e., model error). Quantum mechanics support the idea that uncertainty is part of the natural order of the universe. For whatever the reason might be, studying stochastic systems is important for solving many real-life prob- lems from various fields. This thesis will try to tackle a few problems in stochastic systems that are inspired by recent advances in decision theory. More specifically, we use some of the latest performance measures suggested by the decision theory community to evaluate the performances of stochastic systems. These novel prob- lems are particularly suited for studying human-centric systems (e.g., war games, consumer behaviors, medical decisions). 1 1.1.1 Stochastic Optimal Control Problems ?The concept of control can be described as the process of influencing the behavior of a dynamical system to achieve a desired goal. If the goal is to optimize some payoff function (or cost function) which depends on the control inputs to the system, then the problem is one of optimal control.? - Wendell H. Fleming and H. Mete Soner [38] If a stochastic system has a control input along with a performance criterion, then the resulting problem is a stochastic optimal control problem. Stochastic optimal control problems have many applications in engineering. The evidence of their successful applications can be found in a wide-range of fields (i.e., robotics, route planning, space exploration). In finance, the seminal paper by Black & Scholes in 1973 [14] provides insight into the management of risks, which leads to an equation for valuing options. There are three approaches for solving stochastic optimal control problems, namely dynamic programming (Hamilton-Jacobi-Bellman equation), the maximum principle, and the martingale and convex duality approach [68]. Dynamic pro- gramming, most popular in the analysis of controlled Markov processes, provides sufficient conditions for optimality in its verification theorem, stating that if there exists a policy satisfying the Hamilton-Jacobi-Bellman PDE, then it is an optimal policy. The essence of dynamic programming is Bellman?s optimality principle, which roughly says if one knows an optimal policy for an entire period, then start- ing from any time in that period and at a state along an optimal trajectory, the 2 same policy is still optimal. This insight leads to the realization that decomposing the optimal value function into two parts (i.e., immediate value and value-to-go) is the key to obtaining an optimal policy. Intuitively, one can obtain the Hamilton- Jacobi-Bellman PDE by first considering the discrete time Bellman equation with time step h, dividing both sides by h, and then taking h to zero. One key feature of dynamic programming is that a solution of the Hamilton-Jacobi-Bellman PDE is a function of the state. In practice, this means the computational complexity of dynamic programming grows exponentially as the number of states increases, leading Bellman to coin the term ?curse of dimensionality?. On the other hand, a state-feedback policy can be found easily once a solution is obtained, which can be implemented simply as a lookup-table. Fleming & Soner [38] and Bertsekas [8] are excellent textbooks for a comprehensive review of dynamic programming in the controlled Markov process setting. Alternatively, the maximum principle provides necessary conditions for optimality. The stochastic maximum principle is similar in spirit to its deterministic counterpart. For the interested reader, Haussmann [44], Peng [67], Yong and Zhou [89] are excellent sources for a further investigation into this topic. The deterministic maximum principle can be intuitively described as perturbing an optimal control over an interval . By taking the first order Taylor approximation of the corresponding value function with respect to  and sending  to zero, we obtain a variational inequality. Combining the variational inequality with the co-state equations, the deterministic maximum principle is complete. The stochastic maximum principle differs from its deterministic counterpart in its usage of forward-backward stochastic differential equations to describe the dynamics of 3 its state and adjoint variables. The martingale approach, which was originated by Pliska [70] and gained popularity in the mathematical finance community, divides the problem into two subproblems: 1) Find the optimizer for the problem at a fixed terminal time T. If the cost function is convex, then the problem of finding the optimizer for the problem can be reduced from an infinite-dimensional problem to a finite-dimensional problem by using convex duality. 2) Use the martingale Repre- sentation Theorem to extract the corresponding optimal control. In Pliska?s original paper, convexity of the cost function is an important assumption in proving the ex- istence of the solution to the dual static problem. Pham [68] has a recent discussion on this approach. 1.1.2 Dynamic Programming Of the three approaches we mention above, dynamic programming has proven to be the most popular method for solving dynamic stochastic optimization problems with controlled Markov processes. Numerically, dynamic programming can be applied using either value iteration or policy iteration. In value iteration, the algorithm could start from an infeasible1 value function and converges to a feasible one. On the other hand, applying policy iteration results in value functions that are feasible. Perhaps the most important reason for dynamic programming?s popularity is its production of a feedback policy, which is advantageous for storage, execution (i.e., a table lookup) and robustness. As shown in Bertsekas [8], the breadth of problems that can be solved using dynamic programming includes inventory control, deterministic 1A value function is feasible if it is yielded by a feasible policy. 4 scheduling problems, machine repair, reachability of ellipsoidal tubes, and pursuit- evasion games. Problems presented in his book all have the following flavor: 1) an underlying discrete time dynamic system; 2) a cost function that is additive over time. The natural question to ask for an inquisitive mind is: why do we want to evaluate performances using expectation? If our goal is to predict, in particular, what a human would do in many situations, then expected value contradicts much empirical evidence. Hence, in applying dynamic programming to human decision making processes, we need to have a model that agrees with empirical data. The discussion of the merits of various classes of performance criteria is the focus of the next section. This thesis deviates from the standard dynamic programming approach by updating the dynamic programming framework with a more general class of performance criteria. A host of issues arise from doing this: does a dynamic programming equation even exist? 1.1.3 Performance Criteria: From Expected Value to Prospect The- ory Using expected value as a performance criterion has been a long tradition in many engineering and scientific fields. Why is that? Is it only out of its mathematical convenience (i.e., linearity)? In this section, we will trace the development of the expected utility theory and highlight some of its deficiencies. For a more in-depth analysis of the development in the area of prospect theory, the interested reader can refer to [86], which the discussion below draws many facts and examples from. 5 When given a dynamic control problem, we would like to use an appropriate performance measure for each situation. Hence, understanding its implications is of paramount importance. Using the wrong performance measure might yield an ineffective policy. Each performance criterion encapsulates our preference order- ing of the potential outcomes2 due to our action/decision. In other words, if we know we prefer outcome L1 to L23 , then we must use a performance criterion that reflects this preference (i.e., ? (L1) > ? (L2)). On the other hand, we would also like the implications of such a preference ordering to be sensible for the problem at hand. For example, in the expected value case, de Finetti [27] (also see a survey by Fishburn [35] for the axiomatization of expected value) shows that the existence of subjective probabilities is equivalent to transitivity, monotonicity and additiv- ity. In addition, the existence of a certainty equivalent (i.e., deterministic) value for each possible outcome guarantees the no-arbitrage condition for the preference system. In other words, expected value performance criteria have properties that we deem rational (i.e., transitivity, monotonicity, and additivity), and lack some undesirable attributes (e.g., arbitrage). However, expected value does have some limitations. One particular limitations that spurred the search for its alternative, expected utility, is best demonstrated in the St. Petersburg paradox4. The paradox is an example of a game having an infinite certainty equivalent value (i.e., the price one is willing to pay) under expected value; However, in practice, people often are only willing to pay a finite amount for the game. The paradox was resolved by 2An element of the probability space is usually called an outcome. 3L1 is a short-hand notation for lottery 1, not to be confused with the function space. 4see appendix on prospect theory 6 Bernoulli in 1738 [7] by suggesting that people do not evaluate outcomes by their objective values, but rather by their utility. This realization started the new field of expected utility theory. Expected utility as a performance criterion still remains popular. Perhaps such success can be attributed to its axiomatization by von Neumann & Morgen- stern [85]. von Neumann provides the necessary and sufficient conditions for using maximization of expected utility as a function for ordering preferences. Fishburn [34] made updates to von Neumann?s work in the 1970s. These conditions include completeness, transitivity, continuity, and substitution. Despite many justifications for using expected utility, it suffers from a well known contradiction with empirical observations demonstrated by the Allais paradox [2], where most people violate the substitution axiom implied by expected utility theory. Lesser known but more re- cent discussions on the empirical violations of expected utility can be found in [82] and [71]. Cumulative prospect theory resolves all of the paradoxes mentioned above, and has stronger empirical support compared with expected utility theory. There is also a strong behavioral foundation found in its axiomatization [19]. A key feature of cumulative prospect theory is probabilistic sensitivity. This is different from the traditional approach of outcome sensitivity in the expected utility theory. We will first demonstrate, via an example, that risk-aversion can have an equivalent representation outside of expected utility theory. Example 1. This example is from [86], which demonstrates that risk-sensitivity 7 a b c d e % outcome is 100 0.10 0.30 0.50 0.70 0.90 % outcome is 0 0.90 0.70 0.50 0.30 0.10 Certainty Equivalent Value 1 9 25 49 81 Table 1.1.1: Example: Data-Equivalence can be represented either as outcome sensitivity or probability sensitivity. We are given five certainty equivalent (i.e., indifference) values and probability pairs. From the table above, we see that the value a person places on an outcome (e.g., a,b,c,d or e) might not be a linear evaluation. In other words, in prospect b, 0.30 ? 100 + 0.70 ? 0 = 30 6= 9. We can of course find a utility function U such that ? U (x) p (x) agrees with the values in the table above, where p is a probability mass function and U (?) is a utility function. This operation can be equivalently achieved by using a probability weighting function w, ? w (p) x, where instead of transforming the outcomes, we are now transforming the probability weights (see Figure 1.1.1). Remark 1. In the example above, we are only trying to demonstrate that risk- 8 ? ? ? ? ? H a L H b L H d L H c L H e L 20 40 60 80 $ 0.2 0.4 0.6 0.8 1.0 p (a) Normalized EV vs. Certainty Equiv- alent ? ? ? ? ? (a) (c) (b ) (d ) (e) 0.2 0.4 0.6 0.8 1.0 p 20 40 60 80 $ (b) Certainty Equivalent vs. Normalized EV Figure 1.1.1: Data Equivalence sensitivity can be expressed equivalently either as outcome sensitivity or probability sensitivity. We want to emphasize the fact that w(p) : x ? [0, 1], transforms the probabilities based on the entire probability mass function. In other words, w : P ? P , is a mapping from P , the space of probability mass functions, to P . The example above should convince the reader that every utility function has an equivalent probability weighting function. This, of course, is not the full story. If probability weighting is only equivalent to expected utility, then we would not be so interested in it. After all, if probability weighting functions can predict only as well as expected utility, we will not need to advance risk-sensitive performance measures beyond utility theory. Several sources have demonstrated that in many cases probability weighting gives different predictions than those given by outcome weighting (i.e., expected utility). Onay and ?nc?ler [66] demonstrate the predic- tions offered by outcome based risk-aversion are different from that of probability weighting. More importantly, the predictive power of the probability weighting approach is confirmed by their experiments. 9 1.1.4 Stochastic Optimization with Probability Weighting Functions Stochastic optimization is another field where a novel approach is inspired by cu- mulative prospect theory (CPT). In CPT, the probability weighting function has the effect of weighting rare good-news events more than other events. In stochastic optimization, if we can apply this shift in weights iteratively to sampled distribu- tions, we can intuitively understand how we might converge to an optimal value. There are a few desirable properties for stochastic optimization algorithms: 1) we would like the algorithms to increase, in expected value, monotonically; 2) once an optimal solution is obtained, we would like it to be robust again perturbations. As the reader will see in Chapter 3, our method, cumulative weighting optimization (CWO) exhibits both of these properties. We also will develop CWO-based numeri- cal algorithms and present their simulated results in the same chapter. Interestingly, the well-known cross-entropy method is a special case of CWO-based algorithms. In fact, we are able to improve the performance of the cross-entropy method by viewing it as such. 1.1.5 Outline The thesis is organized by chapters. Chapter 2 will prove the suitability of dynamic programming equations for non-convex performance measures, which include CPT- based criteria. We will present both the finite horizon and infinite horizon cases. In addition, we will also analyze the structure of optimal policies yield by CPT- based criteria and compare them with other risk-sensitive performance criteria. In 10 Chapter 3, we will provide convergence proofs for cumulative weighting optimization methods. Numerical examples will be provided to demonstrate the performance of our algorithms. In the last chapter, we will discuss our contributions and future work. 11 Chapter 2 Dynamic Programming with Non-Convex Risk-Sensitive Measures Dynamic programming with risk-sensitive performance measures has applications in many fields (e.g., operations research, finance, control systems). Historically, risk- sensitive performance measures are represented using expected utility functions. More recently, literature on dynamic performance (i.e., risk or reward) measures has inspired an alternative approach to risk-sensitive performance evaluation. The dynamic performance measure framework is a generalization of the classical work using expected value. One limitation of this approach is that it has only been devel- oped for coherent performance measures, which exclude a large class of important non-convex performance measures (e.g., cumulative prospect theory (CPT) based performance measures). We remedy this limitation by proving the optimality of the dynamic programming equation for non-convex performance measures. 2.1 Introduction Dynamic programming, introduced by Bellman [5], is a dynamic optimization method. It has been the subject of intense research in the past five decades; see for ex- ample [6, 10, 15, 32, 63, 48, 73]. Dynamic optimization problems modeled by controlled Markov processes and solved via dynamic programming are commonly referred to as Markov decision processes (MDPs). Researchers have developed tech- 12 niques to lift MDPs? curse-of-dimensionality (e.g., approximate dynamic program- ming [11, 72, 9]), which enable the application of dynamic programming in many fields (e.g., operations research, finance, control systems). In many applications, risk-sensitive measures are more appropriate than risk- neutral measures [52, 56, 57]. In standard MDPs, the performance measures are frequently expressed as expected utility functions that are risk-sensitive [23, 17, 36, 37, 33, 46, 47, 24, 25]. For example, many problems evaluate their outcomes by using the performance measure E [u(X)], where u is a risk-sensitive utility function (e.g., exponential), and X is a random variable representing the reward1. A notable feature of optimal policies, induced by the risk-sensitive performance measures, is their robustness with respect to modeling errors [30]. An important class of risk-sensitive performance measures is coherent risk measures, of which E [u (X)], where u (?) is a convex function, is a special case [3, 28, 40, 41, 64, 79, 78]. Other well known examples include mean-semideviation and conditional value-at-risk. An important property of coherent risk measures is convexity. Recently, their dynamic counterparts have received great interests in the literature [74, 20, 31, 39, 43, 22, 21, 4, 60]. In many problems, convex performance measures are not the best option for measuring the desirability of outcomes. A well-known example of a non-convex performance measure is suggested by Tversky and Kahneman in the cumulative prospect theory (CPT) [83]. Although CPT had its beginning in the 1990s, its incorporation into dynamic systems is still nascent. Recently, He and Zhou have studied [45] a portfolio choice problem with a non- 1Reward is often the sum of per-stage rewards. 13 convex performance measure. The problem maximizes the terminal wealth of a self- financing portfolio2 driven by a financial market3 that is uncontrollable from the perspective of the investor (see [45], Eq. 3). Often, the financial market is assumed to be a Markov semimartingale and has a nonempty set of equivalent martingale measures. Under these assumptions, one can apply the martingale approach (see [68], Chapter 7) to arrive at the desired analytical results. These results become more difficult, if not impossible, to obtain if these assumptions are eliminated. This chapter will study both convex and non-convex performance measures (e.g., CPT- inspired reward measures) when the underlying model is a discrete-time controlled Markov process. The goal of this chapter is to address, when the underlying system is mod- eled as a controlled Markov process, the question: How can we generalize dynamic programming to both convex and non-convex performance measures? An approach, suggested by Ruszczy?ski [76], is based on dynamic risk measures and risk transi- tion mappings (see [77, Definition 5]). Assuming a sequence of time-consistent4 risk measures is given (see [76, Theorem 1] ), he concludes that if the corresponding one- step dynamic risk measures satisfy the four assumptions of coherent performance measures, namely convexity, monotonicity, translation equivalence, and positive homogeneity, and an equivalent Markov risk transition mapping exists for each one- step dynamic risk measure, then a dynamic programming equation exists for the dynamic optimization problem. Unfortunately, since CPT-inspired measures have 2This is just a constraint on the action space of the MDP. 3A special case of a controlled semi-martingale Markov process. 4Time-consistency is key for rewriting the risk measures into their nested forms, which can be easily optimized via dynamic programming. 14 nonlinear weighting functions, they do not satisfy some of these assumptions. We derive the dynamic programming equation for a class of non-convex reward measures (e.g., CPT-inspired reward measures). Our work has many parallels with that of Ruszczy?ski; however our goal is to generalize his approach to non-convex reward measures. Before we proceed, we will review some background material in controlled Markov processes, CPT, reward transition mappings, and dynamic reward measures. 2.2 Background In the following sections, we use the following notations: ? (?)+ := max(0, ?); (?)? := ?min(0, ?); ? P(?): the set of probability measures defined on ?. 2.2.1 Discrete-Time Markov Control Model We are interested in the case when the underlying system dynamics can be modeled as a discrete-time controlled Markov process. Let us first review the necessary technical background for our discussion. We restate the definition from [48] for the reader?s convenience. A Markov control model is a five-tuple, (X,A, {A(x)|x ? X}, Q, r), consisting of: ? a Polish space X, called the state space and whose elements are referred to as states; 15 ? a Polish space A, called the control or action set; ? a family {A(x) ? A|x ? X} of nonempty measurable subsets A(x) of A, where A(x) denotes the set of feasible controls or actions when the system is in state x ? X, and with the property that the set K := {(x, a)|x ? X, a ? A(x)} (2.2.1) of feasible state-action pairs is a measurable subset with respect to the product ?-algebra of X ? A (i.e., ?(X ? A)); ? a stochastic kernel5 Q(?|x, a) on X, where (x, a) ? K; ? a measurable function r:K ? X ? R called the per-stage reward function. Remark 2. We can make A(x) and r time-varying, denoted by At(x) and rt, by considering the state space X? := X ? [0, . . . , T ]. Polish spaces include finite-dimensional real spaces, which are important for many real-life applications (e.g. dynamic pricing). The following definition is useful for describing the set of feasible deterministic and randomized policies. Definition 1. We denote by F the set of all measurable functions f : X ? A such that f(x) ? A(x) for all x ? X. In addition, we let ? denote the set of all 5A stochastic kernel on X given Y is a function P (?|?) such that 1. P (?|y) is a probability measure on X for each fixed y ? Y ; 2. P (B|?) is a measurable function on Y for each fixed B ? B(X). 16 stochastic kernels ? in P(A|X), the set of probability measures on A given X, such that ?(A(x)|x) = 1 for every x ? X. We track a system?s history by doing the following: for each t = 0, 1, . . . , define the space Ht of admissible histories up to time t as H0 := X, and Ht := Kt ? X = K ?Ht?1, ?t = 1, 2, . . . . The most general policies we investigate are randomized policies, which are defined below. Definition 2. A randomized policy is a sequence pi = {pit, t = 0, 1, . . . } of stochastic kernels pit ? P (A|Ht) satisfying the constraint pit(A(xt)|ht) = 1, ?xt ? X, ht ? Ht, t = 0, 1, . . . . The set of all randomized policies is denoted by ?. A special class of randomized polices is the class of randomized Markov poli- cies. Definition 3. A randomized policy, pi ? ?, is a randomized Markov policy if there exists a sequence of stochastic kernels ?t ? ? such that pit(?|ht) = ?t(?|xt) = 1 ?ht ? Ht, t = 0, 1, . . . . 17 The policy pi is a randomized stationary policy if there is a ? ? ? such that pit (?|ht) = ? (?|xt) ?ht ? Ht, t = 0, 1, . . . . We denote the sets of randomized Markov policies and randomized stationary polices by ?RM and ?RS, respectively. Furthermore, if there exists a sequence ft ? F such that ?t(?|xt) is the Dirac measure concentrated at f(xt) for all t = 0, 1, . . . , then pi is a deterministic Markov policy, and pit := ft ? F. We denote the sets of all deterministic Markov policies and deterministic stationary policies by ?DM and ?DS, respectively. By fixing a Markov control model, an initial probability distribution v (e.g., a known initial state x0), and a randomized policy pi, we obtain the probability distribution evolution of a discrete-time Markov process. We denote the resulting discrete-time Markov process and action sequence by {xpit } and {apit } (i.e., apit is a random variable with probability distribution pit(?| {x0, . . . , xt})) respectively. For ease of notation, we drop the process?s dependence on its initial condition, as it is fixed unless stated otherwise. For the rest of the discussion, we are given a fixed Markov control model. 2.2.2 Cumulative Prospect Theory (CPT) Before introducing cumulative prospect theory, we will first introduce a useful defi- nition. 18 Definition 4. A good-news distribution, ?F , of a random variable is defined as ?F (x) := 1 ? F (x) , where F (x) is the cumulative distribution function (CDF). Other names for this distribution are: survival distribution, complementary CDF and reliability distribu- tion. Remark 3. The above definition should be altered if we are given a minimization problem. In that case, a good-news distribution function should be the cumulative distribution function itself, because smaller values are more favorable. Another important element of CPT is probability weighting functions, which are defined below. Definition 5. A probability weighting function, w, is a continuous function from [0, 1] to [0, 1]. Prospect theory was suggested in the 1970s by Kahneman and Tversky [59]. They were unsatisfied with the theory and suggested its improved version, cumula- tive prospect theory (CPT), in the 1990s [83]. CPT asserts that the human decision making process can be modeled by a utility function with the following characteris- tics: ? The utility function has a reference point against which gains and losses are measured; 19 ? The utility function is concave on gains and convex on losses (i.e., horizontal S-shape); ? A probability weighting function that transforms the probability measure such that a small probability is inflated and a large probability is deflated. For example, a typical weighting function w : [0, 1] ? [0, 1] is w(y) : = y ? (y? + (1 ? y)?) 1? , where ? ? (0, 1) and y is usually the good-news distribution. This function was originally presented in [83]. Definition 6. Given a probability space (?,F ,P), and random variables R and B defined on it. A CPT performance measure has the following form: ?(R) : = ? ? 0 w+ ( P ( u+ ((R?B)+)> s))ds ? ? ? 0 w? ( P ( u ? ((R? B) ? ) > s )) ds, (2.2.2) where w+ : [0, 1] ? [0, 1] and w? : [0, 1] ? [0, 1] are two continuous non-decreasing functions. u+ : R+ ? R+ and u? : R+ ? R+ are two utility functions. The random variable B represents the benchmark we measure the performance against. The weighting functions used in a CPT performance measure are required to be non-decreasing, which is not necessarily true for a probability weighting function. We apply a CPT-inspired measure to evaluate the expected outcome of a game of 20 dice in the example below. Example 2. Cumulative Prospect Theory - Finite State Case with no Control: Consider a game of dice. You roll a die with six possible outcomes {1, 2, 3, 4, 5, 6}. If the outcome is even, you win an amount equal to the outcome; on the other hand, if the outcome is odd, then you lose an amount equal to the outcome. Thus, the payoffs are {-5, -3, -1, 2, 4, 6}. Furthermore, the payoffs are organized into gains and losses. The probability of gains is derived by assuming the die is fair and written as {0 : 12 ; 2 : 16 ; 4 : 16 ; 6 : 16}, which is read as the probability of winning 0 is 12 , the probability of winning 2 is 16 and so on. On the down side, a similar calculation leads to {0 : 12 ; ?1 : 16 ; ?3 : 16 ; ?5 : 16}. Since the initial state of the die does not matter in this case, the CPT expected value calculation is as follows: ?V = u+ ((2)+)(w+(12) ? w+(13))+ u+((4)+)(w+(13) ? w+(16)) + u+ ((6)+)(w+(16) ? w+(0))? u?((?5)?)(w?(16) ? w?(0)) ? u ? ((?3) ? )( w ? (13) ? w?( 1 6) ) ? u ? ((?1) ? )( w ? (1 2 ) ? w ? (13) ) = u+ (2) ( w+(12) ? w+( 1 3) ) + u+ (4) ( w+(13) ? w+( 1 6) ) + u+ (6) ( w+(16) ? w+(0) ) ? u ? (5) ( w ? (16) ? w?(0) ) ? u ? (3) ( w ? (13) ? w?( 1 6) ) ? u ? (1) ( w ? (1 2 ) ? w ? (13) ) . Here, we use different probability weighting functions for gains and losses, namely w+ : [0, 1] ? [0, 1] and w? : [0, 1] ? [0, 1]. The two functions, u+ : R+ ? R+ and u ? : R+ ? R+, are two utility functions. In this example, the reference point 21 is assumed to be zero. Remark 4. We presented in Eq. 2.2.2 the most general form of a CPT reward measure. In the sequel, we will study various special cases of this reward measure. For example, we are interested in the case when the rewards are strictly positive. 2.2.3 Reward Transition Mappings In this section, we consider the Markov control model (X,A, {A(x)|x ? X} , Q, r) . The discrete-time Markov process resulting from applying the policy pi and the cor- responding action sequence are denoted by {xpit } and {apit }, respectively. A standard finite-horizon dynamic control problem has the following performance measure: max pi?? E [ T?1? t=0 r(xpit , apit , xpit+1) + rT (xpiT )|x0 ] , (2.2.3) where rT is a measurable terminal reward function. We would like to solve the optimization problem V ?T (x) := max pi?? VT (x, pi). From [48], the corresponding dynamic programming equation is vt(x) = max ??P(A(x)) ? X?A (r(x, a, y) + vt+1(y))Q(dy|x, a)?(da). (2.2.4) 22 Remark 5. Under some assumptions (i.e., the existence of a measurable deterministic selector), ? ? P(A(x)) in Eq. 2.2.4 can be replaced by ? ? A(x) to reflect the fact that a deterministic optimal policy exists. The right-hand side of Eq. 2.2.4 is a function of the current state x, the reward function parameterized by the current state x (i.e., gx(a, y) := r(x, ?, ?) + vt+1(?) : A ? X ? R), the transition probability Q, and the randomized control ?. Taking one step further, we can define a function ?t (r(x, ?, ?) + vt+1(?), x, ? ?Q(?|x, ?)) := ? X?A (r(x, a, y) + vt+1(y))Q(dy|x, a)?(da), and rewrite Eq. 2.2.4 as vt(x) = max ??P(A(x)) ?t (r(x, ?, ?) + vt+1(?), x, ? ?Q(?|x, ?)) . The sequence, {?t, t = 0, . . . , T ? 1}, is called the reward transition mappings for Eq. 2.2.4. Before we can provide the definition for reward transition mappings, we need to define the term ? ?Q(?|x, ?) in the equation above. Definition 7. Given a fixed current state x ? X and a randomized action ? ? P(A), we denote the one-step state-action measure (see [18]) with respect to the Markov control model by: [? ?Qx](Ba ?By) := ? Ba Q(By|x, a)?(da) By ? B(X) Ba ? B(A), (2.2.5) 23 where Qx(a) := Q(?|x, a) : A ? P(X) is the stochastic kernel parameterized by x ? X. Remark 6. The one-step state-action measure is a measure over X ? A, which rep- resents the uncertainty over the next state and the current action (i.e., we are interested in randomized policies). We need to define the space that contains ? ?Qx, which was also mentioned in ?avu? & Ruszczy?ski [18]. Given a probability space (X ? A,B(X ? A), P0), where P0 is some reference probability measure, the space of p-integrable random variables is denoted by V := Lp(X ? A,B(X ? A), P0), p ? [1,?). Its dual space, V ?, is the space of signed measures on (X ? A,B(X ? A)), which are absolutely continuous with respect to P0 with their densities in Lq(X ? A,B(X ? A), P0), where q satisfies the equation 1p + 1q = 1. The reference measure, P0, should be chosen such that all possible measures of the form ? ? Qx are in V ?. In the special case of a finite state and control space, P0 can always be chosen to be uniform. We denote the set of all probability measures in V ? by: M := {m ? V ?|m(X ? A) = 1, m ? 0} . Remark 7. The measure defined by Eq. 2.2.5 is an element of M. The space V ? (and thus M) is endowed with the Prokhorov topology (weak convergence). For p ? [1,?) we will endow V with the strong (i.e., norm) topology. 24 However, if p = ?, we will endow V with the topology induced by the form: ??,m? = ? X?A ?(x, a)m(dx, da), ? ? V , m ? V ?. Definition 8. A mapping ? : V ? X ? M ? R is a reward transition mapping if for every x ? X and every m ? M fixed (denote ?(?) := ?(?, x,m)), the following conditions are true: 1) if ? ? ? then ?(?) ? ?(?), ??, ? ? V ; 2) ?(??) = ??(?), ?? ? V , ? ? 0. This definition is more general than Definition 3.1 in [18]. Since CPT inspired performance measures are distorted by nonlinear probability weighting functions, they generally do not satisfy the convexity and translation invariant requirements satisfied by convex risk measures. By removing these two requirements, we are able to work with non-convex CPT inspired performance measures. 2.2.4 Generalized Markov Dynamic Reward Measures The definition of dynamic reward measures varies based on the objective of the analysis. The dynamic risk measure community, for example [76], defines dynamic performance measures based on coherent risk measures. Since our goal is to define a class of dynamic performance measures that contains non-convex performance measures (e.g., CPT inspired reward measures), we need to modify the definitions used by the dynamic risk measure community. However, because we are maximizing rewards rather than minimizing risks, our definitions are defined by switching the 25 direction of the analogous inequalities. Given a filtered probability space (?,F , {Ft}, P ) with F0 = {?, ?}, we define the spaces Lt = Lp(?,Ft, P ), p ? [1,?], t = 0, 1, . . . , T, and Lt,T = Lt ? ? ? ? ? LT . Remark 8. Given a Markov control model, ? can be thought of as HT and FT as ? (HT ) . Definition 9. A mapping ?t,T :Lt,T ? Lt, where 1 ? t ? T , is called a conditional reward measure, if it has the following monotonicity property: Z ? W implies ?t,T (Z) ? ?t,T (W ), ?Z,W ? Lt,T . Definition 9 was first presented by Ruszczy?ski in [76]. The inequality above is meant to be component-wise almost surely. Intuitively, the definition above says a conditional reward measure preserves the order of the rewards. Furthermore, taking R ? Lt,T to be a sequence of future rewards, ?t,T (R) gives the price, at time t, that one is willing to pay to obtain the payoff sequence R. Definition 10. A dynamic reward measure is a sequence of conditional reward measures {?t,T , t = 1, . . . , T}. In other words, a dynamic reward measure is a time-varying mapping that reflects the present value of a sequence of future rewards. It can be utilized as a performance measure in many real-life scenarios. One important concept in dynamic reward measures is time-consistency, which is defined below. 26 Definition 11. A dynamic reward measure {?t,T}Tt=1 is called time-consistent if for all 1 ? ? < ? ? T and all sequences Z,W ? L?,T the conditions Zk = Wk, k = ?, . . . , ? ? 1 and ??,T (Z?, . . . , ZT ) ? ??,T (W?, . . . ,WT ) imply that ??,T (Z? , . . . , ZT ) ? ??,T (W? , . . . ,WT ). In applications, a time-consistent dynamic reward measure can be more con- veniently represented by its corresponding sequence of one-step conditional reward measures, whose definition is given below. Definition 12. A mapping ?t : Lt+1 ? Lt is called a one-step conditional reward measure if ?t(Z) = ?t,t+1(0, Z), Z ? Lt+1. For this thesis, we are only interested in one-step conditional reward measures that satisfy the assumption below. Assumption 1. A one-step conditional reward measure satisfies the following con- ditions: 1. If Z ? W then ?t(Z) ? ?t(W ), ?Z,W ? Lt+1; 2. ?t(?Z) = ??t(Z), ?Z ? Lt+1, ? ? 0. Below are several one-step conditional rewards that satisfy Assumption 1. 27 Example 3. The following reward measures are both convex reward measures. Mean-semideviation model: ?t(Zt+1) = E [Zt+1|Ft] + ?E[((Zt+1 ? E [Zt+1|Ft])+)r |Ft]1r . Here, r ? [1, p] and ? ? [0, 1] may be any Ft-measurable random variables. Another interesting example is the Average Conditional Value at Risk: ?t(Zt+1) = inf U?Lt { U + 1 ? E [(Zt+1 ? U)+ |Ft]} , where the infimum is point-wise, and ? is any Ft-measurable function with values in an interval [?min, ?max] ? (0, 1). The next example is an example of a non-convex reward measure in Zt+1. Example 4. Cumulative Prospect Theory: ?t(Zt+1) = ? ? 0 w+t ( P ( u+t ((Zt+1) +) > s )) ds? ? ? 0 w?t ( P ( u?t ((Zt+1) ? ) > s )) ds, (2.2.6) where w+t , w+t , u?t , and u?t are Ft-measurable functions with values in the function spaces [0, 1] ? [0, 1], [0, 1] ? [0, 1], R+ ? R+ and R ? R+, respectively (see Eq. 2.2.2), and P is an appropriate probability measure. Here, the benchmark random variable B in Eq. 2.2.2 is zero. Remark 9. The performance measures in Example 3 satisfy the convexity and trans- 28 lation invariance assumptions in Ruszczy?ski?s work, whereas the performance mea- sure (i.e., Eq. 2.2.6) in Example 4 does not. Eq. 2.2.6 is the main motivation for us to generalize Ruszczy?ski?s approach. Applying a one-step conditional reward measure to a controlled Markov pro- cess, we ideally would like to obtain an optimal Markov policy. However, we cannot expect this to be true in general, because the one-step reward measure could depend on the past history of the underlying Markov process (i.e., ht). In order to over- come this difficulty, we follow Ruszczy?ski?s ([76]) definition of the one-step Markov conditional reward measure. 2.2.4.1 Markov Conditional Reward Measures As we mentioned in the previous section, one-step conditional reward measures might not be Markov. However, if a one-step conditional reward has a corresponding reward transition mapping, then it only depends on the current state of the system, hence it is Markov. The following condition is important for the integrability of Markov conditional reward measures. Definition 13. A function g is said to be b-bounded if ?C > 0 and b : X ? [1,?), b ? V and |g(x, a, y)| ? C (b(x) + b(y)) , ?x ? X, a ? A(x), y ? X. We denote the function g(x, a, y) : X ? A ? X with the x argument parame- terized by gx : A ? X ? R (i.e., gx(a, y) := g(x, a, y)). In addition, the notation pit,x 29 denotes the measure pit(?|x) ? P(A). We remind the reader that Qt,x, the transition probability at time t, is a mapping a ? Qt(?|x, a). We consider the filtered probability space (HT , ?(HT ),Ft,Ppi) , where Ft is the ?-field generated by the state-action trajectory (i.e., {xpi0 , api0 , . . . , xpit } ) of the controlled Markov process {xpit }. The space Lt in the definition below is defined with respect to the filtered probability space (HT , ?(HT ),Ft,Ppi) . More specifically, elements of Lt are functions of {xpi0 , api0 , . . . , xpit } . Definition 14. A one-step conditional reward measure ?t : Lt+1 ? Lt is a Markov reward measure with respect to a controlled Markov process {xpit } and its controls {apit }, if there exists a reward transition mapping ?t : V ?X?M ? R, such that for any b-bounded measurable functions g : X ? A ? X ? R, there is a feasible control pit : X ? P(A(x)) such that the following equation holds ?t ( g(xpit , apit , xpit+1) ) = ?t(gxpit , xpit , pit,xpit ?Qt,xt), a.s. (2.2.7) Remark 10. In the sequel, we use the term one-step Markov reward measure for both ?t and its corresponding ?t. Furthermore, the right-hand side of Eq. 2.2.7 can be thought of as a function parameterized by the current state xpit . Definition 15. A one-step conditional reward measure ?t is Markov, if ?t is a Markov reward measure with respect to all feasible controlled Markov processes and controls {{xpit } , {apit } |pi ? ?} and ?t is the same for all pi ? ?. Furthermore, a dynamic reward measure {?t} is Markov, if each of the one-step conditional reward measure ?t is Markov. 30 In other words, if a conditional reward measure ?t is Markov, then we can replace it with its Markov counterpart ?t when calculating the reward at time t. 2.3 Dynamic Programming 2.3.1 Finite-Horizon Given a time-consistent dynamic reward measure {?t,T}T?1t=0 and its corresponding one-step dynamic reward measure {?t}T?1t=0 , we can write the corresponding value function starting at x0 with a control policy pi ? ? and the resulting state-action trajectory {xpi0 , api0 , . . . , xpiT} as: VT (x0, pi) = ?0 (r(xpi0 , api0 , xpi1 ) + ?1 (r(xpi1 , api1, xpi2 ) +?2 (r(xpi2 , api2, xpi3 ) + ? ? ? + ?T?1 ( r(xpiT?1, apiT?1, xpiT ) + rT (xpiT ) ) ? ? ? ))) . The equation above is obtained by applying Definition 3 and Theorem 1 in [76] 6. We are interested in optimization problems of the form: V ?T (x0) := max pi?? VT (x0, pi). (2.3.1) In the rest of this section, we prove the optimality of the dynamic programming equation that solves this optimization problem. The state space X is extended with time variable (i.e., X ? [0, . . . T ]) to model the time-varying nature of the reward 6The policies considered in [76] are deterministic, but we are considering randomized policies. 31 functions rt, and action space constraint At(x). The extended state space is denoted by X?. Theorem 1. Assume the following conditions hold: 1) ?x ? X, the stochastic kernels Qt,x : a ? Qt(?|x, a) are continuous; 2) The one-step dynamic reward measure {?t}T?1t=0 is Markov (see Definition 15), and there exists a sequence of corresponding reward transition mappings ?t : m ? ?(?, x,m), t = 0, . . . , T ? 1 that are upper semi-continuous; 3) The functions {rt(?, ?, ?)}T?1t=0 are b-bounded, measurable, and a ? rt(?, a, ?) is upper semi-continuous; 4) For every x ? X and t ? [0, . . . , T ? 1] the set At(x) is compact; 5) The function rT (?) is b-bounded and measurable; Then a maximizer for the dynamic programing equation: vt(x) = max ??P(A(x)) ?t (rt(x, ?, ?) + vt+1(?), x, ? ?Qt,x) vT (x) = rT (x) x ? X, t = 1, . . . , T ? 1, (2.3.2) exists. Furthermore, an optimal policy, pi? := { pi?0, pi ? 1, ? ? ? pi ? T?1 } exists and each pi?t (x) is a maximizer for the right-hand side of Eq. 2.3.2 at time t for all x ? X; In addition, every measurable solution of Eq. 2.3.2 at time 0, v0, is an optimal solution for Eq. 2.3.1. Proof. Let pi denote an arbitrary randomized policy and {xpi0 , api0 , . . . , xpiT}, the re- sulting state-action trajectory of the controlled Markov process. We denote the 32 reward-to-go function by : Rt(x, pi) : = ?t(r(xpit , apit , xpit+1)+ ?t+1 ( r(xpit+1, apit+1, xpit+2) + ? ? ? + ?T?1 ( r(xpiT?1, apiT?1, xpiT ) + rT (xpiT ) ) ? ? ? )) RT (x, pi) : = rT (x). This is the total reward from time t onwards when the policy pi is applied at the initial state x. In particular, we know VT (x, pi) = R0(x, pi). We first prove that a solution to Eq. 2.3.2 exists. By assumption 1, Qt,x is a continuous stochastic kernel, which implies ? ? Qt,x : P (A) ? M is continuous in ?. Here, Qt,x : A ? P (X) is the stochastic kernel parameterized by x at time t (see Definition 7). By assumption 2, we know that ? ? ?(?, x, ? ? Qt,x) is upper semi- continuous in ?. Assumptions 3, 4, 5 imply that the set P(A(x)) is weakly-compact, hence a maximizer exists for Eq. 2.3.2. We denote an optimal policy by pi? = { pi?0, . . . , pi ? T?1 } , where pi?t (x) is a maximizer for Eq. 2.3.2 for all x ? X. We need to show that for 33 t = 0, . . . T , Rt(x, pi) ? vt(x), (2.3.3) and with equality if pi = pi?, i.e., Rt(x, pi?) = vt(x). (2.3.4) In particular, if Eq. 2.3.3 is true, we have VT (x, pi) = R0(x, pi) ? v0(x) and VT (x, pi?) = R0(x, pi?) = v0(x), which prove the statement regarding v0(x) being the solution for the optimization problem stated in Eq. 2.3.1. We show Eq. 2.3.3 to be true by backward induction. We first note the fact that RT (x, pi) = vT (x) = rT (x). Assuming the induction hypothesis that for some t = T ? 1, . . . , 0, Rt+1(x, pi) ? vt+1(x), x ? X, the reward-to-go equation at time t satisfies the following inequalities: Rt(x, pi) = ?t(r(x, apit , xpit+1) + Rt+1(xpit+1, pi)) ? ?t ( r(x, apit , xpit+1) + vt+1(xpit+1) ) ? max ??P(A(x)) ?t (rt(x, ?, ?) + vt+1(?), x, ? ?Qx) := vt(x). 34 The second line in the equation above is due to part 1 of Assumption 1 (i.e., monotonicity) and the induction hypothesis. The third line in the equation above is true by the virtue of ?t being Markov (the second assumption of this theorem). This proves Eq. 2.3.3. If we assume Rt+1(x, pi?) ? vt+1(x), x ? X, then we conclude Rt(x, pi?) ? vt(x) using a similar induction argument as above, which proves Eq. 2.3.4. It should be easy to see that RT?1(x, pi?) = vT?1(x), since pi? is in ? by definition. Repeating the same steps as above for T ? 2, T ? 3, . . . , 0, we obtain the desired result V ?T (x) = VT (x, pi?) = R0(x, pi?) = v0(x). 2.3.1.1 Application: Cumulative Prospect Theory Measures We assume that we are given a one-step dynamic reward measure of the form in Eq. 2.2.6, where u+(x) = x and u?(x) = x. We would like to evaluate the performance of the random variable ?x at each time t, assuming the dynamic reward measure is Markov, and the following reward transition mapping satisfies Eq. 2.2.7: ?t(?x, x,m) = ? ? 0 w+ ( m ((?x)+ > s))ds ? ? ? 0 w? ( m ((?x) ? > s )) ds, (2.3.5) where ? is a B(K ? X)-measurable random variable (e.g., ? = r + v), and m ? M. We denote the function ?(x, ?, ?) by ?x ? Lt+1, which is a B(X ? A)-measurable 35 random variable. w+ : [0, 1] ? [0, 1] and w? : [0, 1] ? [0, 1] are two continuous monotonically non-decreasing functions. We would like to apply Theorem 1 to Eq. 2.2.6, given the assumption that it is Markov, by proving that Eq. 2.3.5 is a reward transition mapping. Theorem 2. ?t defined by equation 2.3.5 is a reward transition mapping. Further- more, it is continuous in m. Proof. First we need to show that Eq. 2.3.5 satisfies the two properties in Definition 14. 1) prove: if ?x ? ?x then ?(?x) ? ?(?x), ??x, ?x ? V ; We need to break ?t into two parts, ? ? 0 w +(m((?x)+ > s))ds and ? ? 0 w ? ( m ((?x) ? > s )) ds. We first look at ? ?0 w+ ( m ((?x)+ > s))ds. Since ?x ? ?x, we have m ((?x)+ > s)? m((?x)+ > s). Using the fact that w+ is a monotonically non-decreasing function, we have w+ ( m ((?x)+ > s))? w+(m((?x)+ > s)), which implies ? ? 0 w+ ( m ((?x)+ > s))ds ? ? ? 0 w+ ( m ((?x)+ > s))ds. Similarly, we have m ((?x) ? > s ) ? m ((?x) ? > s ) , using the fact the w? is a mono- tonically non-decreasing function, we have w? ( m ((?x) ? > s )) ? w? ( m ((?x) ? > s )) , 36 which implies ? ? 0 w? ( m ((?x) ? > s )) ds ? ? ? 0 w? ( m ((?x) ? > s )) ds. Conclusion 1 follows from the previous inequality. 2) prove: ?(??x) = ??(?x), ??x ? V , ? ? 0; Since ? ? 0 w+ ( m ((??x)+ > s))ds? ? ? 0 w? ( m ((??x) ? > s )) ds =? ? 0 w+ ( m ( (?x)+ > s? )) ds? ? ? 0 w? ( m ( (?x) ? > s ? )) ds, we do a change of variable with z = s? . Rewriting the equation in terms of z we have ? (? ? 0 w+ ( m ((?x)+ > z))dz ? ? ? 0 w? ( m ((?x) ? > z )) dz ) . To prove the continuity of ?t, we explicitly prove that ? ? 0 w+ ( m ((??x)+ > s))ds is continuous in m, because the proof for the second part of the equation will follow similarly. We prove the continuity of ?t by appealing to the fact that the sum of two continuous functions is continuous. 37 We denote the Prokhorov metric (see [16, Section 2.1]) by: d(?, ?) := inf {|?(A) ? ?(A) + , ?(A) ? ?(A) +  ?A ? B(X ? A)} . For the purpose of readability, we define the function f ?,??x (s) = ??w+ ( ? ((?x)+ > s))? w+(?((?x)+ > s))?? , and its associated sets B?1 = { s : s ? [0,M ], f ?,??x (s) ? ?1 } , and ?B?1 = { s : s ? [0,M ], f ?,??x (s) > ?1 } . Since the total reward is the sum of a finite number (i.e., finite-horizon) of per-stage rewards, it is bounded by M ? R. Given an arbitrary  > 0, we need to find a ?1 such that it satisfies the following 38 equations: ???? ? M 0 w+ ( ? ((?x)+ > s))ds? ? M 0 w+ ( ? ((?x)+ > s))ds???? ? ? M 0 ??w+ ( ? ((?x)+ > s))? w+(?((?x)+ > s))?? ds ?? M 0 1B?1?1ds+ ? M 0 1 ?B?1 ds = ?1 ?M ? ? M 0 1B?1 M ds+M ? ? M 0 1 ?B?1 M ds = ?1 ?M ? ? M 0 1B?1 M ds+M ? ( 1 ? ? M 0 1B?1 M ds ) ? 2 ? ?1 ?M = . By letting 1 > ?1 = 2M > 0, the last line in the equation above holds. Our goal is to prove that given an arbitrary ?1, there always exists a ? ? ?1 such that all ? in the ?-neighborhood of ? satisfy the following ? M 0 1B?1 M ds ? 1 ? ?1, which implies ?1 ?M ? ? M 0 1B?1 M ds+M ? ( 1 ? ? M 0 1B?1 M ds ) ? . 39 Since w+ is continuous, for any ?1 > 0 there exists a ?2 such that ??? ((?x)+ > s)? ?((?x)+ > s)?? ? ?2 =? ??w+ ( ? ((?x)+ > s))? w+(?((?x)+ > s))?? ? ?1. Hence, for any ?1 > 0 we can always find a ?2 such that ? M 0 1|?((?x)+>s)??((?x)+>s)|??2 M ds ? 1 ? ?1 =?? M 0 1B?1 M ds ? 1 ? ?1. (2.3.6) From the Markov inequality, we have ? M 0 1|?((?x)+>s)??((?x)+>s)|??2 1 M ds ? 1 ? ?M 0 ??? ((?x)+ > s)? ?((?x)+ > s)?? 1M ds ?2 . (2.3.7) Next, we need to find a ? such that the following equations hold: ? M 0 ??? ((?x)+ > s)? ?((?x)+ > s)?? 1Mds ?? M 0 d(v, ?)2 1 M ds ? ?2 = ?1 ? ?2. Finally, letting ? := +??1 ? ?2, it is true that for any v in the ?-neighborhood of ?, 40 the following equations hold: ? M 0 ??? ((?x)+ > s)? ?((?x)+ > s)?? 1Mds ? ?1 ? ?2 =? 1 ? ?M 0 ??? ((?x)+ > s)? ?((?x)+ > s)?? 1M ds ?2 ? 1 ? ?1 ? ?2 ?2 = 1 ? ?1 =? ? M 0 1|?((?x)+>s)??((?x)+>s)|??2 1 M ds ? 1 ? ?1 =? ? M 0 1B?1 M ds ? 1 ? ?1 (Eq. 2.3.6). The second implication is due to Eq. 2.3.7 and the third implication is due to Eq. 2.3.6. The second assertion of the theorem is proved. Below is an example where we use a Markov CPT dynamic reward mea- sure. Example 5. The following example attempts to explain why people become en- trepreneurs. We assume a person could be in several states {poor, middle, upper- middle, super-rich}. If one decides to become an entrepreneur, one has the following transition probability matrix. poor middle upper-middle super-rich poor .999 0 0 .001 middle .999 0 0 .001 upper-middle .999 0 0 .001 super-rich .001 0 0 .999 Table 2.3.1: Transition probability matrix for becoming an entrepreneur One could also choose to pursuit a normal job with the following transition 41 probabilities. poor middle upper-middle super-rich poor 0 1 0 0 middle 0 0 1 0 upper-middle 0 0 1 0 supper-rich 0 0 0 1 Table 2.3.2: Transition probability matrix for taking a normal job The action space is {entrepreneur (E), normal (N)}. We define the random variable xt to represent the current state of the controlled Markov process: xt(?) = ? ?????????????? ?????????????? 1 ? = poor 2 ? = middle 3 ? = upper-middle 4 ? = super-rich . In this example, the per-stage reward function is given as: r(x, a) := ? ?????????????? ?????????????? x? 1/x x ? 3 and a = E x x ? 3 and a = N 100 ? 1/x x > 3 and a = E 100 x > 3 and a = N , 42 and the terminal reward function is: r2(x) := ? ???? ???? x x ? 3 100 x > 3 . We want to solve the optimization problem stated in Eq. 2.3.1 given a dynamic reward measure of the form in Eq. 2.2.67: ?t(Zt+1) = ? ? 0 w+(P ((Zt+1|xt) > s))ds, with the nonlinear weighting function w+(F ):=e??(?ln(F ))? , where 0 < ? < 1 and ? > 0. For the purpose of this example, we take ? to be 0.9 and ? to be 0.5. Since we are only dealing with positive rewards in this example, w? and u ? need not be given. Furthermore, we note that the dynamic reward measure is Markov and has a sequence of transition mappings ?t of the form: ?t(r + vt+1, xt, ? ?Qxt) =? ? 0 w+ (? ?Qxt (r + vt+1(xt+1) > s)) ds. The table below shows the value function at times 0 and 1 by applying the 7For simplicity, we dropped u+ and u?. 43 dynamic programming equation vt(x) = max pE?[0,1] {? ? 0 w+ (Qx,E (r(x,E) + vt+1(xt+1) > s) pE +Qx,N (r(x,N) + vt+1(xt+1) > s) (1 ? pE)) ds} , where v2 = r2 and the variable pE is the probability of becoming an entrepreneur. Time=1 x1 pE v1(x1) poor 0.850471 7.1229 middle 0.787238 8.81843 upper-middle 0.808817 9.91617 super-rich 0 200 Time = 0 x0 pE v0(x0) poor 0.919031 18.6786 middle 0.886583 20.3504 upper-middle 0.896001 21.4663 super-rich 0 300 Table 2.3.3: An optimal solution for Ex. 5 (a value function and an optimal policy) at time 0 and 1. Since Table 2.3.3 above shows the likelihood of becoming an entrepreneur is higher if one is younger, it agrees with our intuition that one should pursue entrepreneurship while still young. For example, an individual, starting out poor, should be entrepreneurial almost 92% of the time; On the other hand, if the same individual is a year older, he or she should only be entrepreneurial 85% of the time. This result also agrees with our tendency to become more risk-averse as we grow older. Our approach yields an optimal randomized policy, which is different from 44 the standard approach (see Eq. 2.2.3), where an optimal solution is deterministic. Non-convex reward measures are useful for modeling many real-life problems. More specifically, CPT-inspired reward measures are derived from experimental data and have been proven to model several key characteristics of human behavior well. In this section, we proved the optimality of dynamic programming equations for the optimization problem described by Eq. 2.3.1. In addition, we provided a numerical example demonstrating the intuitiveness of the optimal policies obtained. In the next section, we will apply dynamic programming to infinite-horizon MDPs with non-convex reward measures. 2.3.2 Discounted Infinite-Horizon As in the finite-horizon case, we assume that we are given a time-consistent dynamic reward measure {?t,?}?t=0 and its corresponding time-invariant one-step dynamic reward measure {?}. Here, ? does not depend on t anymore. From Definition 3 and Theorem 1 in [76], we write the corresponding value function starting at x0 with a control policy pi and the resulting state-action trajectory {xpi0 , api0 , . . . , xpiT} as: V (x0, pi) = ? (r(xpi0 , api0 , xpi1 ) + ?? (r(xpi1 , api1 , xpi2 ) +?? (r(xpi2 , api2 , xpi3 ) + ? ? ? + ?? (r(xpi?, api?, xpi?)) ? ? ? ))) . 45 We would like to consider the following optimization problem with ? ? (0, 1): V ? (x0) := max pi?? V (x0, pi). (2.3.8) In this section, we are interested in the case when r : K ? X ? R is bounded, i.e., ?r? ? R+ such that |r| ? r?. We assume r to be a non-positive valued function. The non-negative case can be argued by symmetry. We denote the t-stage-reward function resulting from applying a policy pi by: Jt(x0, pi) = ? (r(xpi0 , api0 , xpi1 ) + ?? (r(xpi1 , api1 , xpi2 ) +?? (r(xpi2 , api2 , xpi3 ) + ? ? ? + ??(r(xpit?1, apit?1, xpit ))? ? ?))). (2.3.9) Since r ? 0 and Vt ? Vt?1 ? 0 , by the Monotone Convergence Theorem, we may write V (x0, pi) = lim t?? Jt (x0, pi) , ?pi ? ?. From Section 2.3.1, we know the solution to Eq. 2.3.9 can be obtained by iterating the following equation: vt (x) := max ??P(A(x)) ? (r(x, ?, ?) + ?vt?1(?), x, ? ?Qx) . In other words, vt (x) = max ? Jt (pi, x) , ?x ? X. 46 The equation above is the backward form of the finite-horizon dynamic pro- gramming equation; this is different from the forward equation we used in the pre- vious section. In the finite-horizon case, we are able to prove the optimality of the dynamic programming equation (i.e., Eq. 2.3.2) by backward induction; this ap- proach will not suffice in the infinite-horizon case. In the infinite-horizon case, we need to appeal to the Banach fixed-point theorem to prove the existence of a mea- surable function v? such that v? = Tv?. Lastly, we need to prove that the solution v? is indeed equal to V ? (x) := max ? V (x, pi) , ?x ? X. We used [48, 49] as the main technical references for the proofs below. Definition 16. Let M (X)? denote the cone of non-positive measurable functions on X. For every v ? M (X)?, Tv is defined as a mapping from X, i.e., Tv (x) := max ??P(A(x)) ? (rx + ?v, x, ? ?Qx) , ?x ? X, where rx is the reward function with x held fixed, i.e., rx := r (x, ?, ?) . The existence of a measurable selector8 is important in proving the optimality of the dynamic programming equation. Lemma 1. Assuming the function ? (rx + ?v, x, ? ?Qx) 8See appendix for the definition of a measurable selector. 47 is upper semi-continuous (u.s.c) in ?, r is a non-positive valued function, and P (A (x)) is compact-valued, then T maps M (X)?into itself, i.e., for every v in M (X)?, Tv is also in M (X)?, and moreover, there exists a measurable selector ? : X ? P (A) with ? (x) ? P (A (x)) such that ? (rx + ?v, x, ? (x) ?Qx) = max ??P(A(x)) ? (rx + ?v, x, ? ?Qx) , ?x ? X. Proof. This follows from Proposition 6 in the appendix. The lemma above is important for ensuring the value function is measurable. The reader might notice that ? can be used to construct a stationary policy pi = {?, ?, . . . } , which will be used to prove the optimality of stationary Markov polices. The following Lemma is used in the upcoming theorem. Lemma 2. If u ? M (X)? is such that u ? Tu, and r is a non-positive valued function, then u ? V ?. Proof. Assuming u ? Tu and using Lemma 1, we write the following inequality: u (x) ? ? (r (x, ?, ?) + ?u, x, ? (x) ?Qx) . 48 Iterating this inequality, we obtain u (x) ? ? (r(x, api0 , xpi1 ) + ?? (r(xpi1 , api1 , xpi2 ) + ?? (r(xpi2 , api2 , xpi3 ) + ? ? ? + ??(r(xpin?1, apin?1, xpin) + ?u (xpin))? ? ?))), ?n ? 1, x ? X, (2.3.10) where pi = {?, ?, . . . }. In the inequality above, we used the short-hand notation ? (r + ?u) := ? (r + ?u, x, ? (x) ?Qx) , and {xpit } is the resulting process from applying the policy pi. Applying the fact that ?u (xpin) ? 0 to Eq. 2.3.10, we conclude the following inequality: u (x) ? ? (r(x, api0 , xpi1 ) + ?? (r(xpi1 , api1 , xpi2 ) + ?? (r(xpi2 , api2 , xpi3 ) + ? ? ? + ??(r(xpin?1, apin?1, xpin))? ? ?))), ?n ? 1, x ? X. By letting n ? ?, the inequality above yields u (x) ? V (x, pi) ? V ? (x) ?x ? X. 49 Theorem 3. Assume the following conditions hold: 1) The stochastic kernels Qx : a ? Q(?|x, a) are continuous ?x ? X; 2) The one-step dynamic reward measure ? is Markov (see Definition 15), and a sequence of corresponding reward transition mappings ? : m ? ?(?, x,m) is upper semi-continuous; 3) The function r(?, ?, ?) is bounded, measurable, and a ? r(?, a, ?) is upper semi-continuous in a; 4) For every x ? X the set A(x) is compact; 5) ? ? (0, 1) . Then a maximizer for the dynamic programing equation v(x) = max ??P(A(x)) ? (r(x, ?, ?) + ?v(?), x, ? ?Qx) ?x ? X, (2.3.11) exists. Furthermore, an optimal policy, pi? := {??, ?? ? ? ? } exists and each ?? is a maximizer for the right-hand side of Eq. 2.3.11. In addition, every bounded measurable solution of Eq. 2.3.11 is an optimal solution for Eq. 2.3.8. Proof. Since A (x) is compact for every x, we know P (A (x)) is also compact-valued. We want to show that the operator Tv := max ??P(A(x)) ? (r + ?v, x, ? ?Qx) is a contraction on the space of bounded measurable functions endowed with the 50 supremum norm ?v? := sup X |v| . We first prove that a solution to Eq. 2.3.11 exists. By assumption 1, Qx is a continuous stochastic kernel, which implies ? ? Qx : P (A) ? M is continuous in ?. Here, Qx : A ? P (X) is the stochastic kernel parameterized by x at time t (see Definition 7). By assumption 2, we know that ? ? ?(?, x, ? ? Qx) is upper semi-continuous in ?. Assumptions 3 and 4 imply that the set P(A(x)) is weakly- compact; hence a maximizer exists for Eq. 2.3.11. This also proves the existence of pi? := {??, ??, ? ? ? }. Next, the mapping T : M (X)? ? M (X)? satisfies: Tv : = max ??P(A(x)) ? (r + ?v, x, ? ?Qx) = ? (r + ?v, x, ? (x) ?Qx) = ? (r + ? (v? + (v ? v?)) , x, ? (x) ?Qx) ? ? (r + ?v?, x, ? (x) ?Qx) + ? sup X |v ? v?| ? max ??P(A(x)) ? (r + ?v?, x, ? ?Qx) + ? sup X |v ? v?| = Tv? + ? sup X |v ? v?| =? Tv ? Tv? ? ? sup X |v ? v?| ?x ? X =? sup X |Tv ? Tv?| ? ? sup X |v ? v?| , where ? is an optimal measurable selector and its existence is ensured by Lemma 1. 51 Hence, by appealing to Banach?s Fixed-Point Theorem for contraction map- pings and assumptions 3 and 5 of this theorem, we conclude there exists a unique function v? ? V such that Tv? = v?. Finally, we need to prove that v? is a measurable solution to Eq. 2.3.8. We need to prove this fact in two steps: v? ? V ? and v? ? V ?. From the fact that Tv? = v? and Lemma 2 we conclude that v? ? V ?. To prove the inequality in the other direction, we know from the finite-horizon case with the reward-to-go function denoted by Rn, we have vn ? Rn (x, pi) ? V (x, pi) , ?n ? {0, 1, 2, . . . } , ?n, ?pi ? ?, ?x ? X, which implies vn ? Rn (x, pi?) ? V ? (x) ?n, ?x ? X, where pi? = {??, ??, . . . } is constructed using the maximizer, ??, of Eq. 2.3.11. The operator T is monotone, i.e., if u and u? are functions in M (X)? and u ? u?, then Tu ? Tu?. Since v0 := 0 and vn := Tvn?1 for n ? 1, vn form a non-increasing sequence in M (X)? converging to some function v? ? M (X)? . Since vn ? v? due to the monotone convergence theorem, and vn ? V ?, we conclude that v? ? V ?. The desired conclusion is reached given the fact the policy pi? = {??, ??, . . . } ? ?. If the per-stage reward is both positive and negative, we defer it to the later transient case; as the discounted infinite-horizon problem can be rewritten into a transient problem by adding an absorbing state. The requirement that the reward 52 function, r, be a bounded measurable function can be relaxed in the previous the- orem. Of course, the proof for the relaxed case will be different. Due to space limitations, we do not explore alternatives with the relaxed assumption on the re- ward function here. For the interested reader, the proof for the existence of an optimal policy with the relaxed assumption on the reward function can be found in [48] for the standard expected value measure, which can be adapted for our case. We present a numerical example, similar to the finite-horizon case, which will demonstrate the type of policies expected from the CPT-based reward measures. In the example below, we also compare an optimal solution of the expected value (standard reward measures) with that of the CPT-based reward measures. Example 6. As in the finite-horizon case of Example 5, its infinite discounted coun- terpart tries to explain why people become entrepreneurs. The major difference here is we are no longer given a terminal reward function. The transition probabilities and the per-stage reward function used for this numerical example can be found from Example 5. We calculate the discounted infinite-horizon counterpart with the nonlinear weighting function w+(F ):=e??(?ln(F ))? , where 0 < ? < 1 and ? > 0. For the purpose of this example, we take ? to be 0.9 and ? to be 0.5. Furthermore, we assume the discount factor, ?, to be 0.5. The tables below summarize our numerical results. We notice that the value function given by the CPT measures is higher than that of the expected value. In addition, using CPT measures will yield a more risk-seeking optimal policy. 53 x0 P(entrepreneur) v(x0) poor 0 3.49999 middle 0 4.99999 upper-middle 0 5.99999 super-rich 0 200 (a) Expected Value x0 P(entrepreneur) v(x0) poor 0.868488 11.5553 middle 0.861585 13.0742 upper-middle 0.881884 14.1858 super-rich 0 200 (b) CPT Expected Value Table 2.3.4: An optimal solution for Ex. 6 (a value function and an optimal policy) . Table 2.3.4 is similar to Table 2.3.3 in the finite-horizon case in the sense that they both produce randomized polices. This example suggests that a middle-class person should be the least likely to pursue entrepreneurship. On the other hand, an upper-middle class person is mostly likely to start his/her own business. However, the difference in the probability of entrepreneurship for the states poor, middle and upper-middle is very small, which suggests in the long run we should all be entrepreneurial regardless of our current state unless one is already super-rich. Of course, in practice we need to calibrate the underlying Markov model with empirical data. In the next section, we will examine the suitability of the dynamic program- ming method for the transient Markov control model case. 54 2.3.3 Transient Markov Control Model In this section, we prove the optimality of the dynamic programming equation for transient Markov control models. Our main technical references are [18], [55] and [49]. Transient Markov control models require further specification in addition to the definitions provided in Section 2.2.1. Before we introduce the definition of a transient Markov control model, we need to define a few notations first. Given a norm weight function w : X ? [1,?), the w weighted-norm is denoted by ???w . It is calculated for a substochastic kernel A as: ?A?w = sup X 1 w (x) ? X w (y)A (dy|x) . Similarly, a w-norm can be defined for a measurable function v : X ? R as: ?v?w = sup x?X |v (x)| w (x) . It is the standard operator norm in the space Bw (X,B (X)), of measurable functions v such that ?v?w < ?. The reader can refer to [48] for a more complete discussion on weighted norms. At this point, the reader may be confused by the three functions w, w+, and w?. The first function is used in defining a weighted norm, and the latter two functions are used in CPT-based measures. The function used should be clear from the context. Assumption 2. The function w ? V (i.e., the integrable function space) is fixed 55 with respect to the given Markov control model such that ?Q??w < ?, ?? ? ?. Furthermore, the per-stage reward function r(?, ?, ?) is measurable, w-bounded, i.e., there is a constant r? ? 0 such that sup A(x) |r (x, a, x?)| ? r?w (x?) , ?x, x? ? X, and r : a ? r(?, a, ?) is upper semi-continuous in a. The assumption above is assumed to hold throughout this section. A transient Markov model has some absorbing state xA ? X, such that Q ({xA} |xA, a) = 1 and r (xA, a, xA) = 0 for all a ? A (x). In other words, once an absorbing state is reached, no further rewards will be given. In addition, a transient Markov model reaches its absorbing state in finite amount of time, i.e., sup ?,X E [? pi0 |x] < ?, where ? pi0 := inf {t ? |xpit = xA} . Without loss of generality, we assume the model only has one absorbing state, because the case of multiple absorbing states can be easily reduced to the single absorbing state case. We introduce some additional notations for clarity. We denote the effective state space by ?X = X \ {xA}, and the effective controlled substochastic kernel by ?Q. The substochastic kernel ?Q restricts its arguments to only allow the 56 effective states (i.e., ?Q (B|x, a) = Q (B|x, a) , ?B ? B ( ?X ) , ?x ? ?X, ?a ? A (x)). We introduce the definition of a transient Markov control model below. Definition 17. A randomized Markov policy pi ? ?RM is transient with respect to a Markov control model, if there exists a constant k and a weight function w : X ? [1,?) such that ????? ?? t=0 ?Qtpi ????? w ? k, (2.3.12) where Qtpi := Q0Q1 . . . Qt?1 and Q0pi (?|x) := ?x (?) . If the inequality above is uniform for all Markov policies, then the model is called uniformly transient (i.e., Eq. 2.3.12 is true for all pi ? ?RM ). Since we are working with stationary transition probabilities (i.e., Q1 = Q2), Eq. 2.3.12 implies that Q ( ?X|x, a ) ? 1 for all x ? X and a ? A (x). Eq. 2.3.12 can also be written as: ????? ?? t=0 ?Qtpi ????? w = sup X w (x)?1 ?? t=0 ? X w (y) ?Qtpi (dy|x) = sup X w (x)?1 ?? t=0 E [w (xpit ) |x] . Hence, we can infer from Eq. 2.3.12 that E [w (xpit ) |x] ? 0 as t ? ?. Eq. 2.3.12 is also known as the Pliska condition [69]. One major contribution of ?avu? and Ruszczy?ski [18] is to suggest a generalized version of the Pliska condition for co- herent risk measures (i.e., convexity). Since we are interested in non-convex perfor- mance measures, we take a different approach to prove the optimality of dynamic programming equations. 57 We are interested in solving a more general version of the standard expected total reward problem, which searches for an optimal policy that maximizes the following expected value: E [ ?? t=0 r (xt, at, xt+1) ] . We know from [48] that we can apply dynamic programming to this problem and obtain an optimal stationary deterministic policy. In this section, we would like to explore the risk-sensitive version of this problem, especially when the conditional reward function ? is not convex. Our goal is to prove that dynamic programming can still be applied to the non-convex risk-sensitive version of the expected total reward problem. We are interested in finding the maximum of a total reward function of the form: V (x0, pi) = ? (r(xpi0 , api0 , xpi1 ) + ? (r(xpi1 , api1 , xpi2 ) +? (r(xpi2 , api2 , xpi3 ) + ? ? ? + ? (r(xpi?, api?, xpi?)) ? ? ? ))) . The corresponding optimization problem can be written as: V ? (x0) := max pi?? V (x0, pi). (2.3.13) Without loss of generality, we restrict ourselves to randomized Markov policies (see 58 [49], Theorem 9.4.5), i.e., V ? (x0) = max pi??RM V (x0, pi) . (2.3.14) To solve Eq. 2.3.14, we start by finding an optimal solution for the simpler case of randomized stationary policies, i.e., V ? (x0) := max pi??RS V (x0, pi). (2.3.15) Later, the sufficiency of randomized stationary Markov polices is proven, i.e., the left-hand sides of Equ 2.3.14 and Eq. 2.3.15 are equivalent. We denote the reward- to-go function at time t by: Rt(x0, pi) = ?(r(xpit , apit , xpit+1) + ?(r(xpit+1, apit+1, xpit+2) +? ( r(xpit+2, apit+2, xpit+3) + ? ? ? + ? (r(xpi?, api?, xpi?)) ? ? ? ))) , and the t-stage total reward function by: Jt(x0, pi) = ? (r(xpi0 , api0 , xpi1 ) + ? (r(xpi1 , api1 , xpi2 ) +? (r(xpi2 , api2 , xpi3 ) + ? ? ? + ? ( r(xpit , apit , xpit+1) ) ? ? ? ))) . 59 We denote the optimal t-stage-reward value function by J?t (x) := max pi?? Jt (x, pi) , and define the operator T (?) on Bw (X,B (X)) as T (?) v (x) := ? (rx + v, x, ? (x) ?Qx) . In addition, the dynamic programming operator T is denoted by Tv := max ??P(A(x)) ? (rx + v, x, ? ?Qx) . Since ? is monotone, the operator T (?) is also monotone, i.e., v ? v? =? T (?) v ? T (?) v?, ?v, v? ? Bw (X,B (X)) , ? ? ?. Given a Markov policy pi = {?t} ? ? and T (pi)0 = I, then for k = 1, 2, . . . , the iterated operator T k (pi) on Bw (X,B (X)) is defined by T k (pi) := T (?0)T (?1) ? ? ?T (?k?1) . We denote the total-reward function with respect to a policy pi as T ? ? by V (x0, pi) = lim T?? JT (x0, pi) ?x ? ?X. 60 For the rest of this section, we make the following assumption. Assumption 3. The following conditions hold. 1. There exists a k ? 1 such that T k (pi) is a contraction mapping for all transient stationary polices pi; i.e., ?? < 1 s.t. ??T k (pi) v ? T k (pi) v??? w ? ? ?v ? v??w ?v1, v2 ? Bw (X,B (X)) , ? transient pi ? ?RS; 2. The reward transition mapping ? : ? ? ? (?, x,m) is continuous; 3. V ? ? Bw (X,B (X)) ; 4. The Markov control model is uniformly transient. Condition 4 in Assumption 3 can be relaxed (see [55]) at the expense of addi- tional assumptions. From condition 3 in Assumption 3, we can trivially deduce the following lemma. Lemma 3. If Assumption 3(3) holds, then Jt (pi, x) and J?t (x) are both w-bounded for all x ? X, pi ? ?, t = 1, 2, . . . . Proof. Proof by contradiction: If Jt (pi, x) and J?t are not w-bounded, then V ? /? Bw (X,B (X)). This contradicts Assumption 3(3). The lemma above justifies for writing Jt (pi, ?) and J?t (?) as arguments of ? (?) . Assumption 3(1) ensures the convergence of the operator T k , which is stated in the following lemma. 61 Lemma 4. For any transient stationary policy pi ? ?RS, if Assumption 3(1) holds, then for any v ? Bw (X,B (X)) lim k?? T k (pi) v (x) = V (x, pi) = lim k?? Jk (x, pi) ?x ? X. Proof. Using Assumption 3(1) and the Banach fixed-point theorem, noting the fact that V (x, pi) = lim k?? T k (pi) 0 < ?, the proof follows. The following theorem proves the optimality criteria for Eq. 2.3.15. Theorem 4. Let Assumptions 2 and 3 hold. For a transient Markov controlled model, a Markov reward transition mapping ? (?, ?, ?), and a randomized stationary Markov policy pi = {?, ?, . . . }, a bounded measurable function v : ?X ? R (i.e., ?v?w < ?) satisfies the equation v (x) = ? (rx + v, x, ? (x) ?Qx) , x ? ?X v (xA) = 0, (2.3.16) if and only if v (x) = V (pi, x) for all x ? X. Proof. Let v (?) be a bounded measurable solution of Eq. 2.3.16. Since ?v?w < ? and w ? Bw (X,B (X)), we know that v ? Bw (X,B (X)). By assumption, r (?, ?, ?) is w-bounded, and thus rx ? Bw (X,B (X)). Consequently, the right-hand side of Eq. 62 2.3.16 is well defined and can be iterated, which results in the following equation v(x0) = ? (r(xpi0 , api0 , xpi1 ) + ? (r(xpi1 , api1 , xpi2 ) +? (r(xpi2 , api2 , xpi3 ) + ? ? ? + v (xT+1) ? ? ? ))) ?x0 ? ?X. Since v (?) is a w-bounded function, we conclude by evoking Lemma 4 and taking k to infinity that v (x) = lim k?? T k (pi) v (x) = V (pi, x) ?x ? ?X v (xA) = 0 = V (pi, xA) . The converse is proved by writing down the equation: JT (pi, x0) = ? (r (x0, ? (x0) , x1) + JT?1 (pi, x0)) . Taking the limit as T ? ? on both sides, we arrive at the following equation: lim T?? JT (pi, x0) = lim T?? ? (r (x0, ? (x0) , x1) + JT?1 (pi, x0)) . Since ? (?) is continuous by assumption, we conclude that lim T?? JT (pi, x0) = ? ( r (x0, ? (x0) , x1) + lim T?? JT?1 (pi, x0) ) . 63 Using the fact that lim T?? JT (pi, x0) = V (pi, x0) = v (x0) , ?x0 ? ?X as T ? ?, we rewrite the previous equation as: v (x0) = ? (r (x0, ? (x0) , xpi1 ) + v (xpi1 )) = ? (rx0 + v, x0, ? (x0) ?Qx0) , ?x0 ? ?X, which is the same as Eq. 2.3.16. Furthermore, V (pi, xA) = 0 = v (xA) by definition. Theorem 5. Assume the following conditions hold for a uniformly transient Markov control model: 1) The stochastic kernels Qx : a ? Q(?|x, a) are continuous ?x ? X ; 2) The one-step dynamic reward measure ? is Markov (see Definition 15), and a sequence of corresponding reward transition mappings ? : m ? ?(?, x,m) is upper semi-continuous; 3) The assumptions in Theorem 4 are satisfied; 4) For every x ? X the set A(x) is compact; Then a maximizer for the dynamic programing equation v(x) = max ??P(A(x)) ? (r(x, ?, ?) + v(?), x, ? ?Qx) ?x ? ?X v (xA) = 0, (2.3.17) 64 exists. Furthermore, an optimal randomized stationary policy, pi? := {??, ??, ? ? ? } exists and each ?? is a maximizer for the right-hand side of Eq. 2.3.17; In addition, a bounded measurable function v, (i.e., ?v?w < ?) is a solution of Eq. 2.3.17 if and only if it equals V ? in Eq. 2.3.15, i.e., v (x) = V ? (x) , ?x ? X. Proof. Since the set of all policy sequences of the form pi = {?, pi, pi, . . . } contains ?RM , we write down the inequality V ? (x0) ? sup ??P(A(x0)),pi??RM ? (r (x0, a0, x1) + V (pi, x1)) , where V ? is defined by Eq. 2.3.14. Because ? is monotone, we move the supremum operator inside: V ? (x0) ? sup ??P(A(x0)) ? ( r (x0, a0x1) + sup pi??RM V (pi, x1) ) ? sup ??P(A(x0)) ? ( r (x0, a0, x1) + V ? (x1)). By Assumption 3(3), i.e., ??V ? ?? w < ?, the right-hand side is well defined. Thus V ? satisfies the inequality V ? (x) ? sup ??P(A(x)) ? ( rx + V ?, x, ? ?Qx ) , x ? X. (2.3.18) Since the existence of a solution for Eq. 2.3.17 is assured by the semi-continuity of the mapping ? : ? ? ? (rx + v, x, ? ?Qx) and the weak compactness of the set 65 P (A (x)), we conclude that V ? (x) ? ?(rx + V ?, x, ?? (x) ?Qx), x ? X. Here, ?? is a solution to the optimization problem represented by the right-hand side of Eq. 2.3.18. By iterating the inequality above, appealing to the monotonicity property of ?, and applying the policy pi? = {??, ??, . . . } , we obtain the fact that V ? (x0) ? V (pi?, x0) . Since by assumption V ? (?) is the optimal value function, V ? (?) ? V (pi?, ?), which along with the previous inequality, imply V ? (?) = V (pi?, ?) . Using Theorem 4, we conclude V ? (?) satisfies the dynamic programming equation. To prove the converse, we first suppose v (?) satisfies Eq. 2.3.17, and ?v?w < ?. Since the mapping ? : ? ? ? (rx + v, x, ? ?Qx) is continuous and the set P (A (x)) is weakly compact, an optimal control function, ??, exists. Furthermore, ?? is the maximizer for the right-hand size of the dynamic programing equation. This enables us to write v (x) = ? ( rx + v, x, ?? (x) ?Qx ) , x ? X. (2.3.19) Using Theorem 4, we conclude that v (x) = V (pi?, x) ? V ? (x) , x ? X, (2.3.20) 66 where pi? = { ??, ??, . . . } . On the other hand, it follows from Eq. 2.3.17 that the control function ?? satisfies v (x) ? ? ( rx + v, x, ?? (x) ?Qx ) , x ? ?X. Using the monotonicity property of ?, we iterate the above inequality and arrive at v (x) ? ?0,T (0, Z1, . . . , ZT + v (xT )) , where Zt is the reward sequence resulting from applying the policy pi?. By taking T ? ? for the equation above, we conclude that v (x) ? V (pi?, x) = V ? (x) , x ? ?X. The last inequality, together with Eq. 2.3.20 and the fact that v (?) = V ? (?) im- ply the stationary policy ?? that satisfies Eq. 2.3.19 is optimal, i.e., V (pi?, x) = V (pi?, x) = V ? (x) , ?x ? X. In addition, we know v (xA) = V ? (xA) = 0 from the definition of transition Markov model. In the theorem above, we provide the optimality criteria for the case of ran- domized stationary Markov policies. Next, we prove the sufficiency of randomized stationary Markov policies as optimal policies. Theorem 6. Assume the assumptions in the Theorem 5 are satisfied. Then a w- bounded measurable function v : X ? R, with ?v?w < ?, satisfies Eq. 2.3.17 if and 67 only if v (x) = V ? (x) for all x ? X. Moreover, a maximizer ?? exists for Eq. 2.3.17 and defines an optimal randomized stationary Markov policy pi? = {??, ??, ??, . . . } . Proof. We denote a Markov policy by pi1 = {?1, ?2, . . . } . Given the monotonicity and continuity of ? (?), we have V ? (x0) = sup ?0,?1,... lim sup T?? ? ( r (x0, a0, x1) + JT?1(pi1, x1)) ? sup ?0,?1,... lim sup T?? ? ( r (x0, a0, x1) + sup ??T?1 J? ( pi1, x1 )) = sup ?0,?1,... lim T?? ? ( r (x0, a0, x1) + sup ??T?1 J? ( pi1, x1 )) = sup ?0,?1,... ? ( r (x0, a0, x1) + lim sup T?? JT?1 ( pi1, x1 )) = sup ?0,?1,... ? ( r (x0, a0, x1) + V (pi1, x1)). By appealing to the monotonicity property of ? (?), we can move the supremum inside the argument? V ? (x0) ? sup ?0 ? ( r (x0, a0, x1) + sup ?1 V ( pi1, x1 )) = sup ?1 ? (r (x0, a0, x1) + V ? (x2)) . Thus V ? (?) satisfies the inequality V ? (x) ? sup ??P(A(x)) ? (rx + V ?, x, ? ?Qx) , x ? X. (2.3.21) Appealing to the monotonicity property of ?, iterating the above inequality and 68 letting ?? be the maximizer from the equation above we conclude that V ? (x) ? V (pi?, x) , x ? X, where pi? = {??, ??, . . . } is a stationary Markov policy that maximizes Eq. 2.3.21. Therefore, optimization with respect to stationary Markov policies is sufficient, and the result follows from Theorem 5. We need to ensure the first three conditions in Assumption 3 are satisfied by all CPT-inspired reward measures, which have the form: ?t(rx + v, x, ? (x) ?Qx) = ? ? 0 w+ (? (x) ?Qx((rx + v)+ > s))ds ? ? ? 0 w? (? (x) ?Qx((rx + v) ? > s )) ds. (2.3.22) With this specific form, we can write down the operator T with respect to a transient stationary policy pi = {?, ?, . . . } as: T (pi) v (x) : = ? ? 0 w+ ( ? (x) ? ?Qx((rx + v)+ > s))ds ? ? ? 0 w? ( ? (x) ? ?Qx((rx + v) ? > s )) ds = ? ? 0 (rx + v)+ d (( ? (x) ? ?Qx )w+,rx+v) ? ? ? 0 (rx + v) ? d (( ? (x) ? ?Qx )w?,rx+v) . The second equality is due to the fact that the transformed measures ( ? (x) ? ?Qx )w+,rx+v 69 and ( ? (x) ? ?Qx )w?,rx+v are absolutely continuous with respect to ? (x) ? ?Qx; hence a Radon-Nikodym derivative exists. The lemma below will be used in Theorem 7. Lemma 5. If a uniformly transient Markov model is given, i.e., lim k?? ???? ?( pik (x) ? ?Qx )k???? w ? ?1, ?pi = {pi1, pi2, . . . } ? ?RM , then there exists a ?k > 0 such that ????? (( pik (x) ? ?Qx )w?,v)k????? w < 1, ?k ? ?k, ?pi = {pi1, pi2, . . . } ? ?RM , v ? V . Proof. Note that if lim k?? ???? ?( pik (x) ? ?Qx )k???? w is finite, then there exists a ?k such that for all k ? ?k ????? (( pik (x) ? ?Qx )w?,v)k????? w < 1. Since w? (p) equal to 1 if and only if p = 1, the assertion follows. A similar statement (Lemma 5) can be made about ????? (( pik (x) ? ?Qx )w+,v)k????? w . Theorem 7. Suppose the Markov control model is uniformly transient, and the following assumptions are satisfied: 70 1) w+ and w? are continuous non-decreasing functions; 2) There is a constant r? such that sup A(x) |r (x, a, x?)| ? r?w (x?) , ?x, x? ? X; Then Eq. 2.3.22 satisfies Assumption 3. Proof. 1) Letting pi = {?, ?, . . . }, we know the following ???T (pi)k v1 ? T (pi)k v2 ??? w ? ????? ? ? ?? (v1 ? v2) d ((( ? (x) ? ?Qx )w+,v1)k + (( ? (x) ? ?Qx )w?,v1)k)????? w +k ?v1 ? v2?w + ?k ?v1 ? v2?w ? (?????((? (x) ? ?Qx)w+,v1)k????? w + ????? (( ? (x) ? ?Qx )w?,v1)k????? w + k + ?k ) ? ? ? ?v1 ? v2?w , (2.3.23) where (?????((? (x) ? ?Qx)w+,v1)k????? w + ????? (( ? (x) ? ?Qx )w?,v1)k????? w + k + ?k ) can be chosen to be less than 1. The appropriate k is found by appealing to Lemma 5. Since the Markov control model is uniformly transient (i.e., the probability weight on the non-absorbing states decreases as k increases), k and ?k can be made ar- bitrarily small as k ? ?. Here, ?k captures the difference between the distorted measures, of the the previous finite number of per-stage rewards, induced by v1 and v2. As k increases, the measure distortions induced by v1 and v2 on the initial steps disappear. The first inequality in Eq. 2.3.23 is due to the fact that the trans- formed measures (( ? (x) ? ?Qx )w+,v1)k and ((? (x) ? ?Qx)w?,v1)k are absolutely continuous with respect to ( ? (x) ? ?Qx )k ; hence a Radon-Nikodym derivative ex- 71 ists. Furthermore, we used the fact that ? f1 (x1) dP (x1) ? ? f2 (x1) dQ (x1) = ? (f1 ? f2) (x1)dP (x1) + ? f2(x1) (dP (x1) ? dQ (x1)) , where f1 and f2 are two w-bounded measurable functions, and P and Q are two ?-finite measures. In Eq. 2.3.23, k represents the difference between the v1 and v2 distorted measures (i.e., k = ?v2?w?v1?v2?w ?P ?Q?w, where P is distorted by v1 and Q is distorted by v2). In other words, k captures the difference in the distorted measures at the k-th stage induced by v1 and v2. 2) To prove condition 2, we appeal to the continuity property of w+ and w?, which implies the following inequalities: ??t(z1, x,m) ? ?t(z2, x,m)?w = ???? ? ? 0 w+ ( m ((z1)+ > s))? w+(m((z2)+ > s))ds ? (? ? 0 w? ( m ((z1) ? > s )) ? w? ( m ((z2) ? > s )) ds )???? w ?  2 +  2 = , where for brevity the measure ? (x) ? ?Qx is denoted by m. From the continuity property of the functions w+ and w?, we know that there 72 exist ?1 and ?2 such that ??m ((z1)+ > s)?m((z2)+ > s)??w ? ?1 =? ??w+ ( m ((z1)+ > s))? w+(m((z2)+ > s))??w ? 2 and ??m ((z1) ? > s ) ?m ((z2) ? > s )?? w ? ?2 =? ??w+ ( m ((z1) ? > s )) ? w+ ( m ((z2) ? > s ))?? w ?  2 . Since ?z1 ? z2?w ? 0, implies ??m ((z1)+ > s)?m((z2)+ > s)??w ? 0 and ??m ((z1) ? > s ) ?m ((z2) ? > s )?? w ? 0, we conclude that there exists a ?3 such that ?z1 ? z2?w < ?3 ??m ((z1)+ > s)?m((z2)+ > s)??w ? ?1 ??m ((z1) ? > s ) ?m ((z2) ? > s )?? w ? ?2. 73 3) Now, we prove condition 3 of Assumption 3. Since sup A(x) |r (x, a, x?)| ? r?w (x?) , ?x, x? ? X, and the Markov control model is assumed to be uniformly transient, we can show, by induction, that V (pi, x) is also w-bounded for any transient policy pi = {?, ?, . . . } ? ?RM . We start by writing down the one-stage reward function: ? ( |rx| , x, ? ? ?Qx ) = ? ? 0 |rx| d (( ? ? ?Qx )w+,|rx|) ? r? ? ? 0 wd (( ? ? ?Qx )w+,|rx|) . The two-stage reward function is written as ? ( |rx| + ? ( |rx1 | , x1, ? ? ?Qx1 ) , x, ? ? ?Qx ) ? r? (? ? 0 wd (( ? ? ?Qx )w+,|rx|) + ? ? 0 wd (( ? ? ?Qx )w+,|rx|+|rx|)2) . By iterating the inequality above and appealing to Lemma 5, we arrive at the conclusion that V (pi, ?) ? r?kw (?) , where the k is found in Eq. 2.3.12. In the next section, we present a numerical example to explore the structure of optimal policies for the transient Markov case. 74 2.3.4 The Organ Transplant Example: A Comparative Analysis We will compare numerically the type of polices obtained from CPT-based measures against some of the other risk-sensitive approaches. Example 7. The following example is from [18], which is a simplified version of the organ transplant problem discussed in [1]. The problem considers the discrete- time absorbing Markov chain depicted in Fig. 2.3.1a. The initial state S (i.e., sick) represents a patient waiting on an organ transplant due to sickness. The state L (i.e., live) represents the state where the patient lives after a successful transplant. The state D, an absorbing state, represents death. There are two possible actions to take in state S: 1) one can wait (W), in which case the next state could either be D or S probabilistically; 2) one can choose to transplant (T), which concentrates the transition probability on states L and D (i.e., states L and D are the only two possible next states). The probability of death is lower for W than for T, but a successful transplant may result in a longer life. In other two states, only the action continue is allowed. The reward collected at each time step is months of life. In state S, a reward equal to 1 is collected if the control is W; otherwise, the immediate reward is 0. In state L, the reward r(L) is collected representing the certainty equivalent of the random length of life after the transplant. In state D the reward is 0. The states where there is only one possible action allowed have a deterministic reward function (i.e., L and D). In particular, the equivalent length of life at the state L is r (L) . However, this value is generated by taking on certain assumptions, which are the focus of the following discussion. The state L is in fact an aggregation 75 (a) Organ Transplant Transitions (b) The Survival Model Figure 2.3.1: Organ Transplant State Transitions & Rewards Optimum 20 40 60 80 100 p wait - 400 - 350 - 300 - 250 Value (a) Semideviation [18] Optimum 20 40 60 80 100 p wait - 150 - 100 - 50 50 100 Value (b) CPT-based Figure 2.3.2: Optimal Policy Comparison of the Organ Transplant Example of n states in a survival model representing months of life after the transplant, as depicted in Fig. 2.3.1b. At the state i, i = 1, . . . , n, the patient dies with probability pi and survives with probability 1 ? pi. The patient will die for sure in the state n (i.e., pn = 1). The reward collected at each state i is equal to 1. In ?avu? and Ruszczy?ski [18], the problem is stated as a minimization problem. However, we desire a maximization problem, thus we compare our results to that of ?avu? and Ruszczy?ski?s [18] by negating the rewards. In [18], r (L) is calculated from the survival model using the transition mapping 76 of the form: ? (?, i,m) = Em [?] ? ?? ? expected value +?E [(?? Em [?])+] ? ?? ? semideviation . (2.3.24) In Eq. 2.3.24, the measure m is the transition kernel at the current state i, and the function ? (?) is the reward collected at the current state and action plus the value function at the next state (i.e., cost-to-go). At each state i = 1, . . . , n?1, two transitions are possible: 1) transition to the state D with probability pi and ? = ?1; 2) transition to the state i + 1 with probability 1 ? pi and ? = ?1 + vi+1 (i+ 1) . At the state i = n, the transition to D occurs with probability 1, and ? = ?1. Therefore, vn (n) = ?1. The survival problem is now a finite-horizon problem, which can be expressed as in Eq. 2.3.1. Since there is only one action allowed, the minimization operation is eliminated. The equation has the form: vi (i) = ? (?, i, Qi) , i = 1, . . . , n? 1, with ? and Qi being the reward and the transition probability respectively. The values of ? and Qi are explained in the previous paragraph. By induction, vi (i) ? 0, for i = n? 1, n? 2, . . . , 1. The mean and semideviation components of Eq. 2.3.24 at the states i = 77 1, . . . , n? 1 can be calculated as follows: EQi [?] = ?pi + (1 ? pi) (?1 + vi+1 (i+ 1)) = ?1 + (1 ? pi) vi+1 (i+ 1) , EQi [(?? EQi [?])+]= EQi [(?+ 1 ? (1 ? pi) vi+1 (i+ 1))] = pi (?1 + 1 ? (1 ? pi) vi+1 (i+ 1))+ + (1 ? pi) (?1 + vi+1 (i+ 1) + 1 ? (1 ? pi) vi+1 (i+ 1))+ = pi (? (1 ? pi) vi+1 (i+ 1))+ + (1 ? pi) (pivi+1 (i+ 1))+ = ?pi (1 ? pi) vi+1 (i+ 1) , where the last equality in the equation above is implied by the fact that vi+1 (i+ 1) ? 0. For i = 1, . . . , n? 1, the dynamic programming equation for the optimization problem stated in Eq. 2.3.1 takes the form: vi (i) = ?1 + (1 ? pi) vi+1 (i+ 1) ? ?? ? expected value ?? pi (1 ? pi) vi+1 (i+ 1) ? ?? ? semideviation , i = n? 1, n? 2, . . . , 1. The value v (1) is the negative of the risk-adjusted length of life with the new transplanted organ. For ? = 0, the above formula gives the negative of the expected length of life with the new organ. In the calculations below, we use the transition data from Table 2.3.5. They have been chosen for illustrative purposes only, and do not correspond to any real-life medical data. For the survival model, the distribution function, F (x), of lifetime of the American population is suggested by Jasiulewicz 78 Action S L D W 0.99882 0 0.00118 T 0 0.90782 0.09218 Table 2.3.5: Transition Probabilities From State S Distribution Parameters Weights Weibull ? = 0.297, ? = 0.225 w1 = 0.0170 Lognormal m = 3.11, ? = 0.218 w2 = 0.0092 Gompertz b = 0.0000812, ? = 0.0844 w3 = 0.9737 Table 2.3.6: Organ transplant example: parameters for F (x). [58]. It is a mixture of Weibull, lognormal, and Gompertz distribution: F (x) := w1 ( 1 ? e?(x? )? ) + w2? (logx?m ? ) + w3 ( 1 ? e? b a (e?x?1) ) , x ? 0. The values of the parameters and weights, provided by Jasiulewicz [58], are given in Table 2.3.6. Using the information provided above, the probability of dying in the k-th month can be calculated using the equation: pk = F (k 12 + 1 24 ) ? F (k 12 ? 1 24 ) 1 ? F (k 12 ? 1 24 ) , k = 1, 2, . . . . The maximum lifetime of the patient is assumed to be 1200 months, and the post-transplant survival probabilities for the patient starts from k = 300. Hence, a total of 900 steps, n=900, is used in the survival model to calculate r (L) . If we let ? = (?W , ?T ) be a randomized policy in the state S and let ? = {? ? R2 : ?W + ?T = 1, ? ? 0}, then the dynamic programming equation at the 79 state S has the form: v (S) = min ??? {?W [qS,S (W ) (v (S) ? 1) + qS,D (W ) (v (D) ? 1)] +?T [qS,L (T ) v (L) + qS,D (T ) v (D)] +? ( ?W [ qS,S (W ) (v (S) ? 1 ? ?)+ + qS,D (W ) (v (D) ? 1 ? ?)+ ] +?T [ qS,L (T ) (v (L) ? ?)+ + qS,D (T ) (v (D) ? ?)+ ])} . Here, ? = 1. If ? is held fixed in the equation above, then we can solve for v (S) . By varying ? ? (0, 1), we obtain Fig. 2.3.2a. We can compute the value function of the CPT-based reward measure as follows: v (S) = max ??? ? ? 0 w+ ( ?W ( qS,S (W ) 1{(v (S) + 1 ? ?)+ > s} +qS,D (W ) 1{(v (D) + 1 ? ?)+ > s}) +?T ( qS,L (T ) 1{(v (L) ? ?)+ > s} +qS,D (T ) 1{(v (D) ? ?)+ > s}))ds ? ? ? 0 w ? ( ?W ( qS,S (W ) 1{(v (S) + 1 ? ?) ? > s } +qS,D (W ) 1{(v (D) + 1 ? ?) ? > s }) +?T ( qS,L (T ) 1{(v (L) ? ?) ? > s } +qS,D (T ) 1{(v (D) ? ?) ? > s })) ds. In the equation above, w+ (?) = w? (?) = exp (?0.5 (? ln (?))) , and ? is the expected value without probability weighting. The numerical results of the three 80 Method r(L) Optimum Value Optimal ?W Expected Value 610.46 846.611 1.000000 Semideviation 515.33 426.139 0.987236 CPT 702.32 104.438 0.868232 Table 2.3.7: Organ Transplant Optimal Value and Policy Comparison solution methods (i.e., expected value, semideviation, and CPT) 9 are listed in Table 2.3.7. Furthermore, the value functions of the semideviation and the CPT performance measures are plotted in Fig. 2.3.2. We calculated r (L) for the CPT method using the following equation: ? (?, i,m) = ? ? 0 w+(m (?+ > s))ds. As is evident from Table 2.3.7, the CPT performance measure produces a more randomized optimal policy than the other two approaches. The ?W value of 0.99 is very close to the deterministic policy of W (i.e., to wait). In fact, obtaining a very randomized policy is difficult using semideviation. The ease with which the CPT performance measure is able to obtain an optimal randomized policy can be ex- plained by the fact that the probability weighting function is applied to the control. Intuitively, the need for randomized policies stems from the nonlinear transforma- tion of the uncertainty in the system, which renders deterministic optimal policies insufficient. 9In the table, some values is negated to be positive for the purpose of comparison. 81 2.4 Reward Measures and Optimal Policies From the previous sections, we have learned that we can solve finite-horizon non- convex optimization problems with reward functions of the form: Vt(x0, pi) = ?(rt(xpit , apit , xpit+1) + ?(rt+1(xpit+1, apit+1, xpit+2) + ? ( rt+2(xpit+2, apit+2, xpit+3) + ? ? ? + ? ( rT?1(xpiT?1, apiT?1, xpiT ) + rT (xpiT ) ) ? ? ? ))) , where Vt is known as the reward-to-go function. The equation is analogous to the expanded form of the standard expected value measure: ?Vt (xt, pi) = E[rt(xt, apit , xpit+1)+ E[rt+1(xpit+1, apit+1, xpit+2) + E [ rt+2 ( xpit+2, a pi t+2, x pi t+3 ) + ? ? ? + E [ rT?1 ( xpiT?1, a pi T?1, x pi T ) + rT (xpiT ) |xpiT?1 ] ? ? ? |xpit+2 ]|xpit+1]xt]. One of the advantages of the standard expected value measure is that it can be written more compactly as ?Vt (x0, pi) = E [ T?1? i=t ri ( xpii , a pi i , x pi i+1 ) + rT (xpiT ) ] . We would like to write down a similar simplified counterpart in the CPT Markov conditional reward measure case for Vt. In other words, we would like to 82 study the reward measures: Vt (xt, pi) = ? ? 0 wt (pit (xt) ?Qxt,t (rt (xt, apit , xt+1) + Vt+1 (xt+1, pi)) > s) dst, where VT (xT , pi) = rT (xT ) . (2.4.1) For simplicity, we only consider bounded non-negative rewards (i.e., rt ? 0 and rt ? M, M > 0 ?t ? 0). The inclusion of negative rewards is a straightforward exercise. We can see that Eq. 2.4.1 is a complicated sequence of nested integrals. We would like to simplify this expression by introducing some new notations. We note that the transformed measure on the measurable space (X ? A,B (X ? A)) Pw,? (? > s) := w (P (? > s)) (2.4.2) is absolutely continuous with respect to P (? > s) . By the Radon-Nikodym theorem, there exists a measurable function such that Pw,? (B) := ? B dPw,? dP dP, where dPw,?dP is a Radon-Nikodym derivative. Using Radon-Nikodym derivatives of the form dPw,?dP , we can rewrite Eq.2.4.1 83 as Vt(xt, pi) := ? X?A rt (xt, apit , xt+1) + ? X?A rt+1 ( xpit+1, a pi t+1, x pi t+1 ) ? ? ? d ( pi (xt+1, t+ 1) ?Qxt+1,t+1 ) wt+1,rt+1+vt+2 dpi (xt+1, t+ 1) ?Qxt+1,t+1 dpi (xt+1, t+ 1) ?Qxt+1,t+1 d (pi (xt, t) ?Qxt,t)wt,rt+vt+1 dpi (xt, t) ?Qxt,t dpi (xt, t) ?Qxt,t, where vt is recursively defined as vt (x) = ? ? 0 wt (pit (x) ?Qx,t (rt + vt+1 > s)) ds vT (x) = rT (x) . In the equation above, the mapping rt+vt+1 : X?A?X ? R is used to transform the probability measure pit (x) ?Qx,t, which is a probability measure on the probability space (X ? A,B (X ? A)). According to Eq. 2.4.2, the function used to transform the measure pit (x) ? Qx,t should be a B (X ? A)-measurable function. It is obvious that rt + vt+1 is a B (X ? A)-measurable function if x is held fixed (i.e., rt + vt+1 is treated as rt (x, ?) + v (?) in the equation above). In the sequel, whether or not x is held fixed for the reward function rt + vt+1 should be obvious from the context. Using the Radon-Nikodym derivative notation, Vt can be written more com- pactly as: 84 Vt (xt, pi) = Epit [( T?1? i=t rt ( xpii , a pi i , x pi i+1 ) + rT (xpiT ) ) T?1? i=t d (pi (xi, i) ?Qxi,i) wi,ri+vi+1 dpi (xi, i) ?Qxi,i ] vt (x) = ? ? 0 wt (pit (x) ?Qx,t (rt + vt+1 > s)) ds vT (x) = rT (x) . (2.4.3) Now, we can easily see that the difficulty of solving the optimization problem stated in Eq. 2.3.1 is due to the appearance of the value functions {vi}T?1t in the calculation of Vt. The following proposition aggregates several fundamental properties of Eq. 2.4.3. Proposition 1. The value function, Vt, in Eq. 2.4.3 has the following properties: 1) As sup x |wt (x) ? x| ? 0 ?t ? [0, T ], a solution pi?? for the standard (i.e., risk-neutral) optimization problem: max pi ?Vt (xt, pi) , also solves the optimization problem: max pi Vt (xt, pi) ; 2) If wt is such that it puts all weights on the highest possible reward value, then an 85 optimal policy for the optimization problem: max pi Vt (xt, pi) is obtained by considering only deterministic policies: pi?t (x) = arg max a?A(x) rt(x, a) + vt+1 (xt+1) . Proof. The proof for (1) is trivial since ?Vt (xt, pi) ? Vt (xt, pi) point-wise when sup x |wt (x) ? x| ? 0 ?t ? [0, T ]. The proof for (2) is also straightforward. By placing all probability weights on the highest reward value, it is always optimal to pick the deterministic action with the highest reward value. Remark 11. When there exists an action a? such that Qx,t (?x,a? > s|x, a?) ? Qx,t (?x,a > s|x, a) ?s ? [0,?), ?a ? A (x) , ?x ? X, ?t ? [0, T ], where ?x,a denotes the reward function with x and a fixed, then there exists a deter- ministic optimal policy (i.e., the policy that takes action a? for state x at time t). De- terministic optimal policies is obtained trivially when the complement CDF of an ac- tion dominates all other complement CDFs (i.e., If 1?Fa1 (x) ? 1?Fa2 (x) ?a1 6= a2, 86 then a1 is an optimal deterministic action). The next example demonstrates some of the difficulties in analyzing even the simplest probability weighting function. We prove the uniqueness of the optimal policy for some special cases afterwards. Example 8. For this example, we will use the probability weighting function: w (x) = 1 ? (1 ? x)b b > 1. For simplicity, we deal with a two-state-two-action problem. We write the probability transition matrix as Q1 := ? ??? Q [1, 1, 1] Q [1, 1, 2] Q [2, 1, 1] Q [2, 1, 2] ? ??? and similarly we denote the transition probability matrix of taking action 2 (i.e., a2) as: Q2 := ? ??? Q [1, 2, 1] Q [1, 2, 2] Q [2, 2, 1] Q [2, 2, 2] ? ??? . In other words, Q [1, 2, 1] is the probability of starting and arriving at state 1 by taking action 2. Using the reward function ? (x, a, x) := x+ a+ x, we study the optimization problem: max p1 ? ? 0 1 ? (1 ? (P (? > s|x, a1) ? p1 + P (? > s|x, a1) ? (1 ? p1)))2 ds. (2.4.4) 87 By differentiating Eq. 2.4.4 with respect to p1, we obtain ?Q[1, 2, 2]w? (?Q[1, 2, 2] (?1 + p1)) +(Q[1, 1, 2] ?Q[1, 2, 1] ?Q[1, 2, 2]) ?w? (?(Q[1, 2, 1] +Q[1, 2, 2]) (?1 + p1) +Q[1, 1, 2]p1) +3(Q[1, 1, 1] +Q[1, 1, 2] ?Q[1, 2, 1] ?Q[1, 2, 2]) ?w? (?(Q[1, 2, 1] +Q[1, 2, 2]) (?1 + p1) + (Q[1, 1, 1] +Q[1, 1, 2])p1) . Substituting w? (p) = 2 (1 ? p), we obtain an affine equation in p1: g (p1) = ? 2 (?3Q[1, 1, 1] ? 4Q[1, 1, 2] + 4Q[1, 2, 1] + 3Q[1, 1, 1]Q[1, 2, 1] + 4Q[1, 1, 2]Q[1, 2, 1] ? 4Q[1, 2, 1]2 + 5Q[1, 2, 2] + 3Q[1, 1, 1]Q[1, 2, 2] +4Q[1, 1, 2]Q[1, 2, 2] ? 8Q[1, 2, 1]Q[1, 2, 2] ? 5Q[1, 2, 2]2) ?2 ( 3Q[1, 1, 1]2 + 6Q[1, 1, 1]Q[1, 1, 2] + 4Q[1, 1, 2]2 ? 6Q[1, 1, 1]Q[1, 2, 1] ? 8Q[1, 1, 2]Q[1, 2, 1] + 4Q[1, 2, 1]2 ? 6Q[1, 1, 1]Q[1, 2, 2] ? 8Q[1, 1, 2]Q[1, 2, 2] +8Q[1, 2, 1]Q[1, 2, 2] + 5Q[1, 2, 2]2)p1. Although the equation above is affine in p1, its coefficients depend on the transition probabilities (Q1 and Q2) and the reward function ?. This dependence makes any generalized results on the structure of the optimal policies difficult. 88 We observe from the example above that any optimal solution of the problem stated in Eq. 2.4.4 must satisfy g (p1) = 0, which has a unique solution p?1. In the next theorem, we will prove that this is true in general. Theorem 8. Assume wt is a strictly concave function, then the function ? ? 0 wt (? ?Qx,t (rt + vt+1 > s)) ds is strictly concave in ?. Furthermore, there is a unique maximizer for the optimization problem: max ??P(A(x)) ? ? 0 wt (? ?Qx,t (rt + vt+1 > s)) ds. Proof. We prove the concavity of the function of interest: ? ? 0 wt ((??1 + (1 ? ?) ?2) ?Qx,t (rt + vt+1 > s)) ds = ? ? 0 wt (? Qx,t (rt + vt+1 > s) d (??1 + (1 ? ?) ?2) ) ds > ? (? ? 0 wt (?1 ?Qx,t (rt + vt+1 > s)) ) ds + (1 ? ?) (? ? 0 wt (?x ?Qx,t (rt + vt+1 > s)) ) ds. Since the right hand side of the second equation in the theorem is strictly concave, the uniqueness of maximizer follows. We are only able to prove the uniqueness of the optimal policy in the positive reward case. For the more general case of both positive and negative rewards, where 89 we have a convex-concave measure, we were unable to prove the uniqueness of the optimal policy. 2.5 Conclusion Non-convex reward measures are useful for modeling many real-life problems. More specifically, CPT reward measures are derived from experimental data and have been proven to model several key characteristics of human behaviors well. This inspired us to start building a rigorous theoretical foundation for their application to dynamic problems. Our effort has resulted in proving the applicability of dynamic programming equations for non-convex reward measures. In relation to ?avu? and Ruszczy?ski [18], we relaxed their assumptions on the performance measures to monotonicity and positive homogeneity. In the finite- horizon case, the monotonicity assumption is important in the proof for the suit- ability of the dynamic programming method for the non-convex case. One of our contributions in the finite-horizon case is the assumptions on the weighting functions such that the monotonicity assumption of the performance measures is satisfied. In the discounted infinite-horizon case, the suitability of the dynamic programming method for non-convex performance measures is proven using the monotonicity as- sumption and the fact that they are contractions. In the transient case, the proofs are more difficult and require the utilization of k-step contractions. In this chapter, we have established a rigorous mathematical foundation for using dynamic programming to solve Markov Decision Problems with CPT-based 90 reward measures. In our new framework, CPT-based reward measures, unlike the existing work where CPT-based reward measures are only applied statically or to special cases, can be applied to a larger class of problems. The development of these new MDPs is especially useful for modeling dynamic human decision making processes. Through numerical examples, we demonstrated the properties of optimal policies obtained by solving these problems. The optimal polices obtained from these problems are different from the standard polices; they are randomized rather than deterministic. This finding, perhaps not too surprising due to previous work done by Ruszczy?ski, suggests that deterministic optimal policies are insufficient for obtaining the optimal value function when humans are involved. It is our hope that these new proposed models will be utilized to model human risk-sensitive behaviors in dynamic settings. 91 Chapter 3 Cumulative Weighting Optimization Global optimization problems are relevant in many fields (e.g., control systems, operations research, economics). There are many approaches to solving these prob- lems. One particular approach is model-based methods, which are a class of random search methods. A model-based method iteratively updates its probability density function. At each step, additional weight is given to solution subspaces that are more likely to yield an optimal objective value. Model-based methods can be an- alyzed by writing down a corresponding system of differential equations similar to the well known Fokker-Planck equation, which models the evolution of probability density functions for diffusions. We propose an innovative model-based method, Cumulative Weighting Optimization (CWO), which can be proven to converge to an optimal solution. Using this rigorous theoretical foundation, we design a class of CWO-based numerical algorithms for solving global optimization problems. In- terestingly, the well-known cross-entropy method is a special case of CWO-based numerical algorithms. 3.1 Introduction Many problems in engineering and science can be formulated as global optimiza- tion problems. These problems are challenging when their objective functions are 92 nonlinear (e.g., non-convex, multi-modal, or badly scaled). If we are only interested in finding their local extrema and they are differentiable, then the standard local optimization method (i.e., first derivative being zero) would suffice. If there are only a few local extrema, then we can easily find a global optimal solution by eval- uating all of them. However, this approach does not work on objective functions with absence of structural information (e.g., non-differentiable), or in the presence of many local extrema. Approaches developed to solve these problems can be di- vided into two categories: deterministic and random search algorithms. Random search algorithms can be further divided into instance-based (e.g., simulated anneal- ing, genetic algorithm, tabu search, nested partitions, generalized hill climbing, and evolutionary programming) and model-based algorithms (e.g., annealing-adaptive search, cross-entropy (CE), model reference adaptive search (MRAS), and estima- tion of distribution algorithms (EDAs)). A more recent addition to model-based algorithms is model-based evolutionary optimization [88]. For the interested reader, Hu et al. have a recent survey paper on model-based methods [54], which also contains references to instance-based methods mentioned in this paragraph We propose a new addition, inspired by Cumulative Prospect Theory (CPT), to the class of model-based methods. The new CWO-based algorithms have an intuitive connection with the risk-sensitive nature of the human decision making process. In the rest of this chapter, we will proceed in the following sequence. In Section 3.2, we present the problem statement. In Section 3.3, we introduce the concept of 93 probability weighting functions. In Section 3.4, we will work with the case when X = {1, 2, 3, 4, . . . } and provide the reader some insight into the construction of our probability updating equation. Later in the same section, we will prove the convergence properties for the equation. In Section 3.5, we will generalize the results of Section 3.4 to Polish spaces. Finally, in Section 3.6, we will outline our numerical algorithms and present some simulation results. 3.2 Problem In many engineering applications, we are looking for a ?best? solution based on some criterion. For example, in the well known traveling salesman problem (TSP), we are looking for the cheapest route that visits all cities and terminates at the starting point. Problems of this nature can be formulated as the following optimization problem: x? ? arg max x?X H(x), (3.2.1) where x? is an optimal solution to the problem and X is a non-empty, often compact, solution space (in many applications X ? Rn). H : X ? R, the objective function, is a bounded deterministic measurable function with many local extrema. In the rest of this chapter we assume the following. Assumption 4. There exists a global optimal solution to Eq. 3.2.1, i.e., ?x? ? X 94 such that H(x) ? H(x?) ?x 6= x?, ?x ? X . In practice, this assumption is true for many optimization problems. For example, the assumption holds trivially when X is a finite discrete solution space. Generally, we do not assume any other structural information about the objective function (i.e., convexity, differentiability). We can introduce a measurable strictly increasing fitness function, ? : R ? R+, and reformulate Eq. 3.2.1 as: x? ? arg max x?X ? (H(x)) . (3.2.2) Since the reformulated problem guarantees the range of the new fitness-objective function (i.e., ? (H (?))) will always be non-negative, and it is equivalent to the original problem, we will solve Eq. 3.2.2 in place of Eq. 3.2.1. A similar problem statement can be found in Hu et al. [54]. 3.3 Probability Weighting Functions Probability weighting functions have many applications in science and engineering. In this thesis, we are most concerned with using them to re-weight the probabili- ties of outcomes. Weighting is suggested by Cumulative Prospect Theory (CPT) as an important part of the human decision making process. Prospect Theory (PT), the predecessor to CPT, was suggested in the 1970s by Kahneman and Tversky [59]. They were unsatisfied with PT due to its violation of second order stochastic 95 dominance, which was remedied by CPT in the 1990s [83]. CPT improves PT by re-weighting the outcome cumulative probability function instead of the outcome probability density function. This new approach can also be useful for global op- timization problems. The purpose of this section is to familiarize the reader with probability weighting functions, which will be used later for iteratively updating probability measures. We first introduce several definitions to assist us in our dis- cussion. Definition 18. A weighting function, w : [0, 1] ? [0, 1], is a monotonically increas- ing and Lipschitz continuous function with w(0) = 0 and w(1) = 1. We are interested in weighting functions with the additional property of optimal- seeking. Definition 19. A weighting function, w : [0, 1] ? [0, 1], is optimal-seeking if w (?x+ (1 ? ?) y) > ?w (x) + (1 ? ?)w(y), ?? ? (0, 1), x 6= y ? [0, 1]. Optimal-seeking is called risk-seeking in fields modeling risk-sensitivity. From the definitions above, we can prove the following proposition. Proposition 2. An optimal-seeking weighting function satisfies the inequality w(z) > z, ?z ? (0, 1) . Proof. Let x = 1, y = 0, and ? = z in the definition for optimal-seeking weighting functions. The proof follows trivially. 96 Next, we will be more concrete and provide several examples of risk-seeking probability weighting functions. Assumption 5. w is an optimal-seeking weighting function. Example 9. A simple polynomial weighting function has the form: w (p) = 1 ? (1 ? p)b , b > 1. (3.3.1) Another more complicated weighting function involving exponentials has the form: w(p) = ecp ? 1 ec ? 1 , (3.3.2) where c < 0. There are other parametric weighting functions, which can be found in [29]. An optimal-seeking weighting function tends to place more weight on highly unlikely, yet highly rewarding outcomes. In particular, it is used to overweight the probabilities of unlikely events (i.e., events with small probabilities) and underweight the probabilities of highly likely events. More specifically, we can apply an optimal- seeking weighting function to the cumulative distribution function of the outcomes, as in the example below. Example 10. We are given a die and asked to roll it once. We are told that we will be given a payoff that is equivalent to the outcome of the roll. For example, if we rolled a 1, then we would be given a $1 reward. We assume the die is fair, and 97 calculate the expected payoff for both the risk-neutral and optimal-seeking cases. We use R to denote the random variable associated with the outcomes of this game. The risk-neutral expected payoff is calculated as: E [R] = 6? n=1 1 ? F (n) = 6? n=1 n 6 = 21 6 ? 3.5, where F (n) is the cumulative distribution function at outcome n. Using the poly- nomial weighting function in Eq. 3.3.1 with b = 2, we have our optimal-seeking re-weighted expected payoff: Ew [R] = 6? n=1 w (1 ? F (n)) = 6? n=1 w(n6 ) = 6? n=1 1 ? (1 ? n6 ) 2 = 161 36 ? 4.47222. Remark 12. The reader should note that the re-weighting is applied to the good- news function1 (i.e., the probability of the outcome exceeding a threshold). In other words, the unlikely events are events whose payoff exceeded some threshold. It should be noted that the optimal-seeking re-weighted expected payoff is greater than that of the risk-neutral. This will be an important feature in proving the convergence of the CWO method. 1In other fields, the good-news function is also known as the survival function or the reliability function. 98 3.4 Discrete Solution Space We are trying to find an solution for Eq. 3.2.2 assuming that X := {1, 2, 3, 4, . . . } . We further assume the discrete topology for X . The discrete solution space case should offer the reader some intuitive insight into the workings of the CWO method. In the next section, we will present analogous results on Polish spaces. We denote the set of optimal measures on (X ,B (X )) by P?X := {P ? PX |P (X ?) = 1} , where X ? is the set of all optimal solutions, i.e., X ? := {i? ? X |H(i) ? H(i?), ?i ? X} , and PX is the set of all possible probability measures over B (X ). It should not surprise the reader that if we can find an element of P?X , then we have found a solution to the global optimization problem stated in Eq. 3.2.2. We assume X ? has only a finite number of elements. Assumption 6. The objective function, H : X ? R, has a finite number of optimal 99 solutions, i.e., the set X ? := {i? ? X |H(i) ? H(i?), ?i ? X} has a finite number of elements. Proposition 3. P?X and PX are both non-empty sets. Proof. Using Assumption 4 we know that a Dirac measure concentrated at any i? ? X ? is in P?X , which is a subset of PX . Our objective is to restrict the temporal evolution of a probability measure such that it will eventually concentrate its probability density at one of the optimal solu- tions. This evolution can be defined on a measurable space, (X ? R+,B (X ? R+)), where X is the given solution space and R+ represents time. If the evolution hap- pens in discrete time or iteration steps, then R+ can be replaced by {0, 1, 2, 3, . . . } . To solve Eq. 3.2.1, we want to find a probability measure P and a t? ? R+ such that P ({(t, i)|i ? X ?}) = P ({(t, i)|i ? X}) , ?t ? t?. (3.4.1) In other words, P at some finite time t? is a member of P?X . We denote the resulting probability space by 2 ( X ? R+,B ( X ? R+ ) ,P ) . 2For simplicity, we choose to work with (X ? R+,B (X ? R+) ,P) instead of( X R+ ,B ( X R+ ) , ?P ) , which is the common practice for defining a stochastic process. It should not be hard for the reader to see that ( X R+ ,B ( X R+ ) , ?P ) can induce a measure P on the measurable space (X ? R+,B (X ? R+)) . 100 At each time t, P induces a probability measure on the measurable space (X ,B (X )), Pt (BX ) := P ({(t, i)|i ? BX }) , ?BX ? B (X ) , (3.4.2) resulting in a probability space (X ,B (X ) ,Pt). Conversely, if we know Pt at all times, then we can construct a P that satisfies Eq. 3.4.23. The coordinate random variable is denoted by X (i.e., X (i) = i, i ? X ). Similarly, the outcome random variable is denoted by Y, where Y = ? (H (X)) . We denote the set of all possible outcomes from evaluating ? (H (?)) over X by Y := {y ? R+|?i ? X s.t. y = ? (H (i))} . A sensible next step is to write down the dynamics of Pt with respect to time (i.e., ?Pt), and interestingly we will be able to prove that Et [? (H (X))] := ? X ? (H (i)) dPt := ? Y ydP?(H(Xt))t (3.4.3) is a strictly increasing function with increasing t. In the equation above, P?(H(Xt))t is the probability measure of the random variable ? (H (Xt)) induced by Xt. Of course, probability weighting functions from Section 3.3 play a key role in the equations for ?Pt4. Using Eq. 3.4.3 along with Lyapunov stability analysis, we will conclude the convergence of Pt to optimal solutions (i.e., elements of P?X ). The following examples 3We can construct such a measure by using the definition: P (BX ?BT ) =? BX ?BT Pt (dx) dt, BX ?BT ? B (X ? R+).4We opted for the notation Pt instead of Pwt for simplicity, but the reader should be mindful of Pt?s dependence on w. 101 illustrate our approach. Example 11. (Distinct Outcomes) We are given a finite solution space X = [1, 2, 3], and its corresponding out- come space Y = [? (H (1)) , ? (H (2)) , ? (H (3))] ? R+. In addition, we assume the outcomes are distinct (i.e., ? (H(3)) > ? (H(2)) > ? (H(1)) ? 0). We denote the vector of probabilities for non-intersecting outcome events, yi(t) = Pt (? (H (i)) ? ? (H (X)) < ? (H(i+ 1)) , i ? X with ? (H (4)) = ?, by [y1(t), y2(t), y3(t)]. Furthermore, we denote the probabilities on elements of the solution space by [x1(t), x2(t), x3(t)], respectively. To avoid confusion, we will refer to [y1(t), y2(t), y3(t)] as the outcome probability vector, and [x1(t), x2 (t) , x3 (t)] as the solution probability vector. Note, Pt in the previous discussion is equivalent conceptually to the solution probability vector. Our goal is to write down the dynamic equation for the solution probability vector. However, in order to do that, we need to write down the dynamic equation for the outcome probability vector. Using an optimal-seeking weighting function, w, the dynamics of the outcome probability vector can be written as: dy1 dt = (w(1) ? w(y2 + y3)) ? y1 dy2 dt = (w(y2 + y3) ? w(y3)) ? y2 dy3 dt = w(y3) ? y3. 102 We define the matrix G, which in the nonlinear Markov processes literature is called a generator (see [61, 62]), as G(y1, y2, y3) :=? ??????? ?w(y2 + y3) w(y2 + y3) ? w(y3) w(y3) 1 ? w(y2 + y3) w(y2 + y3) ? w(y3) ? 1 w(y3) 1 ? w(y2 + y3) w(y2 + y3) ? w(y3) w(y3) ? 1 ? ??????? . Using the matrix G, we can write down the outcome and solution probability vector equations as: ? ? ??????? y1 y2 y3 ? ??????? = [ y1 y2 y3 ] G(y1, y2, y3) ? ? ??????? x1 x2 x3 ? ??????? = [ x1 x2 x3 ] G(x1, x2, x3). From Proposition 2 we see that w(y) > y ?y ? (0, 1), which implies dy3 dt > 0, dx3 dt > 0 ?x3, y3 ? (0, 1) and dy3 dt = 0, dx3 dt = 0 x3, y3 = {0, 1} . Since the best outcome, x = 3, will monotonically increase in its probabil- ity weight, and the increase in probability weight has to come from the non-best 103 solutions, the non-best solutions will eventually die out, i.e., ? ??????? x1 (t) x2 (t) x3 (t) ? ???????? ? ??????? 0 0 1 ? ??????? as t ? ?. We will prove our convergence assertion more rigorously later in this section. What would happen if two or more members in the solution space might map to the same outcome? Example 12. (Non-Distinct Outcomes) Consider the case when there are more than one solution mapping to the same outcome. We are given an optimal-seeking probability weighting function, w, and a solution space X = [1, 2, 3, 4]. Assume we know that ? (H(4)) = ? (H(3)) > ? (H(2)) > ? (H(1)) ? 0. Now, the outcome space, Y = [? (H(3)) , ? (H(2)) , ? (H(1))], has fewer elements than the solution space. Following the similar line of logic as in the previous example, the outcome probability vector equation is written as : dy1 dt = (w(1) ? w(y2 + y3)) ? y1 dy2 dt = (w(y2 + y3) ? w(y3)) ? y2 dy3 dt = w(y3) ? y3, 104 where yi is defined as y1 (t) = Pt (? (H(1) ? ? (H (X)) < ? (H(2)) y2 (t) = Pt (? (H (2)) ? ? (H (X)) < ? (H(3)) y3 (t) = Pt (? (H (3)) ? ? (H (X)) < ?) . We define the matrix Gy as: Gy(y1, y2, y3) :=? ??????? ?w(y2 + y3) w(y2 + y3) ? w(y3) w(y3) 1 ? w(y2 + y3) w(y2 + y3) ? w(y3) ? 1 w(y3) 1 ? w(y2 + y3) w(y2 + y3) ? w(y3) w(y3) ? 1 ? ??????? . As in the distinct outcome case, the solution probability vector is written as: dx1 dt = (w(1) ? w(x2 + x3 + x4)) ? x1 dx2 dt = (w(x2 + x3 + x4) ? w(x3 + x4)) ? x2 dx3 dt = ?w(x3 + x4) ? x3 dx4 dt = (1 ? ?)w (x3 + x4) ? x4, ? ? [0, 1] , 105 and its corresponding generator is Gx,?(x1, x2, x3, x4) :=? ??????????? ?w(x2 + x3 + x4) w(x2 + x3 + x4) ? w(x3 + x4) ?w(x3 + x4) (1 ? ?)w(x3 + x4) 1 ? w(x2 + x3 + x4) w(x2 + x3 + x4) ? w(x3 + x4) ? 1 ?w(x3 + x4) (1 ? ?)w(x3 + x4) 1 ? w(x2 + x3 + x4) w(x2 + x3 + x4) ? w(x3 + x4) ?w(x3 + x4) ? 1 (1 ? ?)w(x3 + x4) 1 ? w(x2 + x3 + x4) w(x2 + x3 + x4) ? w(x3 + x4) ?w(x3 + x4) (1 ? ?)w(x3 + x4) ? 1 ? ??????????? . We can write the dynamic equation for the outcome and solution probability vectors more compactly as: ? ? ??????? y1 y2 y3 ? ??????? = [ y1 y2 y3 ] Gy(y1, y2, y3) ? ? ??????? x1 x2 x3 ? ??????? = [ x1 x2 x3 ] Gx,?(x1, x2, x3). From Proposition 2 we see that w(y) > y, ?y ? (0, 1), which implies dy3 dt > 0, dx3 dt + dx4 dt > 0, ?x3, x4, y3 ? (0, 1) and dy3 dt = 0, dx3 dt = 0, dx4 dt = 0, x3, x4, y3 = {0, 1} . Ultimately, we have the best solutions for the problem, x = 3 and x = 4, 106 monotonically increasing in their probability weights. Since the increase in proba- bility weights has to come from the non-best solutions, they will eventually die out, i.e., ? ??????????? x1 (t) x2 (t) x3 (t) x4 (t) ? ??????????? ? ? ??????????? 0 0 ?3 ?4 ? ??????????? as t ? ?, ?3, ?4 ? 0, ?3 + ?4 = 1. Remark 13. A key feature of the non-distinct outcomes case is the appearance of the ?-function (later, this will be called a distribution rule). For instance, in the example above, the ?-function could be: Pt (X = i|Y = ? (H(i))) = xi (t)? j:?(H(j))=?(H(i)) xj (t) . An interesting observation from the examples above is that the solution and outcome probability vector equations fall into a category of equations called the nonlinear Fokker-Planck equation (see [61, 42]). In addition, the examples suggest that by propagating the solution and outcome probability vectors appropriately, we can concentrate the probability weights on optimal solutions. The key step forward is writing down the general equations for the solution and outcome probability vectors. The generalized solution probability vector equation has the form: 107 dxi (t) dt = ?i (t) ??w ?? ? j:?(H(j))??(H(i)) xj (t) ?? ? w ?? ? j:?(H(j))>?(H(i)) xj (t) ???? ? ? ? ? xi (t) ?i ? X (3.4.4) ? ?(H(i))=y ? (i, y, t) = 1 ?y ? Y ?t ? R+, (3.4.5) where xi : R+ ? [0, 1] is the probability measure assigned to an element i ? X , and ?i (t) := ? (i, ? (H (i)) , t) is a distribution rule defined below. In Eq. 3.4.4, the difference between the first w distorted term and the second w distorted term is the event ? (H (j)) = ? (H (i)). Wang et al. (see [88]) have an alternative set of evolu- tion equations, also nonlinear Fokker-Planck equations, motivated by evolutionary game theory. As the reader will see later, we reach the same convergence results as Wang et al. in [88] with a modified approach. Definition 20. A distribution rule with respect to a given objective function, ? (H (?)), is a mapping ? : X ? Y ? R+ ? [0, 1] such that ? ?(H(i))=y ? (i, y, t) dx = 1 ?y ? Y ?t ? R+. Connecting this equation with the discussion at the beginning of this section, the reader should note that Pt (X = i) = xi(t) ?i ? X . 108 The generalized outcome probability vector equation has the form: dyz (t) dt = w (? j:j?z yj (t) ) ? w (? j:j>z yj (t) ) ? yz (t) ?z ? Y . We pay special attention to the best outcome equation: dy? (t) dt = w (? j:j?? yj (t) ) ? y? (t) , where ? := ? (H (i?)) i? ? X ?. In the rest of this section, we want to study the convergence properties of Eq. 3.4.4. Furthermore, we want to understand the stability properties, in the Lyapunov sense, of its limit points. The first step in understanding Eq. 3.4.4 is to understand the existence and uniqueness of its solutions. The outline of our proof for the next theorem follows [65] and [51]. Theorem 9. For each x(0) ? PX , the ordinary differential equation 3.4.4 has a unique solution for t ? R+. Here, ? : X ? Y ? R+ ? [0, 1] is a distribution rule5. Proof. We use the total variation norm, ??? , on a ?-finite signed measure space over (X ,B (X )) : ?x(t)? = sup A?B(X ) ? i?A |xi(t)| . Since x (t) ? PX is a probability measure ?t, and |?i| ? 1 , we know the following 5We will provide the definition for distribution rules in more general spaces in the next section. For the proof of this theorem, we are only using the fact that it is a bounded function. In the future, ? could depend on both i ? X and x (t) ? PX . 109 inequalities hold: sup A?B(X ) ? i?A ?????? ?i (t) ??w ?? ? j:?(H(j))??(H(i)) xj (t) ?? ? w ?? ? j:?(H(j))>?(H(i)) xj (t) ???? ? xi (t) ?????? ? sup A?B(X ) ? i?A ?????? ?i (t) ??w ?? ? j:?(H(j))??(H(i)) xj (t) ?? ? w ?? ? j:?(H(j))>?(H(i)) xj (t) ???? ?????? + sup A?B(X ) ? i?A |xi (t)| ? 2. Hence, we conclude the right hand side of Eq. 3.4.4 is bounded by 2. Next, we need 110 to prove that the right hand side of Eq. 3.4.4 is Lipschitz continuous. sup A?B(X ) ? i?A ?????? ???i (t) ??w ?? ? j:?(H(j))??(H(i)) x1j (t) ?? ? w ?? ? j:?(H(j))>?(H(i)) x1j (t) ???? ? x1i (t) ?? ? ???i (t) ??w ?? ? j:?(H(j))??(H(i)) x2j (t) ?? ? w ?? ? j:?(H(j))>?(H(i)) x2j (t) ???? ? x2i (t) ?? ?????? ? sup A?B(X ) ? i?A ?????? ??w ?? ? j:?(H(j))??(H(i)) x1j (t) ?? ? w ?? ? j:?(H(j))>?(H(i)) x1j (t) ???? ? x1i (t) ? ????w ?? ? j:?(H(j))??(H(i)) x2j (t) ?? ? w ?? ? j:?(H(j))>?(H(i)) x2j (t) ???? ? x2i (t) ?? ?????? ? sup A?B(X ) ? i?A ?????? w ?? ? j:?(H(j))??(H(i)) x1j (t) ?? ? w ?? ? j:?(H(j))>?(H(i)) x1j (t) ?? ? ??w ?? ? j:?(H(j))??(H(i)) x2j (t) ?? ? w ?? ? j:?(H(j))>?(H(i)) x2j (t) ???? ?????? + ??x1i (t) ? x2i (t) ?? ? K sup A?B(X ) ? i?A ?????? ? j:?(H(j))=?(H(i)) x1j (t) ? ? j:?(H(j))=?(H(i)) x2j (t) ?????? + sup A?B(X ) ? i?A ??x1i (t) ? x2i (t) ?? ? K sup A?B(X ) ? i?A ?????? ? j:?(H(j))=?(H(i)) x1j (t) ? x2j (t) ?????? + sup A?B(X ) ? i?A ??x1i (t) ? x2i (t) ?? ? K sup A?B(X ) ? i?A ??x1i (t) ? x2i (t) ??+ ??x1(t) ? x2(t)?? ? K ??x1(t) ? x2(t)??+ ??x1(t) ? x2(t)?? ? (K + 1) ??x1(t) ? x2(t)?? . Hence, the right hand side of Eq. 3.4.4 is Lipschitz continuous in x with the constant K+1, where K is the Lipschitz constant for w (w is defined to be Lipschitz 111 continuous; see Definition 18). Using [90, Corollary 3.9], we conclude that Eq. 3.4.4 with an initial measure x(0) ? PX has a unique solution x(t) ?t ? R+. The next Lemma is needed in Theorem 12, which shows the monotonically increasing nature of Et [? (H (X))] . Lemma 6. Given an optimal-seeking weighting function, w, there exists a ?? such that ? ??Y ? ( w (? j:Y?? yj (t) ) ? w (? j:Y >? yj (t) ) ? y? (t) ) (3.4.6) can be decomposed into the sum of its non-negative and negative parts: ? ???? ? ( w (? j:Y?? yj (t) ) ? w (? j:Y >? yj (t) ) ? y? (t) ) ? ?? ? non?negative? ?? yj (t) ) ? y? (t) ) ? ?? ? negative . In other words, we can write Eq. 3.4.6 as the sum of its non-negative and negative parts. Proof. Since w is a monotonically increasing function, it satisfies w (? j:Y?? yj (t) ) ? w (? j:Y >? yj (t) ) y? (t) ? 0. 112 Furthermore, since w is an optimal-seeking function we have w (? j:Y??1 yj (t) ) ? w (? j:Y >?1 yj (t) ) y?1 (t) > w (? j:Y??2 yj (t) ) ? w (? j:Y >?2 yj (t) ) y?2 (t) ??1 ? ?2 ? Y . (3.4.7) In addition, since w (0) = 0 and w (1) = 1, we know that w (? j:Y?? yj (t) ) ? w (? j:Y >? yj (t) ) y? (t) > 1 for some ? ? Y . From Eq. 3.4.7, we know that if ?2 satisfies the above inequality, then so does ?1 ? ?2 ? Y . Hence, we conclude that ?? is the smallest such ?. At the beginning of this section, we stated implicitly that if we can find an element of P?X , then we have found a solution to the global optimization problem stated in Eq. 3.2.2. The theorems below present a blueprint, through the use of Eq. 3.4.4, to obtain an element of P?X . In Theorem 9, an initial point can be any element of PX . As we have discovered, PX is too large a set to initialize Eq. 3.4.4 to guarantee as t ? ? the solution probability vector, x (t), will be an element of P?X . Hence, we need to constrain our initial points to a smaller set. Definition 21. We denote the set of all x (0) for which there exists an optimal solution, i? ? X ?, such that xi?(0) > 0 by O. In other words, O contains all initial probability vectors with nonzero weights on at least one optimal solution. The next theorem proves the total probability 113 measure on the optimal solution set will converge to 1 as t ? ?. On the other hand, the total probability measure on the non-optimal solution set will converge to 0 as t ? ?. Theorem 10. If x (t) is a solution for Eq. 3.4.4, then it satisfies the following with initial points in O (i.e., x (0) ? O): 1) The total probability weight on the optimal solutions, ?i?X ? xi (t), is a monotonically increasing function of t. In fact, it converges to 1 as t ? ?; 2) The probability of any non-optimal solution, xi (t) : R+ ? [0, 1], i /? X ?, approaches zero as t? ?. Proof. We know that ? i?X ? xi = y?, hence we only need to prove y? is a monotonically increasing function of t. Writing down the equation for y?, dy? (t) dt = w (? j:j?? yj (t) ) ? y? (t) , and from Proposition 2: w (? j:j?? yj (t) ) > y? (t) , we conclude dy?(t) dt > 0 ?y?(t) 6= 1, and dy?(t) dt = 0 when y?(t) = 1. 114 Since x (0) ? O implies y? (0) > 0 , the first claim is proved. The second claim follows from the first claim. Since y? (?) = 1, and x is a solution probability vector (i.e., sum of xis is 1), we can conclude the following: lim t?? ? i?X xi (t) = lim t?? ? i?X ? xi (t) + ? i/?X ? xi (t) = 1 + lim t?? ? i/?X ? xi (t) = 1 =? lim t?? ? i/?X ? xi (t) = 0 =? lim t?? xi (t) = 0 ?i /? X ?. We are interested in finding the limit points of Eq. 3.4.4. Ideally, these limit points should be elements in P?X . This is accomplished by picking the initial point set more carefully. Definition 22. We define the limit set of a differential equation starting from an element x (0) ? I as EI := { x? ? PX |x? = lim t?? x (t) , x (0) ? I } . Remark 14. The limit set is invariant with respect to Eq. 3.4.4. More specifically, once x enters the set, it will not exit the set under Eq. 3.4.4. The author is not the first to introduce the concept of an initial point dependent limit set (cf. [87]). We characterize the limit set of Eq. 3.4.4 when x (0) ? O in the following theorem. 115 Theorem 11. The limit set of Eq. 3.4.4 started in O is P?X , i.e., EO = P?X := { x ? PX | ? i??X ? xi? = 1 } . Proof. To prove the first claim, we will first prove EO ? P?X , then we will prove EO ? P?X . The first case, EO ? P?X , can be trivially proved by taking an element x ? P?X , we notice that x ? O, and by definition of EO (i.e., the limit set of Eq. 3.4.4 starting from O), we conclude x ? EO . Now we proceed to prove EO ? P?X . We prove by contradiction. Assume there is an element e ? EO, but not in P?X such that: e?i (t) = ?i (t) ( w (? j:?(H(j))??(H(i)) ej (t) ) ? w (? j:?(H(j))>?(H(i)) ej (t) )) ?ei (t) ei (0) ? 0, limt?? ei (t) > 0 i /? X ?. This contradicts the second claim of Theorem 10, where ei (?) = 0. The next theorem shows the monotonically increasing nature of Et [? (H (X))], which will be useful later in proving some stability properties of Eq. 3.4.4. Theorem 12. Let x (t) be a solution of the dynamics represented by Eq. 3.4.4 with an initial point in O. Then the following statements hold: 1) The expected outcome, i.e., Et [? (H (X))] := ?i?X ? (H (i)) xi (t), is mono- tonically increasing with t; 2) If x(t) /? EO for any t ? R+, then Et [? (H (X))] is strictly increasing with t. 116 Proof. We start our proof by differentiating the average outcome function: d dtEt [? (H (X))] = d dtEt [Y ] = ? ??Y ? dy?(t)dt = ? ??Y ? ( w (? j:Y?? yj (t) ) ? w (? j:Y >? yj (t) ) ? y? (t) ) = ? ???? ? ( w (? j:Y?? yj (t) ) ? w (? j:Y >? yj (t) ) ? y? (t) ) + ? ?? yj (t) ) ? y? (t) ) (Lemma. 6) ? ?? ? ??Y ( w (? j:Y?? yj (t) ) ? w (? j:Y >? yj (t) ) ? y? (t) ) = ?? ? 0 = 0. Here, the last equality holds because ? ??Y w (? j:Y?? yj (t) ) ? w (? j:Y >? yj (t) ) = 1, and ? ??Y y? (t)=1. The first claim is proved. The second claim is proved by contradiction. We assume that x(t) is not in the limit set, and d dtEt [? (H (X))] = 0. Along with Theorem 10, the equality above implies that ? (H (X)) is equal to a constant C = supi?X ? (H (i)), which means x (t) has all its probability mass on the optimal solutions. From Theorem 11, we know a limit point has its probability mass 117 on the optimal solutions. However, we assumed x(t) is not a limit point, hence we reach a contradiction. We will now proceed to prove some stability properties of Eq. 3.4.4, but first we need to introduce our definitions of stability given a metric d. Definition 23. Let L be a subset of PX . For a point x(t) ? PX , we define the distance between x(t) and L as d (x(t),L) := inf {d (x(t), q) , ?q ? L} . L is called Lyapunov stable if for all  > 0, there exists a ? > 0 such that d (x (0) ,L) < ? ? d (x (t) ,L) < , ?t > 0. Lyapunov was also interested in other stronger types of stability. Definition 24. Let L be a subset of PX . L is called asymptotically stable if L is Lyapunov stable, and there exists a ? > 0 such that d (x (0) ,L) < ? ? d (x (t) ,L) ? 0 as t ? ?. The next theorem is the main result of this section. It states EO is compact and asymptotically stable. Theorem 13. EO is a compact set and it is asymptotically stable. 118 Proof. We need to first prove that EO is a compact set. Since from our Theorem 11, we have EO = P?X and can instead prove P?X := { x ? PX | ? i??X ? xi? = 1 } is compact. It is easy to see that P?X is tight (see Appendix, Definition 31) due to Assumption 6. Furthermore, we can prove it is a closed set by contradiction. Assume there exists a sequence {xn} ? P?X such that xn ? x? /? P?X . This implies ?N such that ?n > N we have?i??X ? xni? < 1, and?i/?X ? xni > 0, which contradicts the second claim of Theorem 10. Hence, P?X = EO is a compact set. Consider the Lyapunov function V (xt) = E [? (H (X?))] ? Et [? (H (X))] , where x? ? P?O and X? is the corresponding random variable. Note that V (xt) is positive for all xt ? PX \P?X , and V (xt) = 0 for xt ? P?X = EO. From Theorem 12 we have ?V (xt) < 0 for all t > 0 and xt /? P?O. Furthermore, we know EO is a compact set. Applying a generalized version of Lyapunov?s theorem (see [12, Chapter 5]), the desired conclusion is reached. Remark 15. Chapter V of [12] presented a generalized version of Lyapunov?s theo- rem on a general metric space. In the proof of Theorem 13, we applied this gen- eralized Lyapunov?s theorem on the Banach Space of ?-finite signed measures over 119 (X ,B (X )), where we used the total variation distance d ( x1, x2 ) = sup A?B(X ) ? j?A ??x1j(t) ? x2j (t) ?? . Conclusion 1. We have proven so far that if we start Eq. 3.4.4 in O, then the possible limit points are elements of EO = P?X . Furthermore, the set EO is asymptomatically stable. The use of the Lyapunov function in proving the stability of the limit set can also be found in Wang [87], which applied the generalized version of Lyapunov?s theorem to an infinite dimensional space. In the next section, we will apply a similar approach to prove the stability of the limit set when the solution space is a Polish space. 3.5 Polish Space In this section, we try to find a solution for Eq. 3.2.2 given that X is a Pol- ish space with the Prohorov topology (see Appendix C.1). Polish spaces include finite-dimensional real spaces (i.e., Rn), which are important in many engineering applications. The Polish space of probability measures on (X ,B (X )) is denoted by PX , which also has the Prohorov topology. We will alter our notations in this section from the discrete space case. The symbol x in this section is an element of X (i.e., x ? X ). We will use the notation PX,t ({x}) to represent the probability measure of x at time t. This is different 120 from the discrete space case, where Pt (i) = xi(t) ?i ? X 6. We write X for the solution random variable (i.e., X (x) = x), and denote the outcome random variable by Y = ? (H (X)). On top of the assumptions we made for the discrete space case, we will make the following assumption. Assumption 7. w is differentiable, and has a bounded first derivative which is denoted by w?. The initial probability space, (X ,B (X ) ,PX,0), with an initial distribution PX,0 induces a probability measure PY,0 on the measurable space (Y ,B (Y)), where Y := {y ? R+|y = ? (H (x)) ?x ? X}? R+ PY,0 (B) = PX,0 ({x|? (H (x)) = y ?y ? B}) ?B ? B (Y) , (3.5.1) and B (Y) denotes the Borel ?-algebra for Y . The space of probability measures on (Y ,B (Y)) is denoted by PY . From examples 11 and 12, the generalized (i.e., when X is a Polish space) outcome probability measure, on the measurable space (Y ,B (Y)), satisfies:7 ?PY,t (B) = ? B w? (PY,t ({Y ? y})) dPY,t ? PY,t (B) ?B ? B (Y) . (3.5.2) The initial condition for PY,t is given by Eq. 3.5.1. We pay special attention to the 6In the discrete space case, we used xi (t) to denote the probability of obtaining the i-th solution at time t 7PY,t({y}) is equal to yi(t) in the discrete space case. 121 best outcome equation: ?PY,t (y?) = w (PY,t ({Y ? y?})) ? PY,t (y?) , where y? = maxx?X ? (H (x)) . Next, we write down the the Polish space counterpart to Eq. 3.4.4 PX,t (A) = ? A?Y dPX|Y,tdPY,t ?A ? B (X ) ?t ? R+\ {0} , (3.5.3) where PX|Y,t, the probability of X conditioned on Y , is the generalized ? function. In other words, given a fixed PX|Y,t , PX,t is determined by PY,t, which is a solution of Eq. 3.5.2. PY,t can be determined, without knowing PX|Y,t, from PX,t at any t ? R+ by the following equation: PY,t (B) = PX,t ({x|? (H (x)) = y ?y ? B}) ?B ? B (Y) . (3.5.4) Assumption 8. PX|Y,t is a given fixed conditional probability measure. We denote the set of optimal measures on (X ,B (X )) by P?X := {P ? PX |P (X ?) = 1} , where X ? is the set of all optimal solutions, i.e., X ? := {x? ? X |H(x) ? H(x?) ?x ? X} . 122 Similarly, we denote the set of optimal measures on (Y ,B (Y)) by P?Y := {PY ? PY |P (y?) = 1} , where y? = maxx?X ? (H (x)) . Obtaining an element P of P?X is equivalent to solving the optimization problem stated in Eq. 3.2.2. We will now prove the existence and uniqueness of a solution for Eq. 3.5.2. This is important for building a solid theoretical foundation for our approach. Theorem 14. For each PY,0 ? PY , the ordinary differential equation (3.5.2) has a unique solution for t ? R+. Proof. The outline of our proof follows [65] and [51]. We use the total variation norm, ??? , on a ?-finite signed measure space over (Y ,B (Y)) at the time t: ?Pt? = sup g ???? ? Y g(y)dPt ???? , where the sup is taken over all measurable functions g : Y ? R and sup y?Y |g (y)| ? 1. We simplify our notations by introducing the following definition: C (PY,t) (B) := ? B w? (PY,t ({Y ? y})) dPY,t ? PY,t (B) , , 123 where B ? B (Y). Since PY,t ? PY is a probability measure, we note that ?C (PY,t)? = sup g ???? ? Y g (y) C (PY,t) ???? ? sup g ???? ? Y g (y)w? (PY,t ({Y ? y})) dPY,t ????+ sup g ???? ? Y g (y) dPY,t ???? ? ? Y w? (PY,t ({Y ? y})) dPY,t + ? Y dPY,t ? K ? Y dPY,t + ? Y dPY,t ? (K + 1) , which proves the boundedness of C (PY,t), with K being the Lipschitz constant for w. Next, we need to prove that the right hand side of Eq. 3.5.2 is Lipschitz continuous. We know that ?C (PY,t) ? C (QY,t)? ? sup g ???? ? Y g (y) d (C (PY,t) ? C (QY,t)) ???? . (3.5.5) 124 Furthermore, we know that |C (PY,t) ? C (QY,t)| (B) = ???? (? B w? (PY,t ({Y ? y})) dPY,t ? PY,t (B) ) ? (? B w? (QY,t ({Y ? y})) dQY,t ? QY,t (B) )???? ? ???? ? B w? (PY,t ({Y ? y})) dPY,t ? ? B w? (QY,t ({Y ? y})) dQY,t ???? + |PY,t (B) ? QY,t (B)| ? K ???? ? B d (PY,t ? QY,t) ????+ ???? ? B d (PY,t ? QY,t) ???? ? (K + 1) ???? ? B d (PY,t ? QY,t) ???? . (3.5.6) Substituting Eq. 3.5.6 into Eq. 3.5.5, we have ?C (PY,t) ? C (QY,t)? ? sup g (K + 1) ???? ? Y g(y)d (PY,t ? QY,t) ???? = (K + 1) ?PY,t ? QY,t? . Hence, the right hand side of Eq. 3.5.2 is Lipschitz continuous in PY,t with the constant K + 1, where K is the Lipschitz constant for w (w is assumed to be Lipschitz continuous). Using Corollary 3.9 of [90], we conclude that Eq. 3.5.2 with an initial measure PY,0 ? PY has a unique solution PY,t. It should not be surprising that PY,t is a probability measure for all t. 125 Lemma 7. Given that PY,0 is a probability measure, then a solution PY,t of Eq. 3.5.2 at each time t > 0 is a probability measure, i.e., PY,t (B) ? 0 ?B ? B (Y) PY,t (Y) = 1 ?t ? R+ PY,t (?Bi) = ? PY,t (Bi) , where {Bi} is any countable collection of pairwise disjoint elements of B (Y). Proof. If we can prove that ?PY,t (Y) = 0, and ?PY,t (?Bi) = ? ?PY,t (Bi) then we have obtained our desired result. From Eq. 3.5.2 we know ?PY,t (Y) = ? Y w? (PY,t ({Y ? y})) dPY,t ? PY,t (Y) = ? Y (w? (PY,t ({Y ? y})) ? 1) dPY,t. Furthermore, that fact that ? 10 w? (s) ds = 1 (i.e., we assumed w(1) = 1) implies ? Y (w? (PY,t ({Y ? y})) ? 1) dPY,t = 0. 126 Next, we need to prove that ?PY,t (?Bi) = ? ?PY,t (Bi) , because ? ?PY,t (?Bi) dt = ? ? ?PY,t (Bi) dt? ?PY,t (?Bi) dt = ? ? ?PY,t (Bi) dt =? PY,t (?Bi) = ? PY,t (Bi) . Using the fact that w? is bounded, along with dominated convergence theorem we conclude that ?PY,t (?Bi) = ? ?Bi w? (PY,t ({Y ? y})) dPY,t ? PY,t (?Bi) = ? ? Bi w? (PY,t ({Y ? y})) dPY,t ? PY,t (Bi) = ? ?PY,t (Bi) . The next Lemma is needed in Theorem 18, which shows the monotonically increasing nature of Et [Y ] . Lemma 8. Given an optimal-seeking weighting function, w, there exists a y? such 127 that ? Y y (w? (PY,t ({Y ? y}))) dPY,t ? ? Y ydPY,t (3.5.7) can be decomposed into the sum of its non-negative and negative parts: ? Y?y? y (w? (PY,t ({Y ? y}))) dPY,t ? ? Y?y? ydPY,t ? ?? ? non?negative(? Y w? (PY,t ({Y ? y2})) ?y1 ? y2 ? Y . (3.5.8) In addition, since w (0) = 0 and w (1) = 1, we know that w? (PY,t ({Y ? y})) > 1 for some y ? Y . From Eq. 3.5.8 we know if y2 satisfies the above inequality, then 128 so does y1 ? y2 ? Y . Hence, we can conclude that y? is the smallest of such y. At the beginning of this section, we stated implicitly that if we can find an elements of P?X , then we have found a solution to the global optimization problem stated in Eq. 3.2.2. The theorems below present a blueprint, through the use of Eq. 3.5.3, to obtain an element of P?X . In Theorem 14, an initial condition can be any element of PY , which implies there is no restriction on the initial condition PX,0 ? PX . As we will discover later, PX is too large of a set to start Eq. 3.5.3 to guarantee as t ? ? the solution probability measure, PX,t, will be an element of P?X . Hence, we need to constrain our potential initial points to a smaller set. Definition 25. The set of all optimal initial solution probability measures is denoted by: OX := {PX,0 ? PX |PX,0 (X ?) > 0} . Furthermore, the set of all optimal initial outcome probability measures is denoted by: OY := {PY,0 ? PY |PY,0 (B) = PX,0 ({x|? (H (x)) = y ?y ? B}) ?PX,0 ? OX , ?B ? B (Y)} . Proposition 4. OY = {PY,0 ? PY |PY,0 (y?) > 0} and P?X ? OX . Proof. The first claim is a direct result of the definition above and Eq. 3.5.1. The second claim holds because how P?X is defined. The next theorem proves the probability measure on the optimal outcome set 129 will converge to 1 as t ? ?. On the other hand, the total probability measure on the non-optimal outcome set will converge to 0 as t ? ?. Theorem 15. If PY,t is a solution of Eq. 3.5.2 with initial points in OY (i.e., PY,0 ? OY ), then the following statements hold: 1) PY,t ({y?}) is a monotonically increasing function of t. In fact, it converges to 1 as t ? ?; 2) PY,t (Y\ {y?}) approaches zero as t? ?. Proof. We first write down the equation for PY,t (?): ?PY,t (?) = w (PY,t ({Y ? ?})) ? PY,t (?) . From Proposition 2, we conclude that w (PY,t ({Y ? ?})) > PY,t (?) . From the two equations above, we conclude that ?PY,t (?) > 0 ?PY,t (?) ? (0, 1) , and ?PY,t (?) = 0 when PY,t (?) = 1. Since PY,0 > 0 , the first claim is proved. The second claim follows from the first claim. Since lim t?? PY,t ({y?}) = 1 we 130 can conclude the following: lim t?? PY,t (Y) = lim t?? PY,t ({y?}) + PY,t (Y\{y?}?) = 1 + lim t?? PY,t (Y\{y?}) = 1 =? lim t?? PY,t (Y\{y?}) = 0. The second claim is proved. The next theorem connects the properties of PY,t with those of PX,t as t ? ?. This is an important step for understanding the evolution of Eq. 3.5.3. Theorem 16. Assuming PX,0 ? OX and PY,0 (B) = PX,0 ({x|? (H (x)) = y ?y ? B}) ?B ? B (Y) , the following statements hold: 1) lim t?? PY,t ({y?}) = lim t?? PX,t (X ?)=1; 2) lim t?? PY,t (Y\ {y?}) = lim t?? P (X\X ?)=0. Proof. Since we know PX,t (A) = ? A?Y dPX|Y,tdPY,t ?A ? B (X ) ?t ? R+\ {0} , 131 we prove the first claim by writing down the following equation: lim t?? PX,t (X ?) = lim t?? ? X ??Y dPX|Y,tdPY,t = ? X ??{y?} dPX|Y,t lim t?? dPY,t = 1. For claim 2, we can write down a similar equation: lim t?? PX,t (X\X ?) = lim t?? ? X\X ??Y dPX|Y,tdPY,t = ? X\X ??{y?} dPX|Y,t lim t?? dPY,t = 0. We are interested in finding the limit points of Eq. 3.5.2. Ideally, these limit points should be elements in P?X . In order to guarantee this, we need to restrict the potential initial points to OX . Definition 26. We define a limit set starting from an element P0 ? I as EI := { PX,? ? PX |PX,? (A) = lim t?? PX,t (A) ?PX,0 ? I, ?A ? B (X ) } We are particularly interested in the limit set EOX . The next theorem describes the elements in this set. 132 Theorem 17. The limit set of Eq. 3.5.3 started in OX is P?X , i.e., EOX = P?X := {P ? PX |P (X ?) = 1} . Proof. We will first prove EOX ? P?X , then we will prove EOX ? P?X . In the first case, EOX ? P?X , can be proved by taking an element PX ? P?X ? OX . By the definition of EOX (i.e., the limit set of Eq. 3.5.3 starting from OX), equations 3.5.3 and 3.5.2, we conclude that PX ? EOX . The second claim, EOX ? P?X , can be proven by contradiction. We assume there is an element QX ? EOX but not in P?X , which implies QX (A) = lim t?? QX,t (A) = ? A?Y dPX|Y,t lim t?? dPY,t ?A ? B (X ) s.t QX,0 (X ?) > 0, QX (X\X ?) > 0. The first line is due to the fact that QX ? EOX . The second line in the equation above, along with Eq. 3.5.4, imply lim t?? PY,t (Y\ {y?}) > 0 , which contradicts the second claim of Theorem 15, where lim t?? PY,t (Y\ {y?}) = 0. The following theorem shows the monotonically increasing nature of Et [Y ] , which will be useful later in proving some stability properties for Eq. 3.5.3. Theorem 18. Let PY,t be a solution for Eq. 3.5.2 with its initial point in OY . Then the following statements are true: 1) The expected outcome, i.e., Et [Y ] is monotonically increasing with t; 2) If PY,?t /? EOY for any ?t ? R+, then E?t [Y ] is strictly increasing with ?t. 133 Proof. We start our proof by differentiating the average outcome function: d dtEt [Y ] = ? Y y ?PY,t (dy) = ? Y y (w? (PY,t ({Y ? y}))) dPY,t ? ? Y ydPY,t = ? Y?y? y (w? (PY,t ({Y ? y}))) dPY,t ? ? Y?y? ydPY,t + (? Y 0, there exists a ? > 0 such that d (Pt,L) < ? ? d (Pt,L) < , ?t > 0. L is called asymptotically stable, with respect to a sequence of measures {Pt}, if L is Lyapunov stable, and there exists a ? > 0 such that d (P0,L) < ? ? d (Pt,L) ? 0 as t ? ?. The next theorem is the main result of this section. It tells us that if we start Eq. 3.5.3 in the set OX , then EOX will coincide with P?X . Furthermore, EOX is asymptotically stable. Theorem 19. EOX is a compact set and it is asymptotically stable. Proof. We need to first prove that EOX is a compact set. Since from Theorem 17, we have EOX = P?X , we can instead prove P?X := {P ? PX |P (X ?) = 1} 135 is compact. It is easy to see that P?X is tight (see Appendix, Definition 31) due to Assumption 6. Furthermore, we can prove it is a closed set by contradiction. Assume there exists a sequence {Pn} ? P?X such that Pn ? ?P /? P?X . This implies ?N such that ?n > N we have Pn (X ?) < 1, and Pn (X\X ?) > 0, which implies lim n?? PnY (Y\ {y?}) > 0. This contradicts the second claim of Theorem 15. Hence, P?X = EO is a compact set. Using the Lyapunov function V (PX,t) = y ? ?Et [? (H (X))] = V (PY,t) = y ? ?Et [Y ] , note that V (PX,t) > 0 for all PX,t ? PX \P?X , and V (PX,t) = 0 for PX,t ? P?X = EOX . From Theorem 18 we have ?V (PX,t) = ?V (PY,t) < 0 for all t > 0 if PX,t /? P?X . Furthermore, we know ?V (PX,t) = ?V (PY,t) = 0 ?t > 0 if PX,t ? P?X . Using V (PX,t) as the Lyapunov function, and the fact that EOX is a compact set, we can appeal to a generalized version of Lyapunov?s theorem (see [12, Chapter 5]). The desired conclusion is reached. The use of the Lyapunov function for proving the asymptotic stability of the limit set can be found previously in Wang?s dissertation [87]. 136 3.6 Numerical Algorithms In this section, we present a few numerical algorithms based on the CWO (Cu- mulative Weighting Optimization) method we presented in the previous sections. These algorithms attempt to find an optimal solution iteratively. Each iteration consists of 5 stages: generation, quantile-update, parameter-update, weight-update, and projection. Generation, quantile-update and projection stages remain the same for all variations of the generic algorithm (i.e., Algorithm 1). We propose several approaches for constructing the weight-update stage. These algorithms build on the theoretical results using the same types of modifications as are found in the CE and MRAS (see [75, 53, 88] ). 137 Algorithm 1 Generic CWO Algorithm 1. Initialization: Select a number N0 as the total initial number of candidate solutions generated at each iteration and an initial g?0 (a parameterized prob- ability density distribution) defined on X. Pick an initial quantile ?0 ? (0, 1),  ? 0, ? > 0, ? ? (0, 1). 2. Generation: Generate Nk i.i.d candidate solutions {xik}Ni=1 from g??k = (1 ? ?)((1 ? ?)g?k?1 + ?g?k) + ?U, where U is the uniform distribution. 3. Quantile-Update: Calculate the (1 ? ?k)-quantile, ??k+1 (?k, Nk) := ? (H)(d(1??k)Nke), where dae isthe smallest integer greater than a and H(i) is the i-th highest value for the sequence {? (H (xik))}Nki=1 . 4. Parameter-Update: If k=0 or ??k+1 (?k, Nk) ? ??k + 2 , then ? 4(a). Set ??k+1 = ??k+1 (?k, Nk) , ?k+1 = ?k, Nk+1 = Nk; else ? Find the largest ?? ? (0, ?k) such that ??k+1 (??, Nk) ? ??k + 2 ; ? If such a ?? exists and ?? > ?min, then ?? 4(b). ??k+1 = ??k+1 (??, Nk) , ?k+1 = ??, Nk+1 = Nk; ? else ?? 4(c). ??k+1 = ??k, ?k+1 = ?k, Nk+1 = d?Nke ; 5. Weight-Update: Update the weights of the generated samples {xik}Ni=1 ac- cording to weight update methods based on Eq. 3.4.4, producing the p.m.f pX,k+1 = ?N i=1w i k+1?(x? xik). wik+1 is the updated weight for xik. 6. Density Projection: Construct g?k+1 by projecting the density pX,k+1 =?N i=1w i k+1?(x? xik) onto g? by solving ?k+1 = arg max ??? N? i=1 wik+1 ln g? ( xik ) ; 7. Stop if some stopping criterion is satisfied; otherwise go to step 2 and k = k+1. Since discrete-time, discrete-state equations are more suitable for the computa- tions in the weight-update stage, we write down the probability density counterparts 138 of Eq. 3.5.2 and 3.5.3: pY,k+1 (y) = w ?? ? s:?(H(s))?y pX,k (s) ?? ? w ?? ? s:?(H(s))>y pX,k (s) ?? pX,k+1 (x) = pY,k+1 (H (x)) pX,k (x)? s:?(H(s))=y pX,k (s) , (3.6.1) where w : [0, 1] ? [0, 1] is the probability weighting function. If we assume w (?) is differentiable, then Eq. 3.6.1 can be written more compactly as: pX,k+1 (x) = w? (1 ? FY,k (? (H (x)))) pY,k (? (H (x))) pX|Y,k (x,H (? (H (x)))) = w? (1 ? FY,k (? (H (x)))) pY,k (? (H (x))) pX,k (x)? {s:?(H(s))=?(H(x))} pX,k (s) = ??w (? s:?(H(s))??(H(x)) pX,k (s) ) ? w (? s:?(H(s))>?(H(x)) pX,k (s) ) ? {s:?(H(s))=?(H(x))} pX,k (s) ?? pX,k (x) = w? (1 ? FY,k (? (H (x)))) pX,k (x) , (3.6.2) where the second equality holds because the conditional density is taken to be pX|Y,t (x,H (? (H (x)))) = pX,t (x)? {s:?(H(s))=?(H(x))} pX,t (s) and FY,k (?) is the cumulative distribution function for the outcome values. Other choices for pX|Y,t are also allowed; in particular, a uniform conditional density. The second to last equality in Eq. 3.6.2 can be related to Eq. 3.6.1 by the 139 following equation: pY,k+1 (y) = w ?? ? s:?(H(s))?y pX,k (s) ?? ? w ?? ? s:?(H(s))>y pX,k (s) ?? = w? (1 ? FY,k (? (H (x)))) pY,k (? (H (x))) . Algorithm 1 is the generic CWO algorithm using Eq. 3.6.1 as the weight update equation. Later, two algorithms with different ways of updating the density function are described. Although both weight-update methods will use Eq. 3.6.1 with a chosen distribution rule, they differ in their assignment of the sample weights. The performance of the algorithms is measured using asymmetric traveling salesman problems, which we will introduce in the section below. 3.6.1 Numerical Examples: Asymmetric Traveling Salesman Prob- lems (ATSPs) We apply variations of Algorithm 1 to several asymmetric traveling salesman prob- lems (ATSPs). They are taken from the website http://www.iwr.uni-heidelberg. de/groups/comopt/software/TSPLIB95. We follow a similar approach as in Hu [53], which is outlined below. The reader is reminded here that Algorithm 1 is designed for maximization problems, whereas we are searching for the minimum distances of ATSPs. The goal in each ATSP problem is to find the minimum length of a tour that connects Ncities cities with the same starting and ending cities. For an ATSP, we are given an Ncities-by-Ncities distance matrix D, whose (i,j)-th element 140 Di,j represents the distance from city i to city j. The problem can be mathematically stated as: min x?X {Ncities?1? i=1 Dxi,xi+1 +DxNcities ,x1 } , where x := { x1, x2, . . . , xNcities , x1 } is an admissible tour and X is the set of all admissible tours. We use the same approach suggested by Rubinstein [75], and De Boer et al. [26] for solving these problems. Each distance matrix D is given an initial state probability transition matrix ?P0, whose (i,j)-th element specifies the probability of transitioning from city i to city j. At each iteration of the algorithm, there are two important steps: 1) generate random admissible tours according to the probability transition matrix and evaluate the performance of each sampled tour; 2) update the probability transition matrix based on the tours generated from step 1. We denote the set of tours generated at the k-th iteration by {xik} , where i ? {1, . . . , Nk}. Without loss of generality, we will assume the samples are sorted according to their values (i.e., ? (H (xik)) < ? ( H ( x j k )) if and only if i < j). A detailed discussion of the admissible tour generation process can be found in De Boer et al. [26]. The CWO algorithm differs from other algorithms in how it updates its transition matrix. At the k-th iteration of CWO, the probability density function, pk (?, ?k), parametrized by the transition matrix ?k is given by the equation below: pk (x, ?k) = Ncities? l=i Ncities? i,j ?k (i, j) I{x?Xi,j(l)}, where Xi,j (l) is the set of all tours in X such that the l-th transition is from city i 141 to city j. We can show that the new transition matrix is updated (i.e., stage 6 in Algorithm 1) as ?k+1 (i, j) = Nk? l=1 ( pwk+1 ( xik )) I{xik?Xi,j}, where we denote the updated density by pwk+1 (?) and {xik+1} is generated from pk(?, ?k) (i.e., a density function that is parameterized by ?k). The superscript w is used to emphasize the dependence of the updated probability mass function on the probability weighting function w. The construction of pwk+1 (?) depends on the specific weight-update method. 3.6.1.1 Weight-Update Methods In this section, we present several different methods of obtaining pwk+1 (?) from a collection of samples {xik} at the k-th step. The first method we introduce is called tilted weight update. Tilted weight update (CWO_T): The tilted weight-update method is described in Algorithm 2. The key idea behind this variation is that we assign the initial weights of the samples according to their outcome values: the smaller the value, the more initial weight it gets (see stage 2 in Algorithm 2). 142 Algorithm 2 Tilted Weight Update 1. Remove all the non-elite samples, i.e., {x?ik} := {xik : H (xik) ? ??k} , where {x?ik} is the set of remaining elite samples; 2. Assign a weight to each element in Y according to the equation: pY,k (y) = maxY y ? y? Y maxY y ? y , where Y := {H (x) |x ? {x?ik}} ; 3. Assign the updated outcome weights to samples according the following equa- tion: wik+1 = 1 ?Nk,x?ik ? ??w ? ?? ? y:y??(H(x?ik)) pY,k (y) ? ??? w ? ?? ? y:y ?}l p CE X,k (x) ? 1{? (H (s)) > ?}pCEX,k (x) , (3.6.4) where an indicator function is used to select the elite samples. In fact, the cross- entropy equation is just the limiting case8 of the CWO_U algorithm. As we increase the optimal-seeking factor, the derivative of Eq. 3.6.3 will approach a step function (i.e., Eq. 3.6.4) with its discontinuity occurring at ? = 0.1 (see Fig. 3.6.1). Table 3.6.2 contains the results from running 20 trials of CWO_U and CE algorithms with the parameters ? = 0.01, ?0 = 0.1, ?min = 0.001, N0 = 1000,  = 0, ? = 1, ? = 0.01, and ? = 0.7. Here, N0 is the initial sample size. ATSP Ncities NTotal (Std. err.) Hbest H? H? ?? ?? ? (Std. err.) ft53 53 90,450(6.0e3) 6,905 7,679 7,037 0.112 0.0191 0.060(0.0244) ce_ft53 53 65,100(5.7e3) 6,905 7,676 7,088 0.111 0.0265 0.075(0.0276) Table 3.6.2: CWO_U and CE performance Results We plot the sorted minimum tour distances obtained from the 20 trials of CE and CWO_U algorithms in Fig. 3.6.2. We observe from Fig. 3.6.2 that compared with the standard cross-entropy method, our approach does better in every per- centile. For example, the 1920th percentile would contain the lowest optimal solution obtained among the 20 trials. The 1820th percentile would contain the second lowest optimal solution obtained among the 20 trials. 8as ? ? ? 146 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0 5 10 15 20 Trials 7100 7200 7300 7400 7500 7600 Optimal Value ? CE ? CWO_U Figure 3.6.2: CE vs. CWO_U Sorted Trial Runs ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ?? ???? ? ?? ?? ?? ??? ? ??? ??????? ?? ?????? ??????? ??? ????????????????????????????????????? 20 40 60 80 Iterations 10 000 15 000 20 000 Values ? CWO_U ? CE Figure 3.6.3: One trial of CE vs. CWO_U In Figure 3.6.3, which displays a typical convergence of a single run of CE vs. that of CWO_U, we observe that although CE converges faster at the beginning, CWO_U is able to eventually overtake CE and finishes at a lower value. 147 3.7 Conclusion In the first part of this chapter, we proved the convergence of CWO-based algo- rithms. The proofs provided a rigorous mathematical foundation for the two practi- cal algorithms we proposed in the numerical examples section. These two algorithms are variations of the generic CWO algorithm described in Algorithm 1. The two algorithm variations, CWO_T and CWO_U, differ by how they update their proba- bility density functions over the solution space for each iteration. The first approach, CWO_T, weights the samples according to their outcome values. On the other hand, CWO_U, uniformly weights the samples. We benchmarked the performance of the CWO_T algorithm and summarized the results in Table 3.6.1. Although the numeric values are quite satisfactory, we wanted to see if we could improve these results. This effort led us to the development of the second approach, CWO_U, which we consider as the preferred implementation of the CWO-base algorithm. Perhaps the most surprising fact is that by not taking into account the outcome values of the samples, we are able to achieve better performance results. Even more interesting is the fact that the standard cross-entropy approach is just a limiting case of the CWO_U approach. Comparing the numerical results of CWO_U with those of CE, we believe our algorithm is better at obtaining an optimal solution (see Fig. 3.6.2). Of course, the improvement in performance is at the cost of increasing computational costs. 148 Chapter 4 Contributions and Future Work 4.1 Contributions A new family of performance criteria has been proposed in the first part of this the- sis. These performance criteria are inspired by cumulative prospect theory, which has substantial empirical support. They include the performance criteria used in classical risk-sensitive control problems (e.g., expected utility). We proved the class of non-convex risk-sensitive control problems is still solvable via dynamic program- ming. We investigated both finite-horizon and infinite-horizon cases, and offered numerical examples to demonstrate the applications of our approach. The second part of this thesis presented a novel approach for solving stochastic global optimization problems. This new approach, cumulative weighting optimiza- tion, is also inspired by cumulative prospect theory. We proved the convergence to an optimal solution of the cumulative weighting optimization algorithms given a mild assumption on the initial condition. In addition, the algorithms in CWO include the well-known cross-entropy optimization algorithm. Since we have proved the convergence for all CWO algorithms, we naturally have obtained a convergence proof for the cross-entropy algorithm, which to the best of our knowledge, has not been done before. In addition, we presented the numerical analysis of our algo- rithms, where we compared the performance of two weight updating schemes. We 149 also compared the performance of our algorithms against that of the cross-entropy. 4.2 Future Work In the future, we would like to develop a new game theory framework which cap- tures risk-sensitivity. The effects of incorporating the CPT-based distortions could provide a novel perspective into the well-established field of game theory. Take, for example, the classic prisoner?s dilemma game. The payoff matrix for the players is given in Table 4.2.1. John von Neumann and Oskar Morgenstern [84] showed that Player 2 Decision Cooperate Defect Player 1 Cooperate R=3 R=3 S=0 T=5Defect T=5 S=0 P=1 P=1 R:Reward S:Sucker T:Temptation P:Penalty Table 4.2.1: Classic Prisoner?s Dilemma Problem a mixed strategy Nash equilibrium exists for any zero-sum game with a finite set of actions. Although the prisoner?s dilemma game is not a zero-sum game, analyzing how it reacts to mixed strategies is still important. The analysis of the effects of mixed strategies on the prisoner?s dilemma starts with generating the ?risk-neutral? (i.e., non-distorted) discretized mixed strategy reward table for player 1. In the matrix below, each element is the expected reward value given the probabilities of player 1 and player 2 cooperate. The i-th row and j-th column entry is calculated as the expected reward with the probability i?15 that player 1 cooperating and the probability j?15 that player 2 cooperating, where i, j ? {1, 2, . . . , 6}. 150 ? ??????????????????? 1. 1.8 2.6 3.4 4.2 5. 0.8 1.56 2.32 3.08 3.84 4.6 0.6 1.32 2.04 2.76 3.48 4.2 0.4 1.08 1.76 2.44 3.12 3.8 0.2 0.84 1.48 2.12 2.76 3.4 0. 0.6 1.2 1.8 2.4 3. ? ??????????????????? By introducing probability weighting distortion into the prisoner?s dilemma game, we alter the analogous risk-sensitive expected reward matrix: ? ??????????????????? 1. 1.00021 1.35889 3.28597 4.72747 5. 0.931868 0.962743 1.19336 2.63784 4.20448 4.86374 0.571493 0.758082 1.06917 2.19005 3.35859 4.14299 0.0897219 0.354482 0.850323 2.03271 2.87765 3.17944 0.0000523055 0.0423175 0.533821 1.90096 2.82637 3.0001 0. 0.000156917 0.269166 1.71448 2.7956 3. ? ??????????????????? . For the matrix displayed above, the weighting function used is w (x) := exp[?3.0 (? log (x))2.5], (4.2.1) and the risk adjusted expected reward is calculated as: E (p1, p2) := 5 (w ((1 ? p1) p2)) + 3 (w ((1 ? p1) p2 + p1p2) ? w ((1 ? p1) p2)) +1 (w (p1p2 + (1 ? p1) p2 + (1 ? p1) (1 ? p2)) ? w (p1p2 + (1 ? p1) p2)) , 151 where p1 and p2 are the probabilities that player 1 and player 2 cooperate, respec- tively. The rate that the expected reward decreases down each column for the risk- neutral matrix is constant, whereas the same rate for the risk-sensitive matrix slows down as we traverse toward the bottom of each column. The particular function (Eq. 4.2.1) used represents risk-aversion, hence we see from this example how risk- sensitivity changes player 1?s expected payoff. Aside from having Nash equilibria, the prisoner?s dilemma game could also have -equilibria. An -equilibrium is formally defined below. Definition 28. Let G = ( N,A = A1 ? ? ? ? ? AN , r : A ? RN ) be a N-player game with action sets Ai for each player i and a reward function r. The space of probability distributions over Ai is denoted by P (Ai). Let ri (pi) denote the payoff to player i when the policy pi = {pi1 ? ? ? ? ? piN} is played, where pii ? P (Ai). A policy pi is an -Nash Equilibrium for the game G if ri (pi) ? ri (pi?i, pi?i) ? , ?pi?i ? P (Ai) , i ? {1, . . . , N} , where pi ?i denotes all the mixed strategies except the i-th policy. When  = 0, an -equilibrium is exactly the well-known Nash equilibrium. By picking an  > 0, two additional equilibria are found in the matrix above. More specifically, if player 1 is risk-averse, with an appropriate , the player could always cooperate, which is not a Nash equilibrium. This example is only one of the many ways in which, by introducing risk-sensitivity into the game, we alter the standard 152 conclusion. It is interesting to note that the same effect cannot be achieved by using a utility function. In addition to game theory, we will apply the CPT-based risk-sensitive mea- sures to classic control problems and study the structure of the optimal policies obtained. Systems that are human-centric might find it beneficial to be controlled by an optimal controller derived using our risk-sensitive performance measure. In the future, we will investigate the effects of CPT-based risk-sensitive measures for iterated games. 153 Appendix A Prospect Theory A.1 St. Petersburg Paradox The example below is from [86]. Example 13. [St. Petersburg paradox]. Consider the following game. A fair coin will be flipped until the first heads shows up. If heads shows up at the k-th flip, then you receive $2k. Thus, immediate heads gives only $2, and after each tails the amount doubles. After 19 tails you are sure to be a millionaire. Think for yourself how much it would be worth to you to play this game. The expected value of the game is 1 2 ? 2 + 1 4 ? 4 + 1 8 ? 8 + 1 16 ? 16 + ? ? ? = 1 + 1 + 1 + 1 + ? ? ? = ?. Thus if you maximize expected value, then this game is worth more to you than any amount of money. In reality, people pay considerably less to participate in the game, something like $5. A.2 Axiomatization of Expected Utility: We need a little notation before we can write down the axioms. Consider two outcomes L1 and L2 with known probabilities. We use the notation L1 < L2 to 154 mean that the decision maker prefers L1 over L2 or considers them to be equally preferred. We assume that the utility here is continuous and strictly increasing. Axiom 1. 1. Completeness: For any two outcomes L1 and L2, either L1 < L2 or L2 < L2 or both. 2. Transitivity: For any three outcomes L1, L2, and L3, if L1 < L2, and L2 < L3, then L1 < L3. 3. Continuity: For any three outcomes L1 < L2 < L3, there exist ?, ? ? (0, 1) such that ?L1 + (1 ? ?)L3 < L2, and L2 < ?L1 + (1 ? ?)L3. 4. Substitution (Independence) Savage[80]: For any L1, L2 and L3, and any ? ? (0, 1) , L1 < L2 if and only if ?L1 + (1 ? ?)L3 < ?L2 + (1 ? ?)L3. Furthermore, von Neumann and Morgenstern [85] proved the following: L1 < L2 if and only if n? i=1 p1iu ( x1i ) ? n? i=1 p2ju ( x2j ) , where p1i is the probability for the i-th outcome of L1, x1i is the value for the i-th outcome of L1 and u is the utility function. 155 Appendix B Multifunctions and Selectors Our main source of reference for this section is [48]. Let X and A be (nonempty) Borel spaces. Definition 29. A multifunction (also known as a correspondence or set-valued mapping) ? from X to A is a function such that ? (x) is a nonempty subset of A for all x ? X. (A single-valued mapping ? : X ? A is of course an example of a multifunction.) The graph of the multifunction ? is the subset of X ?A defined as Gr (?) := {(x, a) |x ? X, a ? ? (x)} . A multifunction could have one of the properties described below. For every subset B of A, let ??1 [B] := {x ? X|? (x) ? B 6= ?} . Definition 30. A multifunction ? from X to A is said to be (a) Borel measurable if ??1 [G] is a Borel subset of X for every open set G ? A; (b) upper semi-continuous (u.s.c) if ??1 [F ] is closed in X for every closed set F ? A; (c) lower semi-continuous (l.s.c)if ??1 [G] is open in X for every open set G ? A; (d) continuous if it is both u.s.c and l.s.c. 156 A multifunction ? is said to be closed-valued (resp. compact-valued) if ? (x) is a closed (resp. compact) set for all x ? X. The multifunction is said to be closed if its graph is closed. Proposition 5. Let ? be a compact-valued multifunction from X to A. Then the following statements are equivalent: (a) ? is Borel-measurable; (b) ??1 [F ] is a Borel subset of X for every closed set F ? A; (c) Gr(?) is a Borel subset of X ? A; (d) ? is a measurable function from X to the space of nonempty compact subsets of A topologized by the Hausdorff metric. Proof. See [50] and [81]. Throughout the remainder of this appendix, ? is a given Borel-measurable multifunction from X to A, and F denotes the set of (single-valued) measurable multifunction from X to A, and F denotes the set of (single-valued) measurable functions f : X ? A with f (x) ? ? (x) for all x ? X. A function f ? F is called a selector (or measurable selector or choice or decision function) for the multifunction ?. Moreover, v : Gr (?) ? R is a given measurable function and v? (x) := inf ?(x) v (x, a) , x ? X. If v (x, ?) attains its minimum at some point in ? (x) , we write ?min? instead of ?inf.? 157 Proposition 6. Suppose that ? is compact-valued. (a) If v (x, ?) is l.s.c. on ? (x) for every x ? X, then exists a selector f ? ? F such that v (x, f ? (x)) = v? (x) = min ?(x) v (x, a) , ?x ? X and v?is measurable. Similarly, if v (x, ?) is u.s.c. on ? (x) for every x ? X, then exists a selector f ? ? F such that v (x, f ? (x)) = v? (x) = max ?(x) v (x, a) , ?x ? X and v? is measurable. (b) If ? is u.s.c and v is l.s.c and bounded below on Gr(?) , then there exists f ? ? F for which v (x, f ? (x)) = v? (x) = min ?(x) v (x, a) , ?x ? X holds, and v? is l.s.c. and bounded below on X. 158 Appendix C Spaces of Probability Measures The content of this section can be found in the appendices of [15] and [13]. C.1 Polish Spaces A Polish space, X , is a topological space which is separable and admits a complete metrization. Examples of such spaces are: separable Banach spaces, compact metric spaces, the space D [0, 1] of cadlag path from [0, 1] to R with Skorohod topology. Let X be a Polish space and d (?, ?) a complete metric on it. Definition 31. A probability measure P on X is said to be tight if for each  > 0, there exists a compact set K ? X with P (K) ? 1 ? . Analogously, a family P?, ? ? I, (I being an index set) is said to be tight if the above holds for all P? uniformly in ?, i.e., the set K above can be chosen to be the same for all ?. C.2 The Prohorov Topology Let Cb (X ) , PX denote respectively the space of bounded continuous real valued functions on X , and the space of probability measures on X . Endow Cb (X ) with the supremum norm ??? . PX will be given the topology for with a local base at 159 P ? PX is given by the sets of the type { Q ? PX | ???? ? fidQ ? ? fidP ???? < i, 1 ? i ? k } for some k ? 1, i > 0 and fi ? Cb (X ) for 1 ? i ? k. It is easily seen that this topology is Hausdorff and is coarser than the one induced by the total variation norm. It is called the Prohorov topology or the topology of weak convergence. Some other possible choices for the local basis at P ? PX are given below. {Q ? PX |Q (Fi) < P (Fi) + i, 1 ? i ? k} , Fi ? X closed {Q ? PX |Q (Gi) > P (Gi) ? i, 1 ? i ? k} , Gi ? S open {Q ? PX | |Q (Ai) ? P (Ai)| < i, 1 ? i ? k} , Ai ? S satisfy P (?Ai) = 0 where ?Ai is the boundary of Ai, { Q ? PX | ???? ? fidP ? ? fidQ ???? < i, 1 ? i ? k } , fi are bounded and uniformly continuous with respect to the metric d. Here,  > 0, k ? 1. C.3 Compactness in PX Theorem 20. A subset L ? PX is relatively compact if and only if it is tight. 160 C.4 Metrics on PX Definition 32. For any  > 0 and any Borel setA ? X , letA = {x ? X |d (x,A) < } . For ?, ? ? PX , we define the metric d (?, ?) = inf  { > 0|? (A) ? ? (A) + , ? (A) ? ? (A) +  for all Borel subset A of S} . Theorem 21. d(?, ?) defines a metric on PX consistent with the Prohorov topology. 161 References [1] O. Alagoz, L. M. Maillart, A. J. Schaefer, and M. S. Roberts, The optimal timing of living-donor liver transplantation, Management Science, 50 (2004), pp. 1420?1430. [2] M. Allais, Le comportement de lH?omme rationnel devant le risque: Cri- tique des postulats et axiomes de l?cole americaine, Econometrica, 21 (1953), pp. 503?546. ArticleType: research-article / Full publication date: Oct., 1953 / Copyright ? 1953 The Econometric Society. [3] P. Artzner, F. Delbaen, J.-M. Eber, and D. Heath, Coherent measures of risk, Mathematical Finance, 9 (1999), pp. 203?228. [4] P. Artzner, F. Delbaen, J.-M. Eber, D. Heath, and H. Ku, Coherent multiperiod risk adjusted values and Bellman?s principle, Annals of Operations Research, 152 (2007), pp. 5?22. [5] R. Bellman, On the theory of dynamic programming, Proceedings of the National Academy of Sciences of the United States of America, 38 (1952), pp. 716?719. PMID: 16589166 PMCID: PMC1063639. [6] , Applied Dynamic Programming, Princeton University Press, Princeton, 1957. 162 [7] D. Bernoulli, Exposition of a new theory on the measurement of risk, Econo- metrica, 22 (1954), pp. 23?36. ArticleType: research-article / Full publication date: Jan., 1954 / Copyright ? 1954 The Econometric Society. [8] D. P. Bertsekas, Dynamic Programming and Optimal Control, vol. 1, Athena Scientific, 2nd ed., nov 2000. [9] D. P. Bertsekas, Dynamic Programming and Optimal Control: Approximate dynamic programming, Athena Scientific, 2012. [10] D. P. Bertsekas and S. E. Shreve, Stochastic Optimal Control: The Discrete Time Case, Academic Press, Nov. 1978. [11] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, 1996. [12] N. P. Bhatia and G. P. Szeg?, Stability Theory of Dynamical Systems, Springer, 1970. [13] P. Billingsley, Convergence of Probability Measures, Wiley-Interscience, 2 ed., July 1999. [14] F. Black and M. Scholes, The pricing of options and corporate liabilities, Journal of Political Economy, 81 (1973), pp. 637?654. ArticleType: research- article / Full publication date: May - Jun., 1973 / Copyright ?1973 The Uni- versity of Chicago Press. [15] , Topics in Controlled Markov Chains, CRC Press, Apr. 1991. 163 [16] , Probability Theory: An Advanced Course, Springer, Oct. 1995. [17] M. Bouakiz and M. J. Sobel, Inventory control with an exponential utility criterion, Operations Research, 40 (1992), pp. 603?608. ArticleType: research- article / Full publication date: May - Jun., 1992 / Copyright ?1992 INFORMS. [18] O. ?avu? and A. Ruszczy?ski, Risk-averse control of undiscounted transient markov models, arXiv preprint arXiv:1203.5437, (2012). [19] A. Chateauneuf and P. Wakker, An axiomatization of cumulative prospect theory for decision under risk, Journal of Risk and Uncertainty, 18 (1999), pp. 137?145. [20] P. Cheridito, F. Delbaen, and M. Kupper, Coherent and convex mone- tary risk measures for bounded c?dl?g processes, Stochastic Processes and their Applications, 112 (2004), pp. 1?22. [21] , Coherent and convex monetary risk measures for unbounded c?dl?g pro- cesses, Finance and Stochastics, 10 (2006), pp. 427?448. [22] P. Cheridito, F. Delbaen, and M. Kupper, Dynamic monetary risk measures for bounded discrete-time processes, Electronic Journal of Probability, 11 (2006), pp. 57?106. [23] K.-J. Chung and M. J. Sobel, Discounted MDP?s: distribution functions and exponential utility maximization, SIAM journal on control and optimiza- tion, 25 (1987), pp. 49?62. 164 [24] S. P. Coraluppi and S. I. Marcus, Risk-sensitive and minimax control of discrete-time, finite-state markov decision processes, Automatica, 33 (1999), pp. 301?309. [25] , Mixed risk-neutral/minimax control of discrete-time, finite-state markov decision processes, IEEE Transactions on Automatic Control, 45 (2000), pp. 528 ?532. [26] P.-T. de Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein, A tutorial on the cross-entropy method, Annals of Operations Research, 134 (2005), pp. 19?67. [27] B. de Finetti, Logical foundations and measurement of subjective probability, Acta Psychologica, 34 (1970), pp. 129?145. [28] F. Delbaen and E. T. Hochschule, Coherent risk measures on general probability spaces, in In essays in honour of Dieter Sondermann, Springer, 2002, pp. 1?37. [29] E. Diecidue, U. Schmidt, and H. Zank, Parametric weighting functions, Journal of Economic Theory, 144 (2009), pp. 1102?1118. [30] P. Dupuis, M. R. James, and I. Petersen, Robust properties of risk- sensitive control, Mathematics of Control, Signals, and Systems (MCSS), 13 (2000), pp. 318?332. [31] A. Eichhorn and W. R?misch, Polyhedral risk measures in stochastic pro- gramming, SIAM Journal of Optimization, 16 (2005), pp. 69?95. 165 [32] E. A. Feinberg and A. Shwartz, Handbook of Markov Decision Processes: Methods and Applications, Springer, 2002. [33] E. Fern?ndez-Gaucherand and S. I. Marcus, Risk-sensitive optimal control of hidden markov models: structural results, IEEE Transactions on Automatic Control, 42 (1997), pp. 1418 ?1422. [34] P. C. Fishburn, Utility theory for decision making, tech. rep., DTIC Docu- ment, June 1970. [35] , The axioms of subjective probability, Statistical Science, 1 (1986), pp. 335?345. Mathematical Reviews number (MathSciNet): MR858514. [36] W. H. Fleming, Optimal long term growth rate of expected utility of wealth, The Annals of Applied Probability, 9 (1999), pp. 871?903. Mathematical Reviews number (MathSciNet): MR1722286; Zentralblatt MATH identifier: 0962.91036. [37] W. H. Fleming and S. J. Sheu, Risk-sensitive control and an optimal in- vestment model, Mathematical Finance, 10 (2000), pp. 197?213. [38] W. H. Fleming and H. M. Soner, Controlled Markov Processes and Vis- cosity Solutions, Springer, 2nd ed., Nov. 2005. [39] H. F?llmer and I. Penner, Convex risk measures and the dynamics of their penalty functions, Statistics & Decisions, 24 (2006), pp. 61?96. 166 [40] H. F?llmer and A. Schied, Convex measures of risk and trading constraints, Finance and Stochastics, 6 (2002), pp. 429?447. [41] H. F?llmer, A. Schied, and T. Lyons, Stochastic finance. an introduction in discrete time, The Mathematical Intelligencer, 26 (2004), pp. 67?68. [42] T. D. Frank, Nonlinear Fokker-Planck Equations - Fundamentals and Appli- cations, Springer-Verlag, 2005. [43] M. Frittelli and G. Scandolo, Risk measures and capital requirments for processes, Mathematical Finance, 16 (2006), pp. 589?612. [44] U. G. Haussmann, Some examples of optimal stochastic controls or: The stochastic maximum principle at work, SIAM Review, 23 (1981), pp. 292?307. ArticleType: research-article / Full publication date: Jul., 1981 / Copyright ? 1981 Society for Industrial and Applied Mathematics. [45] X. D. He and X. Y. Zhou, Portfolio choice via quantiles, Mathematical Finance, 21 (2011), pp. 203?231. [46] D. Hern?ndez-Hern?ndez and S. I. Marcus, Risk sensitive control of markov processes in countable state space, Systems & Control Letters, 29 (1996), pp. 147?155. [47] , Existence of risk-sensitive optimal stationary policies for con- trolled markov processes, Applied Mathematics & Optimization, 40 (1999), pp. 273?285. 167 [48] O. Hern?ndez-Lerma and J. B. Lasserre, Discrete-Time Markov Control Processes: Basic Optimality Criteria, Springer, 1 ed., Dec. 1996. [49] O. Hern?ndez-Lerma and J.-B. Lasserre, Further topics on discrete-time Markov control processes, Springer, June 1999. [50] C. J. Himmelberg, T. Parthasarathy, and F. S. VanVleck, Optimal plans for dynamic programming problems, Mathematics of Operations Research, 1 (1976), pp. 390?394. [51] J. Hofbauer, J. Oechssler, and F. Riedel, Brown-von Neumann-Nash dynamics: The continuous strategy case, Games and Economic Behavior, 65 (2009), pp. 406?429. [52] R. A. Howard and J. E. Matheson, Risk-sensitive markov decision pro- cesses, Management Science, 18 (1972), pp. 356?369. ArticleType: research- article / Issue Title: Theory Series / Full publication date: Mar., 1972 / Copy- right ?1972 INFORMS. [53] J. Hu, M. C. Fu, and S. I. Marcus, A model reference adaptive search method for global optimization, Operations Research, 55 (2007), pp. 549?568. [54] J. Hu, Y. Wang, E. Zhou, M. C. Fu, and S. I. Marcus, A survey of some model-based methods for global optimization, in Optimization, Control, and Applications of Stochastic Systems, D. Hern?ndez-Hern?ndez and J. A. Minj?rez-Sosa, eds., Birkh?user Boston, Boston, 2012, pp. 157?179. 168 [55] H. W. James and E. J. Collins, An analysis of transient markov decision processes, Journal of Applied Probability, 43 (2006), pp. 603?621. Mathematical Reviews number (MathSciNet): MR2274787; Zentralblatt MATH identifier: 1145.90099. [56] S. C. Jaquette, Markov decision processes with a new optimality criterion: Discrete time, The Annals of Statistics, 1 (1973), pp. 496?505. Mathemati- cal Reviews number (MathSciNet): MR378839; Zentralblatt MATH identifier: 0259.90054. [57] S. C. Jaquette, A utility criterion for markov decision processes, Manage- ment Science, 23 (1976), pp. 43?49. ArticleType: research-article / Full publi- cation date: Sep., 1976 / Copyright ?1976 INFORMS. [58] H. Jasiulewicz, Application of mixture models to approximation of age- at-death distribution, Insurance: Mathematics and Economics, 19 (1997), pp. 237?241. [59] D. Kahneman and A. Tversky, Prospect theory: an analysis of decision under risk, National Emergency Training Center, 1979. [60] S. Kl?ppel and M. Schweizer, Dynamic indifference valuation via convex risk measures, Mathematical Finance, 17 (2007), pp. 599?627. [61] V. N. Kolokoltsov, Nonlinear Markov Processes and Kinetic Equations, Cambridge University Press, July 2010. 169 [62] , Markov Processes, Semigroups and Generators, Walter de Gruyter, Mar. 2011. [63] H. J. Kushner, Introduction to Stochastic Control, Holt, Rinehart and Win- ston, 1971. [64] J. Leitner, A short note on second-order stochastic dominance preserving coherent risk measures, Mathematical Finance, 15 (2005), pp. 649?651. [65] J. Oechssler and F. Riedel, Evolutionary dynamics on infinite strategy spaces, Economic Theory, 17 (2001), pp. 141?162. [66] S. Onay and A. ?nc?ler, Intertemporal choice under timing risk: An ex- perimental approach, Journal of Risk and Uncertainty, 34 (2007), pp. 99?121. [67] S. Peng, A general stochastic maximum principle for optimal control problems, SIAM Journal on Control and Optimization, 28 (1990), p. 966?979. ACM ID: 84050. [68] H. Pham, Continuous-time Stochastic Control and Optimization with Financial Applications, Springer, 1 ed., July 2009. [69] S. R. Pliska, Dynamic Programming and Its Applications, Academic Press, 1978, ch. On the transient case for Markov decision chains with general state spaces, pp. 335?349. [70] S. R. Pliska, A stochastic calculus model of continuous trading: Optimal portfolios, Mathematics of Operations Research, 11 (1986), pp. 371?382. Ar- 170 ticleType: research-article / Full publication date: May, 1986 / Copyright ? 1986 INFORMS. [71] T. Post, M. J. van den Assem, G. Baltussen, and R. H. Thaler, Deal or no deal? decision making under risk in a large-payoff game show, American Economic Review, 98 (2008), pp. 38?71. [72] W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, John Wiley & Sons, Sept. 2007. [73] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, John Wiley & Sons, Inc., New York, NY, USA, 1st ed., 1994. [74] F. Riedel, Dynamic coherent risk measures, Stochastic Processes and their Applications, 112 (2004), pp. 185?200. [75] R. Y. Rubinstein, Combinatorial optimization, cross-entropy, ants and rare events, in Stochastic Optimization: Algorithms and Applications, P. M. P. S. Uryasev, ed., Kluwer Academic Publishers, Dordrecht, The Netherlands, 2001, pp. 304?358. [76] A. Ruszczy?ski, Risk-averse dynamic programming for markov decision pro- cesses, Mathematical Programming, 125 (2010), pp. 235?261. [77] A. Ruszczy?ski and A. Shapiro, Conditional risk mappings, Mathematics of Operations Research, 31 (2006), pp. 544?561. ArticleType: research-article / Full publication date: Aug., 2006 / Copyright ?2006 INFORMS. 171 [78] , Optimization of convex risk functions, Mathematics of Operations Re- search, 31 (2006), pp. 433?452. ArticleType: research-article / Full publication date: Aug., 2006 / Copyright ?2006 INFORMS. [79] , Optimization of risk measures, in Probabilistic and Randomized Methods for Design under Uncertainty, G. Calafiore and F. Dabbene, eds., Springer London, 2006, pp. 119?157. [80] L. J. Savage, The Foundation of Statistics, Courier Dover Publications, 1972. [81] M. Sch?l, Conditions for optimality in dynamic programming and for the limit of n-stage optimal policies to be optimal, Probability Theory and Related Fields, 32 (1975), pp. 179?196. [82] C. Starmer, Developments in non-expected utility theory: The hunt for a descriptive theory of choice under risk, Journal of Economic Literature, 38 (2000), pp. 332?382. ArticleType: research-article / Full publication date: Jun., 2000 / Copyright ? 2000 American Economic Association. [83] A. Tversky and D. Kahneman, Advances in prospect theory: Cumula- tive representation of uncertainty, Journal of Risk and Uncertainty, 5 (1992), pp. 297?323. [84] J. von Neumann and O. Morgenstern, The Theory of Games and Eco- nomic Behavior, The Theory of Games and Economic Behavior, Princeton University Press, 1947. 172 [85] J. von Neumann and O. Morgenstern, Theory of Games and Economic Behavior (Commemorative Edition), Princeton University Press, Mar. 2007. [86] P. P. Wakker, Prospect Theory: For Risk and Ambiguity, Cambridge Uni- versity Press, July 2010. [87] Y. Wang, Simulation-Based Methods for Stochastic Control and Global Opti- mization, PhD thesis, University of Maryland - College Park, 2011. [88] Y. Wang, M. C. Fu, and S. I. Marcus, Model-based evolutionary optimiza- tion, in Proceedings of the Winter Simulation Conference, WSC ?10, Winter Simulation Conference, 2010, pp. 1199??1210. [89] J. Yong and X. Y. Zhou, Stochastic Controls: Hamiltonian Systems and HJB Equations, Springer, 1 ed., June 1999. [90] E. Zeidler, Nonlinear Functional Analysis and Its Applications: Part 2 B: Nonlinear Monotone Operators, Springer, Dec. 1989. 173