ABSTRACT
 Title of dissertation: STOCHASTIC SYSTEMS WITH
 CUMULATIVE PROSPECT THEORY
 Kun Lin, Doctor of Philosophy, 2013
 Dissertation directed by: Professor Steven I. Marcus
 Department of Electrical & Computer Engineering
 Stochastic control problems arise in many fields. Traditionally, the most
 widely used class of performance criteria in stochastic control problems is risk-
 neutral. More recent attempts at introducing risk-sensitivity into stochastic control
 problems include the application of utility functions. The decision theory com-
 munity has long debated the merits of using expected utility for modeling human
 behaviors, as exemplified by the Allais paradox. Substantiated by strong experimen-
 tal evidence, Cumulative Prospect Theory (CPT) based performance measures have
 been proposed as alternatives to expected utility based performance measures for
 evaluating human-centric systems. Our goal is to study stochastic control problems
 using performance measures derived from the cumulative prospect theory.
 The first part of this thesis solves the problem of evaluating Markov decision
 processes (MDPs) using CPT-based performance measures. A well-known method
 of solving MDPs is dynamic programming, which has traditionally been applied
 with an expected utility criterion. When the performance measure is CPT-inspired,
several complications arise. Firstly, when solving a problem via dynamic program-
 ming, it is important that the performance criterion has a recursive structure, which
 is not true for all CPT-based criteria. Secondly, we need to prove the traditional op-
 timality criteria for the updated problems (i.e., MDPs with CPT-based performance
 criteria). The theorems stated in this part of the thesis answer the question: what
 are the conditions required on a CPT-inspired criterion such that the corresponding
 MDP is solvable via dynamic programming?
 The second part of this thesis deals with stochastic global optimization prob-
 lems. Using ideas from the cumulative prospect theory, we are able to introduce
 a novel model-based randomized optimization algorithm: Cumulative Weighting
 Optimization (CWO). The key contributions of our research are: 1) proving the
 convergence of the algorithm to an optimal solution given a mild assumption on
 the initial condition; 2) showing that the well-known cross-entropy optimization al-
 gorithm is a special case of CWO-based algorithms. To the best knowledge of the
 author, there is no previous convergence proof for the cross-entropy method. In
 practice, numerical experiments have demonstrated that a CWO-based algorithm
 can find a better solution than the cross-entropy method.
 Finally, in the future, we would like to apply some of the ideas from cumulative
 prospect theory to games. In this thesis, we present a numerical example where
 cumulative prospect theory has an unexpected effect on the equilibrium points of
 the classic prisoner?s dilemma game.
STOCHASTIC SYSTEMS WITH CUMULATIVE PROSPECT
 THEORY
 by
 Kun Lin
 Dissertation submitted to the Faculty of the Graduate School of the
 University of Maryland, College Park in partial fulfillment
 of the requirements for the degree of
 Doctor of Philosophy
 2013
 Advisory Committee:
 Professor Steven I. Marcus, Chair/Advisor
 Professor Michael C. Fu
 Professor P.S. Krishnaprasad
 Professor Nuno Martins
 Professor Gang Qu
? Copyright by
 Kun Lin
 2013
Dedication
 To my family.
 ii
Acknowledgments
 Most importantly, I would like to express my most sincere gratitude to Pro-
 fessor Steven I. Marcus. Without his support and guidance, this dissertation would
 not have been possible. His integrity, dedication and patience have made a lasting
 impression on me. A graduate experience, riddled with set-backs and struggles, was
 made much more pleasant by his willingness to provide sound advice that led me
 back on the right track.
 I would like to thank my committee members for taking the time out of their
 busy schedules to read this manuscript and attend my defense. From Professor
 Michael C. Fu, I have gained many insights through our frequent research meetings.
 Professor P.S Krishnaprasad, Professor Nuno Martins, and Professor Gang Qu have
 all inspired me through their interesting lectures in their respective fields.
 In addition, I would like to express many thanks to my colleagues and office-
 mates, Enlu Zhou, Yongqiang Wang, James Ferlez, and Bhaskar Ramasubramanian,
 for the many interesting discussions on topics both technical and non-technical.
 For the many people not mentioned here, but who have definitely contributed
 to the completion of this thesis, I would like to express my deepest gratitude.
 iii
Contents
 List of Tables vi
 List of Figures vi
 1 Overview 1
 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
 1.1.1 Stochastic Optimal Control Problems . . . . . . . . . . . . . 2
 1.1.2 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . 4
 1.1.3 Performance Criteria: From Expected Value to Prospect Theory 5
 1.1.4 Stochastic Optimization with Probability Weighting Functions 10
 1.1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
 2 Dynamic Programming with Non-Convex Risk-Sensitive Measures 12
 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
 2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
 2.2.1 Discrete-Time Markov Control Model . . . . . . . . . . . . . . 15
 2.2.2 Cumulative Prospect Theory (CPT) . . . . . . . . . . . . . . 18
 2.2.3 Reward Transition Mappings . . . . . . . . . . . . . . . . . . 22
 2.2.4 Generalized Markov Dynamic Reward Measures . . . . . . . 25
 2.2.4.1 Markov Conditional Reward Measures . . . . . . . 29
 2.3 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 31
 2.3.1 Finite-Horizon . . . . . . . . . . . . . . . . . . . . . . . . . . 31
 2.3.1.1 Application: Cumulative Prospect Theory Measures
 35
 2.3.2 Discounted Infinite-Horizon . . . . . . . . . . . . . . . . . . . 45
 2.3.3 Transient Markov Control Model . . . . . . . . . . . . . . . . 55
 2.3.4 The Organ Transplant Example: A Comparative Analysis . . 75
 2.4 Reward Measures and Optimal Policies . . . . . . . . . . . . . . . . . 82
 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
 3 Cumulative Weighting Optimization 92
 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
 3.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
 3.3 Probability Weighting Functions . . . . . . . . . . . . . . . . . . . . . 95
 3.4 Discrete Solution Space . . . . . . . . . . . . . . . . . . . . . . . . . 99
 3.5 Polish Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
 3.6 Numerical Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 137
 3.6.1 Numerical Examples: Asymmetric Traveling Salesman Prob-
 lems (ATSPs) . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
 3.6.1.1 Weight-Update Methods . . . . . . . . . . . . . . . 142
 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
 iv
4 Contributions and Future Work 149
 4.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
 4.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
 A Prospect Theory 154
 A.1 St. Petersburg Paradox . . . . . . . . . . . . . . . . . . . . . . . . . 154
 A.2 Axiomatization of Expected Utility: . . . . . . . . . . . . . . . . . . 154
 B Multifunctions and Selectors 156
 C Spaces of Probability Measures 159
 C.1 Polish Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
 C.2 The Prohorov Topology . . . . . . . . . . . . . . . . . . . . . . . . . 159
 C.3 Compactness in PX . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
 C.4 Metrics on PX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
 Bibliography 162
 v
List of Tables
 1.1.1 Example: Data-Equivalence . . . . . . . . . . . . . . . . . . . . . . . 8
 2.3.1 Transition probability matrix for becoming an entrepreneur . . . . . 41
 2.3.2 Transition probability matrix for taking a normal job . . . . . . . . . 42
 2.3.3 An optimal solution for Ex. 5 (a value function and an optimal
 policy) at time 0 and 1. . . . . . . . . . . . . . . . . . . . . . . . . . 44
 2.3.4 An optimal solution for Ex. 6 (a value function and an optimal
 policy) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
 2.3.5 Transition Probabilities From State S . . . . . . . . . . . . . . . . . . 79
 2.3.6 Organ transplant example: parameters for F (x). . . . . . . . . . . . 79
 2.3.7 Organ Transplant Optimal Value and Policy Comparison . . . . . . . 81
 3.6.1 Performance of CWO_T on various ATSP problems based on 30
 independent replications . . . . . . . . . . . . . . . . . . . . . . . . . 144
 3.6.2 CWO_U and CE performance Results . . . . . . . . . . . . . . . . . 146
 4.2.1 Classic Prisoner?s Dilemma Problem . . . . . . . . . . . . . . . . . . 150
 List of Figures
 1.1.1 Data Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
 2.3.1 Organ Transplant State Transitions & Rewards . . . . . . . . . . . . 76
 2.3.2 Optimal Policy Comparison of the Organ Transplant Example . . . . 76
 3.6.1 Derivatives of Eq. 3.6.3 as ? ? ? . . . . . . . . . . . . . . . . . . . 145
 3.6.2 CE vs. CWO_U Sorted Trial Runs . . . . . . . . . . . . . . . . . . . 147
 3.6.3 One trial of CE vs. CWO_U . . . . . . . . . . . . . . . . . . . . . . 147
 List of Algorithms
 1 Generic CWO Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 138
 2 Tilted Weight Update . . . . . . . . . . . . . . . . . . . . . . . . . . 143
 3 CWO_U Weight Update Algorithm . . . . . . . . . . . . . . . . . . . 145
 vi
Nomenclature
 ATSP Asymmetric Traveling Salesman Problems
 ATSP Asymmetric Traveling Salesman Problems
 CDF Cumulative Density Function
 CPT Cumulative Propsect Theory
 CWO Cumulative Weighting Optimization
 MDP Markov Decision Problem
 PDE Partial Differential Equation
 vii
Chapter 1
 Overview
 1.1 Motivation
 Many relevant real-life problems can be modeled as stochastic systems (e.g., weather,
 traffic patterns, financial markets, communication systems). A system could be
 stochastic for many reasons. For one, the randomness could be introduced by inac-
 curate sensors (i.e., measurement error). Sometimes, we lack sufficient information
 about the system, and model our ignorance by intentionally incorporating random-
 ness in the model (i.e., model error). Quantum mechanics support the idea that
 uncertainty is part of the natural order of the universe. For whatever the reason
 might be, studying stochastic systems is important for solving many real-life prob-
 lems from various fields. This thesis will try to tackle a few problems in stochastic
 systems that are inspired by recent advances in decision theory. More specifically,
 we use some of the latest performance measures suggested by the decision theory
 community to evaluate the performances of stochastic systems. These novel prob-
 lems are particularly suited for studying human-centric systems (e.g., war games,
 consumer behaviors, medical decisions).
 1
1.1.1 Stochastic Optimal Control Problems
 ?The concept of control can be described as the process of influencing the
 behavior of a dynamical system to achieve a desired goal. If the goal is to
 optimize some payoff function (or cost function) which depends on the
 control inputs to the system, then the problem is one of optimal control.?
 - Wendell H. Fleming and H. Mete Soner [38]
 If a stochastic system has a control input along with a performance criterion, then
 the resulting problem is a stochastic optimal control problem. Stochastic optimal
 control problems have many applications in engineering. The evidence of their
 successful applications can be found in a wide-range of fields (i.e., robotics, route
 planning, space exploration). In finance, the seminal paper by Black & Scholes in
 1973 [14] provides insight into the management of risks, which leads to an equation
 for valuing options.
 There are three approaches for solving stochastic optimal control problems,
 namely dynamic programming (Hamilton-Jacobi-Bellman equation), the maximum
 principle, and the martingale and convex duality approach [68]. Dynamic pro-
 gramming, most popular in the analysis of controlled Markov processes, provides
 sufficient conditions for optimality in its verification theorem, stating that if there
 exists a policy satisfying the Hamilton-Jacobi-Bellman PDE, then it is an optimal
 policy. The essence of dynamic programming is Bellman?s optimality principle,
 which roughly says if one knows an optimal policy for an entire period, then start-
 ing from any time in that period and at a state along an optimal trajectory, the
 2
same policy is still optimal. This insight leads to the realization that decomposing
 the optimal value function into two parts (i.e., immediate value and value-to-go) is
 the key to obtaining an optimal policy. Intuitively, one can obtain the Hamilton-
 Jacobi-Bellman PDE by first considering the discrete time Bellman equation with
 time step h, dividing both sides by h, and then taking h to zero. One key feature
 of dynamic programming is that a solution of the Hamilton-Jacobi-Bellman PDE
 is a function of the state. In practice, this means the computational complexity
 of dynamic programming grows exponentially as the number of states increases,
 leading Bellman to coin the term ?curse of dimensionality?. On the other hand,
 a state-feedback policy can be found easily once a solution is obtained, which can
 be implemented simply as a lookup-table. Fleming & Soner [38] and Bertsekas [8]
 are excellent textbooks for a comprehensive review of dynamic programming in the
 controlled Markov process setting. Alternatively, the maximum principle provides
 necessary conditions for optimality. The stochastic maximum principle is similar in
 spirit to its deterministic counterpart. For the interested reader, Haussmann [44],
 Peng [67], Yong and Zhou [89] are excellent sources for a further investigation into
 this topic. The deterministic maximum principle can be intuitively described as
 perturbing an optimal control over an interval . By taking the first order Taylor
 approximation of the corresponding value function with respect to  and sending
  to zero, we obtain a variational inequality. Combining the variational inequality
 with the co-state equations, the deterministic maximum principle is complete. The
 stochastic maximum principle differs from its deterministic counterpart in its usage
 of forward-backward stochastic differential equations to describe the dynamics of
 3
its state and adjoint variables. The martingale approach, which was originated by
 Pliska [70] and gained popularity in the mathematical finance community, divides
 the problem into two subproblems: 1) Find the optimizer for the problem at a fixed
 terminal time T. If the cost function is convex, then the problem of finding the
 optimizer for the problem can be reduced from an infinite-dimensional problem to
 a finite-dimensional problem by using convex duality. 2) Use the martingale Repre-
 sentation Theorem to extract the corresponding optimal control. In Pliska?s original
 paper, convexity of the cost function is an important assumption in proving the ex-
 istence of the solution to the dual static problem. Pham [68] has a recent discussion
 on this approach.
 1.1.2 Dynamic Programming
 Of the three approaches we mention above, dynamic programming has proven to be
 the most popular method for solving dynamic stochastic optimization problems with
 controlled Markov processes. Numerically, dynamic programming can be applied
 using either value iteration or policy iteration. In value iteration, the algorithm could
 start from an infeasible1 value function and converges to a feasible one. On the other
 hand, applying policy iteration results in value functions that are feasible. Perhaps
 the most important reason for dynamic programming?s popularity is its production
 of a feedback policy, which is advantageous for storage, execution (i.e., a table
 lookup) and robustness. As shown in Bertsekas [8], the breadth of problems that
 can be solved using dynamic programming includes inventory control, deterministic
 1A value function is feasible if it is yielded by a feasible policy.
 4
scheduling problems, machine repair, reachability of ellipsoidal tubes, and pursuit-
 evasion games. Problems presented in his book all have the following flavor: 1) an
 underlying discrete time dynamic system; 2) a cost function that is additive over
 time. The natural question to ask for an inquisitive mind is: why do we want to
 evaluate performances using expectation? If our goal is to predict, in particular,
 what a human would do in many situations, then expected value contradicts much
 empirical evidence. Hence, in applying dynamic programming to human decision
 making processes, we need to have a model that agrees with empirical data. The
 discussion of the merits of various classes of performance criteria is the focus of
 the next section. This thesis deviates from the standard dynamic programming
 approach by updating the dynamic programming framework with a more general
 class of performance criteria. A host of issues arise from doing this: does a dynamic
 programming equation even exist?
 1.1.3 Performance Criteria: From Expected Value to Prospect The-
 ory
 Using expected value as a performance criterion has been a long tradition in many
 engineering and scientific fields. Why is that? Is it only out of its mathematical
 convenience (i.e., linearity)? In this section, we will trace the development of the
 expected utility theory and highlight some of its deficiencies. For a more in-depth
 analysis of the development in the area of prospect theory, the interested reader can
 refer to [86], which the discussion below draws many facts and examples from.
 5
When given a dynamic control problem, we would like to use an appropriate
 performance measure for each situation. Hence, understanding its implications is
 of paramount importance. Using the wrong performance measure might yield an
 ineffective policy. Each performance criterion encapsulates our preference order-
 ing of the potential outcomes2 due to our action/decision. In other words, if we
 know we prefer outcome L1 to L23 , then we must use a performance criterion that
 reflects this preference (i.e., ? (L1) > ? (L2)). On the other hand, we would also
 like the implications of such a preference ordering to be sensible for the problem at
 hand. For example, in the expected value case, de Finetti [27] (also see a survey
 by Fishburn [35] for the axiomatization of expected value) shows that the existence
 of subjective probabilities is equivalent to transitivity, monotonicity and additiv-
 ity. In addition, the existence of a certainty equivalent (i.e., deterministic) value
 for each possible outcome guarantees the no-arbitrage condition for the preference
 system. In other words, expected value performance criteria have properties that
 we deem rational (i.e., transitivity, monotonicity, and additivity), and lack some
 undesirable attributes (e.g., arbitrage). However, expected value does have some
 limitations. One particular limitations that spurred the search for its alternative,
 expected utility, is best demonstrated in the St. Petersburg paradox4. The paradox
 is an example of a game having an infinite certainty equivalent value (i.e., the price
 one is willing to pay) under expected value; However, in practice, people often are
 only willing to pay a finite amount for the game. The paradox was resolved by
 2An element of the probability space is usually called an outcome.
 3L1 is a short-hand notation for lottery 1, not to be confused with the function space.
 4see appendix on prospect theory
 6
Bernoulli in 1738 [7] by suggesting that people do not evaluate outcomes by their
 objective values, but rather by their utility. This realization started the new field
 of expected utility theory.
 Expected utility as a performance criterion still remains popular. Perhaps
 such success can be attributed to its axiomatization by von Neumann & Morgen-
 stern [85]. von Neumann provides the necessary and sufficient conditions for using
 maximization of expected utility as a function for ordering preferences. Fishburn
 [34] made updates to von Neumann?s work in the 1970s. These conditions include
 completeness, transitivity, continuity, and substitution. Despite many justifications
 for using expected utility, it suffers from a well known contradiction with empirical
 observations demonstrated by the Allais paradox [2], where most people violate the
 substitution axiom implied by expected utility theory. Lesser known but more re-
 cent discussions on the empirical violations of expected utility can be found in [82]
 and [71].
 Cumulative prospect theory resolves all of the paradoxes mentioned above,
 and has stronger empirical support compared with expected utility theory. There is
 also a strong behavioral foundation found in its axiomatization [19]. A key feature
 of cumulative prospect theory is probabilistic sensitivity. This is different from
 the traditional approach of outcome sensitivity in the expected utility theory. We
 will first demonstrate, via an example, that risk-aversion can have an equivalent
 representation outside of expected utility theory.
 Example 1. This example is from [86], which demonstrates that risk-sensitivity
 7
a b c d e
 % outcome is 100 0.10 0.30 0.50 0.70 0.90
 % outcome is 0 0.90 0.70 0.50 0.30 0.10
 Certainty Equivalent Value 1 9 25 49 81
 Table 1.1.1: Example: Data-Equivalence
 can be represented either as outcome sensitivity or probability sensitivity. We are
 given five certainty equivalent (i.e., indifference) values and probability pairs.
 From the table above, we see that the value a person places on an outcome
 (e.g., a,b,c,d or e) might not be a linear evaluation. In other words, in prospect b,
 0.30 ? 100 + 0.70 ? 0 = 30 6= 9.
 We can of course find a utility function U such that
 ?
 U (x) p (x)
 agrees with the values in the table above, where p is a probability mass function
 and U (?) is a utility function. This operation can be equivalently achieved by using
 a probability weighting function w,
 ?
 w (p) x,
 where instead of transforming the outcomes, we are now transforming the probability
 weights (see Figure 1.1.1).
 Remark 1. In the example above, we are only trying to demonstrate that risk-
 8
?
 ?
 ?
 ?
 ?
 H a L
 H b L
 H d L
 H c L
 H e L
 20 40 60 80
 $
 0.2
 0.4
 0.6
 0.8
 1.0
 p
 (a) Normalized EV vs. Certainty Equiv-
 alent
 ?
 ?
 ?
 ?
 ?
 (a)
 (c)
 (b )
 (d )
 (e)
 0.2 0.4 0.6 0.8 1.0
 p
 20
 40
 60
 80
 $
 (b) Certainty Equivalent vs. Normalized
 EV
 Figure 1.1.1: Data Equivalence
 sensitivity can be expressed equivalently either as outcome sensitivity or probability
 sensitivity. We want to emphasize the fact that w(p) : x ? [0, 1], transforms
 the probabilities based on the entire probability mass function. In other words,
 w : P ? P , is a mapping from P , the space of probability mass functions, to P .
 The example above should convince the reader that every utility function has
 an equivalent probability weighting function. This, of course, is not the full story.
 If probability weighting is only equivalent to expected utility, then we would not
 be so interested in it. After all, if probability weighting functions can predict only
 as well as expected utility, we will not need to advance risk-sensitive performance
 measures beyond utility theory. Several sources have demonstrated that in many
 cases probability weighting gives different predictions than those given by outcome
 weighting (i.e., expected utility). Onay and ?nc?ler [66] demonstrate the predic-
 tions offered by outcome based risk-aversion are different from that of probability
 weighting. More importantly, the predictive power of the probability weighting
 approach is confirmed by their experiments.
 9
1.1.4 Stochastic Optimization with Probability Weighting Functions
 Stochastic optimization is another field where a novel approach is inspired by cu-
 mulative prospect theory (CPT). In CPT, the probability weighting function has
 the effect of weighting rare good-news events more than other events. In stochastic
 optimization, if we can apply this shift in weights iteratively to sampled distribu-
 tions, we can intuitively understand how we might converge to an optimal value.
 There are a few desirable properties for stochastic optimization algorithms: 1) we
 would like the algorithms to increase, in expected value, monotonically; 2) once an
 optimal solution is obtained, we would like it to be robust again perturbations. As
 the reader will see in Chapter 3, our method, cumulative weighting optimization
 (CWO) exhibits both of these properties. We also will develop CWO-based numeri-
 cal algorithms and present their simulated results in the same chapter. Interestingly,
 the well-known cross-entropy method is a special case of CWO-based algorithms.
 In fact, we are able to improve the performance of the cross-entropy method by
 viewing it as such.
 1.1.5 Outline
 The thesis is organized by chapters. Chapter 2 will prove the suitability of dynamic
 programming equations for non-convex performance measures, which include CPT-
 based criteria. We will present both the finite horizon and infinite horizon cases.
 In addition, we will also analyze the structure of optimal policies yield by CPT-
 based criteria and compare them with other risk-sensitive performance criteria. In
 10
Chapter 3, we will provide convergence proofs for cumulative weighting optimization
 methods. Numerical examples will be provided to demonstrate the performance of
 our algorithms. In the last chapter, we will discuss our contributions and future
 work.
 11
Chapter 2
 Dynamic Programming with Non-Convex Risk-Sensitive Measures
 Dynamic programming with risk-sensitive performance measures has applications in
 many fields (e.g., operations research, finance, control systems). Historically, risk-
 sensitive performance measures are represented using expected utility functions.
 More recently, literature on dynamic performance (i.e., risk or reward) measures
 has inspired an alternative approach to risk-sensitive performance evaluation. The
 dynamic performance measure framework is a generalization of the classical work
 using expected value. One limitation of this approach is that it has only been devel-
 oped for coherent performance measures, which exclude a large class of important
 non-convex performance measures (e.g., cumulative prospect theory (CPT) based
 performance measures). We remedy this limitation by proving the optimality of the
 dynamic programming equation for non-convex performance measures.
 2.1 Introduction
 Dynamic programming, introduced by Bellman [5], is a dynamic optimization method.
 It has been the subject of intense research in the past five decades; see for ex-
 ample [6, 10, 15, 32, 63, 48, 73]. Dynamic optimization problems modeled by
 controlled Markov processes and solved via dynamic programming are commonly
 referred to as Markov decision processes (MDPs). Researchers have developed tech-
 12
niques to lift MDPs? curse-of-dimensionality (e.g., approximate dynamic program-
 ming [11, 72, 9]), which enable the application of dynamic programming in many
 fields (e.g., operations research, finance, control systems).
 In many applications, risk-sensitive measures are more appropriate than risk-
 neutral measures [52, 56, 57]. In standard MDPs, the performance measures are
 frequently expressed as expected utility functions that are risk-sensitive [23, 17, 36,
 37, 33, 46, 47, 24, 25]. For example, many problems evaluate their outcomes by
 using the performance measure E [u(X)], where u is a risk-sensitive utility function
 (e.g., exponential), and X is a random variable representing the reward1. A notable
 feature of optimal policies, induced by the risk-sensitive performance measures, is
 their robustness with respect to modeling errors [30].
 An important class of risk-sensitive performance measures is coherent risk
 measures, of which E [u (X)], where u (?) is a convex function, is a special case
 [3, 28, 40, 41, 64, 79, 78]. Other well known examples include mean-semideviation
 and conditional value-at-risk. An important property of coherent risk measures is
 convexity. Recently, their dynamic counterparts have received great interests in the
 literature [74, 20, 31, 39, 43, 22, 21, 4, 60]. In many problems, convex performance
 measures are not the best option for measuring the desirability of outcomes. A
 well-known example of a non-convex performance measure is suggested by Tversky
 and Kahneman in the cumulative prospect theory (CPT) [83]. Although CPT had
 its beginning in the 1990s, its incorporation into dynamic systems is still nascent.
 Recently, He and Zhou have studied [45] a portfolio choice problem with a non-
 1Reward is often the sum of per-stage rewards.
 13
convex performance measure. The problem maximizes the terminal wealth of a self-
 financing portfolio2 driven by a financial market3 that is uncontrollable from the
 perspective of the investor (see [45], Eq. 3). Often, the financial market is assumed
 to be a Markov semimartingale and has a nonempty set of equivalent martingale
 measures. Under these assumptions, one can apply the martingale approach (see
 [68], Chapter 7) to arrive at the desired analytical results. These results become
 more difficult, if not impossible, to obtain if these assumptions are eliminated. This
 chapter will study both convex and non-convex performance measures (e.g., CPT-
 inspired reward measures) when the underlying model is a discrete-time controlled
 Markov process.
 The goal of this chapter is to address, when the underlying system is mod-
 eled as a controlled Markov process, the question: How can we generalize dynamic
 programming to both convex and non-convex performance measures? An approach,
 suggested by Ruszczy?ski [76], is based on dynamic risk measures and risk transi-
 tion mappings (see [77, Definition 5]). Assuming a sequence of time-consistent4 risk
 measures is given (see [76, Theorem 1] ), he concludes that if the corresponding one-
 step dynamic risk measures satisfy the four assumptions of coherent performance
 measures, namely convexity, monotonicity, translation equivalence, and positive
 homogeneity, and an equivalent Markov risk transition mapping exists for each one-
 step dynamic risk measure, then a dynamic programming equation exists for the
 dynamic optimization problem. Unfortunately, since CPT-inspired measures have
 2This is just a constraint on the action space of the MDP.
 3A special case of a controlled semi-martingale Markov process.
 4Time-consistency is key for rewriting the risk measures into their nested forms, which can be
 easily optimized via dynamic programming.
 14
nonlinear weighting functions, they do not satisfy some of these assumptions.
 We derive the dynamic programming equation for a class of non-convex reward
 measures (e.g., CPT-inspired reward measures). Our work has many parallels with
 that of Ruszczy?ski; however our goal is to generalize his approach to non-convex
 reward measures. Before we proceed, we will review some background material
 in controlled Markov processes, CPT, reward transition mappings, and dynamic
 reward measures.
 2.2 Background
 In the following sections, we use the following notations:
 ? (?)+ := max(0, ?); (?)? := ?min(0, ?);
 ? P(?): the set of probability measures defined on ?.
 2.2.1 Discrete-Time Markov Control Model
 We are interested in the case when the underlying system dynamics can be modeled
 as a discrete-time controlled Markov process. Let us first review the necessary
 technical background for our discussion. We restate the definition from [48] for
 the reader?s convenience. A Markov control model is a five-tuple, (X,A, {A(x)|x ?
 X}, Q, r), consisting of:
 ? a Polish space X, called the state space and whose elements are referred to as
 states;
 15
? a Polish space A, called the control or action set;
 ? a family {A(x) ? A|x ? X} of nonempty measurable subsets A(x) of A, where
 A(x) denotes the set of feasible controls or actions when the system is in state
 x ? X, and with the property that the set
 K := {(x, a)|x ? X, a ? A(x)} (2.2.1)
 of feasible state-action pairs is a measurable subset with respect to the product
 ?-algebra of X ? A (i.e., ?(X ? A));
 ? a stochastic kernel5 Q(?|x, a) on X, where (x, a) ? K;
 ? a measurable function r:K ? X ? R called the per-stage reward function.
 Remark 2. We can make A(x) and r time-varying, denoted by At(x) and rt, by
 considering the state space X? := X ? [0, . . . , T ].
 Polish spaces include finite-dimensional real spaces, which are important for many
 real-life applications (e.g. dynamic pricing). The following definition is useful for
 describing the set of feasible deterministic and randomized policies.
 Definition 1. We denote by F the set of all measurable functions f : X ? A
 such that f(x) ? A(x) for all x ? X. In addition, we let ? denote the set of all
 5A stochastic kernel on X given Y is a function P (?|?) such that
 1. P (?|y) is a probability measure on X for each fixed y ? Y ;
 2. P (B|?) is a measurable function on Y for each fixed B ? B(X).
 16
stochastic kernels ? in P(A|X), the set of probability measures on A given X, such
 that ?(A(x)|x) = 1 for every x ? X.
 We track a system?s history by doing the following: for each t = 0, 1, . . . ,
 define the space Ht of admissible histories up to time t as H0 := X, and
 Ht := Kt ? X = K ?Ht?1, ?t = 1, 2, . . . .
 The most general policies we investigate are randomized policies, which are defined
 below.
 Definition 2. A randomized policy is a sequence pi = {pit, t = 0, 1, . . . } of stochastic
 kernels pit ? P (A|Ht) satisfying the constraint
 pit(A(xt)|ht) = 1, ?xt ? X, ht ? Ht, t = 0, 1, . . . .
 The set of all randomized policies is denoted by ?.
 A special class of randomized polices is the class of randomized Markov poli-
 cies.
 Definition 3. A randomized policy, pi ? ?, is a randomized Markov policy if there
 exists a sequence of stochastic kernels ?t ? ? such that
 pit(?|ht) = ?t(?|xt) = 1
 ?ht ? Ht, t = 0, 1, . . . .
 17
The policy pi is a randomized stationary policy if there is a ? ? ? such that
 pit (?|ht) = ? (?|xt)
 ?ht ? Ht, t = 0, 1, . . . .
 We denote the sets of randomized Markov policies and randomized stationary
 polices by ?RM and ?RS, respectively.
 Furthermore, if there exists a sequence ft ? F such that ?t(?|xt) is the Dirac
 measure concentrated at f(xt) for all t = 0, 1, . . . , then pi is a deterministic Markov
 policy, and pit := ft ? F. We denote the sets of all deterministic Markov policies
 and deterministic stationary policies by ?DM and ?DS, respectively.
 By fixing a Markov control model, an initial probability distribution v (e.g.,
 a known initial state x0), and a randomized policy pi, we obtain the probability
 distribution evolution of a discrete-time Markov process. We denote the resulting
 discrete-time Markov process and action sequence by {xpit } and {apit } (i.e., apit is a
 random variable with probability distribution pit(?| {x0, . . . , xt})) respectively. For
 ease of notation, we drop the process?s dependence on its initial condition, as it is
 fixed unless stated otherwise. For the rest of the discussion, we are given a fixed
 Markov control model.
 2.2.2 Cumulative Prospect Theory (CPT)
 Before introducing cumulative prospect theory, we will first introduce a useful defi-
 nition.
 18
Definition 4. A good-news distribution, ?F , of a random variable is defined as
 ?F (x) := 1 ? F (x) ,
 where F (x) is the cumulative distribution function (CDF). Other names for this
 distribution are: survival distribution, complementary CDF and reliability distribu-
 tion.
 Remark 3. The above definition should be altered if we are given a minimization
 problem. In that case, a good-news distribution function should be the cumulative
 distribution function itself, because smaller values are more favorable.
 Another important element of CPT is probability weighting functions, which
 are defined below.
 Definition 5. A probability weighting function, w, is a continuous function from
 [0, 1] to [0, 1].
 Prospect theory was suggested in the 1970s by Kahneman and Tversky [59].
 They were unsatisfied with the theory and suggested its improved version, cumula-
 tive prospect theory (CPT), in the 1990s [83]. CPT asserts that the human decision
 making process can be modeled by a utility function with the following characteris-
 tics:
 ? The utility function has a reference point against which gains and losses are
 measured;
 19
? The utility function is concave on gains and convex on losses (i.e., horizontal
 S-shape);
 ? A probability weighting function that transforms the probability measure such
 that a small probability is inflated and a large probability is deflated. For
 example, a typical weighting function w : [0, 1] ? [0, 1] is
 w(y) : = y
 ?
 (y? + (1 ? y)?) 1? ,
 where ? ? (0, 1) and y is usually the good-news distribution. This function
 was originally presented in [83].
 Definition 6. Given a probability space (?,F ,P), and random variables R and B
 defined on it. A CPT performance measure has the following form:
 ?(R) : =
 ? ?
 0
 w+
 (
 P
 (
 u+
 ((R?B)+)> s))ds
 ?
 ? ?
 0
 w?
 (
 P
 (
 u
 ?
 ((R? B)
 ?
 )
 > s
 ))
 ds, (2.2.2)
 where w+ : [0, 1] ? [0, 1] and w? : [0, 1] ? [0, 1] are two continuous non-decreasing
 functions. u+ : R+ ? R+ and u? : R+ ? R+ are two utility functions. The random
 variable B represents the benchmark we measure the performance against.
 The weighting functions used in a CPT performance measure are required to be
 non-decreasing, which is not necessarily true for a probability weighting function.
 We apply a CPT-inspired measure to evaluate the expected outcome of a game of
 20
dice in the example below.
 Example 2. Cumulative Prospect Theory - Finite State Case with no Control:
 Consider a game of dice. You roll a die with six possible outcomes {1, 2, 3, 4, 5, 6}.
 If the outcome is even, you win an amount equal to the outcome; on the other hand,
 if the outcome is odd, then you lose an amount equal to the outcome. Thus, the
 payoffs are {-5, -3, -1, 2, 4, 6}. Furthermore, the payoffs are organized into gains and
 losses. The probability of gains is derived by assuming the die is fair and written
 as {0 : 12 ; 2 : 16 ; 4 : 16 ; 6 : 16}, which is read as the probability of winning 0 is 12 , the
 probability of winning 2 is 16 and so on. On the down side, a similar calculation
 leads to {0 : 12 ; ?1 : 16 ; ?3 : 16 ; ?5 : 16}. Since the initial state of the die does not
 matter in this case, the CPT expected value calculation is as follows:
 ?V = u+
 ((2)+)(w+(12) ? w+(13))+ u+((4)+)(w+(13) ? w+(16))
 + u+
 ((6)+)(w+(16) ? w+(0))? u?((?5)?)(w?(16) ? w?(0))
 ? u
 ?
 ((?3)
 ?
 )(
 w
 ?
 (13) ? w?(
 1
 6)
 )
 ? u
 ?
 ((?1)
 ?
 )(
 w
 ?
 (1
 2
 ) ? w
 ?
 (13)
 )
 = u+ (2)
 (
 w+(12) ? w+(
 1
 3)
 )
 + u+ (4)
 (
 w+(13) ? w+(
 1
 6)
 )
 + u+ (6)
 (
 w+(16) ? w+(0)
 )
 ? u
 ?
 (5)
 (
 w
 ?
 (16) ? w?(0)
 )
 ? u
 ?
 (3)
 (
 w
 ?
 (13) ? w?(
 1
 6)
 )
 ? u
 ?
 (1)
 (
 w
 ?
 (1
 2
 ) ? w
 ?
 (13)
 )
 .
 Here, we use different probability weighting functions for gains and losses,
 namely w+ : [0, 1] ? [0, 1] and w? : [0, 1] ? [0, 1]. The two functions, u+ : R+ ? R+
 and u
 ?
 : R+ ? R+, are two utility functions. In this example, the reference point
 21
is assumed to be zero.
 Remark 4. We presented in Eq. 2.2.2 the most general form of a CPT reward
 measure. In the sequel, we will study various special cases of this reward measure.
 For example, we are interested in the case when the rewards are strictly positive.
 2.2.3 Reward Transition Mappings
 In this section, we consider the Markov control model
 (X,A, {A(x)|x ? X} , Q, r) .
 The discrete-time Markov process resulting from applying the policy pi and the cor-
 responding action sequence are denoted by {xpit } and {apit }, respectively. A standard
 finite-horizon dynamic control problem has the following performance measure:
 max
 pi??
 E
 [
 T?1?
 t=0
 r(xpit , apit , xpit+1) + rT (xpiT )|x0
 ]
 , (2.2.3)
 where rT is a measurable terminal reward function. We would like to solve the
 optimization problem
 V ?T (x) := max
 pi??
 VT (x, pi).
 From [48], the corresponding dynamic programming equation is
 vt(x) = max
 ??P(A(x))
 ?
 X?A
 (r(x, a, y) + vt+1(y))Q(dy|x, a)?(da). (2.2.4)
 22
Remark 5. Under some assumptions (i.e., the existence of a measurable deterministic
 selector), ? ? P(A(x)) in Eq. 2.2.4 can be replaced by ? ? A(x) to reflect the fact
 that a deterministic optimal policy exists.
 The right-hand side of Eq. 2.2.4 is a function of the current state x, the reward
 function parameterized by the current state x (i.e., gx(a, y) := r(x, ?, ?) + vt+1(?) :
 A ? X ? R), the transition probability Q, and the randomized control ?. Taking
 one step further, we can define a function
 ?t (r(x, ?, ?) + vt+1(?), x, ? ?Q(?|x, ?)) :=
 ?
 X?A
 (r(x, a, y) + vt+1(y))Q(dy|x, a)?(da),
 and rewrite Eq. 2.2.4 as
 vt(x) = max
 ??P(A(x))
 ?t (r(x, ?, ?) + vt+1(?), x, ? ?Q(?|x, ?)) .
 The sequence, {?t, t = 0, . . . , T ? 1}, is called the reward transition mappings
 for Eq. 2.2.4. Before we can provide the definition for reward transition mappings,
 we need to define the term ? ?Q(?|x, ?) in the equation above.
 Definition 7. Given a fixed current state x ? X and a randomized action ? ? P(A),
 we denote the one-step state-action measure (see [18]) with respect to the Markov
 control model by:
 [? ?Qx](Ba ?By) :=
 ?
 Ba
 Q(By|x, a)?(da) By ? B(X) Ba ? B(A), (2.2.5)
 23
where Qx(a) := Q(?|x, a) : A ? P(X) is the stochastic kernel parameterized by
 x ? X.
 Remark 6. The one-step state-action measure is a measure over X ? A, which rep-
 resents the uncertainty over the next state and the current action (i.e., we are
 interested in randomized policies).
 We need to define the space that contains ? ?Qx, which was also mentioned in
 ?avu? & Ruszczy?ski [18]. Given a probability space (X ? A,B(X ? A), P0), where
 P0 is some reference probability measure, the space of p-integrable random variables
 is denoted by V := Lp(X ? A,B(X ? A), P0), p ? [1,?). Its dual space, V ?, is the
 space of signed measures on (X ? A,B(X ? A)), which are absolutely continuous
 with respect to P0 with their densities in Lq(X ? A,B(X ? A), P0), where q satisfies
 the equation 1p + 1q = 1. The reference measure, P0, should be chosen such that all
 possible measures of the form ? ? Qx are in V ?. In the special case of a finite state
 and control space, P0 can always be chosen to be uniform. We denote the set of all
 probability measures in V ? by:
 M := {m ? V ?|m(X ? A) = 1, m ? 0} .
 Remark 7. The measure defined by Eq. 2.2.5 is an element of M.
 The space V ? (and thus M) is endowed with the Prokhorov topology (weak
 convergence). For p ? [1,?) we will endow V with the strong (i.e., norm) topology.
 24
However, if p = ?, we will endow V with the topology induced by the form:
 ??,m? =
 ?
 X?A
 ?(x, a)m(dx, da), ? ? V , m ? V ?.
 Definition 8. A mapping ? : V ? X ? M ? R is a reward transition mapping if
 for every x ? X and every m ? M fixed (denote ?(?) := ?(?, x,m)), the following
 conditions are true:
 1) if ? ? ? then ?(?) ? ?(?), ??, ? ? V ;
 2) ?(??) = ??(?), ?? ? V , ? ? 0.
 This definition is more general than Definition 3.1 in [18]. Since CPT inspired
 performance measures are distorted by nonlinear probability weighting functions,
 they generally do not satisfy the convexity and translation invariant requirements
 satisfied by convex risk measures. By removing these two requirements, we are able
 to work with non-convex CPT inspired performance measures.
 2.2.4 Generalized Markov Dynamic Reward Measures
 The definition of dynamic reward measures varies based on the objective of the
 analysis. The dynamic risk measure community, for example [76], defines dynamic
 performance measures based on coherent risk measures. Since our goal is to define
 a class of dynamic performance measures that contains non-convex performance
 measures (e.g., CPT inspired reward measures), we need to modify the definitions
 used by the dynamic risk measure community. However, because we are maximizing
 rewards rather than minimizing risks, our definitions are defined by switching the
 25
direction of the analogous inequalities.
 Given a filtered probability space (?,F , {Ft}, P ) with F0 = {?, ?}, we define
 the spaces Lt = Lp(?,Ft, P ), p ? [1,?], t = 0, 1, . . . , T, and Lt,T = Lt ? ? ? ? ? LT .
 Remark 8. Given a Markov control model, ? can be thought of as HT and FT as
 ? (HT ) .
 Definition 9. A mapping ?t,T :Lt,T ? Lt, where 1 ? t ? T , is called a conditional
 reward measure, if it has the following monotonicity property:
 Z ? W implies ?t,T (Z) ? ?t,T (W ), ?Z,W ? Lt,T .
 Definition 9 was first presented by Ruszczy?ski in [76]. The inequality above
 is meant to be component-wise almost surely. Intuitively, the definition above says a
 conditional reward measure preserves the order of the rewards. Furthermore, taking
 R ? Lt,T to be a sequence of future rewards, ?t,T (R) gives the price, at time t, that
 one is willing to pay to obtain the payoff sequence R.
 Definition 10. A dynamic reward measure is a sequence of conditional reward
 measures {?t,T , t = 1, . . . , T}.
 In other words, a dynamic reward measure is a time-varying mapping that
 reflects the present value of a sequence of future rewards. It can be utilized as a
 performance measure in many real-life scenarios. One important concept in dynamic
 reward measures is time-consistency, which is defined below.
 26
Definition 11. A dynamic reward measure {?t,T}Tt=1 is called time-consistent if for
 all 1 ? ? < ? ? T and all sequences Z,W ? L?,T the conditions
 Zk = Wk, k = ?, . . . , ? ? 1 and ??,T (Z?, . . . , ZT ) ? ??,T (W?, . . . ,WT )
 imply that
 ??,T (Z? , . . . , ZT ) ? ??,T (W? , . . . ,WT ).
 In applications, a time-consistent dynamic reward measure can be more con-
 veniently represented by its corresponding sequence of one-step conditional reward
 measures, whose definition is given below.
 Definition 12. A mapping ?t : Lt+1 ? Lt is called a one-step conditional reward
 measure if
 ?t(Z) = ?t,t+1(0, Z), Z ? Lt+1.
 For this thesis, we are only interested in one-step conditional reward measures
 that satisfy the assumption below.
 Assumption 1. A one-step conditional reward measure satisfies the following con-
 ditions:
 1. If Z ? W then ?t(Z) ? ?t(W ), ?Z,W ? Lt+1;
 2. ?t(?Z) = ??t(Z), ?Z ? Lt+1, ? ? 0.
 Below are several one-step conditional rewards that satisfy Assumption 1.
 27
Example 3. The following reward measures are both convex reward measures.
 Mean-semideviation model:
 ?t(Zt+1) = E [Zt+1|Ft] + ?E[((Zt+1 ? E [Zt+1|Ft])+)r |Ft]1r .
 Here, r ? [1, p] and ? ? [0, 1] may be any Ft-measurable random variables.
 Another interesting example is the Average Conditional Value at Risk:
 ?t(Zt+1) = inf
 U?Lt
 {
 U + 1
 ?
 E
 [(Zt+1 ? U)+ |Ft]} ,
 where the infimum is point-wise, and ? is any Ft-measurable function with values
 in an interval [?min, ?max] ? (0, 1).
 The next example is an example of a non-convex reward measure in Zt+1.
 Example 4. Cumulative Prospect Theory:
 ?t(Zt+1) =
 ? ?
 0
 w+t
 (
 P
 (
 u+t ((Zt+1) +) > s
 ))
 ds?
 ? ?
 0
 w?t
 (
 P
 (
 u?t
 ((Zt+1)
 ?
 )
 > s
 ))
 ds,
 (2.2.6)
 where w+t , w+t , u?t , and u?t are Ft-measurable functions with values in the function
 spaces [0, 1] ? [0, 1], [0, 1] ? [0, 1], R+ ? R+ and R ? R+, respectively (see Eq.
 2.2.2), and P is an appropriate probability measure. Here, the benchmark random
 variable B in Eq. 2.2.2 is zero.
 Remark 9. The performance measures in Example 3 satisfy the convexity and trans-
 28
lation invariance assumptions in Ruszczy?ski?s work, whereas the performance mea-
 sure (i.e., Eq. 2.2.6) in Example 4 does not. Eq. 2.2.6 is the main motivation for
 us to generalize Ruszczy?ski?s approach.
 Applying a one-step conditional reward measure to a controlled Markov pro-
 cess, we ideally would like to obtain an optimal Markov policy. However, we cannot
 expect this to be true in general, because the one-step reward measure could depend
 on the past history of the underlying Markov process (i.e., ht). In order to over-
 come this difficulty, we follow Ruszczy?ski?s ([76]) definition of the one-step Markov
 conditional reward measure.
 2.2.4.1 Markov Conditional Reward Measures
 As we mentioned in the previous section, one-step conditional reward measures
 might not be Markov. However, if a one-step conditional reward has a corresponding
 reward transition mapping, then it only depends on the current state of the system,
 hence it is Markov. The following condition is important for the integrability of
 Markov conditional reward measures.
 Definition 13. A function g is said to be b-bounded if ?C > 0 and b : X ?
 [1,?), b ? V and
 |g(x, a, y)| ? C (b(x) + b(y)) , ?x ? X, a ? A(x), y ? X.
 We denote the function g(x, a, y) : X ? A ? X with the x argument parame-
 terized by gx : A ? X ? R (i.e., gx(a, y) := g(x, a, y)). In addition, the notation pit,x
 29
denotes the measure pit(?|x) ? P(A). We remind the reader that Qt,x, the transition
 probability at time t, is a mapping a ? Qt(?|x, a).
 We consider the filtered probability space (HT , ?(HT ),Ft,Ppi) , where Ft is
 the ?-field generated by the state-action trajectory (i.e., {xpi0 , api0 , . . . , xpit } ) of the
 controlled Markov process {xpit }. The space Lt in the definition below is defined
 with respect to the filtered probability space (HT , ?(HT ),Ft,Ppi) . More specifically,
 elements of Lt are functions of {xpi0 , api0 , . . . , xpit } .
 Definition 14. A one-step conditional reward measure ?t : Lt+1 ? Lt is a Markov
 reward measure with respect to a controlled Markov process {xpit } and its controls
 {apit }, if there exists a reward transition mapping ?t : V ?X?M ? R, such that for
 any b-bounded measurable functions g : X ? A ? X ? R, there is a feasible control
 pit : X ? P(A(x)) such that the following equation holds
 ?t
 (
 g(xpit , apit , xpit+1)
 )
 = ?t(gxpit , xpit , pit,xpit ?Qt,xt), a.s. (2.2.7)
 Remark 10. In the sequel, we use the term one-step Markov reward measure for
 both ?t and its corresponding ?t. Furthermore, the right-hand side of Eq. 2.2.7 can
 be thought of as a function parameterized by the current state xpit .
 Definition 15. A one-step conditional reward measure ?t is Markov, if ?t is a
 Markov reward measure with respect to all feasible controlled Markov processes
 and controls {{xpit } , {apit } |pi ? ?} and ?t is the same for all pi ? ?. Furthermore, a
 dynamic reward measure {?t} is Markov, if each of the one-step conditional reward
 measure ?t is Markov.
 30
In other words, if a conditional reward measure ?t is Markov, then we can
 replace it with its Markov counterpart ?t when calculating the reward at time t.
 2.3 Dynamic Programming
 2.3.1 Finite-Horizon
 Given a time-consistent dynamic reward measure {?t,T}T?1t=0 and its corresponding
 one-step dynamic reward measure {?t}T?1t=0 , we can write the corresponding value
 function starting at x0 with a control policy pi ? ? and the resulting state-action
 trajectory {xpi0 , api0 , . . . , xpiT} as:
 VT (x0, pi) = ?0 (r(xpi0 , api0 , xpi1 ) + ?1 (r(xpi1 , api1, xpi2 )
 +?2 (r(xpi2 , api2, xpi3 ) + ? ? ? +
 ?T?1
 (
 r(xpiT?1, apiT?1, xpiT ) + rT (xpiT )
 )
 ? ? ?
 )))
 .
 The equation above is obtained by applying Definition 3 and Theorem 1 in [76] 6.
 We are interested in optimization problems of the form:
 V ?T (x0) := max
 pi??
 VT (x0, pi). (2.3.1)
 In the rest of this section, we prove the optimality of the dynamic programming
 equation that solves this optimization problem. The state space X is extended with
 time variable (i.e., X ? [0, . . . T ]) to model the time-varying nature of the reward
 6The policies considered in [76] are deterministic, but we are considering randomized policies.
 31
functions rt, and action space constraint At(x). The extended state space is denoted
 by X?.
 Theorem 1. Assume the following conditions hold:
 1) ?x ? X, the stochastic kernels Qt,x : a ? Qt(?|x, a) are continuous;
 2) The one-step dynamic reward measure {?t}T?1t=0 is Markov (see Definition
 15), and there exists a sequence of corresponding reward transition mappings ?t :
 m ? ?(?, x,m), t = 0, . . . , T ? 1 that are upper semi-continuous;
 3) The functions {rt(?, ?, ?)}T?1t=0 are b-bounded, measurable, and a ? rt(?, a, ?)
 is upper semi-continuous;
 4) For every x ? X and t ? [0, . . . , T ? 1] the set At(x) is compact;
 5) The function rT (?) is b-bounded and measurable;
 Then a maximizer for the dynamic programing equation:
 vt(x) = max
 ??P(A(x))
 ?t (rt(x, ?, ?) + vt+1(?), x, ? ?Qt,x)
 vT (x) = rT (x) x ? X, t = 1, . . . , T ? 1, (2.3.2)
 exists. Furthermore, an optimal policy, pi? :=
 {
 pi?0, pi
 ?
 1, ? ? ? pi
 ?
 T?1
 }
 exists and each
 pi?t (x) is a maximizer for the right-hand side of Eq. 2.3.2 at time t for all x ? X; In
 addition, every measurable solution of Eq. 2.3.2 at time 0, v0, is an optimal solution
 for Eq. 2.3.1.
 Proof. Let pi denote an arbitrary randomized policy and {xpi0 , api0 , . . . , xpiT}, the re-
 sulting state-action trajectory of the controlled Markov process. We denote the
 32
reward-to-go function by :
 Rt(x, pi) : = ?t(r(xpit , apit , xpit+1)+
 ?t+1
 (
 r(xpit+1, apit+1, xpit+2) + ? ? ? +
 ?T?1
 (
 r(xpiT?1, apiT?1, xpiT ) + rT (xpiT )
 )
 ? ? ?
 ))
 RT (x, pi) : = rT (x).
 This is the total reward from time t onwards when the policy pi is applied at the
 initial state x. In particular, we know
 VT (x, pi) = R0(x, pi).
 We first prove that a solution to Eq. 2.3.2 exists. By assumption 1, Qt,x is a
 continuous stochastic kernel, which implies ? ? Qt,x : P (A) ? M is continuous in
 ?. Here, Qt,x : A ? P (X) is the stochastic kernel parameterized by x at time t (see
 Definition 7). By assumption 2, we know that ? ? ?(?, x, ? ? Qt,x) is upper semi-
 continuous in ?. Assumptions 3, 4, 5 imply that the set P(A(x)) is weakly-compact,
 hence a maximizer exists for Eq. 2.3.2.
 We denote an optimal policy by
 pi? =
 {
 pi?0, . . . , pi
 ?
 T?1
 }
 ,
 where pi?t (x) is a maximizer for Eq. 2.3.2 for all x ? X. We need to show that for
 33
t = 0, . . . T ,
 Rt(x, pi) ? vt(x), (2.3.3)
 and with equality if pi = pi?, i.e.,
 Rt(x, pi?) = vt(x). (2.3.4)
 In particular, if Eq. 2.3.3 is true, we have VT (x, pi) = R0(x, pi) ? v0(x) and
 VT (x, pi?) = R0(x, pi?) = v0(x), which prove the statement regarding v0(x) being
 the solution for the optimization problem stated in Eq. 2.3.1.
 We show Eq. 2.3.3 to be true by backward induction. We first note the fact
 that
 RT (x, pi) = vT (x) = rT (x).
 Assuming the induction hypothesis that for some t = T ? 1, . . . , 0,
 Rt+1(x, pi) ? vt+1(x), x ? X,
 the reward-to-go equation at time t satisfies the following inequalities:
 Rt(x, pi) = ?t(r(x, apit , xpit+1) + Rt+1(xpit+1, pi))
 ? ?t
 (
 r(x, apit , xpit+1) + vt+1(xpit+1)
 )
 ? max
 ??P(A(x))
 ?t (rt(x, ?, ?) + vt+1(?), x, ? ?Qx)
 := vt(x).
 34
The second line in the equation above is due to part 1 of Assumption 1 (i.e.,
 monotonicity) and the induction hypothesis. The third line in the equation above
 is true by the virtue of ?t being Markov (the second assumption of this theorem).
 This proves Eq. 2.3.3. If we assume Rt+1(x, pi?) ? vt+1(x), x ? X, then we conclude
 Rt(x, pi?) ? vt(x) using a similar induction argument as above, which proves Eq.
 2.3.4. It should be easy to see that RT?1(x, pi?) = vT?1(x), since pi? is in ? by
 definition. Repeating the same steps as above for T ? 2, T ? 3, . . . , 0, we obtain the
 desired result
 V ?T (x) = VT (x, pi?) = R0(x, pi?) = v0(x).
 2.3.1.1 Application: Cumulative Prospect Theory Measures
 We assume that we are given a one-step dynamic reward measure of the form in Eq.
 2.2.6, where u+(x) = x and u?(x) = x. We would like to evaluate the performance
 of the random variable ?x at each time t, assuming the dynamic reward measure is
 Markov, and the following reward transition mapping satisfies Eq. 2.2.7:
 ?t(?x, x,m) =
 ? ?
 0
 w+
 (
 m
 ((?x)+ > s))ds
 ?
 ? ?
 0
 w?
 (
 m
 ((?x)
 ?
 > s
 ))
 ds, (2.3.5)
 where ? is a B(K ? X)-measurable random variable (e.g., ? = r + v), and m ? M.
 We denote the function ?(x, ?, ?) by ?x ? Lt+1, which is a B(X ? A)-measurable
 35
random variable. w+ : [0, 1] ? [0, 1] and w? : [0, 1] ? [0, 1] are two continuous
 monotonically non-decreasing functions.
 We would like to apply Theorem 1 to Eq. 2.2.6, given the assumption that it
 is Markov, by proving that Eq. 2.3.5 is a reward transition mapping.
 Theorem 2. ?t defined by equation 2.3.5 is a reward transition mapping. Further-
 more, it is continuous in m.
 Proof. First we need to show that Eq. 2.3.5 satisfies the two properties in Definition
 14.
 1) prove: if ?x ? ?x then ?(?x) ? ?(?x), ??x, ?x ? V ;
 We need to break ?t into two parts,
 ? ?
 0 w
 +(m((?x)+ > s))ds and
 ? ?
 0 w
 ?
 (
 m
 ((?x)
 ?
 > s
 ))
 ds.
 We first look at ? ?0 w+
 (
 m
 ((?x)+ > s))ds. Since ?x ? ?x, we have
 m
 ((?x)+ > s)? m((?x)+ > s). Using the fact that w+ is a monotonically
 non-decreasing function,
 we have w+
 (
 m
 ((?x)+ > s))? w+(m((?x)+ > s)), which implies
 ? ?
 0
 w+
 (
 m
 ((?x)+ > s))ds ? ? ?
 0
 w+
 (
 m
 ((?x)+ > s))ds.
 Similarly, we have m
 ((?x)
 ?
 > s
 )
 ? m
 ((?x)
 ?
 > s
 )
 , using the fact the w? is a mono-
 tonically non-decreasing function, we have
 w?
 (
 m
 ((?x)
 ?
 > s
 ))
 ? w?
 (
 m
 ((?x)
 ?
 > s
 ))
 ,
 36
which implies
 ? ?
 0
 w?
 (
 m
 ((?x)
 ?
 > s
 ))
 ds ?
 ? ?
 0
 w?
 (
 m
 ((?x)
 ?
 > s
 ))
 ds.
 Conclusion 1 follows from the previous inequality.
 2) prove: ?(??x) = ??(?x), ??x ? V , ? ? 0;
 Since
 ? ?
 0
 w+
 (
 m
 ((??x)+ > s))ds? ? ?
 0
 w?
 (
 m
 ((??x)
 ?
 > s
 ))
 ds
 =? ?
 0
 w+
 (
 m
 (
 (?x)+ > s?
 ))
 ds?
 ? ?
 0
 w?
 (
 m
 (
 (?x)
 ?
 >
 s
 ?
 ))
 ds,
 we do a change of variable with z = s? . Rewriting the equation in terms of z we have
 ?
 (? ?
 0
 w+
 (
 m
 ((?x)+ > z))dz ? ? ?
 0
 w?
 (
 m
 ((?x)
 ?
 > z
 ))
 dz
 )
 .
 To prove the continuity of ?t, we explicitly prove that
 ? ?
 0
 w+
 (
 m
 ((??x)+ > s))ds
 is continuous in m, because the proof for the second part of the equation will follow
 similarly. We prove the continuity of ?t by appealing to the fact that the sum of
 two continuous functions is continuous.
 37
We denote the Prokhorov metric (see [16, Section 2.1]) by:
 d(?, ?) := inf {|?(A) ? ?(A) + , ?(A) ? ?(A) + 
 ?A ? B(X ? A)} .
 For the purpose of readability, we define the function
 f ?,??x (s) =
 ??w+
 (
 ?
 ((?x)+ > s))? w+(?((?x)+ > s))?? ,
 and its associated sets
 B?1 =
 {
 s : s ? [0,M ], f ?,??x (s) ? ?1
 }
 , and
 ?B?1 =
 {
 s : s ? [0,M ], f ?,??x (s) > ?1
 }
 .
 Since the total reward is the sum of a finite number (i.e., finite-horizon) of
 per-stage rewards, it is bounded by M ? R.
 Given an arbitrary  > 0, we need to find a ?1 such that it satisfies the following
 38
equations:
 ????
 ? M
 0
 w+
 (
 ?
 ((?x)+ > s))ds? ? M
 0
 w+
 (
 ?
 ((?x)+ > s))ds????
 ?
 ? M
 0
 ??w+
 (
 ?
 ((?x)+ > s))? w+(?((?x)+ > s))?? ds ?? M
 0
 1B?1?1ds+
 ? M
 0
 1
 ?B?1
 ds =
 ?1 ?M ?
 ? M
 0
 1B?1
 M
 ds+M ?
 ? M
 0
 1
 ?B?1
 M
 ds =
 ?1 ?M ?
 ? M
 0
 1B?1
 M
 ds+M ?
 (
 1 ?
 ? M
 0
 1B?1
 M
 ds
 )
 ?
 2 ? ?1 ?M = .
 By letting 1 > ?1 = 2M > 0, the last line in the equation above holds.
 Our goal is to prove that given an arbitrary ?1, there always exists a ? ? ?1
 such that all ? in the ?-neighborhood of ? satisfy the following
 ? M
 0
 1B?1
 M
 ds ? 1 ? ?1,
 which implies
 ?1 ?M ?
 ? M
 0
 1B?1
 M
 ds+M ?
 (
 1 ?
 ? M
 0
 1B?1
 M
 ds
 )
 ? .
 39
Since w+ is continuous, for any ?1 > 0 there exists a ?2 such that
 ???
 ((?x)+ > s)? ?((?x)+ > s)?? ? ?2 =?
 ??w+
 (
 ?
 ((?x)+ > s))? w+(?((?x)+ > s))?? ? ?1.
 Hence, for any ?1 > 0 we can always find a ?2 such that
 ? M
 0
 1|?((?x)+>s)??((?x)+>s)|??2
 M
 ds ? 1 ? ?1 =?? M
 0
 1B?1
 M
 ds ? 1 ? ?1. (2.3.6)
 From the Markov inequality, we have
 ? M
 0
 1|?((?x)+>s)??((?x)+>s)|??2
 1
 M
 ds ?
 1 ?
 ?M
 0
 ???
 ((?x)+ > s)? ?((?x)+ > s)?? 1M ds
 ?2
 . (2.3.7)
 Next, we need to find a ? such that the following equations hold:
 ? M
 0
 ???
 ((?x)+ > s)? ?((?x)+ > s)?? 1Mds ?? M
 0
 d(v, ?)2 1
 M
 ds ? ?2 = ?1 ? ?2.
 Finally, letting ? := +??1 ? ?2, it is true that for any v in the ?-neighborhood of ?,
 40
the following equations hold:
 ? M
 0
 ???
 ((?x)+ > s)? ?((?x)+ > s)?? 1Mds ? ?1 ? ?2
 =? 1 ?
 ?M
 0
 ???
 ((?x)+ > s)? ?((?x)+ > s)?? 1M ds
 ?2
 ?
 1 ?
 ?1 ? ?2
 ?2
 = 1 ? ?1
 =?
 ? M
 0
 1|?((?x)+>s)??((?x)+>s)|??2
 1
 M
 ds ? 1 ? ?1
 =?
 ? M
 0
 1B?1
 M
 ds ? 1 ? ?1 (Eq. 2.3.6).
 The second implication is due to Eq. 2.3.7 and the third implication is due to Eq.
 2.3.6. The second assertion of the theorem is proved.
 Below is an example where we use a Markov CPT dynamic reward mea-
 sure.
 Example 5. The following example attempts to explain why people become en-
 trepreneurs. We assume a person could be in several states {poor, middle, upper-
 middle, super-rich}. If one decides to become an entrepreneur, one has the following
 transition probability matrix.
 poor middle upper-middle super-rich
 poor .999 0 0 .001
 middle .999 0 0 .001
 upper-middle .999 0 0 .001
 super-rich .001 0 0 .999
 Table 2.3.1: Transition probability matrix for becoming an entrepreneur
 One could also choose to pursuit a normal job with the following transition
 41
probabilities.
 poor middle upper-middle super-rich
 poor 0 1 0 0
 middle 0 0 1 0
 upper-middle 0 0 1 0
 supper-rich 0 0 0 1
 Table 2.3.2: Transition probability matrix for taking a normal job
 The action space is {entrepreneur (E), normal (N)}. We define the random
 variable xt to represent the current state of the controlled Markov process:
 xt(?) =
 ?
 ??????????????
 ??????????????
 1 ? = poor
 2 ? = middle
 3 ? = upper-middle
 4 ? = super-rich
 .
 In this example, the per-stage reward function is given as:
 r(x, a) :=
 ?
 ??????????????
 ??????????????
 x? 1/x x ? 3 and a = E
 x x ? 3 and a = N
 100 ? 1/x x > 3 and a = E
 100 x > 3 and a = N
 ,
 42
and the terminal reward function is:
 r2(x) :=
 ?
 ????
 ????
 x x ? 3
 100 x > 3
 .
 We want to solve the optimization problem stated in Eq. 2.3.1 given a dynamic
 reward measure of the form in Eq. 2.2.67:
 ?t(Zt+1) =
 ? ?
 0
 w+(P ((Zt+1|xt) > s))ds,
 with the nonlinear weighting function w+(F ):=e??(?ln(F ))? , where 0 < ? < 1 and
 ? > 0. For the purpose of this example, we take ? to be 0.9 and ? to be 0.5. Since
 we are only dealing with positive rewards in this example, w? and u
 ?
 need not be
 given. Furthermore, we note that the dynamic reward measure is Markov and has
 a sequence of transition mappings ?t of the form:
 ?t(r + vt+1, xt, ? ?Qxt) =? ?
 0
 w+ (? ?Qxt (r + vt+1(xt+1) > s)) ds.
 The table below shows the value function at times 0 and 1 by applying the
 7For simplicity, we dropped u+ and u?.
 43
dynamic programming equation
 vt(x) = max
 pE?[0,1]
 {? ?
 0
 w+ (Qx,E (r(x,E) + vt+1(xt+1) > s) pE
 +Qx,N (r(x,N) + vt+1(xt+1) > s) (1 ? pE)) ds} ,
 where v2 = r2 and the variable pE is the probability of becoming an entrepreneur.
 Time=1
 x1 pE v1(x1)
 poor 0.850471 7.1229
 middle 0.787238 8.81843
 upper-middle 0.808817 9.91617
 super-rich 0 200
 Time = 0
 x0 pE v0(x0)
 poor 0.919031 18.6786
 middle 0.886583 20.3504
 upper-middle 0.896001 21.4663
 super-rich 0 300
 Table 2.3.3: An optimal solution for Ex. 5 (a value function and an optimal policy)
 at time 0 and 1.
 Since Table 2.3.3 above shows the likelihood of becoming an entrepreneur
 is higher if one is younger, it agrees with our intuition that one should pursue
 entrepreneurship while still young. For example, an individual, starting out poor,
 should be entrepreneurial almost 92% of the time; On the other hand, if the same
 individual is a year older, he or she should only be entrepreneurial 85% of the time.
 This result also agrees with our tendency to become more risk-averse as we grow
 older.
 Our approach yields an optimal randomized policy, which is different from
 44
the standard approach (see Eq. 2.2.3), where an optimal solution is deterministic.
 Non-convex reward measures are useful for modeling many real-life problems. More
 specifically, CPT-inspired reward measures are derived from experimental data and
 have been proven to model several key characteristics of human behavior well. In
 this section, we proved the optimality of dynamic programming equations for the
 optimization problem described by Eq. 2.3.1. In addition, we provided a numerical
 example demonstrating the intuitiveness of the optimal policies obtained. In the
 next section, we will apply dynamic programming to infinite-horizon MDPs with
 non-convex reward measures.
 2.3.2 Discounted Infinite-Horizon
 As in the finite-horizon case, we assume that we are given a time-consistent dynamic
 reward measure {?t,?}?t=0 and its corresponding time-invariant one-step dynamic
 reward measure {?}. Here, ? does not depend on t anymore. From Definition 3 and
 Theorem 1 in [76], we write the corresponding value function starting at x0 with a
 control policy pi and the resulting state-action trajectory {xpi0 , api0 , . . . , xpiT} as:
 V (x0, pi) = ? (r(xpi0 , api0 , xpi1 ) + ?? (r(xpi1 , api1 , xpi2 )
 +?? (r(xpi2 , api2 , xpi3 ) + ? ? ? +
 ?? (r(xpi?, api?, xpi?)) ? ? ? ))) .
 45
We would like to consider the following optimization problem with ? ? (0, 1):
 V ? (x0) := max
 pi??
 V (x0, pi). (2.3.8)
 In this section, we are interested in the case when r : K ? X ? R is bounded, i.e.,
 ?r? ? R+ such that |r| ? r?. We assume r to be a non-positive valued function.
 The non-negative case can be argued by symmetry. We denote the t-stage-reward
 function resulting from applying a policy pi by:
 Jt(x0, pi) = ? (r(xpi0 , api0 , xpi1 ) + ?? (r(xpi1 , api1 , xpi2 )
 +?? (r(xpi2 , api2 , xpi3 ) + ? ? ? +
 ??(r(xpit?1, apit?1, xpit ))? ? ?))). (2.3.9)
 Since r ? 0 and Vt ? Vt?1 ? 0 , by the Monotone Convergence Theorem, we may
 write
 V (x0, pi) = lim
 t??
 Jt (x0, pi) , ?pi ? ?.
 From Section 2.3.1, we know the solution to Eq. 2.3.9 can be obtained by iterating
 the following equation:
 vt (x) := max
 ??P(A(x))
 ? (r(x, ?, ?) + ?vt?1(?), x, ? ?Qx) .
 In other words,
 vt (x) = max
 ?
 Jt (pi, x) , ?x ? X.
 46
The equation above is the backward form of the finite-horizon dynamic pro-
 gramming equation; this is different from the forward equation we used in the pre-
 vious section. In the finite-horizon case, we are able to prove the optimality of the
 dynamic programming equation (i.e., Eq. 2.3.2) by backward induction; this ap-
 proach will not suffice in the infinite-horizon case. In the infinite-horizon case, we
 need to appeal to the Banach fixed-point theorem to prove the existence of a mea-
 surable function v? such that v? = Tv?. Lastly, we need to prove that the solution
 v? is indeed equal to
 V ? (x) := max
 ?
 V (x, pi) , ?x ? X.
 We used [48, 49] as the main technical references for the proofs below.
 Definition 16. Let M (X)? denote the cone of non-positive measurable functions
 on X. For every v ? M (X)?, Tv is defined as a mapping from X, i.e.,
 Tv (x) := max
 ??P(A(x))
 ? (rx + ?v, x, ? ?Qx) , ?x ? X,
 where rx is the reward function with x held fixed, i.e., rx := r (x, ?, ?) .
 The existence of a measurable selector8 is important in proving the optimality
 of the dynamic programming equation.
 Lemma 1. Assuming the function
 ? (rx + ?v, x, ? ?Qx)
 8See appendix for the definition of a measurable selector.
 47
is upper semi-continuous (u.s.c) in ?, r is a non-positive valued function, and
 P (A (x)) is compact-valued, then T maps M (X)?into itself, i.e., for every v in
 M (X)?, Tv is also in M (X)?, and moreover, there exists a measurable selector
 ? : X ? P (A) with ? (x) ? P (A (x)) such that
 ? (rx + ?v, x, ? (x) ?Qx) = max
 ??P(A(x))
 ? (rx + ?v, x, ? ?Qx) , ?x ? X.
 Proof. This follows from Proposition 6 in the appendix.
 The lemma above is important for ensuring the value function is measurable.
 The reader might notice that ? can be used to construct a stationary policy pi =
 {?, ?, . . . } , which will be used to prove the optimality of stationary Markov polices.
 The following Lemma is used in the upcoming theorem.
 Lemma 2. If u ? M (X)? is such that u ? Tu, and r is a non-positive valued
 function, then u ? V ?.
 Proof. Assuming u ? Tu and using Lemma 1, we write the following inequality:
 u (x) ? ? (r (x, ?, ?) + ?u, x, ? (x) ?Qx) .
 48
Iterating this inequality, we obtain
 u (x) ? ? (r(x, api0 , xpi1 ) + ?? (r(xpi1 , api1 , xpi2 )
 + ?? (r(xpi2 , api2 , xpi3 ) + ? ? ? +
 ??(r(xpin?1, apin?1, xpin) + ?u (xpin))? ? ?))), ?n ? 1, x ? X, (2.3.10)
 where pi = {?, ?, . . . }. In the inequality above, we used the short-hand notation
 ? (r + ?u) := ? (r + ?u, x, ? (x) ?Qx) ,
 and {xpit } is the resulting process from applying the policy pi. Applying the fact that
 ?u (xpin) ? 0
 to Eq. 2.3.10, we conclude the following inequality:
 u (x) ? ? (r(x, api0 , xpi1 ) + ?? (r(xpi1 , api1 , xpi2 )
 + ?? (r(xpi2 , api2 , xpi3 ) + ? ? ? +
 ??(r(xpin?1, apin?1, xpin))? ? ?))), ?n ? 1, x ? X.
 By letting n ? ?, the inequality above yields
 u (x) ? V (x, pi) ? V ? (x) ?x ? X.
 49
Theorem 3. Assume the following conditions hold:
 1) The stochastic kernels Qx : a ? Q(?|x, a) are continuous ?x ? X;
 2) The one-step dynamic reward measure ? is Markov (see Definition 15), and
 a sequence of corresponding reward transition mappings ? : m ? ?(?, x,m) is upper
 semi-continuous;
 3) The function r(?, ?, ?) is bounded, measurable, and a ? r(?, a, ?) is upper
 semi-continuous in a;
 4) For every x ? X the set A(x) is compact;
 5) ? ? (0, 1) .
 Then a maximizer for the dynamic programing equation
 v(x) = max
 ??P(A(x))
 ? (r(x, ?, ?) + ?v(?), x, ? ?Qx) ?x ? X, (2.3.11)
 exists. Furthermore, an optimal policy, pi? := {??, ?? ? ? ? } exists and each ?? is
 a maximizer for the right-hand side of Eq. 2.3.11. In addition, every bounded
 measurable solution of Eq. 2.3.11 is an optimal solution for Eq. 2.3.8.
 Proof. Since A (x) is compact for every x, we know P (A (x)) is also compact-valued.
 We want to show that the operator
 Tv := max
 ??P(A(x))
 ? (r + ?v, x, ? ?Qx)
 is a contraction on the space of bounded measurable functions endowed with the
 50
supremum norm
 ?v? := sup
 X
 |v| .
 We first prove that a solution to Eq. 2.3.11 exists. By assumption 1, Qx is
 a continuous stochastic kernel, which implies ? ? Qx : P (A) ? M is continuous
 in ?. Here, Qx : A ? P (X) is the stochastic kernel parameterized by x at time
 t (see Definition 7). By assumption 2, we know that ? ? ?(?, x, ? ? Qx) is upper
 semi-continuous in ?. Assumptions 3 and 4 imply that the set P(A(x)) is weakly-
 compact; hence a maximizer exists for Eq. 2.3.11. This also proves the existence of
 pi? := {??, ??, ? ? ? }.
 Next, the mapping T : M (X)? ? M (X)? satisfies:
 Tv : = max
 ??P(A(x))
 ? (r + ?v, x, ? ?Qx)
 = ? (r + ?v, x, ? (x) ?Qx)
 = ? (r + ? (v? + (v ? v?)) , x, ? (x) ?Qx)
 ? ? (r + ?v?, x, ? (x) ?Qx) + ? sup
 X
 |v ? v?|
 ? max
 ??P(A(x))
 ? (r + ?v?, x, ? ?Qx) + ? sup
 X
 |v ? v?|
 = Tv? + ? sup
 X
 |v ? v?| =?
 Tv ? Tv? ? ? sup
 X
 |v ? v?| ?x ? X =?
 sup
 X
 |Tv ? Tv?| ? ? sup
 X
 |v ? v?| ,
 where ? is an optimal measurable selector and its existence is ensured by Lemma 1.
 51
Hence, by appealing to Banach?s Fixed-Point Theorem for contraction map-
 pings and assumptions 3 and 5 of this theorem, we conclude there exists a unique
 function v? ? V such that Tv? = v?.
 Finally, we need to prove that v? is a measurable solution to Eq. 2.3.8. We
 need to prove this fact in two steps: v? ? V ? and v? ? V ?. From the fact that
 Tv? = v? and Lemma 2 we conclude that v? ? V ?. To prove the inequality in the
 other direction, we know from the finite-horizon case with the reward-to-go function
 denoted by Rn, we have
 vn ? Rn (x, pi) ? V (x, pi) , ?n ? {0, 1, 2, . . . } , ?n, ?pi ? ?, ?x ? X,
 which implies
 vn ? Rn (x, pi?) ? V ? (x) ?n, ?x ? X,
 where pi? = {??, ??, . . . } is constructed using the maximizer, ??, of Eq. 2.3.11. The
 operator T is monotone, i.e., if u and u? are functions in M (X)? and u ? u?, then
 Tu ? Tu?. Since v0 := 0 and vn := Tvn?1 for n ? 1, vn form a non-increasing
 sequence in M (X)? converging to some function v? ? M (X)? . Since vn ? v? due to
 the monotone convergence theorem, and vn ? V ?, we conclude that v? ? V ?. The
 desired conclusion is reached given the fact the policy pi? = {??, ??, . . . } ? ?.
 If the per-stage reward is both positive and negative, we defer it to the later
 transient case; as the discounted infinite-horizon problem can be rewritten into a
 transient problem by adding an absorbing state. The requirement that the reward
 52
function, r, be a bounded measurable function can be relaxed in the previous the-
 orem. Of course, the proof for the relaxed case will be different. Due to space
 limitations, we do not explore alternatives with the relaxed assumption on the re-
 ward function here. For the interested reader, the proof for the existence of an
 optimal policy with the relaxed assumption on the reward function can be found in
 [48] for the standard expected value measure, which can be adapted for our case.
 We present a numerical example, similar to the finite-horizon case, which will
 demonstrate the type of policies expected from the CPT-based reward measures.
 In the example below, we also compare an optimal solution of the expected value
 (standard reward measures) with that of the CPT-based reward measures.
 Example 6. As in the finite-horizon case of Example 5, its infinite discounted coun-
 terpart tries to explain why people become entrepreneurs. The major difference here
 is we are no longer given a terminal reward function. The transition probabilities
 and the per-stage reward function used for this numerical example can be found
 from Example 5.
 We calculate the discounted infinite-horizon counterpart with the nonlinear
 weighting function w+(F ):=e??(?ln(F ))? , where 0 < ? < 1 and ? > 0. For the
 purpose of this example, we take ? to be 0.9 and ? to be 0.5. Furthermore, we
 assume the discount factor, ?, to be 0.5.
 The tables below summarize our numerical results. We notice that the value
 function given by the CPT measures is higher than that of the expected value. In
 addition, using CPT measures will yield a more risk-seeking optimal policy.
 53
x0 P(entrepreneur) v(x0)
 poor 0 3.49999
 middle 0 4.99999
 upper-middle 0 5.99999
 super-rich 0 200
 (a) Expected Value
 x0 P(entrepreneur) v(x0)
 poor 0.868488 11.5553
 middle 0.861585 13.0742
 upper-middle 0.881884 14.1858
 super-rich 0 200
 (b) CPT Expected Value
 Table 2.3.4: An optimal solution for Ex. 6 (a value function and an optimal policy)
 .
 Table 2.3.4 is similar to Table 2.3.3 in the finite-horizon case in the sense that
 they both produce randomized polices.
 This example suggests that a middle-class person should be the least likely to
 pursue entrepreneurship. On the other hand, an upper-middle class person is mostly
 likely to start his/her own business. However, the difference in the probability of
 entrepreneurship for the states poor, middle and upper-middle is very small, which
 suggests in the long run we should all be entrepreneurial regardless of our current
 state unless one is already super-rich. Of course, in practice we need to calibrate
 the underlying Markov model with empirical data.
 In the next section, we will examine the suitability of the dynamic program-
 ming method for the transient Markov control model case.
 54
2.3.3 Transient Markov Control Model
 In this section, we prove the optimality of the dynamic programming equation for
 transient Markov control models. Our main technical references are [18], [55] and
 [49]. Transient Markov control models require further specification in addition to
 the definitions provided in Section 2.2.1. Before we introduce the definition of a
 transient Markov control model, we need to define a few notations first. Given a
 norm weight function w : X ? [1,?), the w weighted-norm is denoted by ???w . It
 is calculated for a substochastic kernel A as:
 ?A?w = sup
 X
 1
 w (x)
 ?
 X
 w (y)A (dy|x) .
 Similarly, a w-norm can be defined for a measurable function v : X ? R as:
 ?v?w = sup
 x?X
 |v (x)|
 w (x) .
 It is the standard operator norm in the space Bw (X,B (X)), of measurable functions
 v such that ?v?w < ?. The reader can refer to [48] for a more complete discussion
 on weighted norms. At this point, the reader may be confused by the three functions
 w, w+, and w?. The first function is used in defining a weighted norm, and the
 latter two functions are used in CPT-based measures. The function used should be
 clear from the context.
 Assumption 2. The function w ? V (i.e., the integrable function space) is fixed
 55
with respect to the given Markov control model such that
 ?Q??w < ?, ?? ? ?.
 Furthermore, the per-stage reward function r(?, ?, ?) is measurable, w-bounded,
 i.e., there is a constant r? ? 0 such that
 sup
 A(x)
 |r (x, a, x?)| ? r?w (x?) , ?x, x? ? X,
 and r : a ? r(?, a, ?) is upper semi-continuous in a.
 The assumption above is assumed to hold throughout this section. A transient
 Markov model has some absorbing state xA ? X, such that Q ({xA} |xA, a) = 1 and
 r (xA, a, xA) = 0 for all a ? A (x). In other words, once an absorbing state is reached,
 no further rewards will be given. In addition, a transient Markov model reaches its
 absorbing state in finite amount of time, i.e.,
 sup
 ?,X
 E [? pi0 |x] < ?, where ? pi0 := inf {t ? |xpit = xA} .
 Without loss of generality, we assume the model only has one absorbing state,
 because the case of multiple absorbing states can be easily reduced to the single
 absorbing state case. We introduce some additional notations for clarity. We denote
 the effective state space by ?X = X \ {xA}, and the effective controlled substochastic
 kernel by ?Q. The substochastic kernel ?Q restricts its arguments to only allow the
 56
effective states (i.e., ?Q (B|x, a) = Q (B|x, a) , ?B ? B
 (
 ?X
 )
 , ?x ? ?X, ?a ? A (x)).
 We introduce the definition of a transient Markov control model below.
 Definition 17. A randomized Markov policy pi ? ?RM is transient with respect to
 a Markov control model, if there exists a constant k and a weight function w : X ?
 [1,?) such that
 ?????
 ??
 t=0
 ?Qtpi
 ?????
 w
 ? k, (2.3.12)
 where Qtpi := Q0Q1 . . . Qt?1 and Q0pi (?|x) := ?x (?) . If the inequality above is uniform
 for all Markov policies, then the model is called uniformly transient (i.e., Eq. 2.3.12
 is true for all pi ? ?RM ).
 Since we are working with stationary transition probabilities (i.e., Q1 = Q2),
 Eq. 2.3.12 implies that Q
 (
 ?X|x, a
 )
 ? 1 for all x ? X and a ? A (x). Eq. 2.3.12 can
 also be written as:
 ?????
 ??
 t=0
 ?Qtpi
 ?????
 w
 = sup
 X
 w (x)?1
 ??
 t=0
 ?
 X
 w (y) ?Qtpi (dy|x)
 = sup
 X
 w (x)?1
 ??
 t=0
 E [w (xpit ) |x] .
 Hence, we can infer from Eq. 2.3.12 that E [w (xpit ) |x] ? 0 as t ? ?. Eq. 2.3.12
 is also known as the Pliska condition [69]. One major contribution of ?avu? and
 Ruszczy?ski [18] is to suggest a generalized version of the Pliska condition for co-
 herent risk measures (i.e., convexity). Since we are interested in non-convex perfor-
 mance measures, we take a different approach to prove the optimality of dynamic
 programming equations.
 57
We are interested in solving a more general version of the standard expected
 total reward problem, which searches for an optimal policy that maximizes the
 following expected value:
 E
 [
 ??
 t=0
 r (xt, at, xt+1)
 ]
 .
 We know from [48] that we can apply dynamic programming to this problem and
 obtain an optimal stationary deterministic policy. In this section, we would like to
 explore the risk-sensitive version of this problem, especially when the conditional
 reward function ? is not convex. Our goal is to prove that dynamic programming
 can still be applied to the non-convex risk-sensitive version of the expected total
 reward problem.
 We are interested in finding the maximum of a total reward function of the
 form:
 V (x0, pi) = ? (r(xpi0 , api0 , xpi1 ) + ? (r(xpi1 , api1 , xpi2 )
 +? (r(xpi2 , api2 , xpi3 ) + ? ? ? +
 ? (r(xpi?, api?, xpi?)) ? ? ? ))) .
 The corresponding optimization problem can be written as:
 V ? (x0) := max
 pi??
 V (x0, pi). (2.3.13)
 Without loss of generality, we restrict ourselves to randomized Markov policies (see
 58
[49], Theorem 9.4.5), i.e.,
 V ? (x0) = max
 pi??RM
 V (x0, pi) . (2.3.14)
 To solve Eq. 2.3.14, we start by finding an optimal solution for the simpler case of
 randomized stationary policies, i.e.,
 V ? (x0) := max
 pi??RS
 V (x0, pi). (2.3.15)
 Later, the sufficiency of randomized stationary Markov polices is proven, i.e., the
 left-hand sides of Equ 2.3.14 and Eq. 2.3.15 are equivalent. We denote the reward-
 to-go function at time t by:
 Rt(x0, pi) = ?(r(xpit , apit , xpit+1) + ?(r(xpit+1, apit+1, xpit+2)
 +?
 (
 r(xpit+2, apit+2, xpit+3) + ? ? ? +
 ? (r(xpi?, api?, xpi?)) ? ? ? ))) ,
 and the t-stage total reward function by:
 Jt(x0, pi) = ? (r(xpi0 , api0 , xpi1 ) + ? (r(xpi1 , api1 , xpi2 )
 +? (r(xpi2 , api2 , xpi3 ) + ? ? ? +
 ?
 (
 r(xpit , apit , xpit+1)
 )
 ? ? ?
 )))
 .
 59
We denote the optimal t-stage-reward value function by
 J?t (x) := max
 pi??
 Jt (x, pi) ,
 and define the operator T (?) on Bw (X,B (X)) as
 T (?) v (x) := ? (rx + v, x, ? (x) ?Qx) .
 In addition, the dynamic programming operator T is denoted by
 Tv := max
 ??P(A(x))
 ? (rx + v, x, ? ?Qx) .
 Since ? is monotone, the operator T (?) is also monotone, i.e.,
 v ? v? =? T (?) v ? T (?) v?, ?v, v? ? Bw (X,B (X)) , ? ? ?.
 Given a Markov policy pi = {?t} ? ? and T (pi)0 = I, then for k = 1, 2, . . . , the
 iterated operator T k (pi) on Bw (X,B (X)) is defined by
 T k (pi) := T (?0)T (?1) ? ? ?T (?k?1) .
 We denote the total-reward function with respect to a policy pi as T ? ? by
 V (x0, pi) = lim
 T??
 JT (x0, pi) ?x ? ?X.
 60
For the rest of this section, we make the following assumption.
 Assumption 3. The following conditions hold.
 1. There exists a k ? 1 such that T k (pi) is a contraction mapping for all
 transient stationary polices pi; i.e.,
 ?? < 1 s.t.
 ??T k (pi) v ? T k (pi) v???
 w
 ? ? ?v ? v??w
 ?v1, v2 ? Bw (X,B (X)) , ? transient pi ? ?RS;
 2. The reward transition mapping ? : ? ? ? (?, x,m) is continuous;
 3. V ? ? Bw (X,B (X)) ;
 4. The Markov control model is uniformly transient.
 Condition 4 in Assumption 3 can be relaxed (see [55]) at the expense of addi-
 tional assumptions. From condition 3 in Assumption 3, we can trivially deduce the
 following lemma.
 Lemma 3. If Assumption 3(3) holds, then Jt (pi, x) and J?t (x) are both w-bounded
 for all x ? X, pi ? ?, t = 1, 2, . . . .
 Proof. Proof by contradiction: If Jt (pi, x) and J?t are not w-bounded, then V ? /?
 Bw (X,B (X)). This contradicts Assumption 3(3).
 The lemma above justifies for writing Jt (pi, ?) and J?t (?) as arguments of ? (?) .
 Assumption 3(1) ensures the convergence of the operator T k , which is stated in the
 following lemma.
 61
Lemma 4. For any transient stationary policy pi ? ?RS, if Assumption 3(1) holds,
 then for any v ? Bw (X,B (X))
 lim
 k??
 T k (pi) v (x) = V (x, pi) = lim
 k??
 Jk (x, pi) ?x ? X.
 Proof. Using Assumption 3(1) and the Banach fixed-point theorem, noting the fact
 that
 V (x, pi) = lim
 k??
 T k (pi) 0 < ?,
 the proof follows.
 The following theorem proves the optimality criteria for Eq. 2.3.15.
 Theorem 4. Let Assumptions 2 and 3 hold. For a transient Markov controlled
 model, a Markov reward transition mapping ? (?, ?, ?), and a randomized stationary
 Markov policy pi = {?, ?, . . . }, a bounded measurable function v : ?X ? R (i.e.,
 ?v?w < ?) satisfies the equation
 v (x) = ? (rx + v, x, ? (x) ?Qx) , x ? ?X
 v (xA) = 0, (2.3.16)
 if and only if v (x) = V (pi, x) for all x ? X.
 Proof. Let v (?) be a bounded measurable solution of Eq. 2.3.16. Since ?v?w < ?
 and w ? Bw (X,B (X)), we know that v ? Bw (X,B (X)). By assumption, r (?, ?, ?) is
 w-bounded, and thus rx ? Bw (X,B (X)). Consequently, the right-hand side of Eq.
 62
2.3.16 is well defined and can be iterated, which results in the following equation
 v(x0) = ? (r(xpi0 , api0 , xpi1 ) + ? (r(xpi1 , api1 , xpi2 )
 +? (r(xpi2 , api2 , xpi3 ) + ? ? ? +
 v (xT+1) ? ? ? ))) ?x0 ? ?X.
 Since v (?) is a w-bounded function, we conclude by evoking Lemma 4 and taking k
 to infinity that
 v (x) = lim
 k??
 T k (pi) v (x) = V (pi, x) ?x ? ?X
 v (xA) = 0 = V (pi, xA) .
 The converse is proved by writing down the equation:
 JT (pi, x0) = ? (r (x0, ? (x0) , x1) + JT?1 (pi, x0)) .
 Taking the limit as T ? ? on both sides, we arrive at the following equation:
 lim
 T??
 JT (pi, x0) = lim
 T??
 ? (r (x0, ? (x0) , x1) + JT?1 (pi, x0)) .
 Since ? (?) is continuous by assumption, we conclude that
 lim
 T??
 JT (pi, x0) = ?
 (
 r (x0, ? (x0) , x1) + lim
 T??
 JT?1 (pi, x0)
 )
 .
 63
Using the fact that
 lim
 T??
 JT (pi, x0) = V (pi, x0) = v (x0) , ?x0 ? ?X
 as T ? ?, we rewrite the previous equation as:
 v (x0) = ? (r (x0, ? (x0) , xpi1 ) + v (xpi1 )) = ? (rx0 + v, x0, ? (x0) ?Qx0) , ?x0 ? ?X,
 which is the same as Eq. 2.3.16. Furthermore, V (pi, xA) = 0 = v (xA) by definition.
 Theorem 5. Assume the following conditions hold for a uniformly transient Markov
 control model:
 1) The stochastic kernels Qx : a ? Q(?|x, a) are continuous ?x ? X ;
 2) The one-step dynamic reward measure ? is Markov (see Definition 15), and
 a sequence of corresponding reward transition mappings ? : m ? ?(?, x,m) is upper
 semi-continuous;
 3) The assumptions in Theorem 4 are satisfied;
 4) For every x ? X the set A(x) is compact;
 Then a maximizer for the dynamic programing equation
 v(x) = max
 ??P(A(x))
 ? (r(x, ?, ?) + v(?), x, ? ?Qx) ?x ? ?X
 v (xA) = 0, (2.3.17)
 64
exists. Furthermore, an optimal randomized stationary policy, pi? := {??, ??, ? ? ? }
 exists and each ?? is a maximizer for the right-hand side of Eq. 2.3.17; In addition,
 a bounded measurable function v, (i.e., ?v?w < ?) is a solution of Eq. 2.3.17 if and
 only if it equals V ? in Eq. 2.3.15, i.e., v (x) = V ? (x) , ?x ? X.
 Proof. Since the set of all policy sequences of the form pi = {?, pi, pi, . . . } contains
 ?RM , we write down the inequality
 V ? (x0) ? sup
 ??P(A(x0)),pi??RM
 ? (r (x0, a0, x1) + V (pi, x1)) ,
 where V ? is defined by Eq. 2.3.14. Because ? is monotone, we move the supremum
 operator inside:
 V ? (x0) ? sup
 ??P(A(x0))
 ?
 (
 r (x0, a0x1) + sup
 pi??RM
 V (pi, x1)
 )
 ? sup
 ??P(A(x0))
 ?
 (
 r (x0, a0, x1) + V ? (x1)).
 By Assumption 3(3), i.e.,
 ??V ?
 ??
 w
 < ?, the right-hand side is well defined. Thus V ?
 satisfies the inequality
 V ? (x) ? sup
 ??P(A(x))
 ?
 (
 rx + V ?, x, ? ?Qx
 )
 , x ? X. (2.3.18)
 Since the existence of a solution for Eq. 2.3.17 is assured by the semi-continuity
 of the mapping ? : ? ? ? (rx + v, x, ? ?Qx) and the weak compactness of the set
 65
P (A (x)), we conclude that
 V ? (x) ? ?(rx + V ?, x, ?? (x) ?Qx), x ? X.
 Here, ?? is a solution to the optimization problem represented by the right-hand
 side of Eq. 2.3.18. By iterating the inequality above, appealing to the monotonicity
 property of ?, and applying the policy pi? = {??, ??, . . . } , we obtain the fact that
 V ? (x0) ? V (pi?, x0) .
 Since by assumption V ? (?) is the optimal value function, V ? (?) ? V (pi?, ?), which
 along with the previous inequality, imply V ? (?) = V (pi?, ?) . Using Theorem 4, we
 conclude V ? (?) satisfies the dynamic programming equation.
 To prove the converse, we first suppose v (?) satisfies Eq. 2.3.17, and ?v?w < ?.
 Since the mapping ? : ? ? ? (rx + v, x, ? ?Qx) is continuous and the set P (A (x))
 is weakly compact, an optimal control function, ??, exists. Furthermore, ?? is the
 maximizer for the right-hand size of the dynamic programing equation. This enables
 us to write
 v (x) = ?
 (
 rx + v, x, ?? (x) ?Qx
 )
 , x ? X. (2.3.19)
 Using Theorem 4, we conclude that
 v (x) = V (pi?, x) ? V ? (x) , x ? X, (2.3.20)
 66
where pi? =
 {
 ??, ??, . . .
 }
 . On the other hand, it follows from Eq. 2.3.17 that the
 control function ?? satisfies
 v (x) ? ?
 (
 rx + v, x, ?? (x) ?Qx
 )
 , x ? ?X.
 Using the monotonicity property of ?, we iterate the above inequality and arrive at
 v (x) ? ?0,T (0, Z1, . . . , ZT + v (xT )) ,
 where Zt is the reward sequence resulting from applying the policy pi?. By taking
 T ? ? for the equation above, we conclude that
 v (x) ? V (pi?, x) = V ? (x) , x ? ?X.
 The last inequality, together with Eq. 2.3.20 and the fact that v (?) = V ? (?) im-
 ply the stationary policy ?? that satisfies Eq. 2.3.19 is optimal, i.e., V (pi?, x) =
 V (pi?, x) = V ? (x) , ?x ? X. In addition, we know v (xA) = V ? (xA) = 0 from the
 definition of transition Markov model.
 In the theorem above, we provide the optimality criteria for the case of ran-
 domized stationary Markov policies. Next, we prove the sufficiency of randomized
 stationary Markov policies as optimal policies.
 Theorem 6. Assume the assumptions in the Theorem 5 are satisfied. Then a w-
 bounded measurable function v : X ? R, with ?v?w < ?, satisfies Eq. 2.3.17 if and
 67
only if v (x) = V ? (x) for all x ? X. Moreover, a maximizer ?? exists for Eq. 2.3.17
 and defines an optimal randomized stationary Markov policy pi? = {??, ??, ??, . . . } .
 Proof. We denote a Markov policy by pi1 = {?1, ?2, . . . } . Given the monotonicity
 and continuity of ? (?), we have
 V ? (x0) = sup
 ?0,?1,...
 lim sup
 T??
 ?
 (
 r (x0, a0, x1) + JT?1(pi1, x1))
 ? sup
 ?0,?1,...
 lim sup
 T??
 ?
 (
 r (x0, a0, x1) + sup
 ??T?1
 J?
 (
 pi1, x1
 ))
 = sup
 ?0,?1,...
 lim
 T??
 ?
 (
 r (x0, a0, x1) + sup
 ??T?1
 J?
 (
 pi1, x1
 ))
 = sup
 ?0,?1,...
 ?
 (
 r (x0, a0, x1) + lim sup
 T??
 JT?1
 (
 pi1, x1
 ))
 = sup
 ?0,?1,...
 ?
 (
 r (x0, a0, x1) + V (pi1, x1)).
 By appealing to the monotonicity property of ? (?), we can move the supremum
 inside the argument?
 V ? (x0) ? sup
 ?0
 ?
 (
 r (x0, a0, x1) + sup
 ?1
 V
 (
 pi1, x1
 ))
 = sup
 ?1
 ? (r (x0, a0, x1) + V ? (x2)) .
 Thus V ? (?) satisfies the inequality
 V ? (x) ? sup
 ??P(A(x))
 ? (rx + V ?, x, ? ?Qx) , x ? X. (2.3.21)
 Appealing to the monotonicity property of ?, iterating the above inequality and
 68
letting ?? be the maximizer from the equation above we conclude that
 V ? (x) ? V (pi?, x) , x ? X,
 where pi? = {??, ??, . . . } is a stationary Markov policy that maximizes Eq. 2.3.21.
 Therefore, optimization with respect to stationary Markov policies is sufficient, and
 the result follows from Theorem 5.
 We need to ensure the first three conditions in Assumption 3 are satisfied by
 all CPT-inspired reward measures, which have the form:
 ?t(rx + v, x, ? (x) ?Qx) =
 ? ?
 0
 w+
 (? (x) ?Qx((rx + v)+ > s))ds
 ?
 ? ?
 0
 w?
 (? (x) ?Qx((rx + v)
 ?
 > s
 ))
 ds. (2.3.22)
 With this specific form, we can write down the operator T with respect to a
 transient stationary policy pi = {?, ?, . . . } as:
 T (pi) v (x) : =
 ? ?
 0
 w+
 (
 ? (x) ? ?Qx((rx + v)+ > s))ds
 ?
 ? ?
 0
 w?
 (
 ? (x) ? ?Qx((rx + v)
 ?
 > s
 ))
 ds
 =
 ? ?
 0
 (rx + v)+ d
 ((
 ? (x) ? ?Qx
 )w+,rx+v)
 ?
 ? ?
 0
 (rx + v)
 ?
 d
 ((
 ? (x) ? ?Qx
 )w?,rx+v)
 .
 The second equality is due to the fact that the transformed measures
 (
 ? (x) ? ?Qx
 )w+,rx+v
 69
and
 (
 ? (x) ? ?Qx
 )w?,rx+v are absolutely continuous with respect to ? (x) ? ?Qx; hence
 a Radon-Nikodym derivative exists. The lemma below will be used in Theorem 7.
 Lemma 5. If a uniformly transient Markov model is given, i.e.,
 lim
 k??
 ????
 ?(
 pik (x) ? ?Qx
 )k????
 w
 ? ?1, ?pi = {pi1, pi2, . . . } ? ?RM ,
 then there exists a ?k > 0 such that
 ?????
 ((
 pik (x) ? ?Qx
 )w?,v)k?????
 w
 < 1, ?k ? ?k, ?pi = {pi1, pi2, . . . } ? ?RM , v ? V .
 Proof. Note that if
 lim
 k??
 ????
 ?(
 pik (x) ? ?Qx
 )k????
 w
 is finite, then there exists a ?k such that for all k ? ?k
 ?????
 ((
 pik (x) ? ?Qx
 )w?,v)k?????
 w
 < 1.
 Since w? (p) equal to 1 if and only if p = 1, the assertion follows.
 A similar statement (Lemma 5) can be made about
 ?????
 ((
 pik (x) ? ?Qx
 )w+,v)k?????
 w
 .
 Theorem 7. Suppose the Markov control model is uniformly transient, and the
 following assumptions are satisfied:
 70
1) w+ and w? are continuous non-decreasing functions;
 2) There is a constant r? such that
 sup
 A(x)
 |r (x, a, x?)| ? r?w (x?) , ?x, x? ? X;
 Then Eq. 2.3.22 satisfies Assumption 3.
 Proof. 1) Letting pi = {?, ?, . . . }, we know the following
 ???T (pi)k v1 ? T (pi)k v2
 ???
 w
 ?
 ?????
 ? ?
 ??
 (v1 ? v2) d
 (((
 ? (x) ? ?Qx
 )w+,v1)k
 +
 ((
 ? (x) ? ?Qx
 )w?,v1)k)?????
 w
 +k ?v1 ? v2?w + ?k ?v1 ? v2?w
 ?
 (?????((? (x) ? ?Qx)w+,v1)k?????
 w
 +
 ?????
 ((
 ? (x) ? ?Qx
 )w?,v1)k?????
 w
 + k + ?k
 )
 ? ? ? ?v1 ? v2?w , (2.3.23)
 where
 (?????((? (x) ? ?Qx)w+,v1)k?????
 w
 +
 ?????
 ((
 ? (x) ? ?Qx
 )w?,v1)k?????
 w
 + k + ?k
 )
 can be
 chosen to be less than 1. The appropriate k is found by appealing to Lemma 5.
 Since the Markov control model is uniformly transient (i.e., the probability weight
 on the non-absorbing states decreases as k increases), k and ?k can be made ar-
 bitrarily small as k ? ?. Here, ?k captures the difference between the distorted
 measures, of the the previous finite number of per-stage rewards, induced by v1
 and v2. As k increases, the measure distortions induced by v1 and v2 on the initial
 steps disappear. The first inequality in Eq. 2.3.23 is due to the fact that the trans-
 formed measures
 ((
 ? (x) ? ?Qx
 )w+,v1)k and ((? (x) ? ?Qx)w?,v1)k are absolutely
 continuous with respect to
 (
 ? (x) ? ?Qx
 )k
 ; hence a Radon-Nikodym derivative ex-
 71
ists. Furthermore, we used the fact that
 ?
 f1 (x1) dP (x1) ?
 ?
 f2 (x1) dQ (x1)
 =
 ?
 (f1 ? f2) (x1)dP (x1) +
 ?
 f2(x1) (dP (x1) ? dQ (x1)) ,
 where f1 and f2 are two w-bounded measurable functions, and P and Q are two
 ?-finite measures. In Eq. 2.3.23, k represents the difference between the v1 and
 v2 distorted measures (i.e., k = ?v2?w?v1?v2?w ?P ?Q?w, where P is distorted by v1 and
 Q is distorted by v2). In other words, k captures the difference in the distorted
 measures at the k-th stage induced by v1 and v2.
 2) To prove condition 2, we appeal to the continuity property of w+ and w?,
 which implies the following inequalities:
 ??t(z1, x,m) ? ?t(z2, x,m)?w
 =
 ????
 ? ?
 0
 w+
 (
 m
 ((z1)+ > s))? w+(m((z2)+ > s))ds
 ?
 (? ?
 0
 w?
 (
 m
 ((z1)
 ?
 > s
 ))
 ? w?
 (
 m
 ((z2)
 ?
 > s
 ))
 ds
 )????
 w
 ? 
 2
 +
 
 2
 = ,
 where for brevity the measure ? (x) ? ?Qx is denoted by m.
 From the continuity property of the functions w+ and w?, we know that there
 72
exist ?1 and ?2 such that
 ??m
 ((z1)+ > s)?m((z2)+ > s)??w ? ?1
 =?
 ??w+
 (
 m
 ((z1)+ > s))? w+(m((z2)+ > s))??w ? 2
 and
 ??m
 ((z1)
 ?
 > s
 )
 ?m
 ((z2)
 ?
 > s
 )??
 w
 ? ?2
 =?
 ??w+
 (
 m
 ((z1)
 ?
 > s
 ))
 ? w+
 (
 m
 ((z2)
 ?
 > s
 ))??
 w
 ? 
 2
 .
 Since
 ?z1 ? z2?w ? 0,
 implies
 ??m
 ((z1)+ > s)?m((z2)+ > s)??w ? 0
 and
 ??m
 ((z1)
 ?
 > s
 )
 ?m
 ((z2)
 ?
 > s
 )??
 w
 ? 0,
 we conclude that there exists a ?3 such that
 ?z1 ? z2?w < ?3
 ??m
 ((z1)+ > s)?m((z2)+ > s)??w ? ?1
 ??m
 ((z1)
 ?
 > s
 )
 ?m
 ((z2)
 ?
 > s
 )??
 w
 ? ?2.
 73
3) Now, we prove condition 3 of Assumption 3. Since
 sup
 A(x)
 |r (x, a, x?)| ? r?w (x?) , ?x, x? ? X,
 and the Markov control model is assumed to be uniformly transient, we can show, by
 induction, that V (pi, x) is also w-bounded for any transient policy pi = {?, ?, . . . } ?
 ?RM . We start by writing down the one-stage reward function:
 ?
 (
 |rx| , x, ? ? ?Qx
 )
 =
 ? ?
 0
 |rx| d
 ((
 ? ? ?Qx
 )w+,|rx|)
 ? r?
 ? ?
 0
 wd
 ((
 ? ? ?Qx
 )w+,|rx|)
 .
 The two-stage reward function is written as
 ?
 (
 |rx| + ?
 (
 |rx1 | , x1, ? ? ?Qx1
 )
 , x, ? ? ?Qx
 )
 ? r?
 (? ?
 0
 wd
 ((
 ? ? ?Qx
 )w+,|rx|)
 +
 ? ?
 0
 wd
 ((
 ? ? ?Qx
 )w+,|rx|+|rx|)2)
 .
 By iterating the inequality above and appealing to Lemma 5, we arrive at the
 conclusion that
 V (pi, ?) ? r?kw (?) ,
 where the k is found in Eq. 2.3.12.
 In the next section, we present a numerical example to explore the structure
 of optimal policies for the transient Markov case.
 74
2.3.4 The Organ Transplant Example: A Comparative Analysis
 We will compare numerically the type of polices obtained from CPT-based measures
 against some of the other risk-sensitive approaches.
 Example 7. The following example is from [18], which is a simplified version of
 the organ transplant problem discussed in [1]. The problem considers the discrete-
 time absorbing Markov chain depicted in Fig. 2.3.1a. The initial state S (i.e., sick)
 represents a patient waiting on an organ transplant due to sickness. The state L
 (i.e., live) represents the state where the patient lives after a successful transplant.
 The state D, an absorbing state, represents death. There are two possible actions to
 take in state S: 1) one can wait (W), in which case the next state could either be D
 or S probabilistically; 2) one can choose to transplant (T), which concentrates the
 transition probability on states L and D (i.e., states L and D are the only two possible
 next states). The probability of death is lower for W than for T, but a successful
 transplant may result in a longer life. In other two states, only the action continue
 is allowed. The reward collected at each time step is months of life. In state S, a
 reward equal to 1 is collected if the control is W; otherwise, the immediate reward
 is 0. In state L, the reward r(L) is collected representing the certainty equivalent of
 the random length of life after the transplant. In state D the reward is 0.
 The states where there is only one possible action allowed have a deterministic
 reward function (i.e., L and D). In particular, the equivalent length of life at the
 state L is r (L) . However, this value is generated by taking on certain assumptions,
 which are the focus of the following discussion. The state L is in fact an aggregation
 75
(a) Organ Transplant Transitions (b) The Survival Model
 Figure 2.3.1: Organ Transplant State Transitions & Rewards
 Optimum
 20 40 60 80 100
 p wait
 - 400
 - 350
 - 300
 - 250
 Value
 (a) Semideviation [18]
 Optimum
 20 40 60 80 100
 p wait
 - 150
 - 100
 - 50
 50
 100
 Value
 (b) CPT-based
 Figure 2.3.2: Optimal Policy Comparison of the Organ Transplant Example
 of n states in a survival model representing months of life after the transplant, as
 depicted in Fig. 2.3.1b. At the state i, i = 1, . . . , n, the patient dies with probability
 pi and survives with probability 1 ? pi. The patient will die for sure in the state
 n (i.e., pn = 1). The reward collected at each state i is equal to 1. In ?avu? and
 Ruszczy?ski [18], the problem is stated as a minimization problem. However, we
 desire a maximization problem, thus we compare our results to that of ?avu? and
 Ruszczy?ski?s [18] by negating the rewards.
 In [18], r (L) is calculated from the survival model using the transition mapping
 76
of the form:
 ? (?, i,m) = Em [?]
 ? ?? ?
 expected value
 +?E
 [(?? Em [?])+]
 ? ?? ?
 semideviation
 . (2.3.24)
 In Eq. 2.3.24, the measure m is the transition kernel at the current state i,
 and the function ? (?) is the reward collected at the current state and action plus the
 value function at the next state (i.e., cost-to-go). At each state i = 1, . . . , n?1, two
 transitions are possible: 1) transition to the state D with probability pi and ? = ?1;
 2) transition to the state i + 1 with probability 1 ? pi and ? = ?1 + vi+1 (i+ 1) .
 At the state i = n, the transition to D occurs with probability 1, and ? = ?1.
 Therefore, vn (n) = ?1.
 The survival problem is now a finite-horizon problem, which can be expressed
 as in Eq. 2.3.1. Since there is only one action allowed, the minimization operation
 is eliminated. The equation has the form:
 vi (i) = ? (?, i, Qi) , i = 1, . . . , n? 1,
 with ? and Qi being the reward and the transition probability respectively. The
 values of ? and Qi are explained in the previous paragraph. By induction, vi (i) ? 0,
 for i = n? 1, n? 2, . . . , 1.
 The mean and semideviation components of Eq. 2.3.24 at the states i =
 77
1, . . . , n? 1 can be calculated as follows:
 EQi [?] = ?pi + (1 ? pi) (?1 + vi+1 (i+ 1)) = ?1 + (1 ? pi) vi+1 (i+ 1) ,
 EQi
 [(?? EQi [?])+]= EQi [(?+ 1 ? (1 ? pi) vi+1 (i+ 1))]
 = pi (?1 + 1 ? (1 ? pi) vi+1 (i+ 1))+
 + (1 ? pi) (?1 + vi+1 (i+ 1) + 1 ? (1 ? pi) vi+1 (i+ 1))+
 = pi (? (1 ? pi) vi+1 (i+ 1))+ + (1 ? pi) (pivi+1 (i+ 1))+
 = ?pi (1 ? pi) vi+1 (i+ 1) ,
 where the last equality in the equation above is implied by the fact that vi+1 (i+ 1) ?
 0.
 For i = 1, . . . , n? 1, the dynamic programming equation for the optimization
 problem stated in Eq. 2.3.1 takes the form:
 vi (i) = ?1 + (1 ? pi) vi+1 (i+ 1)
 ? ?? ?
 expected value
 ?? pi (1 ? pi) vi+1 (i+ 1)
 ? ?? ?
 semideviation
 , i = n? 1, n? 2, . . . , 1.
 The value v (1) is the negative of the risk-adjusted length of life with the new
 transplanted organ. For ? = 0, the above formula gives the negative of the expected
 length of life with the new organ. In the calculations below, we use the transition
 data from Table 2.3.5. They have been chosen for illustrative purposes only, and do
 not correspond to any real-life medical data. For the survival model, the distribution
 function, F (x), of lifetime of the American population is suggested by Jasiulewicz
 78
Action S L D
 W 0.99882 0 0.00118
 T 0 0.90782 0.09218
 Table 2.3.5: Transition Probabilities From State S
 Distribution Parameters Weights
 Weibull ? = 0.297, ? = 0.225 w1 = 0.0170
 Lognormal m = 3.11, ? = 0.218 w2 = 0.0092
 Gompertz b = 0.0000812, ? = 0.0844 w3 = 0.9737
 Table 2.3.6: Organ transplant example: parameters for F (x).
 [58]. It is a mixture of Weibull, lognormal, and Gompertz distribution:
 F (x) := w1
 (
 1 ? e?(x? )?
 )
 + w2?
 (logx?m
 ?
 )
 + w3
 (
 1 ? e?
 b
 a
 (e?x?1)
 )
 , x ? 0.
 The values of the parameters and weights, provided by Jasiulewicz [58], are given
 in Table 2.3.6.
 Using the information provided above, the probability of dying in the k-th
 month can be calculated using the equation:
 pk =
 F
 (k
 12 +
 1
 24
 )
 ? F
 (k
 12 ?
 1
 24
 )
 1 ? F
 (k
 12 ?
 1
 24
 ) , k = 1, 2, . . . .
 The maximum lifetime of the patient is assumed to be 1200 months, and the
 post-transplant survival probabilities for the patient starts from k = 300. Hence, a
 total of 900 steps, n=900, is used in the survival model to calculate r (L) .
 If we let ? = (?W , ?T ) be a randomized policy in the state S and let ? =
 {? ? R2 : ?W + ?T = 1, ? ? 0}, then the dynamic programming equation at the
 79
state S has the form:
 v (S) = min
 ???
 {?W [qS,S (W ) (v (S) ? 1) + qS,D (W ) (v (D) ? 1)]
 +?T [qS,L (T ) v (L) + qS,D (T ) v (D)]
 +?
 (
 ?W
 [
 qS,S (W ) (v (S) ? 1 ? ?)+ + qS,D (W ) (v (D) ? 1 ? ?)+
 ]
 +?T
 [
 qS,L (T ) (v (L) ? ?)+ + qS,D (T ) (v (D) ? ?)+
 ])}
 .
 Here, ? = 1. If ? is held fixed in the equation above, then we can solve for
 v (S) . By varying ? ? (0, 1), we obtain Fig. 2.3.2a. We can compute the value
 function of the CPT-based reward measure as follows:
 v (S) = max
 ???
 ? ?
 0
 w+
 (
 ?W
 (
 qS,S (W ) 1{(v (S) + 1 ? ?)+ > s}
 +qS,D (W ) 1{(v (D) + 1 ? ?)+ > s})
 +?T
 (
 qS,L (T ) 1{(v (L) ? ?)+ > s}
 +qS,D (T ) 1{(v (D) ? ?)+ > s}))ds
 ?
 ? ?
 0
 w
 ?
 (
 ?W
 (
 qS,S (W ) 1{(v (S) + 1 ? ?)
 ?
 > s
 }
 +qS,D (W ) 1{(v (D) + 1 ? ?)
 ?
 > s
 })
 +?T
 (
 qS,L (T ) 1{(v (L) ? ?)
 ?
 > s
 }
 +qS,D (T ) 1{(v (D) ? ?)
 ?
 > s
 }))
 ds.
 In the equation above, w+ (?) = w? (?) = exp (?0.5 (? ln (?))) , and ? is the
 expected value without probability weighting. The numerical results of the three
 80
Method r(L) Optimum Value Optimal ?W
 Expected Value 610.46 846.611 1.000000
 Semideviation 515.33 426.139 0.987236
 CPT 702.32 104.438 0.868232
 Table 2.3.7: Organ Transplant Optimal Value and Policy Comparison
 solution methods (i.e., expected value, semideviation, and CPT) 9 are listed in
 Table 2.3.7. Furthermore, the value functions of the semideviation and the CPT
 performance measures are plotted in Fig. 2.3.2.
 We calculated r (L) for the CPT method using the following equation:
 ? (?, i,m) =
 ? ?
 0
 w+(m (?+ > s))ds.
 As is evident from Table 2.3.7, the CPT performance measure produces a more
 randomized optimal policy than the other two approaches. The ?W value of 0.99 is
 very close to the deterministic policy of W (i.e., to wait). In fact, obtaining a very
 randomized policy is difficult using semideviation. The ease with which the CPT
 performance measure is able to obtain an optimal randomized policy can be ex-
 plained by the fact that the probability weighting function is applied to the control.
 Intuitively, the need for randomized policies stems from the nonlinear transforma-
 tion of the uncertainty in the system, which renders deterministic optimal policies
 insufficient.
 9In the table, some values is negated to be positive for the purpose of comparison.
 81
2.4 Reward Measures and Optimal Policies
 From the previous sections, we have learned that we can solve finite-horizon non-
 convex optimization problems with reward functions of the form:
 Vt(x0, pi) = ?(rt(xpit , apit , xpit+1) + ?(rt+1(xpit+1, apit+1, xpit+2)
 + ?
 (
 rt+2(xpit+2, apit+2, xpit+3) + ? ? ? +
 ?
 (
 rT?1(xpiT?1, apiT?1, xpiT ) + rT (xpiT )
 )
 ? ? ?
 )))
 ,
 where Vt is known as the reward-to-go function. The equation is analogous to the
 expanded form of the standard expected value measure:
 ?Vt (xt, pi) = E[rt(xt, apit , xpit+1)+ E[rt+1(xpit+1, apit+1, xpit+2)
 + E
 [
 rt+2
 (
 xpit+2, a
 pi
 t+2, x
 pi
 t+3
 )
 + ? ? ? +
 E
 [
 rT?1
 (
 xpiT?1, a
 pi
 T?1, x
 pi
 T
 )
 + rT (xpiT ) |xpiT?1
 ]
 ? ? ? |xpit+2
 ]|xpit+1]xt].
 One of the advantages of the standard expected value measure is that it can be
 written more compactly as
 ?Vt (x0, pi) = E
 [
 T?1?
 i=t
 ri
 (
 xpii , a
 pi
 i , x
 pi
 i+1
 )
 + rT (xpiT )
 ]
 .
 We would like to write down a similar simplified counterpart in the CPT
 Markov conditional reward measure case for Vt. In other words, we would like to
 82
study the reward measures:
 Vt (xt, pi) =
 ? ?
 0
 wt (pit (xt) ?Qxt,t (rt (xt, apit , xt+1) + Vt+1 (xt+1, pi)) > s) dst,
 where VT (xT , pi) = rT (xT ) . (2.4.1)
 For simplicity, we only consider bounded non-negative rewards (i.e., rt ? 0
 and rt ? M, M > 0 ?t ? 0). The inclusion of negative rewards is a straightforward
 exercise. We can see that Eq. 2.4.1 is a complicated sequence of nested integrals.
 We would like to simplify this expression by introducing some new notations.
 We note that the transformed measure on the measurable space (X ? A,B (X ? A))
 Pw,? (? > s) := w (P (? > s)) (2.4.2)
 is absolutely continuous with respect to P (? > s) . By the Radon-Nikodym theorem,
 there exists a measurable function such that
 Pw,? (B) :=
 ?
 B
 dPw,?
 dP dP,
 where dPw,?dP is a Radon-Nikodym derivative.
 Using Radon-Nikodym derivatives of the form dPw,?dP , we can rewrite Eq.2.4.1
 83
as
 Vt(xt, pi) :=
 ?
 X?A
 rt (xt, apit , xt+1) +
 ?
 X?A
 rt+1
 (
 xpit+1, a
 pi
 t+1, x
 pi
 t+1
 )
 ? ? ?
 d
 (
 pi (xt+1, t+ 1) ?Qxt+1,t+1
 )
 wt+1,rt+1+vt+2
 dpi (xt+1, t+ 1) ?Qxt+1,t+1 dpi (xt+1, t+ 1) ?Qxt+1,t+1
 d (pi (xt, t) ?Qxt,t)wt,rt+vt+1
 dpi (xt, t) ?Qxt,t dpi (xt, t) ?Qxt,t,
 where vt is recursively defined as
 vt (x) =
 ? ?
 0
 wt (pit (x) ?Qx,t (rt + vt+1 > s)) ds
 vT (x) = rT (x) .
 In the equation above, the mapping rt+vt+1 : X?A?X ? R is used to transform the
 probability measure pit (x) ?Qx,t, which is a probability measure on the probability
 space (X ? A,B (X ? A)). According to Eq. 2.4.2, the function used to transform
 the measure pit (x) ? Qx,t should be a B (X ? A)-measurable function. It is obvious
 that rt + vt+1 is a B (X ? A)-measurable function if x is held fixed (i.e., rt + vt+1 is
 treated as rt (x, ?) + v (?) in the equation above). In the sequel, whether or not x is
 held fixed for the reward function rt + vt+1 should be obvious from the context.
 Using the Radon-Nikodym derivative notation, Vt can be written more com-
 pactly as:
 84
Vt (xt, pi) = Epit
 [(
 T?1?
 i=t
 rt
 (
 xpii , a
 pi
 i , x
 pi
 i+1
 )
 + rT (xpiT )
 )
 T?1?
 i=t
 d (pi (xi, i) ?Qxi,i) wi,ri+vi+1
 dpi (xi, i) ?Qxi,i
 ]
 vt (x) =
 ? ?
 0
 wt (pit (x) ?Qx,t (rt + vt+1 > s)) ds
 vT (x) = rT (x) . (2.4.3)
 Now, we can easily see that the difficulty of solving the optimization problem stated
 in Eq. 2.3.1 is due to the appearance of the value functions {vi}T?1t in the calculation
 of Vt. The following proposition aggregates several fundamental properties of Eq.
 2.4.3.
 Proposition 1. The value function, Vt, in Eq. 2.4.3 has the following properties:
 1) As sup
 x
 |wt (x) ? x| ? 0 ?t ? [0, T ], a solution pi?? for the standard (i.e.,
 risk-neutral) optimization problem:
 max
 pi
 ?Vt (xt, pi) ,
 also solves the optimization problem:
 max
 pi
 Vt (xt, pi) ;
 2) If wt is such that it puts all weights on the highest possible reward value, then an
 85
optimal policy for the optimization problem:
 max
 pi
 Vt (xt, pi)
 is obtained by considering only deterministic policies:
 pi?t (x) = arg max
 a?A(x)
 rt(x, a) + vt+1 (xt+1) .
 Proof. The proof for (1) is trivial since ?Vt (xt, pi) ? Vt (xt, pi) point-wise when
 sup
 x
 |wt (x) ? x| ? 0 ?t ? [0, T ].
 The proof for (2) is also straightforward. By placing all probability weights on the
 highest reward value, it is always optimal to pick the deterministic action with the
 highest reward value.
 Remark 11. When there exists an action a? such that
 Qx,t (?x,a? > s|x, a?) ? Qx,t (?x,a > s|x, a)
 ?s ? [0,?), ?a ? A (x) , ?x ? X, ?t ? [0, T ],
 where ?x,a denotes the reward function with x and a fixed, then there exists a deter-
 ministic optimal policy (i.e., the policy that takes action a? for state x at time t). De-
 terministic optimal policies is obtained trivially when the complement CDF of an ac-
 tion dominates all other complement CDFs (i.e., If 1?Fa1 (x) ? 1?Fa2 (x) ?a1 6= a2,
 86
then a1 is an optimal deterministic action). The next example demonstrates some
 of the difficulties in analyzing even the simplest probability weighting function. We
 prove the uniqueness of the optimal policy for some special cases afterwards.
 Example 8. For this example, we will use the probability weighting function:
 w (x) = 1 ? (1 ? x)b b > 1.
 For simplicity, we deal with a two-state-two-action problem. We write the
 probability transition matrix as
 Q1 :=
 ?
 ??? Q [1, 1, 1] Q [1, 1, 2]
 Q [2, 1, 1] Q [2, 1, 2]
 ?
 ???
 and similarly we denote the transition probability matrix of taking action 2 (i.e.,
 a2) as:
 Q2 :=
 ?
 ??? Q [1, 2, 1] Q [1, 2, 2]
 Q [2, 2, 1] Q [2, 2, 2]
 ?
 ??? .
 In other words, Q [1, 2, 1] is the probability of starting and arriving at state 1
 by taking action 2. Using the reward function ? (x, a, x) := x+ a+ x, we study the
 optimization problem:
 max
 p1
 ? ?
 0
 1 ? (1 ? (P (? > s|x, a1) ? p1 + P (? > s|x, a1) ? (1 ? p1)))2 ds. (2.4.4)
 87
By differentiating Eq. 2.4.4 with respect to p1, we obtain
 ?Q[1, 2, 2]w? (?Q[1, 2, 2] (?1 + p1))
 +(Q[1, 1, 2] ?Q[1, 2, 1] ?Q[1, 2, 2])
 ?w? (?(Q[1, 2, 1] +Q[1, 2, 2]) (?1 + p1) +Q[1, 1, 2]p1)
 +3(Q[1, 1, 1] +Q[1, 1, 2] ?Q[1, 2, 1] ?Q[1, 2, 2])
 ?w? (?(Q[1, 2, 1] +Q[1, 2, 2]) (?1 + p1) + (Q[1, 1, 1] +Q[1, 1, 2])p1) .
 Substituting w? (p) = 2 (1 ? p), we obtain an affine equation in p1:
 g (p1) = ? 2 (?3Q[1, 1, 1] ? 4Q[1, 1, 2] + 4Q[1, 2, 1]
 + 3Q[1, 1, 1]Q[1, 2, 1] + 4Q[1, 1, 2]Q[1, 2, 1]
 ? 4Q[1, 2, 1]2 + 5Q[1, 2, 2] + 3Q[1, 1, 1]Q[1, 2, 2]
 +4Q[1, 1, 2]Q[1, 2, 2] ? 8Q[1, 2, 1]Q[1, 2, 2] ? 5Q[1, 2, 2]2)
 ?2
 (
 3Q[1, 1, 1]2 + 6Q[1, 1, 1]Q[1, 1, 2] + 4Q[1, 1, 2]2
 ? 6Q[1, 1, 1]Q[1, 2, 1] ? 8Q[1, 1, 2]Q[1, 2, 1] + 4Q[1, 2, 1]2
 ? 6Q[1, 1, 1]Q[1, 2, 2] ? 8Q[1, 1, 2]Q[1, 2, 2]
 +8Q[1, 2, 1]Q[1, 2, 2] + 5Q[1, 2, 2]2)p1.
 Although the equation above is affine in p1, its coefficients depend on the
 transition probabilities (Q1 and Q2) and the reward function ?. This dependence
 makes any generalized results on the structure of the optimal policies difficult.
 88
We observe from the example above that any optimal solution of the problem
 stated in Eq. 2.4.4 must satisfy g (p1) = 0, which has a unique solution p?1. In the
 next theorem, we will prove that this is true in general.
 Theorem 8. Assume wt is a strictly concave function, then the function
 ? ?
 0
 wt (? ?Qx,t (rt + vt+1 > s)) ds
 is strictly concave in ?. Furthermore, there is a unique maximizer for the optimization
 problem:
 max
 ??P(A(x))
 ? ?
 0
 wt (? ?Qx,t (rt + vt+1 > s)) ds.
 Proof. We prove the concavity of the function of interest:
 ? ?
 0
 wt ((??1 + (1 ? ?) ?2) ?Qx,t (rt + vt+1 > s)) ds
 =
 ? ?
 0
 wt
 (?
 Qx,t (rt + vt+1 > s) d (??1 + (1 ? ?) ?2)
 )
 ds
 > ?
 (? ?
 0
 wt (?1 ?Qx,t (rt + vt+1 > s))
 )
 ds
 + (1 ? ?)
 (? ?
 0
 wt (?x ?Qx,t (rt + vt+1 > s))
 )
 ds.
 Since the right hand side of the second equation in the theorem is strictly concave,
 the uniqueness of maximizer follows.
 We are only able to prove the uniqueness of the optimal policy in the positive
 reward case. For the more general case of both positive and negative rewards, where
 89
we have a convex-concave measure, we were unable to prove the uniqueness of the
 optimal policy.
 2.5 Conclusion
 Non-convex reward measures are useful for modeling many real-life problems. More
 specifically, CPT reward measures are derived from experimental data and have
 been proven to model several key characteristics of human behaviors well. This
 inspired us to start building a rigorous theoretical foundation for their application
 to dynamic problems. Our effort has resulted in proving the applicability of dynamic
 programming equations for non-convex reward measures.
 In relation to ?avu? and Ruszczy?ski [18], we relaxed their assumptions on
 the performance measures to monotonicity and positive homogeneity. In the finite-
 horizon case, the monotonicity assumption is important in the proof for the suit-
 ability of the dynamic programming method for the non-convex case. One of our
 contributions in the finite-horizon case is the assumptions on the weighting functions
 such that the monotonicity assumption of the performance measures is satisfied. In
 the discounted infinite-horizon case, the suitability of the dynamic programming
 method for non-convex performance measures is proven using the monotonicity as-
 sumption and the fact that they are contractions. In the transient case, the proofs
 are more difficult and require the utilization of k-step contractions.
 In this chapter, we have established a rigorous mathematical foundation for
 using dynamic programming to solve Markov Decision Problems with CPT-based
 90
reward measures. In our new framework, CPT-based reward measures, unlike the
 existing work where CPT-based reward measures are only applied statically or to
 special cases, can be applied to a larger class of problems. The development of
 these new MDPs is especially useful for modeling dynamic human decision making
 processes. Through numerical examples, we demonstrated the properties of optimal
 policies obtained by solving these problems. The optimal polices obtained from
 these problems are different from the standard polices; they are randomized rather
 than deterministic. This finding, perhaps not too surprising due to previous work
 done by Ruszczy?ski, suggests that deterministic optimal policies are insufficient for
 obtaining the optimal value function when humans are involved. It is our hope that
 these new proposed models will be utilized to model human risk-sensitive behaviors
 in dynamic settings.
 91
Chapter 3
 Cumulative Weighting Optimization
 Global optimization problems are relevant in many fields (e.g., control systems,
 operations research, economics). There are many approaches to solving these prob-
 lems. One particular approach is model-based methods, which are a class of random
 search methods. A model-based method iteratively updates its probability density
 function. At each step, additional weight is given to solution subspaces that are
 more likely to yield an optimal objective value. Model-based methods can be an-
 alyzed by writing down a corresponding system of differential equations similar to
 the well known Fokker-Planck equation, which models the evolution of probability
 density functions for diffusions. We propose an innovative model-based method,
 Cumulative Weighting Optimization (CWO), which can be proven to converge to
 an optimal solution. Using this rigorous theoretical foundation, we design a class
 of CWO-based numerical algorithms for solving global optimization problems. In-
 terestingly, the well-known cross-entropy method is a special case of CWO-based
 numerical algorithms.
 3.1 Introduction
 Many problems in engineering and science can be formulated as global optimiza-
 tion problems. These problems are challenging when their objective functions are
 92
nonlinear (e.g., non-convex, multi-modal, or badly scaled). If we are only interested
 in finding their local extrema and they are differentiable, then the standard local
 optimization method (i.e., first derivative being zero) would suffice. If there are
 only a few local extrema, then we can easily find a global optimal solution by eval-
 uating all of them. However, this approach does not work on objective functions
 with absence of structural information (e.g., non-differentiable), or in the presence
 of many local extrema. Approaches developed to solve these problems can be di-
 vided into two categories: deterministic and random search algorithms. Random
 search algorithms can be further divided into instance-based (e.g., simulated anneal-
 ing, genetic algorithm, tabu search, nested partitions, generalized hill climbing, and
 evolutionary programming) and model-based algorithms (e.g., annealing-adaptive
 search, cross-entropy (CE), model reference adaptive search (MRAS), and estima-
 tion of distribution algorithms (EDAs)). A more recent addition to model-based
 algorithms is model-based evolutionary optimization [88]. For the interested reader,
 Hu et al. have a recent survey paper on model-based methods [54], which also
 contains references to instance-based methods mentioned in this paragraph
 We propose a new addition, inspired by Cumulative Prospect Theory (CPT),
 to the class of model-based methods. The new CWO-based algorithms have an
 intuitive connection with the risk-sensitive nature of the human decision making
 process.
 In the rest of this chapter, we will proceed in the following sequence. In Section
 3.2, we present the problem statement. In Section 3.3, we introduce the concept of
 93
probability weighting functions. In Section 3.4, we will work with the case when
 X = {1, 2, 3, 4, . . . }
 and provide the reader some insight into the construction of our probability updating
 equation. Later in the same section, we will prove the convergence properties for
 the equation. In Section 3.5, we will generalize the results of Section 3.4 to Polish
 spaces. Finally, in Section 3.6, we will outline our numerical algorithms and present
 some simulation results.
 3.2 Problem
 In many engineering applications, we are looking for a ?best? solution based on some
 criterion. For example, in the well known traveling salesman problem (TSP), we are
 looking for the cheapest route that visits all cities and terminates at the starting
 point. Problems of this nature can be formulated as the following optimization
 problem:
 x? ? arg max
 x?X
 H(x), (3.2.1)
 where x? is an optimal solution to the problem and X is a non-empty, often compact,
 solution space (in many applications X ? Rn). H : X ? R, the objective function,
 is a bounded deterministic measurable function with many local extrema. In the
 rest of this chapter we assume the following.
 Assumption 4. There exists a global optimal solution to Eq. 3.2.1, i.e., ?x? ? X
 94
such that H(x) ? H(x?) ?x 6= x?, ?x ? X .
 In practice, this assumption is true for many optimization problems. For
 example, the assumption holds trivially when X is a finite discrete solution space.
 Generally, we do not assume any other structural information about the objective
 function (i.e., convexity, differentiability).
 We can introduce a measurable strictly increasing fitness function, ? : R ?
 R+, and reformulate Eq. 3.2.1 as:
 x? ? arg max
 x?X
 ? (H(x)) . (3.2.2)
 Since the reformulated problem guarantees the range of the new fitness-objective
 function (i.e., ? (H (?))) will always be non-negative, and it is equivalent to the
 original problem, we will solve Eq. 3.2.2 in place of Eq. 3.2.1. A similar problem
 statement can be found in Hu et al. [54].
 3.3 Probability Weighting Functions
 Probability weighting functions have many applications in science and engineering.
 In this thesis, we are most concerned with using them to re-weight the probabili-
 ties of outcomes. Weighting is suggested by Cumulative Prospect Theory (CPT) as
 an important part of the human decision making process. Prospect Theory (PT),
 the predecessor to CPT, was suggested in the 1970s by Kahneman and Tversky
 [59]. They were unsatisfied with PT due to its violation of second order stochastic
 95
dominance, which was remedied by CPT in the 1990s [83]. CPT improves PT by
 re-weighting the outcome cumulative probability function instead of the outcome
 probability density function. This new approach can also be useful for global op-
 timization problems. The purpose of this section is to familiarize the reader with
 probability weighting functions, which will be used later for iteratively updating
 probability measures. We first introduce several definitions to assist us in our dis-
 cussion.
 Definition 18. A weighting function, w : [0, 1] ? [0, 1], is a monotonically increas-
 ing and Lipschitz continuous function with w(0) = 0 and w(1) = 1.
 We are interested in weighting functions with the additional property of optimal-
 seeking.
 Definition 19. A weighting function, w : [0, 1] ? [0, 1], is optimal-seeking if
 w (?x+ (1 ? ?) y) > ?w (x) + (1 ? ?)w(y), ?? ? (0, 1), x 6= y ? [0, 1].
 Optimal-seeking is called risk-seeking in fields modeling risk-sensitivity. From the
 definitions above, we can prove the following proposition.
 Proposition 2. An optimal-seeking weighting function satisfies the inequality
 w(z) > z, ?z ? (0, 1) .
 Proof. Let x = 1, y = 0, and ? = z in the definition for optimal-seeking weighting
 functions. The proof follows trivially.
 96
Next, we will be more concrete and provide several examples of risk-seeking
 probability weighting functions.
 Assumption 5. w is an optimal-seeking weighting function.
 Example 9. A simple polynomial weighting function has the form:
 w (p) = 1 ? (1 ? p)b , b > 1. (3.3.1)
 Another more complicated weighting function involving exponentials has the form:
 w(p) =
 ecp ? 1
 ec ? 1
 , (3.3.2)
 where c < 0. There are other parametric weighting functions, which can be found
 in [29].
 An optimal-seeking weighting function tends to place more weight on highly
 unlikely, yet highly rewarding outcomes. In particular, it is used to overweight the
 probabilities of unlikely events (i.e., events with small probabilities) and underweight
 the probabilities of highly likely events. More specifically, we can apply an optimal-
 seeking weighting function to the cumulative distribution function of the outcomes,
 as in the example below.
 Example 10. We are given a die and asked to roll it once. We are told that we
 will be given a payoff that is equivalent to the outcome of the roll. For example, if
 we rolled a 1, then we would be given a $1 reward. We assume the die is fair, and
 97
calculate the expected payoff for both the risk-neutral and optimal-seeking cases.
 We use R to denote the random variable associated with the outcomes of this game.
 The risk-neutral expected payoff is calculated as:
 E [R] =
 6?
 n=1
 1 ? F (n) =
 6?
 n=1
 n
 6 =
 21
 6 ? 3.5,
 where F (n) is the cumulative distribution function at outcome n. Using the poly-
 nomial weighting function in Eq. 3.3.1 with b = 2, we have our optimal-seeking
 re-weighted expected payoff:
 Ew [R] =
 6?
 n=1
 w (1 ? F (n)) =
 6?
 n=1
 w(n6 ) =
 6?
 n=1
 1 ? (1 ? n6 )
 2
 =
 161
 36 ? 4.47222.
 Remark 12. The reader should note that the re-weighting is applied to the good-
 news function1 (i.e., the probability of the outcome exceeding a threshold). In
 other words, the unlikely events are events whose payoff exceeded some threshold.
 It should be noted that the optimal-seeking re-weighted expected payoff is greater
 than that of the risk-neutral. This will be an important feature in proving the
 convergence of the CWO method.
 1In other fields, the good-news function is also known as the survival function or the reliability
 function.
 98
3.4 Discrete Solution Space
 We are trying to find an solution for Eq. 3.2.2 assuming that
 X := {1, 2, 3, 4, . . . } .
 We further assume the discrete topology for X . The discrete solution space case
 should offer the reader some intuitive insight into the workings of the CWO method.
 In the next section, we will present analogous results on Polish spaces.
 We denote the set of optimal measures on (X ,B (X )) by
 P?X := {P ? PX |P (X ?) = 1} ,
 where X ? is the set of all optimal solutions, i.e.,
 X ? := {i? ? X |H(i) ? H(i?), ?i ? X} ,
 and PX is the set of all possible probability measures over B (X ). It should not
 surprise the reader that if we can find an element of P?X , then we have found a
 solution to the global optimization problem stated in Eq. 3.2.2. We assume X ? has
 only a finite number of elements.
 Assumption 6. The objective function, H : X ? R, has a finite number of optimal
 99
solutions, i.e., the set
 X ? := {i? ? X |H(i) ? H(i?), ?i ? X}
 has a finite number of elements.
 Proposition 3. P?X and PX are both non-empty sets.
 Proof. Using Assumption 4 we know that a Dirac measure concentrated at any
 i? ? X ? is in P?X , which is a subset of PX .
 Our objective is to restrict the temporal evolution of a probability measure such
 that it will eventually concentrate its probability density at one of the optimal solu-
 tions. This evolution can be defined on a measurable space, (X ? R+,B (X ? R+)),
 where X is the given solution space and R+ represents time. If the evolution hap-
 pens in discrete time or iteration steps, then R+ can be replaced by {0, 1, 2, 3, . . . } .
 To solve Eq. 3.2.1, we want to find a probability measure P and a t? ? R+ such
 that
 P ({(t, i)|i ? X ?}) = P ({(t, i)|i ? X}) , ?t ? t?. (3.4.1)
 In other words, P at some finite time t? is a member of P?X . We denote the resulting
 probability space by 2 (
 X ? R+,B
 (
 X ? R+
 )
 ,P
 )
 .
 2For simplicity, we choose to work with (X ? R+,B (X ? R+) ,P) instead of(
 X R+ ,B
 (
 X R+
 )
 , ?P
 )
 , which is the common practice for defining a stochastic process. It
 should not be hard for the reader to see that
 (
 X R+ ,B
 (
 X R+
 )
 , ?P
 )
 can induce a measure P on
 the measurable space (X ? R+,B (X ? R+)) .
 100
At each time t, P induces a probability measure on the measurable space (X ,B (X )),
 Pt (BX ) := P ({(t, i)|i ? BX }) , ?BX ? B (X ) , (3.4.2)
 resulting in a probability space (X ,B (X ) ,Pt). Conversely, if we know Pt at all
 times, then we can construct a P that satisfies Eq. 3.4.23. The coordinate random
 variable is denoted by X (i.e., X (i) = i, i ? X ). Similarly, the outcome random
 variable is denoted by Y, where Y = ? (H (X)) . We denote the set of all possible
 outcomes from evaluating ? (H (?)) over X by
 Y := {y ? R+|?i ? X s.t. y = ? (H (i))} .
 A sensible next step is to write down the dynamics of Pt with respect to time
 (i.e., ?Pt), and interestingly we will be able to prove that
 Et [? (H (X))] :=
 ?
 X
 ? (H (i)) dPt :=
 ?
 Y
 ydP?(H(Xt))t (3.4.3)
 is a strictly increasing function with increasing t. In the equation above, P?(H(Xt))t is
 the probability measure of the random variable ? (H (Xt)) induced by Xt. Of course,
 probability weighting functions from Section 3.3 play a key role in the equations for
 ?Pt4. Using Eq. 3.4.3 along with Lyapunov stability analysis, we will conclude the
 convergence of Pt to optimal solutions (i.e., elements of P?X ). The following examples
 3We can construct such a measure by using the definition: P (BX ?BT ) =?
 BX ?BT Pt (dx) dt, BX ?BT ? B (X ? R+).4We opted for the notation Pt instead of Pwt for simplicity, but the reader should be mindful of
 Pt?s dependence on w.
 101
illustrate our approach.
 Example 11. (Distinct Outcomes)
 We are given a finite solution space X = [1, 2, 3], and its corresponding out-
 come space Y = [? (H (1)) , ? (H (2)) , ? (H (3))] ? R+. In addition, we assume the
 outcomes are distinct (i.e., ? (H(3)) > ? (H(2)) > ? (H(1)) ? 0). We denote the
 vector of probabilities for non-intersecting outcome events,
 yi(t) = Pt (? (H (i)) ? ? (H (X)) < ? (H(i+ 1)) , i ? X with ? (H (4)) = ?,
 by [y1(t), y2(t), y3(t)]. Furthermore, we denote the probabilities on elements of the
 solution space by [x1(t), x2(t), x3(t)], respectively.
 To avoid confusion, we will refer to [y1(t), y2(t), y3(t)] as the outcome probability
 vector, and [x1(t), x2 (t) , x3 (t)] as the solution probability vector. Note, Pt in the
 previous discussion is equivalent conceptually to the solution probability vector.
 Our goal is to write down the dynamic equation for the solution probability
 vector. However, in order to do that, we need to write down the dynamic equation
 for the outcome probability vector. Using an optimal-seeking weighting function,
 w, the dynamics of the outcome probability vector can be written as:
 dy1
 dt = (w(1) ? w(y2 + y3)) ? y1
 dy2
 dt = (w(y2 + y3) ? w(y3)) ? y2
 dy3
 dt = w(y3) ? y3.
 102
We define the matrix G, which in the nonlinear Markov processes literature is called
 a generator (see [61, 62]), as
 G(y1, y2, y3) :=?
 ???????
 ?w(y2 + y3) w(y2 + y3) ? w(y3) w(y3)
 1 ? w(y2 + y3) w(y2 + y3) ? w(y3) ? 1 w(y3)
 1 ? w(y2 + y3) w(y2 + y3) ? w(y3) w(y3) ? 1
 ?
 ??????? .
 Using the matrix G, we can write down the outcome and solution probability vector
 equations as:
 ?
 ?
 ???????
 y1
 y2
 y3
 ?
 ??????? =
 [
 y1 y2 y3
 ]
 G(y1, y2, y3)
 ?
 ?
 ???????
 x1
 x2
 x3
 ?
 ??????? =
 [
 x1 x2 x3
 ]
 G(x1, x2, x3).
 From Proposition 2 we see that w(y) > y ?y ? (0, 1), which implies
 dy3
 dt > 0,
 dx3
 dt > 0 ?x3, y3 ? (0, 1) and
 dy3
 dt = 0,
 dx3
 dt = 0 x3, y3 = {0, 1} .
 Since the best outcome, x = 3, will monotonically increase in its probabil-
 ity weight, and the increase in probability weight has to come from the non-best
 103
solutions, the non-best solutions will eventually die out, i.e.,
 ?
 ???????
 x1 (t)
 x2 (t)
 x3 (t)
 ?
 ????????
 ?
 ???????
 0
 0
 1
 ?
 ??????? as t ? ?.
 We will prove our convergence assertion more rigorously later in this section.
 What would happen if two or more members in the solution space might map
 to the same outcome?
 Example 12. (Non-Distinct Outcomes)
 Consider the case when there are more than one solution mapping to the
 same outcome. We are given an optimal-seeking probability weighting function, w,
 and a solution space X = [1, 2, 3, 4]. Assume we know that ? (H(4)) = ? (H(3)) >
 ? (H(2)) > ? (H(1)) ? 0. Now, the outcome space, Y = [? (H(3)) , ? (H(2)) , ? (H(1))],
 has fewer elements than the solution space. Following the similar line of logic as in
 the previous example, the outcome probability vector equation is written as :
 dy1
 dt = (w(1) ? w(y2 + y3)) ? y1
 dy2
 dt = (w(y2 + y3) ? w(y3)) ? y2
 dy3
 dt = w(y3) ? y3,
 104
where yi is defined as
 y1 (t) = Pt (? (H(1) ? ? (H (X)) < ? (H(2))
 y2 (t) = Pt (? (H (2)) ? ? (H (X)) < ? (H(3))
 y3 (t) = Pt (? (H (3)) ? ? (H (X)) < ?) .
 We define the matrix Gy as:
 Gy(y1, y2, y3) :=?
 ???????
 ?w(y2 + y3) w(y2 + y3) ? w(y3) w(y3)
 1 ? w(y2 + y3) w(y2 + y3) ? w(y3) ? 1 w(y3)
 1 ? w(y2 + y3) w(y2 + y3) ? w(y3) w(y3) ? 1
 ?
 ??????? .
 As in the distinct outcome case, the solution probability vector is written as:
 dx1
 dt = (w(1) ? w(x2 + x3 + x4)) ? x1
 dx2
 dt = (w(x2 + x3 + x4) ? w(x3 + x4)) ? x2
 dx3
 dt = ?w(x3 + x4) ? x3
 dx4
 dt = (1 ? ?)w (x3 + x4) ? x4, ? ? [0, 1] ,
 105
and its corresponding generator is
 Gx,?(x1, x2, x3, x4) :=?
 ???????????
 ?w(x2 + x3 + x4) w(x2 + x3 + x4) ? w(x3 + x4) ?w(x3 + x4) (1 ? ?)w(x3 + x4)
 1 ? w(x2 + x3 + x4) w(x2 + x3 + x4) ? w(x3 + x4) ? 1 ?w(x3 + x4) (1 ? ?)w(x3 + x4)
 1 ? w(x2 + x3 + x4) w(x2 + x3 + x4) ? w(x3 + x4) ?w(x3 + x4) ? 1 (1 ? ?)w(x3 + x4)
 1 ? w(x2 + x3 + x4) w(x2 + x3 + x4) ? w(x3 + x4) ?w(x3 + x4) (1 ? ?)w(x3 + x4) ? 1
 ?
 ???????????
 .
 We can write the dynamic equation for the outcome and solution probability
 vectors more compactly as:
 ?
 ?
 ???????
 y1
 y2
 y3
 ?
 ??????? =
 [
 y1 y2 y3
 ]
 Gy(y1, y2, y3)
 ?
 ?
 ???????
 x1
 x2
 x3
 ?
 ??????? =
 [
 x1 x2 x3
 ]
 Gx,?(x1, x2, x3).
 From Proposition 2 we see that w(y) > y, ?y ? (0, 1), which implies
 dy3
 dt > 0,
 dx3
 dt +
 dx4
 dt > 0, ?x3, x4, y3 ? (0, 1)
 and
 dy3
 dt = 0,
 dx3
 dt = 0,
 dx4
 dt = 0, x3, x4, y3 = {0, 1} .
 Ultimately, we have the best solutions for the problem, x = 3 and x = 4,
 106
monotonically increasing in their probability weights. Since the increase in proba-
 bility weights has to come from the non-best solutions, they will eventually die out,
 i.e., ?
 ???????????
 x1 (t)
 x2 (t)
 x3 (t)
 x4 (t)
 ?
 ???????????
 ?
 ?
 ???????????
 0
 0
 ?3
 ?4
 ?
 ???????????
 as t ? ?, ?3, ?4 ? 0, ?3 + ?4 = 1.
 Remark 13. A key feature of the non-distinct outcomes case is the appearance of
 the ?-function (later, this will be called a distribution rule). For instance, in the
 example above, the ?-function could be:
 Pt (X = i|Y = ? (H(i))) = xi (t)?
 j:?(H(j))=?(H(i)) xj (t) .
 An interesting observation from the examples above is that the solution and
 outcome probability vector equations fall into a category of equations called the
 nonlinear Fokker-Planck equation (see [61, 42]). In addition, the examples suggest
 that by propagating the solution and outcome probability vectors appropriately, we
 can concentrate the probability weights on optimal solutions. The key step forward
 is writing down the general equations for the solution and outcome probability
 vectors.
 The generalized solution probability vector equation has the form:
 107
dxi (t)
 dt = ?i (t)
 ??w
 ?? ?
 j:?(H(j))??(H(i))
 xj (t)
 ??
 ? w
 ?? ?
 j:?(H(j))>?(H(i))
 xj (t)
 ????
 ? ? ? ? xi (t) ?i ? X (3.4.4)
 ?
 ?(H(i))=y
 ? (i, y, t) = 1 ?y ? Y ?t ? R+, (3.4.5)
 where xi : R+ ? [0, 1] is the probability measure assigned to an element i ? X ,
 and ?i (t) := ? (i, ? (H (i)) , t) is a distribution rule defined below. In Eq. 3.4.4, the
 difference between the first w distorted term and the second w distorted term is the
 event ? (H (j)) = ? (H (i)). Wang et al. (see [88]) have an alternative set of evolu-
 tion equations, also nonlinear Fokker-Planck equations, motivated by evolutionary
 game theory. As the reader will see later, we reach the same convergence results as
 Wang et al. in [88] with a modified approach.
 Definition 20. A distribution rule with respect to a given objective function,
 ? (H (?)), is a mapping ? : X ? Y ? R+ ? [0, 1] such that
 ?
 ?(H(i))=y
 ? (i, y, t) dx = 1 ?y ? Y ?t ? R+.
 Connecting this equation with the discussion at the beginning of this section,
 the reader should note that
 Pt (X = i) = xi(t) ?i ? X .
 108
The generalized outcome probability vector equation has the form:
 dyz (t)
 dt = w
 (?
 j:j?z
 yj (t)
 )
 ? w
 (?
 j:j>z
 yj (t)
 )
 ? yz (t) ?z ? Y .
 We pay special attention to the best outcome equation:
 dy? (t)
 dt = w
 (?
 j:j??
 yj (t)
 )
 ? y? (t) ,
 where ? := ? (H (i?)) i? ? X ?.
 In the rest of this section, we want to study the convergence properties of Eq.
 3.4.4. Furthermore, we want to understand the stability properties, in the Lyapunov
 sense, of its limit points. The first step in understanding Eq. 3.4.4 is to understand
 the existence and uniqueness of its solutions. The outline of our proof for the next
 theorem follows [65] and [51].
 Theorem 9. For each x(0) ? PX , the ordinary differential equation 3.4.4 has a
 unique solution for t ? R+. Here, ? : X ? Y ? R+ ? [0, 1] is a distribution rule5.
 Proof. We use the total variation norm, ??? , on a ?-finite signed measure space over
 (X ,B (X )) :
 ?x(t)? = sup
 A?B(X )
 ?
 i?A
 |xi(t)| .
 Since x (t) ? PX is a probability measure ?t, and |?i| ? 1 , we know the following
 5We will provide the definition for distribution rules in more general spaces in the next section.
 For the proof of this theorem, we are only using the fact that it is a bounded function. In the
 future, ? could depend on both i ? X and x (t) ? PX .
 109
inequalities hold:
 sup
 A?B(X )
 ?
 i?A
 ??????
 ?i (t)
 ??w
 ?? ?
 j:?(H(j))??(H(i))
 xj (t)
 ??
 ? w
 ?? ?
 j:?(H(j))>?(H(i))
 xj (t)
 ????
 ? xi (t)
 ??????
 ? sup
 A?B(X )
 ?
 i?A
 ??????
 ?i (t)
 ??w
 ?? ?
 j:?(H(j))??(H(i))
 xj (t)
 ??
 ? w
 ?? ?
 j:?(H(j))>?(H(i))
 xj (t)
 ????
 ??????
 + sup
 A?B(X )
 ?
 i?A
 |xi (t)| ? 2.
 Hence, we conclude the right hand side of Eq. 3.4.4 is bounded by 2. Next, we need
 110
to prove that the right hand side of Eq. 3.4.4 is Lipschitz continuous.
 sup
 A?B(X )
 ?
 i?A
 ??????
 ???i (t)
 ??w
 ?? ?
 j:?(H(j))??(H(i))
 x1j (t)
 ??
 ? w
 ?? ?
 j:?(H(j))>?(H(i))
 x1j (t)
 ????
 ? x1i (t)
 ??
 ?
 ???i (t)
 ??w
 ?? ?
 j:?(H(j))??(H(i))
 x2j (t)
 ??
 ? w
 ?? ?
 j:?(H(j))>?(H(i))
 x2j (t)
 ????
 ? x2i (t)
 ??
 ??????
 ? sup
 A?B(X )
 ?
 i?A
 ??????
 ??w
 ?? ?
 j:?(H(j))??(H(i))
 x1j (t)
 ??
 ? w
 ?? ?
 j:?(H(j))>?(H(i))
 x1j (t)
 ????
 ? x1i (t)
 ?
 ????w
 ?? ?
 j:?(H(j))??(H(i))
 x2j (t)
 ??
 ? w
 ?? ?
 j:?(H(j))>?(H(i))
 x2j (t)
 ????
 ? x2i (t)
 ??
 ??????
 ? sup
 A?B(X )
 ?
 i?A
 ??????
 w
 ?? ?
 j:?(H(j))??(H(i))
 x1j (t)
 ??
 ? w
 ?? ?
 j:?(H(j))>?(H(i))
 x1j (t)
 ??
 ?
 ??w
 ?? ?
 j:?(H(j))??(H(i))
 x2j (t)
 ??
 ? w
 ?? ?
 j:?(H(j))>?(H(i))
 x2j (t)
 ????
 ??????
 +
 ??x1i (t) ? x2i (t)
 ??
 ? K sup
 A?B(X )
 ?
 i?A
 ??????
 ?
 j:?(H(j))=?(H(i))
 x1j (t) ?
 ?
 j:?(H(j))=?(H(i))
 x2j (t)
 ??????
 + sup
 A?B(X )
 ?
 i?A
 ??x1i (t) ? x2i (t)
 ??
 ? K sup
 A?B(X )
 ?
 i?A
 ??????
 ?
 j:?(H(j))=?(H(i))
 x1j (t) ? x2j (t)
 ??????
 + sup
 A?B(X )
 ?
 i?A
 ??x1i (t) ? x2i (t)
 ??
 ? K sup
 A?B(X )
 ?
 i?A
 ??x1i (t) ? x2i (t)
 ??+
 ??x1(t) ? x2(t)??
 ? K
 ??x1(t) ? x2(t)??+ ??x1(t) ? x2(t)?? ? (K + 1) ??x1(t) ? x2(t)?? .
 Hence, the right hand side of Eq. 3.4.4 is Lipschitz continuous in x with the
 constant K+1, where K is the Lipschitz constant for w (w is defined to be Lipschitz
 111
continuous; see Definition 18). Using [90, Corollary 3.9], we conclude that Eq. 3.4.4
 with an initial measure x(0) ? PX has a unique solution x(t) ?t ? R+.
 The next Lemma is needed in Theorem 12, which shows the monotonically
 increasing nature of Et [? (H (X))] .
 Lemma 6. Given an optimal-seeking weighting function, w, there exists a ?? such
 that
 ?
 ??Y
 ?
 (
 w
 (?
 j:Y??
 yj (t)
 )
 ? w
 (?
 j:Y >?
 yj (t)
 )
 ? y? (t)
 )
 (3.4.6)
 can be decomposed into the sum of its non-negative and negative parts:
 ?
 ????
 ?
 (
 w
 (?
 j:Y??
 yj (t)
 )
 ? w
 (?
 j:Y >?
 yj (t)
 )
 ? y? (t)
 )
 ? ?? ?
 non?negative?
 ?<??
 ?
 (
 w
 (?
 j:Y??
 yj (t)
 )
 ? w
 (?
 j:Y >?
 yj (t)
 )
 ? y? (t)
 )
 ? ?? ?
 negative
 .
 In other words, we can write Eq. 3.4.6 as the sum of its non-negative and
 negative parts.
 Proof. Since w is a monotonically increasing function, it satisfies
 w
 (?
 j:Y?? yj (t)
 )
 ? w
 (?
 j:Y >? yj (t)
 )
 y? (t) ? 0.
 112
Furthermore, since w is an optimal-seeking function we have
 w
 (?
 j:Y??1 yj (t)
 )
 ? w
 (?
 j:Y >?1 yj (t)
 )
 y?1 (t)
 >
 w
 (?
 j:Y??2 yj (t)
 )
 ? w
 (?
 j:Y >?2 yj (t)
 )
 y?2 (t) ??1 ? ?2 ? Y . (3.4.7)
 In addition, since w (0) = 0 and w (1) = 1, we know that
 w
 (?
 j:Y?? yj (t)
 )
 ? w
 (?
 j:Y >? yj (t)
 )
 y? (t) > 1
 for some ? ? Y . From Eq. 3.4.7, we know that if ?2 satisfies the above inequality,
 then so does ?1 ? ?2 ? Y . Hence, we conclude that ?? is the smallest such ?.
 At the beginning of this section, we stated implicitly that if we can find an
 element of P?X , then we have found a solution to the global optimization problem
 stated in Eq. 3.2.2. The theorems below present a blueprint, through the use of
 Eq. 3.4.4, to obtain an element of P?X . In Theorem 9, an initial point can be any
 element of PX . As we have discovered, PX is too large a set to initialize Eq. 3.4.4
 to guarantee as t ? ? the solution probability vector, x (t), will be an element of
 P?X . Hence, we need to constrain our initial points to a smaller set.
 Definition 21. We denote the set of all x (0) for which there exists an optimal
 solution, i? ? X ?, such that xi?(0) > 0 by O.
 In other words, O contains all initial probability vectors with nonzero weights
 on at least one optimal solution. The next theorem proves the total probability
 113
measure on the optimal solution set will converge to 1 as t ? ?. On the other
 hand, the total probability measure on the non-optimal solution set will converge
 to 0 as t ? ?.
 Theorem 10. If x (t) is a solution for Eq. 3.4.4, then it satisfies the following with
 initial points in O (i.e., x (0) ? O):
 1) The total probability weight on the optimal solutions, ?i?X ? xi (t), is a
 monotonically increasing function of t. In fact, it converges to 1 as t ? ?;
 2) The probability of any non-optimal solution, xi (t) : R+ ? [0, 1], i /? X ?,
 approaches zero as t? ?.
 Proof. We know that
 ?
 i?X ?
 xi = y?,
 hence we only need to prove y? is a monotonically increasing function of t. Writing
 down the equation for y?,
 dy? (t)
 dt = w
 (?
 j:j??
 yj (t)
 )
 ? y? (t) ,
 and from Proposition 2:
 w
 (?
 j:j??
 yj (t)
 )
 > y? (t) ,
 we conclude
 dy?(t)
 dt > 0 ?y?(t) 6= 1, and
 dy?(t)
 dt = 0 when y?(t) = 1.
 114
Since x (0) ? O implies y? (0) > 0 , the first claim is proved.
 The second claim follows from the first claim. Since y? (?) = 1, and x is a
 solution probability vector (i.e., sum of xis is 1), we can conclude the following:
 lim
 t??
 ?
 i?X
 xi (t) = lim
 t??
 ?
 i?X ?
 xi (t) +
 ?
 i/?X ?
 xi (t) = 1 + lim
 t??
 ?
 i/?X ?
 xi (t) = 1
 =? lim
 t??
 ?
 i/?X ?
 xi (t) = 0 =? lim
 t??
 xi (t) = 0 ?i /? X ?.
 We are interested in finding the limit points of Eq. 3.4.4. Ideally, these limit
 points should be elements in P?X . This is accomplished by picking the initial point
 set more carefully.
 Definition 22. We define the limit set of a differential equation starting from an
 element x (0) ? I as
 EI :=
 {
 x? ? PX |x? = lim
 t??
 x (t) , x (0) ? I
 }
 .
 Remark 14. The limit set is invariant with respect to Eq. 3.4.4. More specifically,
 once x enters the set, it will not exit the set under Eq. 3.4.4. The author is not the
 first to introduce the concept of an initial point dependent limit set (cf. [87]).
 We characterize the limit set of Eq. 3.4.4 when x (0) ? O in the following
 theorem.
 115
Theorem 11. The limit set of Eq. 3.4.4 started in O is P?X , i.e.,
 EO = P?X :=
 {
 x ? PX |
 ?
 i??X ?
 xi? = 1
 }
 .
 Proof. To prove the first claim, we will first prove EO ? P?X , then we will prove
 EO ? P?X . The first case, EO ? P?X , can be trivially proved by taking an element
 x ? P?X , we notice that x ? O, and by definition of EO (i.e., the limit set of Eq.
 3.4.4 starting from O), we conclude x ? EO .
 Now we proceed to prove EO ? P?X . We prove by contradiction. Assume there
 is an element e ? EO, but not in P?X such that:
 e?i (t) = ?i (t)
 (
 w
 (?
 j:?(H(j))??(H(i)) ej (t)
 )
 ? w
 (?
 j:?(H(j))>?(H(i)) ej (t)
 ))
 ?ei (t)
 ei (0) ? 0, limt?? ei (t) > 0 i /? X ?.
 This contradicts the second claim of Theorem 10, where ei (?) = 0.
 The next theorem shows the monotonically increasing nature of Et [? (H (X))],
 which will be useful later in proving some stability properties of Eq. 3.4.4.
 Theorem 12. Let x (t) be a solution of the dynamics represented by Eq. 3.4.4 with
 an initial point in O. Then the following statements hold:
 1) The expected outcome, i.e., Et [? (H (X))] := ?i?X ? (H (i)) xi (t), is mono-
 tonically increasing with t;
 2) If x(t) /? EO for any t ? R+, then Et [? (H (X))] is strictly increasing with
 t.
 116
Proof. We start our proof by differentiating the average outcome function:
 d
 dtEt [? (H (X))] =
 d
 dtEt [Y ]
 =
 ?
 ??Y
 ? dy?(t)dt
 =
 ?
 ??Y
 ?
 (
 w
 (?
 j:Y??
 yj (t)
 )
 ? w
 (?
 j:Y >?
 yj (t)
 )
 ? y? (t)
 )
 =
 ?
 ????
 ?
 (
 w
 (?
 j:Y??
 yj (t)
 )
 ? w
 (?
 j:Y >?
 yj (t)
 )
 ? y? (t)
 )
 +
 ?
 ?<??
 ?
 (
 w
 (?
 j:Y??
 yj (t)
 )
 ? w
 (?
 j:Y >?
 yj (t)
 )
 ? y? (t)
 )
 (Lemma. 6)
 ? ??
 ?
 ??Y
 (
 w
 (?
 j:Y??
 yj (t)
 )
 ? w
 (?
 j:Y >?
 yj (t)
 )
 ? y? (t)
 )
 =
 ?? ? 0 = 0.
 Here, the last equality holds because
 ?
 ??Y
 w
 (?
 j:Y??
 yj (t)
 )
 ? w
 (?
 j:Y >?
 yj (t)
 )
 = 1, and
 ?
 ??Y
 y? (t)=1.
 The first claim is proved.
 The second claim is proved by contradiction. We assume that x(t) is not in
 the limit set, and
 d
 dtEt [? (H (X))] = 0.
 Along with Theorem 10, the equality above implies that ? (H (X)) is equal to a
 constant C = supi?X ? (H (i)), which means x (t) has all its probability mass on the
 optimal solutions. From Theorem 11, we know a limit point has its probability mass
 117
on the optimal solutions. However, we assumed x(t) is not a limit point, hence we
 reach a contradiction.
 We will now proceed to prove some stability properties of Eq. 3.4.4, but first
 we need to introduce our definitions of stability given a metric d.
 Definition 23. Let L be a subset of PX . For a point x(t) ? PX , we define the
 distance between x(t) and L as
 d (x(t),L) := inf {d (x(t), q) , ?q ? L} .
 L is called Lyapunov stable if for all  > 0, there exists a ? > 0 such that
 d (x (0) ,L) < ? ? d (x (t) ,L) < , ?t > 0.
 Lyapunov was also interested in other stronger types of stability.
 Definition 24. Let L be a subset of PX . L is called asymptotically stable if L is
 Lyapunov stable, and there exists a ? > 0 such that
 d (x (0) ,L) < ? ? d (x (t) ,L) ? 0
 as t ? ?.
 The next theorem is the main result of this section. It states EO is compact
 and asymptotically stable.
 Theorem 13. EO is a compact set and it is asymptotically stable.
 118
Proof. We need to first prove that EO is a compact set. Since from our Theorem 11,
 we have EO = P?X and can instead prove
 P?X :=
 {
 x ? PX |
 ?
 i??X ?
 xi? = 1
 }
 is compact. It is easy to see that P?X is tight (see Appendix, Definition 31) due
 to Assumption 6. Furthermore, we can prove it is a closed set by contradiction.
 Assume there exists a sequence {xn} ? P?X such that xn ? x? /? P?X . This implies
 ?N such that ?n > N we have?i??X ? xni? < 1, and?i/?X ? xni > 0, which contradicts
 the second claim of Theorem 10. Hence, P?X = EO is a compact set.
 Consider the Lyapunov function
 V (xt) = E [? (H (X?))] ? Et [? (H (X))] ,
 where x? ? P?O and X? is the corresponding random variable. Note that V (xt) is
 positive for all xt ? PX \P?X , and V (xt) = 0 for xt ? P?X = EO. From Theorem 12 we
 have ?V (xt) < 0 for all t > 0 and xt /? P?O. Furthermore, we know EO is a compact
 set. Applying a generalized version of Lyapunov?s theorem (see [12, Chapter 5]),
 the desired conclusion is reached.
 Remark 15. Chapter V of [12] presented a generalized version of Lyapunov?s theo-
 rem on a general metric space. In the proof of Theorem 13, we applied this gen-
 eralized Lyapunov?s theorem on the Banach Space of ?-finite signed measures over
 119
(X ,B (X )), where we used the total variation distance
 d
 (
 x1, x2
 )
 = sup
 A?B(X )
 ?
 j?A
 ??x1j(t) ? x2j (t)
 ??
 .
 Conclusion 1. We have proven so far that if we start Eq. 3.4.4 in O, then the possible
 limit points are elements of EO = P?X . Furthermore, the set EO is asymptomatically
 stable.
 The use of the Lyapunov function in proving the stability of the limit set
 can also be found in Wang [87], which applied the generalized version of Lyapunov?s
 theorem to an infinite dimensional space. In the next section, we will apply a similar
 approach to prove the stability of the limit set when the solution space is a Polish
 space.
 3.5 Polish Space
 In this section, we try to find a solution for Eq. 3.2.2 given that X is a Pol-
 ish space with the Prohorov topology (see Appendix C.1). Polish spaces include
 finite-dimensional real spaces (i.e., Rn), which are important in many engineering
 applications. The Polish space of probability measures on (X ,B (X )) is denoted by
 PX , which also has the Prohorov topology.
 We will alter our notations in this section from the discrete space case. The
 symbol x in this section is an element of X (i.e., x ? X ). We will use the notation
 PX,t ({x}) to represent the probability measure of x at time t. This is different
 120
from the discrete space case, where Pt (i) = xi(t) ?i ? X 6. We write X for the
 solution random variable (i.e., X (x) = x), and denote the outcome random variable
 by Y = ? (H (X)). On top of the assumptions we made for the discrete space case,
 we will make the following assumption.
 Assumption 7. w is differentiable, and has a bounded first derivative which is
 denoted by w?.
 The initial probability space, (X ,B (X ) ,PX,0), with an initial distribution
 PX,0 induces a probability measure PY,0 on the measurable space (Y ,B (Y)), where
 Y := {y ? R+|y = ? (H (x)) ?x ? X}? R+
 PY,0 (B) = PX,0 ({x|? (H (x)) = y ?y ? B}) ?B ? B (Y) , (3.5.1)
 and B (Y) denotes the Borel ?-algebra for Y . The space of probability measures
 on (Y ,B (Y)) is denoted by PY . From examples 11 and 12, the generalized (i.e.,
 when X is a Polish space) outcome probability measure, on the measurable space
 (Y ,B (Y)), satisfies:7
 ?PY,t (B) =
 ?
 B
 w? (PY,t ({Y ? y})) dPY,t ? PY,t (B) ?B ? B (Y) . (3.5.2)
 The initial condition for PY,t is given by Eq. 3.5.1. We pay special attention to the
 6In the discrete space case, we used xi (t) to denote the probability of obtaining the i-th solution
 at time t
 7PY,t({y}) is equal to yi(t) in the discrete space case.
 121
best outcome equation:
 ?PY,t (y?) = w (PY,t ({Y ? y?})) ? PY,t (y?) ,
 where y? = maxx?X ? (H (x)) .
 Next, we write down the the Polish space counterpart to Eq. 3.4.4
 PX,t (A) =
 ?
 A?Y
 dPX|Y,tdPY,t ?A ? B (X ) ?t ? R+\ {0} , (3.5.3)
 where PX|Y,t, the probability of X conditioned on Y , is the generalized ? function.
 In other words, given a fixed PX|Y,t , PX,t is determined by PY,t, which is a solution of
 Eq. 3.5.2. PY,t can be determined, without knowing PX|Y,t, from PX,t at any t ? R+
 by the following equation:
 PY,t (B) = PX,t ({x|? (H (x)) = y ?y ? B}) ?B ? B (Y) . (3.5.4)
 Assumption 8. PX|Y,t is a given fixed conditional probability measure.
 We denote the set of optimal measures on (X ,B (X )) by
 P?X := {P ? PX |P (X ?) = 1} ,
 where X ? is the set of all optimal solutions, i.e.,
 X ? := {x? ? X |H(x) ? H(x?) ?x ? X} .
 122
Similarly, we denote the set of optimal measures on (Y ,B (Y)) by
 P?Y := {PY ? PY |P (y?) = 1} ,
 where y? = maxx?X ? (H (x)) . Obtaining an element P of P?X is equivalent to solving
 the optimization problem stated in Eq. 3.2.2.
 We will now prove the existence and uniqueness of a solution for Eq. 3.5.2.
 This is important for building a solid theoretical foundation for our approach.
 Theorem 14. For each PY,0 ? PY , the ordinary differential equation (3.5.2) has a
 unique solution for t ? R+.
 Proof. The outline of our proof follows [65] and [51]. We use the total variation
 norm, ??? , on a ?-finite signed measure space over (Y ,B (Y)) at the time t:
 ?Pt? = sup
 g
 ????
 ?
 Y
 g(y)dPt
 ???? ,
 where the sup is taken over all measurable functions g : Y ? R and
 sup
 y?Y
 |g (y)| ? 1.
 We simplify our notations by introducing the following definition:
 C (PY,t) (B) :=
 ?
 B
 w? (PY,t ({Y ? y})) dPY,t ? PY,t (B) ,
 ,
 123
where B ? B (Y). Since PY,t ? PY is a probability measure, we note that
 ?C (PY,t)? = sup
 g
 ????
 ?
 Y
 g (y) C (PY,t)
 ????
 ? sup
 g
 ????
 ?
 Y
 g (y)w? (PY,t ({Y ? y})) dPY,t
 ????+ sup
 g
 ????
 ?
 Y
 g (y) dPY,t
 ????
 ?
 ?
 Y
 w? (PY,t ({Y ? y})) dPY,t +
 ?
 Y
 dPY,t
 ? K
 ?
 Y
 dPY,t +
 ?
 Y
 dPY,t
 ? (K + 1) ,
 which proves the boundedness of C (PY,t), with K being the Lipschitz constant for
 w.
 Next, we need to prove that the right hand side of Eq. 3.5.2 is Lipschitz
 continuous. We know that
 ?C (PY,t) ? C (QY,t)?
 ? sup
 g
 ????
 ?
 Y
 g (y) d (C (PY,t) ? C (QY,t))
 ???? . (3.5.5)
 124
Furthermore, we know that
 |C (PY,t) ? C (QY,t)| (B)
 =
 ????
 (?
 B
 w? (PY,t ({Y ? y})) dPY,t ? PY,t (B)
 )
 ?
 (?
 B
 w? (QY,t ({Y ? y})) dQY,t ? QY,t (B)
 )????
 ?
 ????
 ?
 B
 w? (PY,t ({Y ? y})) dPY,t ?
 ?
 B
 w? (QY,t ({Y ? y})) dQY,t
 ????
 + |PY,t (B) ? QY,t (B)|
 ? K
 ????
 ?
 B
 d (PY,t ? QY,t)
 ????+
 ????
 ?
 B
 d (PY,t ? QY,t)
 ????
 ? (K + 1)
 ????
 ?
 B
 d (PY,t ? QY,t)
 ???? . (3.5.6)
 Substituting Eq. 3.5.6 into Eq. 3.5.5, we have
 ?C (PY,t) ? C (QY,t)?
 ? sup
 g
 (K + 1)
 ????
 ?
 Y
 g(y)d (PY,t ? QY,t)
 ????
 = (K + 1) ?PY,t ? QY,t? .
 Hence, the right hand side of Eq. 3.5.2 is Lipschitz continuous in PY,t with
 the constant K + 1, where K is the Lipschitz constant for w (w is assumed to be
 Lipschitz continuous). Using Corollary 3.9 of [90], we conclude that Eq. 3.5.2 with
 an initial measure PY,0 ? PY has a unique solution PY,t.
 It should not be surprising that PY,t is a probability measure for all t.
 125
Lemma 7. Given that PY,0 is a probability measure, then a solution PY,t of Eq.
 3.5.2 at each time t > 0 is a probability measure, i.e.,
 PY,t (B) ? 0 ?B ? B (Y)
 PY,t (Y) = 1 ?t ? R+
 PY,t (?Bi) =
 ?
 PY,t (Bi) ,
 where {Bi} is any countable collection of pairwise disjoint elements of B (Y).
 Proof. If we can prove that
 ?PY,t (Y) = 0,
 and
 ?PY,t (?Bi) =
 ?
 ?PY,t (Bi)
 then we have obtained our desired result.
 From Eq. 3.5.2 we know
 ?PY,t (Y) =
 ?
 Y
 w? (PY,t ({Y ? y})) dPY,t ? PY,t (Y)
 =
 ?
 Y
 (w? (PY,t ({Y ? y})) ? 1) dPY,t.
 Furthermore, that fact that ? 10 w? (s) ds = 1 (i.e., we assumed w(1) = 1) implies
 ?
 Y
 (w? (PY,t ({Y ? y})) ? 1) dPY,t = 0.
 126
Next, we need to prove that
 ?PY,t (?Bi) =
 ?
 ?PY,t (Bi) ,
 because
 ?
 ?PY,t (?Bi) dt =
 ? ?
 ?PY,t (Bi) dt?
 ?PY,t (?Bi) dt =
 ? ?
 ?PY,t (Bi) dt
 =? PY,t (?Bi) =
 ?
 PY,t (Bi) .
 Using the fact that w? is bounded, along with dominated convergence theorem we
 conclude that
 ?PY,t (?Bi) =
 ?
 ?Bi
 w? (PY,t ({Y ? y})) dPY,t ? PY,t (?Bi)
 =
 ? ?
 Bi
 w? (PY,t ({Y ? y})) dPY,t ? PY,t (Bi)
 =
 ?
 ?PY,t (Bi) .
 The next Lemma is needed in Theorem 18, which shows the monotonically
 increasing nature of Et [Y ] .
 Lemma 8. Given an optimal-seeking weighting function, w, there exists a y? such
 127
that ?
 Y
 y (w? (PY,t ({Y ? y}))) dPY,t ?
 ?
 Y
 ydPY,t (3.5.7)
 can be decomposed into the sum of its non-negative and negative parts:
 ?
 Y?y?
 y (w? (PY,t ({Y ? y}))) dPY,t ?
 ?
 Y?y?
 ydPY,t
 ? ?? ?
 non?negative(?
 Y <y?
 y (w? (PY,t ({Y ? y}))) dPY,t ?
 ?
 Y <y?
 ydPY,t
 )
 ? ?? ?
 negative
 .
 In other words, we can write Eq. 3.5.7 as the sum of its non-negative and
 negative parts.
 Proof. Since w is a monotonically increasing function, it satisfies
 w? (PY,t ({Y ? y})) ? 0 ?y ? Y .
 Furthermore, since w is an optimal-seeking function we have
 w? (PY,t ({Y ? y1})) > w? (PY,t ({Y ? y2})) ?y1 ? y2 ? Y . (3.5.8)
 In addition, since w (0) = 0 and w (1) = 1, we know that
 w? (PY,t ({Y ? y})) > 1
 for some y ? Y . From Eq. 3.5.8 we know if y2 satisfies the above inequality, then
 128
so does y1 ? y2 ? Y . Hence, we can conclude that y? is the smallest of such y.
 At the beginning of this section, we stated implicitly that if we can find an
 elements of P?X , then we have found a solution to the global optimization problem
 stated in Eq. 3.2.2. The theorems below present a blueprint, through the use of
 Eq. 3.5.3, to obtain an element of P?X . In Theorem 14, an initial condition can
 be any element of PY , which implies there is no restriction on the initial condition
 PX,0 ? PX . As we will discover later, PX is too large of a set to start Eq. 3.5.3 to
 guarantee as t ? ? the solution probability measure, PX,t, will be an element of
 P?X . Hence, we need to constrain our potential initial points to a smaller set.
 Definition 25. The set of all optimal initial solution probability measures is denoted
 by:
 OX := {PX,0 ? PX |PX,0 (X ?) > 0} .
 Furthermore, the set of all optimal initial outcome probability measures is denoted
 by:
 OY := {PY,0 ? PY |PY,0 (B) = PX,0 ({x|? (H (x)) = y ?y ? B})
 ?PX,0 ? OX , ?B ? B (Y)} .
 Proposition 4. OY = {PY,0 ? PY |PY,0 (y?) > 0} and P?X ? OX .
 Proof. The first claim is a direct result of the definition above and Eq. 3.5.1. The
 second claim holds because how P?X is defined.
 The next theorem proves the probability measure on the optimal outcome set
 129
will converge to 1 as t ? ?. On the other hand, the total probability measure on
 the non-optimal outcome set will converge to 0 as t ? ?.
 Theorem 15. If PY,t is a solution of Eq. 3.5.2 with initial points in OY (i.e.,
 PY,0 ? OY ), then the following statements hold:
 1) PY,t ({y?}) is a monotonically increasing function of t. In fact, it converges
 to 1 as t ? ?;
 2) PY,t (Y\ {y?}) approaches zero as t? ?.
 Proof. We first write down the equation for PY,t (?):
 ?PY,t (?) = w (PY,t ({Y ? ?})) ? PY,t (?) .
 From Proposition 2, we conclude that
 w (PY,t ({Y ? ?})) > PY,t (?) .
 From the two equations above, we conclude that
 ?PY,t (?) > 0 ?PY,t (?) ? (0, 1) , and ?PY,t (?) = 0 when PY,t (?) = 1.
 Since PY,0 > 0 , the first claim is proved.
 The second claim follows from the first claim. Since lim
 t??
 PY,t ({y?}) = 1 we
 130
can conclude the following:
 lim
 t??
 PY,t (Y) = lim
 t??
 PY,t ({y?}) + PY,t (Y\{y?}?)
 = 1 + lim
 t??
 PY,t (Y\{y?}) = 1 =? lim
 t??
 PY,t (Y\{y?}) = 0.
 The second claim is proved.
 The next theorem connects the properties of PY,t with those of PX,t as t ? ?.
 This is an important step for understanding the evolution of Eq. 3.5.3.
 Theorem 16. Assuming PX,0 ? OX and
 PY,0 (B) = PX,0 ({x|? (H (x)) = y ?y ? B}) ?B ? B (Y) ,
 the following statements hold:
 1) lim
 t??
 PY,t ({y?}) = lim
 t??
 PX,t (X ?)=1;
 2) lim
 t??
 PY,t (Y\ {y?}) = lim
 t??
 P (X\X ?)=0.
 Proof. Since we know
 PX,t (A) =
 ?
 A?Y
 dPX|Y,tdPY,t ?A ? B (X ) ?t ? R+\ {0} ,
 131
we prove the first claim by writing down the following equation:
 lim
 t??
 PX,t (X ?) = lim
 t??
 ?
 X ??Y
 dPX|Y,tdPY,t
 =
 ?
 X ??{y?}
 dPX|Y,t lim
 t??
 dPY,t
 = 1.
 For claim 2, we can write down a similar equation:
 lim
 t??
 PX,t (X\X ?) = lim
 t??
 ?
 X\X ??Y
 dPX|Y,tdPY,t
 =
 ?
 X\X ??{y?}
 dPX|Y,t lim
 t??
 dPY,t
 = 0.
 We are interested in finding the limit points of Eq. 3.5.2. Ideally, these limit
 points should be elements in P?X . In order to guarantee this, we need to restrict
 the potential initial points to OX .
 Definition 26. We define a limit set starting from an element P0 ? I as
 EI :=
 {
 PX,? ? PX |PX,? (A) = lim
 t??
 PX,t (A) ?PX,0 ? I, ?A ? B (X )
 }
 We are particularly interested in the limit set EOX . The next theorem describes
 the elements in this set.
 132
Theorem 17. The limit set of Eq. 3.5.3 started in OX is P?X , i.e.,
 EOX = P?X := {P ? PX |P (X ?) = 1} .
 Proof. We will first prove EOX ? P?X , then we will prove EOX ? P?X . In the first
 case, EOX ? P?X , can be proved by taking an element PX ? P?X ? OX . By the
 definition of EOX (i.e., the limit set of Eq. 3.5.3 starting from OX), equations 3.5.3
 and 3.5.2, we conclude that PX ? EOX .
 The second claim, EOX ? P?X , can be proven by contradiction. We assume
 there is an element QX ? EOX but not in P?X , which implies
 QX (A) = lim
 t??
 QX,t (A) =
 ?
 A?Y
 dPX|Y,t lim
 t??
 dPY,t ?A ? B (X )
 s.t QX,0 (X ?) > 0, QX (X\X ?) > 0.
 The first line is due to the fact that QX ? EOX . The second line in the equation
 above, along with Eq. 3.5.4, imply lim
 t??
 PY,t (Y\ {y?}) > 0 , which contradicts the
 second claim of Theorem 15, where lim
 t??
 PY,t (Y\ {y?}) = 0.
 The following theorem shows the monotonically increasing nature of Et [Y ] ,
 which will be useful later in proving some stability properties for Eq. 3.5.3.
 Theorem 18. Let PY,t be a solution for Eq. 3.5.2 with its initial point in OY . Then
 the following statements are true:
 1) The expected outcome, i.e., Et [Y ] is monotonically increasing with t;
 2) If PY,?t /? EOY for any ?t ? R+, then E?t [Y ] is strictly increasing with ?t.
 133
Proof. We start our proof by differentiating the average outcome function:
 d
 dtEt [Y ] =
 ?
 Y
 y ?PY,t (dy)
 =
 ?
 Y
 y (w? (PY,t ({Y ? y}))) dPY,t ?
 ?
 Y
 ydPY,t
 =
 ?
 Y?y?
 y (w? (PY,t ({Y ? y}))) dPY,t ?
 ?
 Y?y?
 ydPY,t
 +
 (?
 Y <y?
 y (w? (PY,t ({Y ? y}))) dPY,t ?
 ?
 Y <y?
 ydPY,t
 )
 ? y?
 ?
 Y
 d ?PY,t = y? ? 0 = 0.
 The y? variable is used to decompose the expected outcome function into non-negative
 and negative parts (see Lemma 8). The last line of the inequality is true because
 PY,t ? PY ?t ? R+ (see Lemma 7). The first claim is proved.
 The second claim is proved by contradiction. If there exists a PY,?t, not a limit
 point, with
 d
 dtE?t [Y ] = 0.
 Along with Theorem 15, the equality above implies that Y is equal to a constant
 C = sup
 x?X
 ? (H (x)). This implies PY,?t is a Dirac measure concentrated at C, which
 is a limit point (see Theorem 17).
 The metric function d in the following definitions is defined in Appendix C
 (see Definition 32).
 Definition 27. Let L be a subset of PX . For a point P ? PX , we define the distance
 134
between P and L as
 d (P,L) := inf {d (P,Q) , ?Q ? L} .
 L is called Lyapunov stable, with respect to a sequence of measures {Pt}, if for all
  > 0, there exists a ? > 0 such that
 d (Pt,L) < ? ? d (Pt,L) < , ?t > 0.
 L is called asymptotically stable, with respect to a sequence of measures {Pt}, if L
 is Lyapunov stable, and there exists a ? > 0 such that
 d (P0,L) < ? ? d (Pt,L) ? 0
 as t ? ?.
 The next theorem is the main result of this section. It tells us that if we start
 Eq. 3.5.3 in the set OX , then EOX will coincide with P?X . Furthermore, EOX is
 asymptotically stable.
 Theorem 19. EOX is a compact set and it is asymptotically stable.
 Proof. We need to first prove that EOX is a compact set. Since from Theorem 17,
 we have EOX = P?X , we can instead prove
 P?X := {P ? PX |P (X ?) = 1}
 135
is compact. It is easy to see that P?X is tight (see Appendix, Definition 31) due
 to Assumption 6. Furthermore, we can prove it is a closed set by contradiction.
 Assume there exists a sequence {Pn} ? P?X such that Pn ? ?P /? P?X . This implies
 ?N such that ?n > N we have Pn (X ?) < 1, and Pn (X\X ?) > 0, which implies
 lim
 n??
 PnY (Y\ {y?}) > 0.
 This contradicts the second claim of Theorem 15. Hence, P?X = EO is a compact
 set.
 Using the Lyapunov function
 V (PX,t) = y ? ?Et [? (H (X))] = V (PY,t) = y ? ?Et [Y ] ,
 note that V (PX,t) > 0 for all PX,t ? PX \P?X , and V (PX,t) = 0 for PX,t ? P?X = EOX .
 From Theorem 18 we have ?V (PX,t) = ?V (PY,t) < 0 for all t > 0 if PX,t /? P?X .
 Furthermore, we know ?V (PX,t) = ?V (PY,t) = 0 ?t > 0 if PX,t ? P?X . Using V (PX,t)
 as the Lyapunov function, and the fact that EOX is a compact set, we can appeal
 to a generalized version of Lyapunov?s theorem (see [12, Chapter 5]). The desired
 conclusion is reached.
 The use of the Lyapunov function for proving the asymptotic stability of the
 limit set can be found previously in Wang?s dissertation [87].
 136
3.6 Numerical Algorithms
 In this section, we present a few numerical algorithms based on the CWO (Cu-
 mulative Weighting Optimization) method we presented in the previous sections.
 These algorithms attempt to find an optimal solution iteratively. Each iteration
 consists of 5 stages: generation, quantile-update, parameter-update, weight-update,
 and projection. Generation, quantile-update and projection stages remain the same
 for all variations of the generic algorithm (i.e., Algorithm 1). We propose several
 approaches for constructing the weight-update stage. These algorithms build on the
 theoretical results using the same types of modifications as are found in the CE and
 MRAS (see [75, 53, 88] ).
 137
Algorithm 1 Generic CWO Algorithm
 1. Initialization: Select a number N0 as the total initial number of candidate
 solutions generated at each iteration and an initial g?0 (a parameterized prob-
 ability density distribution) defined on X. Pick an initial quantile ?0 ? (0, 1),
  ? 0, ? > 0, ? ? (0, 1).
 2. Generation: Generate Nk i.i.d candidate solutions {xik}Ni=1 from
 g??k = (1 ? ?)((1 ? ?)g?k?1 + ?g?k) + ?U,
 where U is the uniform distribution.
 3. Quantile-Update:
 Calculate the (1 ? ?k)-quantile, ??k+1 (?k, Nk) := ? (H)(d(1??k)Nke), where dae isthe smallest integer greater than a and H(i) is the i-th highest value for the
 sequence {? (H (xik))}Nki=1 .
 4. Parameter-Update:
 If k=0 or ??k+1 (?k, Nk) ? ??k + 2 , then
 ? 4(a). Set ??k+1 = ??k+1 (?k, Nk) , ?k+1 = ?k, Nk+1 = Nk;
 else
 ? Find the largest ?? ? (0, ?k) such that ??k+1 (??, Nk) ? ??k + 2 ;
 ? If such a ?? exists and ?? > ?min, then
 ?? 4(b). ??k+1 = ??k+1 (??, Nk) , ?k+1 = ??, Nk+1 = Nk;
 ? else
 ?? 4(c). ??k+1 = ??k, ?k+1 = ?k, Nk+1 = d?Nke ;
 5. Weight-Update: Update the weights of the generated samples {xik}Ni=1 ac-
 cording to weight update methods based on Eq. 3.4.4, producing the p.m.f
 pX,k+1 =
 ?N
 i=1w
 i
 k+1?(x? xik). wik+1 is the updated weight for xik.
 6. Density Projection: Construct g?k+1 by projecting the density pX,k+1 =?N
 i=1w
 i
 k+1?(x? xik) onto g? by solving
 ?k+1 = arg max
 ???
 N?
 i=1
 wik+1 ln g?
 (
 xik
 )
 ;
 7. Stop if some stopping criterion is satisfied; otherwise go to step 2 and k = k+1.
 Since discrete-time, discrete-state equations are more suitable for the computa-
 tions in the weight-update stage, we write down the probability density counterparts
 138
of Eq. 3.5.2 and 3.5.3:
 pY,k+1 (y) = w
 ?? ?
 s:?(H(s))?y
 pX,k (s)
 ??
 ? w
 ?? ?
 s:?(H(s))>y
 pX,k (s)
 ??
 pX,k+1 (x) = pY,k+1 (H (x)) pX,k (x)?
 s:?(H(s))=y pX,k (s) , (3.6.1)
 where w : [0, 1] ? [0, 1] is the probability weighting function. If we assume w (?) is
 differentiable, then Eq. 3.6.1 can be written more compactly as:
 pX,k+1 (x) = w? (1 ? FY,k (? (H (x)))) pY,k (? (H (x))) pX|Y,k (x,H (? (H (x))))
 = w? (1 ? FY,k (? (H (x)))) pY,k (? (H (x))) pX,k (x)?
 {s:?(H(s))=?(H(x))} pX,k (s)
 =
 ??w
 (?
 s:?(H(s))??(H(x)) pX,k (s)
 )
 ? w
 (?
 s:?(H(s))>?(H(x)) pX,k (s)
 )
 ?
 {s:?(H(s))=?(H(x))} pX,k (s)
 ?? pX,k (x)
 = w? (1 ? FY,k (? (H (x)))) pX,k (x) , (3.6.2)
 where the second equality holds because the conditional density is taken to be
 pX|Y,t (x,H (? (H (x)))) = pX,t (x)?
 {s:?(H(s))=?(H(x))} pX,t (s)
 and FY,k (?) is the cumulative distribution function for the outcome values. Other
 choices for pX|Y,t are also allowed; in particular, a uniform conditional density.
 The second to last equality in Eq. 3.6.2 can be related to Eq. 3.6.1 by the
 139
following equation:
 pY,k+1 (y) = w
 ?? ?
 s:?(H(s))?y
 pX,k (s)
 ??
 ? w
 ?? ?
 s:?(H(s))>y
 pX,k (s)
 ??
 = w? (1 ? FY,k (? (H (x)))) pY,k (? (H (x))) .
 Algorithm 1 is the generic CWO algorithm using Eq. 3.6.1 as the weight
 update equation. Later, two algorithms with different ways of updating the density
 function are described. Although both weight-update methods will use Eq. 3.6.1
 with a chosen distribution rule, they differ in their assignment of the sample weights.
 The performance of the algorithms is measured using asymmetric traveling
 salesman problems, which we will introduce in the section below.
 3.6.1 Numerical Examples: Asymmetric Traveling Salesman Prob-
 lems (ATSPs)
 We apply variations of Algorithm 1 to several asymmetric traveling salesman prob-
 lems (ATSPs). They are taken from the website http://www.iwr.uni-heidelberg.
 de/groups/comopt/software/TSPLIB95. We follow a similar approach as in Hu
 [53], which is outlined below. The reader is reminded here that Algorithm 1 is
 designed for maximization problems, whereas we are searching for the minimum
 distances of ATSPs. The goal in each ATSP problem is to find the minimum length
 of a tour that connects Ncities cities with the same starting and ending cities. For an
 ATSP, we are given an Ncities-by-Ncities distance matrix D, whose (i,j)-th element
 140
Di,j represents the distance from city i to city j. The problem can be mathematically
 stated as:
 min
 x?X
 {Ncities?1?
 i=1
 Dxi,xi+1 +DxNcities ,x1
 }
 ,
 where x :=
 {
 x1, x2, . . . , xNcities , x1
 }
 is an admissible tour and X is the set of all
 admissible tours.
 We use the same approach suggested by Rubinstein [75], and De Boer et al.
 [26] for solving these problems. Each distance matrix D is given an initial state
 probability transition matrix ?P0, whose (i,j)-th element specifies the probability of
 transitioning from city i to city j. At each iteration of the algorithm, there are two
 important steps: 1) generate random admissible tours according to the probability
 transition matrix and evaluate the performance of each sampled tour; 2) update the
 probability transition matrix based on the tours generated from step 1. We denote
 the set of tours generated at the k-th iteration by {xik} , where i ? {1, . . . , Nk}.
 Without loss of generality, we will assume the samples are sorted according to their
 values (i.e., ? (H (xik)) < ?
 (
 H
 (
 x
 j
 k
 ))
 if and only if i < j).
 A detailed discussion of the admissible tour generation process can be found
 in De Boer et al. [26]. The CWO algorithm differs from other algorithms in how it
 updates its transition matrix. At the k-th iteration of CWO, the probability density
 function, pk (?, ?k), parametrized by the transition matrix ?k is given by the equation
 below:
 pk (x, ?k) =
 Ncities?
 l=i
 Ncities?
 i,j
 ?k (i, j) I{x?Xi,j(l)},
 where Xi,j (l) is the set of all tours in X such that the l-th transition is from city i
 141
to city j. We can show that the new transition matrix is updated (i.e., stage 6 in
 Algorithm 1) as
 ?k+1 (i, j) =
 Nk?
 l=1
 (
 pwk+1
 (
 xik
 ))
 I{xik?Xi,j},
 where we denote the updated density by pwk+1 (?) and {xik+1} is generated from
 pk(?, ?k) (i.e., a density function that is parameterized by ?k). The superscript w
 is used to emphasize the dependence of the updated probability mass function on
 the probability weighting function w. The construction of pwk+1 (?) depends on the
 specific weight-update method.
 3.6.1.1 Weight-Update Methods
 In this section, we present several different methods of obtaining pwk+1 (?) from a
 collection of samples {xik} at the k-th step. The first method we introduce is called
 tilted weight update.
 Tilted weight update (CWO_T): The tilted weight-update method is described in
 Algorithm 2. The key idea behind this variation is that we assign the initial weights
 of the samples according to their outcome values: the smaller the value, the more
 initial weight it gets (see stage 2 in Algorithm 2).
 142
Algorithm 2 Tilted Weight Update
 1. Remove all the non-elite samples, i.e., {x?ik} := {xik : H (xik) ? ??k} , where {x?ik}
 is the set of remaining elite samples;
 2. Assign a weight to each element in Y according to the equation:
 pY,k (y) = maxY y ? y?
 Y maxY y ? y
 ,
 where Y := {H (x) |x ? {x?ik}} ;
 3. Assign the updated outcome weights to samples according the following equa-
 tion:
 wik+1 =
 1
 ?Nk,x?ik
 ?
 ??w
 ?
 ?? ?
 y:y??(H(x?ik))
 pY,k (y)
 ?
 ??? w
 ?
 ?? ?
 y:y<?(H(x?ik))
 pY,k (y)
 ?
 ??
 ?
 ??
 ?i ? {1, . . . , ?Nk},
 where ?Nk is the number of elite samples, and ?Nk,x?ik is the number of elements
 in {x?ik} having the same outcome value as x?ik. (We remind the reader that
 w : [0, 1] ? [0, 1] is a probability weighting function.)
 We ran 30 independent experiments for seven ATSPs. In those experiments,
 we used the probability weighting function:
 w (p) := 1 ? (1 ? p)2 .
 The trials are done using Algorithm 1 with the parameters ?0 = ?min = 0.6, N0 =
 1000,  = 1, ? = 2, ? = 0.02, ? = 0.7 and the weight-update scheme in Algorithm
 2. The results are summarized in Table 3.6.1. Ncities is the the number of cities
 for each problem; NTotal is the average number of total samples until the solutions
 stop changing; Hbest is the best known solution; H? is the worst algorithm solution
 from the repeated runs; H? is the best algorithm solution from the repeated runs;
 143
?? and ?? are the percentage deviation of the worst and best algorithm solutions
 from the best known solution, respectively; ? is the average percentage deviation
 of the algorithm solutions from the best known solution. The important thing to
 note about the algorithm is its dependence on the actual outcome values. In the
 next weight-update method, this dependence is eliminated; instead we weight the
 samples uniformly.
 ATSP Ncities NTotal (Std. err.) Hbest H? H? ?? ?? ? (Std. err.)
 ftv33 34 6.59e4 (1.81e4) 1,286 1,379 1,286 0.0723 0.0000 0.0396(0.0279)
 ftv35 36 6.79e4 (1.63e4) 1,473 1581 1473 0.0733 0.0000 0.0195(0.0172)
 ftv38 39 8.81e4 (3.26e4) 1,530 1651 1536 0.0791 0.0039 0.0243(0.0190)
 p43 43 2.80e5 (1.04e5) 5,620 5,636 5,622 0.0028 0.0004 0.0011(0.0007)
 ry48p 48 4.65e5 (2.30e5) 14,422 18,725 14,618 0.2984 0.0136 0.0744(0.0676)
 ft53 53 3.24e5 (1.23e5) 6,905 7844 7059 0.1360 0.0223 0.0590(0.0247)
 ft70 70 7.02e5 (3.32e5) 38,673 39,738 38,760 0.0275 0.00225 0.0130(0.0050)
 Table 3.6.1: Performance of CWO_T on various ATSP problems based on 30 inde-
 pendent replications
 Uniform Weight Update(CWO_U): Tilting assigns the initial weights of the sam-
 ples {xik} using their values. Uniform weighting updating differs from tilting by
 assuming uniform distribution over the samples. Another major difference from the
 above approach is that we no longer only consider elites samples. Instead, we use a
 carefully chosen probability weighting function that smoothly re-weights the sam-
 ples. More specifically in stage 5, we assume a uniform initial density and use the
 weighting function
 w (p) := 10p? + ln (1 + e
 ??) ? ln(1 + e(?1+10p)?)
 10? + ln (1 + e??) ? ln (1 + e9?) , (3.6.3)
 144
where ? is the optimal-seeking factor and ? is the quantile threshold. Eq. 3.6.3
 is chosen as the weighting function due to its connection with the cross-entropy
 algorithm.
 0.0 0.2 0.4 0.6 0.8 1.0
 Prob .
 2
 4
 6
 8
 10
 Weight
 Figure 3.6.1: Derivatives of Eq. 3.6.3 as ? ? ?
 Using equations 3.6.2 and 3.6.3, we modify the generic CWO algorithm by
 altering the way the sample weights are updated. The algorithm has a strong
 connection with the traditional cross-entropy method, which is explained below.
 Algorithm 3 CWO_U Weight Update Algorithm
 1. Calculate the outcome cumulative distribution function (CDF),
 FY,k (? (H (x))), for {xik}, assuming a uniform density
 (
 i.e., pX,k (x) = 1Nk
 )
 ;
 2. Assign the updated weights according the following equation:
 wik+1 =
 1
 Nk,xik
 ?
 ??w
 ?
 ?? ?
 s:?(H(s))??(H(xik))
 pX,k (s)
 ?
 ??? w
 ?
 ?? ?
 s:?(H(s))<?(H(xik))
 pX,k (s)
 ?
 ??
 ?
 ??
 ?i ? {1, . . . , Nk},
 where Nk is the number of samples, and Nk,xik , is the number of elements in{xik} having the same value as xik. (We remind the reader that w : [0, 1] ? [0, 1]
 is a probability weighting function.)
 145
We remind the reader that the density update equation for cross entropy is
 pCEX,k+1 (x) = 1{? (H (s)) > ?}l p
 CE
 X,k (x)
 ? 1{? (H (s)) > ?}pCEX,k (x) , (3.6.4)
 where an indicator function is used to select the elite samples. In fact, the cross-
 entropy equation is just the limiting case8 of the CWO_U algorithm. As we increase
 the optimal-seeking factor, the derivative of Eq. 3.6.3 will approach a step function
 (i.e., Eq. 3.6.4) with its discontinuity occurring at ? = 0.1 (see Fig. 3.6.1).
 Table 3.6.2 contains the results from running 20 trials of CWO_U and CE
 algorithms with the parameters ? = 0.01, ?0 = 0.1, ?min = 0.001, N0 = 1000,  = 0,
 ? = 1, ? = 0.01, and ? = 0.7. Here, N0 is the initial sample size.
 ATSP Ncities NTotal (Std. err.) Hbest H? H? ?? ?? ? (Std. err.)
 ft53 53 90,450(6.0e3) 6,905 7,679 7,037 0.112 0.0191 0.060(0.0244)
 ce_ft53 53 65,100(5.7e3) 6,905 7,676 7,088 0.111 0.0265 0.075(0.0276)
 Table 3.6.2: CWO_U and CE performance Results
 We plot the sorted minimum tour distances obtained from the 20 trials of CE
 and CWO_U algorithms in Fig. 3.6.2. We observe from Fig. 3.6.2 that compared
 with the standard cross-entropy method, our approach does better in every per-
 centile. For example, the 1920th percentile would contain the lowest optimal solution
 obtained among the 20 trials. The 1820th percentile would contain the second lowest
 optimal solution obtained among the 20 trials.
 8as ? ? ?
 146
?
 ? ?
 ?
 ?
 ?
 ? ?
 ?
 ?
 ?
 ? ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ? ?
 ? ?
 0 5 10 15 20
 Trials
 7100
 7200
 7300
 7400
 7500
 7600
 Optimal Value
 ? CE
 ? CWO_U
 Figure 3.6.2: CE vs. CWO_U Sorted Trial Runs
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ? ?
 ? ? ?
 ? ? ?
 ? ?
 ? ? ?
 ? ?
 ? ? ? ? ? ? ? ?
 ?
 ?
 ? ? ?
 ?
 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
 ? ? ?
 ? ? ?
 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
 ?
 ??
 ??
 ????
 ?
 ??
 ??
 ??
 ???
 ?
 ???
 ???????
 ??
 ??????
 ???????
 ???
 ?????????????????????????????????????
 20 40 60 80
 Iterations
 10 000
 15 000
 20 000
 Values
 ? CWO_U
 ? CE
 Figure 3.6.3: One trial of CE vs. CWO_U
 In Figure 3.6.3, which displays a typical convergence of a single run of CE vs.
 that of CWO_U, we observe that although CE converges faster at the beginning,
 CWO_U is able to eventually overtake CE and finishes at a lower value.
 147
3.7 Conclusion
 In the first part of this chapter, we proved the convergence of CWO-based algo-
 rithms. The proofs provided a rigorous mathematical foundation for the two practi-
 cal algorithms we proposed in the numerical examples section. These two algorithms
 are variations of the generic CWO algorithm described in Algorithm 1. The two
 algorithm variations, CWO_T and CWO_U, differ by how they update their proba-
 bility density functions over the solution space for each iteration. The first approach,
 CWO_T, weights the samples according to their outcome values. On the other
 hand, CWO_U, uniformly weights the samples. We benchmarked the performance
 of the CWO_T algorithm and summarized the results in Table 3.6.1. Although the
 numeric values are quite satisfactory, we wanted to see if we could improve these
 results. This effort led us to the development of the second approach, CWO_U,
 which we consider as the preferred implementation of the CWO-base algorithm.
 Perhaps the most surprising fact is that by not taking into account the outcome
 values of the samples, we are able to achieve better performance results. Even more
 interesting is the fact that the standard cross-entropy approach is just a limiting
 case of the CWO_U approach. Comparing the numerical results of CWO_U with
 those of CE, we believe our algorithm is better at obtaining an optimal solution (see
 Fig. 3.6.2). Of course, the improvement in performance is at the cost of increasing
 computational costs.
 148
Chapter 4
 Contributions and Future Work
 4.1 Contributions
 A new family of performance criteria has been proposed in the first part of this the-
 sis. These performance criteria are inspired by cumulative prospect theory, which
 has substantial empirical support. They include the performance criteria used in
 classical risk-sensitive control problems (e.g., expected utility). We proved the class
 of non-convex risk-sensitive control problems is still solvable via dynamic program-
 ming. We investigated both finite-horizon and infinite-horizon cases, and offered
 numerical examples to demonstrate the applications of our approach.
 The second part of this thesis presented a novel approach for solving stochastic
 global optimization problems. This new approach, cumulative weighting optimiza-
 tion, is also inspired by cumulative prospect theory. We proved the convergence
 to an optimal solution of the cumulative weighting optimization algorithms given
 a mild assumption on the initial condition. In addition, the algorithms in CWO
 include the well-known cross-entropy optimization algorithm. Since we have proved
 the convergence for all CWO algorithms, we naturally have obtained a convergence
 proof for the cross-entropy algorithm, which to the best of our knowledge, has not
 been done before. In addition, we presented the numerical analysis of our algo-
 rithms, where we compared the performance of two weight updating schemes. We
 149
also compared the performance of our algorithms against that of the cross-entropy.
 4.2 Future Work
 In the future, we would like to develop a new game theory framework which cap-
 tures risk-sensitivity. The effects of incorporating the CPT-based distortions could
 provide a novel perspective into the well-established field of game theory. Take, for
 example, the classic prisoner?s dilemma game. The payoff matrix for the players is
 given in Table 4.2.1. John von Neumann and Oskar Morgenstern [84] showed that
 Player 2
 Decision Cooperate Defect
 Player 1 Cooperate R=3 R=3 S=0 T=5Defect T=5 S=0 P=1 P=1
 R:Reward S:Sucker T:Temptation P:Penalty
 Table 4.2.1: Classic Prisoner?s Dilemma Problem
 a mixed strategy Nash equilibrium exists for any zero-sum game with a finite set of
 actions. Although the prisoner?s dilemma game is not a zero-sum game, analyzing
 how it reacts to mixed strategies is still important. The analysis of the effects of
 mixed strategies on the prisoner?s dilemma starts with generating the ?risk-neutral?
 (i.e., non-distorted) discretized mixed strategy reward table for player 1. In the
 matrix below, each element is the expected reward value given the probabilities of
 player 1 and player 2 cooperate. The i-th row and j-th column entry is calculated
 as the expected reward with the probability i?15 that player 1 cooperating and the
 probability j?15 that player 2 cooperating, where i, j ? {1, 2, . . . , 6}.
 150
?
 ???????????????????
 1. 1.8 2.6 3.4 4.2 5.
 0.8 1.56 2.32 3.08 3.84 4.6
 0.6 1.32 2.04 2.76 3.48 4.2
 0.4 1.08 1.76 2.44 3.12 3.8
 0.2 0.84 1.48 2.12 2.76 3.4
 0. 0.6 1.2 1.8 2.4 3.
 ?
 ???????????????????
 By introducing probability weighting distortion into the prisoner?s dilemma
 game, we alter the analogous risk-sensitive expected reward matrix:
 ?
 ???????????????????
 1. 1.00021 1.35889 3.28597 4.72747 5.
 0.931868 0.962743 1.19336 2.63784 4.20448 4.86374
 0.571493 0.758082 1.06917 2.19005 3.35859 4.14299
 0.0897219 0.354482 0.850323 2.03271 2.87765 3.17944
 0.0000523055 0.0423175 0.533821 1.90096 2.82637 3.0001
 0. 0.000156917 0.269166 1.71448 2.7956 3.
 ?
 ???????????????????
 .
 For the matrix displayed above, the weighting function used is
 w (x) := exp[?3.0 (? log (x))2.5], (4.2.1)
 and the risk adjusted expected reward is calculated as:
 E (p1, p2) := 5 (w ((1 ? p1) p2)) + 3 (w ((1 ? p1) p2 + p1p2) ? w ((1 ? p1) p2))
 +1 (w (p1p2 + (1 ? p1) p2 + (1 ? p1) (1 ? p2)) ? w (p1p2 + (1 ? p1) p2)) ,
 151
where p1 and p2 are the probabilities that player 1 and player 2 cooperate, respec-
 tively.
 The rate that the expected reward decreases down each column for the risk-
 neutral matrix is constant, whereas the same rate for the risk-sensitive matrix slows
 down as we traverse toward the bottom of each column. The particular function
 (Eq. 4.2.1) used represents risk-aversion, hence we see from this example how risk-
 sensitivity changes player 1?s expected payoff.
 Aside from having Nash equilibria, the prisoner?s dilemma game could also
 have -equilibria. An -equilibrium is formally defined below.
 Definition 28. Let G =
 (
 N,A = A1 ? ? ? ? ? AN , r : A ? RN
 )
 be a N-player game
 with action sets Ai for each player i and a reward function r. The space of probability
 distributions over Ai is denoted by P (Ai). Let ri (pi) denote the payoff to player i
 when the policy pi = {pi1 ? ? ? ? ? piN} is played, where pii ? P (Ai). A policy pi is an
 -Nash Equilibrium for the game G if
 ri (pi) ? ri (pi?i, pi?i) ? , ?pi?i ? P (Ai) , i ? {1, . . . , N} ,
 where pi
 ?i denotes all the mixed strategies except the i-th policy.
 When  = 0, an -equilibrium is exactly the well-known Nash equilibrium. By
 picking an  > 0, two additional equilibria are found in the matrix above. More
 specifically, if player 1 is risk-averse, with an appropriate , the player could always
 cooperate, which is not a Nash equilibrium. This example is only one of the many
 ways in which, by introducing risk-sensitivity into the game, we alter the standard
 152
conclusion. It is interesting to note that the same effect cannot be achieved by using
 a utility function.
 In addition to game theory, we will apply the CPT-based risk-sensitive mea-
 sures to classic control problems and study the structure of the optimal policies
 obtained. Systems that are human-centric might find it beneficial to be controlled
 by an optimal controller derived using our risk-sensitive performance measure. In
 the future, we will investigate the effects of CPT-based risk-sensitive measures for
 iterated games.
 153
Appendix A
 Prospect Theory
 A.1 St. Petersburg Paradox
 The example below is from [86].
 Example 13. [St. Petersburg paradox]. Consider the following game. A fair coin
 will be flipped until the first heads shows up. If heads shows up at the k-th flip,
 then you receive $2k. Thus, immediate heads gives only $2, and after each tails the
 amount doubles. After 19 tails you are sure to be a millionaire. Think for yourself
 how much it would be worth to you to play this game. The expected value of the
 game is
 1
 2
 ? 2 +
 1
 4
 ? 4 +
 1
 8 ? 8 +
 1
 16 ? 16 + ? ? ? = 1 + 1 + 1 + 1 + ? ? ? = ?.
 Thus if you maximize expected value, then this game is worth more to you
 than any amount of money. In reality, people pay considerably less to participate
 in the game, something like $5.
 A.2 Axiomatization of Expected Utility:
 We need a little notation before we can write down the axioms. Consider two
 outcomes L1 and L2 with known probabilities. We use the notation L1 < L2 to
 154
mean that the decision maker prefers L1 over L2 or considers them to be equally
 preferred. We assume that the utility here is continuous and strictly increasing.
 Axiom 1. 1. Completeness: For any two outcomes L1 and L2, either L1 < L2 or L2 < L2
 or both.
 2. Transitivity: For any three outcomes L1, L2, and L3, if L1 < L2, and L2 < L3,
 then L1 < L3.
 3. Continuity: For any three outcomes L1 < L2 < L3, there exist ?, ? ? (0, 1)
 such that ?L1 + (1 ? ?)L3 < L2, and L2 < ?L1 + (1 ? ?)L3.
 4. Substitution (Independence) Savage[80]: For any L1, L2 and L3, and any ? ?
 (0, 1) , L1 < L2 if and only if ?L1 + (1 ? ?)L3 < ?L2 + (1 ? ?)L3.
 Furthermore, von Neumann and Morgenstern [85] proved the following:
 L1 < L2 if and only if
 n?
 i=1
 p1iu
 (
 x1i
 )
 ?
 n?
 i=1
 p2ju
 (
 x2j
 )
 ,
 where p1i is the probability for the i-th outcome of L1, x1i is the value for the i-th
 outcome of L1 and u is the utility function.
 155
Appendix B
 Multifunctions and Selectors
 Our main source of reference for this section is [48].
 Let X and A be (nonempty) Borel spaces.
 Definition 29. A multifunction (also known as a correspondence or set-valued
 mapping) ? from X to A is a function such that ? (x) is a nonempty subset of A
 for all x ? X. (A single-valued mapping ? : X ? A is of course an example of a
 multifunction.) The graph of the multifunction ? is the subset of X ?A defined as
 Gr (?) := {(x, a) |x ? X, a ? ? (x)} .
 A multifunction could have one of the properties described below. For every
 subset B of A, let ??1 [B] := {x ? X|? (x) ? B 6= ?} .
 Definition 30. A multifunction ? from X to A is said to be
 (a) Borel measurable if ??1 [G] is a Borel subset of X for every open set G ? A;
 (b) upper semi-continuous (u.s.c) if ??1 [F ] is closed in X for every closed set
 F ? A;
 (c) lower semi-continuous (l.s.c)if ??1 [G] is open in X for every open set
 G ? A;
 (d) continuous if it is both u.s.c and l.s.c.
 156
A multifunction ? is said to be closed-valued (resp. compact-valued) if ? (x)
 is a closed (resp. compact) set for all x ? X. The multifunction is said to be closed
 if its graph is closed.
 Proposition 5. Let ? be a compact-valued multifunction from X to A. Then the
 following statements are equivalent:
 (a) ? is Borel-measurable;
 (b) ??1 [F ] is a Borel subset of X for every closed set F ? A;
 (c) Gr(?) is a Borel subset of X ? A;
 (d) ? is a measurable function from X to the space of nonempty compact
 subsets of A topologized by the Hausdorff metric.
 Proof. See [50] and [81].
 Throughout the remainder of this appendix, ? is a given Borel-measurable
 multifunction from X to A, and F denotes the set of (single-valued) measurable
 multifunction from X to A, and F denotes the set of (single-valued) measurable
 functions f : X ? A with f (x) ? ? (x) for all x ? X. A function f ? F is called a
 selector (or measurable selector or choice or decision function) for the multifunction
 ?. Moreover, v : Gr (?) ? R is a given measurable function and
 v? (x) := inf
 ?(x) v (x, a) , x ? X.
 If v (x, ?) attains its minimum at some point in ? (x) , we write ?min? instead of
 ?inf.?
 157
Proposition 6. Suppose that ? is compact-valued.
 (a) If v (x, ?) is l.s.c. on ? (x) for every x ? X, then exists a selector f ? ? F
 such that
 v (x, f ? (x)) = v? (x) = min
 ?(x)
 v (x, a) , ?x ? X
 and v?is measurable. Similarly, if v (x, ?) is u.s.c. on ? (x) for every x ? X, then
 exists a selector f ? ? F such that
 v (x, f ? (x)) = v? (x) = max
 ?(x)
 v (x, a) , ?x ? X
 and v? is measurable.
 (b) If ? is u.s.c and v is l.s.c and bounded below on Gr(?) , then there exists
 f ? ? F for which
 v (x, f ? (x)) = v? (x) = min
 ?(x) v (x, a) , ?x ? X
 holds, and v? is l.s.c. and bounded below on X.
 158
Appendix C
 Spaces of Probability Measures
 The content of this section can be found in the appendices of [15] and [13].
 C.1 Polish Spaces
 A Polish space, X , is a topological space which is separable and admits a complete
 metrization. Examples of such spaces are: separable Banach spaces, compact metric
 spaces, the space D [0, 1] of cadlag path from [0, 1] to R with Skorohod topology.
 Let X be a Polish space and d (?, ?) a complete metric on it.
 Definition 31. A probability measure P on X is said to be tight if for each  > 0,
 there exists a compact set K ? X with P (K) ? 1 ? . Analogously, a family
 P?, ? ? I, (I being an index set) is said to be tight if the above holds for all P?
 uniformly in ?, i.e., the set K above can be chosen to be the same for all ?.
 C.2 The Prohorov Topology
 Let Cb (X ) , PX denote respectively the space of bounded continuous real valued
 functions on X , and the space of probability measures on X . Endow Cb (X ) with
 the supremum norm ??? . PX will be given the topology for with a local base at
 159
P ? PX is given by the sets of the type
 {
 Q ? PX |
 ????
 ?
 fidQ ?
 ?
 fidP
 ???? < i, 1 ? i ? k
 }
 for some k ? 1, i > 0 and fi ? Cb (X ) for 1 ? i ? k.
 It is easily seen that this topology is Hausdorff and is coarser than the one
 induced by the total variation norm. It is called the Prohorov topology or the
 topology of weak convergence. Some other possible choices for the local basis at
 P ? PX are given below.
 {Q ? PX |Q (Fi) < P (Fi) + i, 1 ? i ? k} , Fi ? X closed
 {Q ? PX |Q (Gi) > P (Gi) ? i, 1 ? i ? k} , Gi ? S open
 {Q ? PX | |Q (Ai) ? P (Ai)| < i, 1 ? i ? k} , Ai ? S
 satisfy P (?Ai) = 0 where ?Ai is the boundary of Ai,
 {
 Q ? PX |
 ????
 ?
 fidP ?
 ?
 fidQ
 ???? < i, 1 ? i ? k
 }
 ,
 fi are bounded and uniformly continuous with respect to the metric d. Here,  >
 0, k ? 1.
 C.3 Compactness in PX
 Theorem 20. A subset L ? PX is relatively compact if and only if it is tight.
 160
C.4 Metrics on PX
 Definition 32. For any  > 0 and any Borel setA ? X , letA = {x ? X |d (x,A) < } .
 For ?, ? ? PX , we define the metric
 d (?, ?) = inf
 
 { > 0|? (A) ? ? (A) + , ? (A) ? ? (A) +  for all Borel subset A of S} .
 Theorem 21. d(?, ?) defines a metric on PX consistent with the Prohorov topology.
 161
References
 [1] O. Alagoz, L. M. Maillart, A. J. Schaefer, and M. S. Roberts, The
 optimal timing of living-donor liver transplantation, Management Science, 50
 (2004), pp. 1420?1430.
 [2] M. Allais, Le comportement de lH?omme rationnel devant le risque: Cri-
 tique des postulats et axiomes de l?cole americaine, Econometrica, 21 (1953),
 pp. 503?546. ArticleType: research-article / Full publication date: Oct., 1953
 / Copyright ? 1953 The Econometric Society.
 [3] P. Artzner, F. Delbaen, J.-M. Eber, and D. Heath, Coherent measures
 of risk, Mathematical Finance, 9 (1999), pp. 203?228.
 [4] P. Artzner, F. Delbaen, J.-M. Eber, D. Heath, and H. Ku, Coherent
 multiperiod risk adjusted values and Bellman?s principle, Annals of Operations
 Research, 152 (2007), pp. 5?22.
 [5] R. Bellman, On the theory of dynamic programming, Proceedings of the
 National Academy of Sciences of the United States of America, 38 (1952),
 pp. 716?719. PMID: 16589166 PMCID: PMC1063639.
 [6] , Applied Dynamic Programming, Princeton University Press, Princeton,
 1957.
 162
[7] D. Bernoulli, Exposition of a new theory on the measurement of risk, Econo-
 metrica, 22 (1954), pp. 23?36. ArticleType: research-article / Full publication
 date: Jan., 1954 / Copyright ? 1954 The Econometric Society.
 [8] D. P. Bertsekas, Dynamic Programming and Optimal Control, vol. 1, Athena
 Scientific, 2nd ed., nov 2000.
 [9] D. P. Bertsekas, Dynamic Programming and Optimal Control: Approximate
 dynamic programming, Athena Scientific, 2012.
 [10] D. P. Bertsekas and S. E. Shreve, Stochastic Optimal Control: The
 Discrete Time Case, Academic Press, Nov. 1978.
 [11] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming,
 Athena Scientific, 1996.
 [12] N. P. Bhatia and G. P. Szeg?, Stability Theory of Dynamical Systems,
 Springer, 1970.
 [13] P. Billingsley, Convergence of Probability Measures, Wiley-Interscience,
 2 ed., July 1999.
 [14] F. Black and M. Scholes, The pricing of options and corporate liabilities,
 Journal of Political Economy, 81 (1973), pp. 637?654. ArticleType: research-
 article / Full publication date: May - Jun., 1973 / Copyright ?1973 The Uni-
 versity of Chicago Press.
 [15] , Topics in Controlled Markov Chains, CRC Press, Apr. 1991.
 163
[16] , Probability Theory: An Advanced Course, Springer, Oct. 1995.
 [17] M. Bouakiz and M. J. Sobel, Inventory control with an exponential utility
 criterion, Operations Research, 40 (1992), pp. 603?608. ArticleType: research-
 article / Full publication date: May - Jun., 1992 / Copyright ?1992 INFORMS.
 [18] O. ?avu? and A. Ruszczy?ski, Risk-averse control of undiscounted transient
 markov models, arXiv preprint arXiv:1203.5437, (2012).
 [19] A. Chateauneuf and P. Wakker, An axiomatization of cumulative prospect
 theory for decision under risk, Journal of Risk and Uncertainty, 18 (1999),
 pp. 137?145.
 [20] P. Cheridito, F. Delbaen, and M. Kupper, Coherent and convex mone-
 tary risk measures for bounded c?dl?g processes, Stochastic Processes and their
 Applications, 112 (2004), pp. 1?22.
 [21] , Coherent and convex monetary risk measures for unbounded c?dl?g pro-
 cesses, Finance and Stochastics, 10 (2006), pp. 427?448.
 [22] P. Cheridito, F. Delbaen, and M. Kupper, Dynamic monetary risk
 measures for bounded discrete-time processes, Electronic Journal of Probability,
 11 (2006), pp. 57?106.
 [23] K.-J. Chung and M. J. Sobel, Discounted MDP?s: distribution functions
 and exponential utility maximization, SIAM journal on control and optimiza-
 tion, 25 (1987), pp. 49?62.
 164
[24] S. P. Coraluppi and S. I. Marcus, Risk-sensitive and minimax control
 of discrete-time, finite-state markov decision processes, Automatica, 33 (1999),
 pp. 301?309.
 [25] , Mixed risk-neutral/minimax control of discrete-time, finite-state markov
 decision processes, IEEE Transactions on Automatic Control, 45 (2000), pp. 528
 ?532.
 [26] P.-T. de Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein,
 A tutorial on the cross-entropy method, Annals of Operations Research, 134
 (2005), pp. 19?67.
 [27] B. de Finetti, Logical foundations and measurement of subjective probability,
 Acta Psychologica, 34 (1970), pp. 129?145.
 [28] F. Delbaen and E. T. Hochschule, Coherent risk measures on general
 probability spaces, in In essays in honour of Dieter Sondermann, Springer, 2002,
 pp. 1?37.
 [29] E. Diecidue, U. Schmidt, and H. Zank, Parametric weighting functions,
 Journal of Economic Theory, 144 (2009), pp. 1102?1118.
 [30] P. Dupuis, M. R. James, and I. Petersen, Robust properties of risk-
 sensitive control, Mathematics of Control, Signals, and Systems (MCSS), 13
 (2000), pp. 318?332.
 [31] A. Eichhorn and W. R?misch, Polyhedral risk measures in stochastic pro-
 gramming, SIAM Journal of Optimization, 16 (2005), pp. 69?95.
 165
[32] E. A. Feinberg and A. Shwartz, Handbook of Markov Decision Processes:
 Methods and Applications, Springer, 2002.
 [33] E. Fern?ndez-Gaucherand and S. I. Marcus, Risk-sensitive optimal
 control of hidden markov models: structural results, IEEE Transactions on
 Automatic Control, 42 (1997), pp. 1418 ?1422.
 [34] P. C. Fishburn, Utility theory for decision making, tech. rep., DTIC Docu-
 ment, June 1970.
 [35] , The axioms of subjective probability, Statistical Science, 1 (1986),
 pp. 335?345. Mathematical Reviews number (MathSciNet): MR858514.
 [36] W. H. Fleming, Optimal long term growth rate of expected utility of wealth,
 The Annals of Applied Probability, 9 (1999), pp. 871?903. Mathematical
 Reviews number (MathSciNet): MR1722286; Zentralblatt MATH identifier:
 0962.91036.
 [37] W. H. Fleming and S. J. Sheu, Risk-sensitive control and an optimal in-
 vestment model, Mathematical Finance, 10 (2000), pp. 197?213.
 [38] W. H. Fleming and H. M. Soner, Controlled Markov Processes and Vis-
 cosity Solutions, Springer, 2nd ed., Nov. 2005.
 [39] H. F?llmer and I. Penner, Convex risk measures and the dynamics of their
 penalty functions, Statistics & Decisions, 24 (2006), pp. 61?96.
 166
[40] H. F?llmer and A. Schied, Convex measures of risk and trading constraints,
 Finance and Stochastics, 6 (2002), pp. 429?447.
 [41] H. F?llmer, A. Schied, and T. Lyons, Stochastic finance. an introduction
 in discrete time, The Mathematical Intelligencer, 26 (2004), pp. 67?68.
 [42] T. D. Frank, Nonlinear Fokker-Planck Equations - Fundamentals and Appli-
 cations, Springer-Verlag, 2005.
 [43] M. Frittelli and G. Scandolo, Risk measures and capital requirments for
 processes, Mathematical Finance, 16 (2006), pp. 589?612.
 [44] U. G. Haussmann, Some examples of optimal stochastic controls or: The
 stochastic maximum principle at work, SIAM Review, 23 (1981), pp. 292?307.
 ArticleType: research-article / Full publication date: Jul., 1981 / Copyright ?
 1981 Society for Industrial and Applied Mathematics.
 [45] X. D. He and X. Y. Zhou, Portfolio choice via quantiles, Mathematical
 Finance, 21 (2011), pp. 203?231.
 [46] D. Hern?ndez-Hern?ndez and S. I. Marcus, Risk sensitive control of
 markov processes in countable state space, Systems & Control Letters, 29 (1996),
 pp. 147?155.
 [47] , Existence of risk-sensitive optimal stationary policies for con-
 trolled markov processes, Applied Mathematics & Optimization, 40 (1999),
 pp. 273?285.
 167
[48] O. Hern?ndez-Lerma and J. B. Lasserre, Discrete-Time Markov Control
 Processes: Basic Optimality Criteria, Springer, 1 ed., Dec. 1996.
 [49] O. Hern?ndez-Lerma and J.-B. Lasserre, Further topics on discrete-time
 Markov control processes, Springer, June 1999.
 [50] C. J. Himmelberg, T. Parthasarathy, and F. S. VanVleck, Optimal
 plans for dynamic programming problems, Mathematics of Operations Research,
 1 (1976), pp. 390?394.
 [51] J. Hofbauer, J. Oechssler, and F. Riedel, Brown-von Neumann-Nash
 dynamics: The continuous strategy case, Games and Economic Behavior, 65
 (2009), pp. 406?429.
 [52] R. A. Howard and J. E. Matheson, Risk-sensitive markov decision pro-
 cesses, Management Science, 18 (1972), pp. 356?369. ArticleType: research-
 article / Issue Title: Theory Series / Full publication date: Mar., 1972 / Copy-
 right ?1972 INFORMS.
 [53] J. Hu, M. C. Fu, and S. I. Marcus, A model reference adaptive search
 method for global optimization, Operations Research, 55 (2007), pp. 549?568.
 [54] J. Hu, Y. Wang, E. Zhou, M. C. Fu, and S. I. Marcus, A survey of
 some model-based methods for global optimization, in Optimization, Control,
 and Applications of Stochastic Systems, D. Hern?ndez-Hern?ndez and J. A.
 Minj?rez-Sosa, eds., Birkh?user Boston, Boston, 2012, pp. 157?179.
 168
[55] H. W. James and E. J. Collins, An analysis of transient markov decision
 processes, Journal of Applied Probability, 43 (2006), pp. 603?621. Mathematical
 Reviews number (MathSciNet): MR2274787; Zentralblatt MATH identifier:
 1145.90099.
 [56] S. C. Jaquette, Markov decision processes with a new optimality criterion:
 Discrete time, The Annals of Statistics, 1 (1973), pp. 496?505. Mathemati-
 cal Reviews number (MathSciNet): MR378839; Zentralblatt MATH identifier:
 0259.90054.
 [57] S. C. Jaquette, A utility criterion for markov decision processes, Manage-
 ment Science, 23 (1976), pp. 43?49. ArticleType: research-article / Full publi-
 cation date: Sep., 1976 / Copyright ?1976 INFORMS.
 [58] H. Jasiulewicz, Application of mixture models to approximation of age-
 at-death distribution, Insurance: Mathematics and Economics, 19 (1997),
 pp. 237?241.
 [59] D. Kahneman and A. Tversky, Prospect theory: an analysis of decision
 under risk, National Emergency Training Center, 1979.
 [60] S. Kl?ppel and M. Schweizer, Dynamic indifference valuation via convex
 risk measures, Mathematical Finance, 17 (2007), pp. 599?627.
 [61] V. N. Kolokoltsov, Nonlinear Markov Processes and Kinetic Equations,
 Cambridge University Press, July 2010.
 169
[62] , Markov Processes, Semigroups and Generators, Walter de Gruyter, Mar.
 2011.
 [63] H. J. Kushner, Introduction to Stochastic Control, Holt, Rinehart and Win-
 ston, 1971.
 [64] J. Leitner, A short note on second-order stochastic dominance preserving
 coherent risk measures, Mathematical Finance, 15 (2005), pp. 649?651.
 [65] J. Oechssler and F. Riedel, Evolutionary dynamics on infinite strategy
 spaces, Economic Theory, 17 (2001), pp. 141?162.
 [66] S. Onay and A. ?nc?ler, Intertemporal choice under timing risk: An ex-
 perimental approach, Journal of Risk and Uncertainty, 34 (2007), pp. 99?121.
 [67] S. Peng, A general stochastic maximum principle for optimal control problems,
 SIAM Journal on Control and Optimization, 28 (1990), p. 966?979. ACM ID:
 84050.
 [68] H. Pham, Continuous-time Stochastic Control and Optimization with Financial
 Applications, Springer, 1 ed., July 2009.
 [69] S. R. Pliska, Dynamic Programming and Its Applications, Academic Press,
 1978, ch. On the transient case for Markov decision chains with general state
 spaces, pp. 335?349.
 [70] S. R. Pliska, A stochastic calculus model of continuous trading: Optimal
 portfolios, Mathematics of Operations Research, 11 (1986), pp. 371?382. Ar-
 170
ticleType: research-article / Full publication date: May, 1986 / Copyright ?
 1986 INFORMS.
 [71] T. Post, M. J. van den Assem, G. Baltussen, and R. H. Thaler, Deal
 or no deal? decision making under risk in a large-payoff game show, American
 Economic Review, 98 (2008), pp. 38?71.
 [72] W. B. Powell, Approximate Dynamic Programming: Solving the Curses of
 Dimensionality, John Wiley & Sons, Sept. 2007.
 [73] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic
 Programming, John Wiley & Sons, Inc., New York, NY, USA, 1st ed., 1994.
 [74] F. Riedel, Dynamic coherent risk measures, Stochastic Processes and their
 Applications, 112 (2004), pp. 185?200.
 [75] R. Y. Rubinstein, Combinatorial optimization, cross-entropy, ants and rare
 events, in Stochastic Optimization: Algorithms and Applications, P. M. P.
 S. Uryasev, ed., Kluwer Academic Publishers, Dordrecht, The Netherlands,
 2001, pp. 304?358.
 [76] A. Ruszczy?ski, Risk-averse dynamic programming for markov decision pro-
 cesses, Mathematical Programming, 125 (2010), pp. 235?261.
 [77] A. Ruszczy?ski and A. Shapiro, Conditional risk mappings, Mathematics
 of Operations Research, 31 (2006), pp. 544?561. ArticleType: research-article
 / Full publication date: Aug., 2006 / Copyright ?2006 INFORMS.
 171
[78] , Optimization of convex risk functions, Mathematics of Operations Re-
 search, 31 (2006), pp. 433?452. ArticleType: research-article / Full publication
 date: Aug., 2006 / Copyright ?2006 INFORMS.
 [79] , Optimization of risk measures, in Probabilistic and Randomized Methods
 for Design under Uncertainty, G. Calafiore and F. Dabbene, eds., Springer
 London, 2006, pp. 119?157.
 [80] L. J. Savage, The Foundation of Statistics, Courier Dover Publications, 1972.
 [81] M. Sch?l, Conditions for optimality in dynamic programming and for the
 limit of n-stage optimal policies to be optimal, Probability Theory and Related
 Fields, 32 (1975), pp. 179?196.
 [82] C. Starmer, Developments in non-expected utility theory: The hunt for a
 descriptive theory of choice under risk, Journal of Economic Literature, 38
 (2000), pp. 332?382. ArticleType: research-article / Full publication date:
 Jun., 2000 / Copyright ? 2000 American Economic Association.
 [83] A. Tversky and D. Kahneman, Advances in prospect theory: Cumula-
 tive representation of uncertainty, Journal of Risk and Uncertainty, 5 (1992),
 pp. 297?323.
 [84] J. von Neumann and O. Morgenstern, The Theory of Games and Eco-
 nomic Behavior, The Theory of Games and Economic Behavior, Princeton
 University Press, 1947.
 172
[85] J. von Neumann and O. Morgenstern, Theory of Games and Economic
 Behavior (Commemorative Edition), Princeton University Press, Mar. 2007.
 [86] P. P. Wakker, Prospect Theory: For Risk and Ambiguity, Cambridge Uni-
 versity Press, July 2010.
 [87] Y. Wang, Simulation-Based Methods for Stochastic Control and Global Opti-
 mization, PhD thesis, University of Maryland - College Park, 2011.
 [88] Y. Wang, M. C. Fu, and S. I. Marcus, Model-based evolutionary optimiza-
 tion, in Proceedings of the Winter Simulation Conference, WSC ?10, Winter
 Simulation Conference, 2010, pp. 1199??1210.
 [89] J. Yong and X. Y. Zhou, Stochastic Controls: Hamiltonian Systems and
 HJB Equations, Springer, 1 ed., June 1999.
 [90] E. Zeidler, Nonlinear Functional Analysis and Its Applications: Part 2 B:
 Nonlinear Monotone Operators, Springer, Dec. 1989.
 173