ABSTRACT Title of dissertation: STOCHASTIC SIMULATION: NEW STOCHASTIC APPROXIMATION METHODS AND SENSITIVITY ANALYSES Marie Chau, Doctor of Philosophy, 2015 Dissertation directed by: Prof. Michael C. Fu R.H. Smith School of Business & Institute of Systems Research In this dissertation, we propose two new types of stochastic approximation (SA) methods and study the sensitivity of SA and of a stochastic gradient method to various input parameters. First, we summarize the most common stochastic gradient estimation techniques, both direct and indirect, as well as the two classical SA algorithms, Robbins-Monro (RM) and Kiefer-Wolfowitz (KW), followed by some well-known modifications to the step size, output, gradient, and projection operator. Second, we introduce two new stochastic gradient methods in SA for univari- ate and multivariate stochastic optimization problems. Under a setting where both direct and indirect gradients are available, our new SA algorithms estimate the gra- dient using a hybrid estimator, which is a convex combination of a symmetric finite di↵erence-type gradient estimate and an average of two associated direct gradient estimates. We derive variance minimizing weights that lead to desirable theoretical properties and prove convergence of the SA algorithms. Next, we study the finite-time performance of the KW algorithm and its sen- sitivity to the step size parameter, along with two of its adaptive variants, namely Kesten’s rule and scale-and-shifted KW (SSKW). We conduct a sensitivity anal- ysis of KW and explore the tightness of an mean-squared error (MSE) bound for quadratic functions, a relevant issue for determining how long to run an SA algo- rithm. Then, we propose two new adaptive step size sequences inspired by both Kesten’s rule and SSKW, which address some of their weaknesses. Instead of us- ing one step size sequence, our adaptive step size is based on two deterministic sequences, and the step size used in the current iteration depends on the perceived proximity of the current iterate to the optimum. In addition, we introduce a method to adaptively adjust the two deterministic sequences. Lastly, we investigate the performance of a modified pathwise gradient esti- mation method that is applied to financial options with discontinuous payo↵s, and in particular, used to estimate the Greeks, which measure the rate of change of (financial) derivative prices with respect to underlying market parameters and are central to financial risk management. The newly proposed kernel estimator relies on a smoothing bandwidth parameter. We explore the accuracy of the Greeks with varying bandwidths and investigate the sensitivity of a proposed iterative scheme that generates an estimate of the optimal bandwidth. STOCHASTIC SIMULATION: NEW STOCHASTIC APPROXIMATION METHODS AND SENSITIVITY ANALYSES by Marie Chau Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2015 Advisory Committee: Prof. Michael C. Fu, Chair/Advisor Prof. Steven I. Marcus Prof. Kasso A. Okoudjou Prof. Ilya O. Ryzhov Prof. Paul J. Smith c Copyright by Marie Chau 2015 To my loving and supportive parents. ii Acknowledgments First, I would like to express my sincerest gratitude and deepest apprecia- tion to my advisor, Prof. Michael C. Fu, for the tremendous amount of support, guidance, feedback, and faith throughout this entire journey. Without him, this dis- sertation would not be possible. Prof. Fu is not only a highly-respected scholar and an excellent teacher who genuinely cares about his students, but also an amazing mentor, always with the best intentions. I admire his breadth and depth of knowl- edge, patience, humility, authenticity, honesty, compassion, optimism, and innate good-hearted nature. For this experience and the opportunity to work with him, I will be forever grateful. Next, I would like to thank the other committee members - Professors Steve I. Marcus, Kasso A. Okoudjou, Ilya O. Ryzhov, and Paul J. Smith - for taking the time to read this dissertation and attend my defense. Special thanks to Professors Patrick M. Fitzpatrick, Eric V. Slud, and David H. Hamilton for their encouragement and willingness to help in their classes as well as in my job search. Last, but certainly not least, I would like to thank the people I’ve met at UMD who have since become some of my nearest and dearest friends. A very special thanks to Patrick Sodre´ Carlos who I met randomly in di↵erential equations during summer school in 2007 and is now one of my closest friends and biggest supporters. He’s always willing to help not only myself but others as well, and is one of the main reasons why I’m still here. Another special thanks to Karamatou Yacoubou Djima for her loyal friendship from day one of boot camp. Graduate school would not iii have been the same without her, from our countless food outings, endless hours of studying for quals, interesting conversations, fun times, and many laughs. Also a very special thanks to Huashuai Qu, Xuan Liu, Jong Jun Lee, Zhixin Lu, Ran Ji, and Anusha Dandapani for being amazing! Each of you have made a dent in my life in your own special way. Thanks to the rest of my 1305 ocemates. Many thanks to Rhyneta Gumbs who has a very kind heart, always looking out for the students. Thanks to the other girl’s night math ladies, Jennifer Clarkson, Clare Wickman, and Hana Ueda for your emotional support and regular dinner outings. Also, thanks to the rest of the boot camp class of 2008 for unforgettable memories. Thanks to Hisham Talukder for the many laughs, Temba and Joe for our fun outings, Lucia Simonelli for our salsa nights, Yimei Fan for fun times, Dana Botesteanu for good conversations over dinner/drinks, Changhui Tan and Wenqing Hu for their patience and invaluable math discussions. Also, thanks to Alverda McCoy for always being so pleasant and flexible, saving me from many administrative disasters. Most of all, I thank my parents. Words can’t begin to explain how grateful I am for all they’ve done throughout my entire life. Their morals, values, determination, patience, and supportive nature have shaped me into the person I am today. iv Table of Contents List of Figures viii List of Notations x 1 Introduction 1 1.1 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1.1 Stochastic Approximation . . . . . . . . . . . . . . . . . . . . 7 1.1.2 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2 Background 11 2.1 Stochastic Gradient Estimation Methods . . . . . . . . . . . . . . . . 11 2.1.1 Indirect Gradients . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.1.1 Finite Di↵erences . . . . . . . . . . . . . . . . . . . . 12 2.1.1.2 Simultaneous Perturbation . . . . . . . . . . . . . . . 13 2.1.1.3 Random Directions . . . . . . . . . . . . . . . . . . . 14 2.1.2 Direct Gradients . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.2.1 Infinitesimal Perturbation Analysis . . . . . . . . . . 16 2.1.2.2 Likelihood Ratio/Score Function . . . . . . . . . . . 17 2.1.2.3 Weak Derivatives . . . . . . . . . . . . . . . . . . . . 18 2.1.2.4 General Extension . . . . . . . . . . . . . . . . . . . 19 2.2 Stochastic Approximation . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.1 Classical Methods . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.1.1 Robbins-Monro . . . . . . . . . . . . . . . . . . . . . 20 2.2.1.2 Kiefer-Wolfowitz . . . . . . . . . . . . . . . . . . . . 22 2.2.2 Robust Gradient . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.2.1 Simultaneous Perturbation Stochastic Approximation 25 2.2.2.2 Gradient Averaging . . . . . . . . . . . . . . . . . . 27 2.2.3 Adaptive Step Sizes . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.3.1 Kesten’s Rule . . . . . . . . . . . . . . . . . . . . . . 29 2.2.3.2 Scaled-and-Shifted Kiefer-Wolfowitz . . . . . . . . . 30 2.2.4 Robust Output . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.2.4.1 Averaging Iterates . . . . . . . . . . . . . . . . . . . 35 2.2.4.2 Robust Stochastic Approximation . . . . . . . . . . . 37 2.2.4.3 Acceleration Stochastic Approximation . . . . . . . . 38 2.2.4.4 Numerical Comparison . . . . . . . . . . . . . . . . . 43 2.2.5 Varying Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . 45 v 3 New Hybrid Stochastic Approximation Methods 49 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2 Secant-Tangents AveRaged Stochastic Approximation . . . . . . . . . 50 3.2.1 Optimal Convex Weight . . . . . . . . . . . . . . . . . . . . . 52 3.2.1.1 Homogeneous Noise . . . . . . . . . . . . . . . . . . 52 3.2.1.2 Non-homogeneous Noise . . . . . . . . . . . . . . . . 54 3.2.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.2.3 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . 64 3.2.3.1 Experiment 1: vary initial value . . . . . . . . . . . . 65 3.2.3.2 Experiment 2: vary steepness level . . . . . . . . . . 67 3.2.3.3 Results Summary . . . . . . . . . . . . . . . . . . . . 69 3.3 STAR-SPSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.3.1 Optimal Deterministic Weights . . . . . . . . . . . . . . . . . 71 3.3.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.3.3 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . 78 3.3.3.1 9-station Closed Jackson Network . . . . . . . . . . . 78 3.4 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . 85 4 Step Size Selection in Stochastic Approximation 87 4.1 Sensitivity of Finite-time Performance to Step Size . . . . . . . . . . 87 4.1.1 KW and its Variants . . . . . . . . . . . . . . . . . . . . . . . 88 4.2 Finite-time MSE Bound . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.3 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.3.1 Tightness of the Finite-time MSE Bound for Quadratics . . . 92 4.3.2 Sensitivity of KW and its Variants . . . . . . . . . . . . . . . 95 4.4 PROX-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.5 Adaptive PROX-step . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.6 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.6.1 Deterministic Problem with Added Noise . . . . . . . . . . . . 110 4.6.2 9-station Closed Jackson Queueing Network . . . . . . . . . . 116 4.7 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . 117 5 Greek Kernel Estimators 120 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.2 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.3 Generalized Pathwise Method . . . . . . . . . . . . . . . . . . . . . . 125 5.3.1 First-Order Greeks . . . . . . . . . . . . . . . . . . . . . . . . 125 5.3.2 Second-Order Greeks . . . . . . . . . . . . . . . . . . . . . . . 126 5.3.3 Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.3.4 First- and Second-Order Greek Estimators . . . . . . . . . . . 129 5.4 Pilot Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.5.1 Sensitivity to Bandwidth . . . . . . . . . . . . . . . . . . . . . 131 5.5.2 Sensitivity of Bandwidth Generator to Input Parameters . . . 136 5.6 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . 141 vi Bibliography 142 vii List of Figures 1.1 Illustration: Generating sample performance . . . . . . . . . . . . . . 3 2.1 MSE under AC-SA, RM, RM w/averaging and RSA for f(x) = 13x 2, x1 = 30.0, = 1.0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.1 Illustration of STAR gradient, where f˜ and f˜ 0 are estimates of f and f 0, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 f(x) = 0.1x2,⇥ = [50, 50], an = 10(n+1)1, cn = 0.1(n+1)1/4, N = 1000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.3 f(x) = ax2,⇥ = [50, 50], an = 10(n+1)1, cn = 0.1(n+1)1/4, N = 1000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.4 9-station closed Jackson queueing network. . . . . . . . . . . . . . . 79 3.5 9-station closed Jackson queueing network, P9i=1 x(i) = 10, x(i) > 0, an = ✓a/(n+1), ↵n = c2n/(1+c2n), N = 50, customers serviced = 300, macroreplications = 20. . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.1 Illustration of Sensitivity of SA to {an}. . . . . . . . . . . . . . . . . 88 4.2 MSE of the 10000th iterate of KW and Kesten for three parameter settings and SSKW for f(x) = 0.001x2, = 0.001, an = ✓a/n, cn = ✓c/n1/4. . . . 96 4.3 Sensitivity of KW to ✓a for f(x) = 0.001x2, an = ✓a/n, cn = ✓c/n1/4, n = 10000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.4 Sensitivity of SSKW for f(x) = 0.001x2 as a function of a, n = 10000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.5 MSE Comparison of KW, Kesten, and SSKW for f(x) = 100e0.006x2 , an = 1/n, cn = 1/n1/4, n = 10000. . . . . . . . . . . . . . . . . . . . 100 4.6 Illustration of PROX-Step Motivational . . . . . . . . . . . . . . . . 102 4.7 f(x) = 100e0.006x2 , ✓ = [30, 70], x1 = 10, an = 10/n, a+n = 5(k + 1)/n, f = g = 0.1, ↵ = 1.025, nL = 20, N = 50, macroreplications = 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.8 f(x) = 100e0.006x2 , ✓ = [30, 70], x1 = 10, an = 1/n, a+n = 2k/n, f = g = 0.1, ↵ = 1.025, nL = 20, N = 50, macroreplications = 20. 112 4.9 MSE of xN and f(xN), f(x) = 100 exp0.006x2 , ↵ = 1.5, ✓ = [30, 70], x1 = 10, f = g = 0.1,, ↵ = 1.5, nL = 50, N = 50, macroreplications = 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.10 MSE of xN and f(xN), f(x) = 100 exp0.006x2 , ↵ = 2 ✓ = [30, 70], x1 = 10, f = g = 0.1, nL = 20, N = 50, macroreplications = 20. . . 114 4.11 MSE of xN and f(xN), f(x) = x2, ✓ = [30, 70], x1 = 10, f = 0.1, g = 0.1, nL = 20, macroreplications = 20. . . . . . . . . . . . . . . 115 4.12 MSE of xN and f(xN), f(x) = x2, N = 50, x1 = 10, f = 0.1, g = 1.0, nL = 50, an = 1/n, a+n = 20/n . . . . . . . . . . . . . . . . 116 4.13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 viii 5.1 95% confidence band for Asian delta (solid curves) for n = 1000 with w/95% confidence interval for bandwidth (vertical dashed lines) generated using s = 0.1 and RRMSE for 100 sample paths. . . . . . 134 5.2 95% confidence band for Asian vega (solid curves) for n = 1000 w/95% confidence interval for bandwidth (vertical dashed lines) gen- erated using s = 0.001 and RRMSE for 100 sample paths. . . . . . . 134 5.3 95% confidence band for Asian gamma (solid curves) for n = 1000 w/95% confidence interval for bandwidth (vertical dashed lines) gen- erated using s = 0.01 and RRMSE for 100 sample paths. . . . . . . . 135 5.4 95% confidence band for barrier theta for n = 10000 and RRMSE for 100 sample paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.5 Barrier gamma estimator for n = 10000, 100 sample paths. . . . . . . 137 5.6 Asian delta pilot 95% confidence interval for cˆ for k = 10. . . . . . . . 139 5.7 Asian delta pilot, n = 500, sample size = 100. . . . . . . . . . . . . . 140 ix List of Notations R the field of real numbers d number of dimensions ⇥ parameter space (continuous) ✓ continuous parameter f(·) distribution of input parameters (excl. ch. 3, 4) J(·) function of interest cn perturbation sizen random vector Y (·, ·) sample performance measure (ch. 1, 2) f˜ sample performance measure (ch. 3) an step or gain sizeei ith unit basis vector ⇠ random e↵ects || · || Euclidean norm h·, ·i dot product x Chapter 1 Introduction Consider the stochastic optimization problem minx2⇥ J(x), (1.1) where J : Rd ! R, J(x) = E[Y (x, ⇠)], and ⇠ denotes the stochastic e↵ects. In simulation optimization, the objective is to find x⇤ that minimizes the function of interest J(x). The search space ⇥ is, in general, either continuous or discrete, but we only focus on the continuous case, which restricts the range of methods that can be applied. One of the most useful and well-known methods in simulation optimization for continuous parameters is stochastic approximation (SA), which generates a sequence of estimates {xn} converging to a solution of rJ(x) = 0 using the recursion xn+1 = ⇧⇥ ⇣xn an brJ(xn)⌘ , (1.2) where ⇧⇥(x) is a projection of x back into the feasible region ⇥ if x /2 ⇥, an is a positive step size or gain size, brJ(xn) is an estimate of the true gradient rJ(xn), and xN is the output, where N is the stopping time, which we denote by x⇤N . The first SA algorithm introduced in [50] estimated the true gradient using an unbiased direct gradient, which unfortunately, is not always available in practice, so [37] proposed the use of indirect gradients (i.e., finite di↵erences), which only 1 require sample performances. SA is simple and requires very little memory due to its recursive nature, and thus can be implemented as an online method. As a result, SA algorithms are used in a wide variety of application areas such as signal processing, statistics, operations research, and machine learning [9]. The landmark papers, [50] and [37], proved convergence in probability for the one-dimensional classical Robbins-Monro (RM) and Kiefer-Wolfowitz (KW) meth- ods, respectively. Later, [5] modified the original conditions of the convergence theorems for both RM and KW to obtain almost sure (a.s.) convergence in higher dimensions. Then, [14] and [53] proved asymptotic normality, and [20] established asymptotic normality for general SA algorithms in the multidimensional setting. More recently, the focus has shifted to finite-time error bounds such as the mean- squared error (MSE) of the estimate E[||x⇤Nx⇤||2], where ||·|| denotes the Euclidean norm [7, 65], or the di↵erence between the objective value at the estimate and the optimal objective value [31, 45,46]. Stochastic approximation converges asymptotically under certain conditions, but the practical performance depends highly on the components of recursion (1.2). Methods have been introduced to increase its robustness, such as choosing appro- priate deterministic or adaptive step sizes [7, 21, 36, 51, 59, 65], generating estimates based on a subset of iterates [48, 49, 52], projecting iterates back into the feasible region in a clever/strategic manner [2], and stabilizing gradients or increasing gradi- ent accuracy. By exploiting previous gradient estimates, [27], [47], and [63] propose gradient averaging methods to increase gradient stability. Moreover, [1] and [3] present methods to prevent gradients from taking extreme values, and [7] introduce 2 SIMULATOR x Y (x, ⇠)Figure 1.1: Illustration: Generating sample performance a technique that reduces the variance by increasing the perturbation size of finite di↵erences adaptively. The SA algorithms with the fastest convergence rates are not necessarily su- perior. Biased gradient estimates lead to slower asymptotic convergence rates, but could outperform unbiased direct gradients in finite-time. One major factor in the performance is the choice in step size. It is dicult to select an appropriate step size since the function (1.1) is unknown, and it is impossible to find a universally optimal step size for all situations. Therefore, adaptive step sizes that adjust based on the ongoing performance of the algorithm have been proposed to tackle the issue. In the simulation optimization context, the gradient estimate brJ(xn) is as- sumed to be computationally expensive to generate. Each sample performance Y (x, ⇠) is generated through a simulator, as illustrated in Figure 1.1, which can be used to estimate the gradient. In some cases, the simulator can also generate an estimate of the sample performance gradient rY (x, ⇠) simultaneously, but it requires additional information about the system dynamics, which is not always available. Each simulation run is computationally heavy; therefore, the number of iterations N is limited, so the final parameter estimate does not have any guarantees based on the asymptotic theory. The expensive gradient together with the step size sensitivity reinforces the importance of finite-time performance. 3 The sensitivity of the function of interest to the parameters is also critical, and the category/class of methods for such gradient approximations is called sensitivity analysis. The main objective is to estimate the sensitivity of J(x) accurately, so the key is to generate an accurate and reliable gradient estimate. Unfortunately, the function J(x) could have undesirable properties such as discontinuities, which restricts the applicable methods. Common gradient estimates and their applicability will be discussed in Section 2.1. In finance, the accuracy of the sensitivity estimates of hedging tools is critical. For example, an option is a financial instrument that can be used to hedge risk, and its sensitivity to market parameters is useful in making investment decisions. If the price of the underlying stock hits the strike price, then the option can be exercised, resulting in some positive payo↵. However, if the stock does not reach the strike price, then it cannot be exercised, resulting in zero profit. The payo↵ function of an option may be discontinuous, which adds a level of diculty in estimating the gradient with respect to market parameters, also known as Greeks. The well-known infinitesimal perturbation analysis (IPA) is not applicable in this case, but a modified version, smoothed perturbation analysis (SPA), can be applied to discontinuous functions [28]. More recently, kernel estimators have been used in combination with IPA estimators to generate a hybrid gradient or modified IPA [43]. The kernel relies on a bandwidth parameter chosen by the user, and [43] proposed a selection method based on minimizing the asymptotic MSE, which has several input parameters, again determined by the user. The commonality between stochastic approximation (SA) and sensitivity anal- 4 ysis lies in their requirement for a gradient estimate, although serving di↵erent pur- poses. Many of the stochastic gradient estimation methods have associated asymp- totic results, which are necessary, but in practice, finite-time performance is also essential. Both methods often require user specified input parameters that must be selected, and the finite-time performance has a strong dependence on those chosen parameters. Take for example the widely-known finite di↵erence gradient estimate, which requires a perturbation size c. If the function is deterministic, a smaller c would increase the accuracy of the gradient estimate. However, in the stochastic setting, where the function evaluations are noisy, perturbations which are too small can result in extremely noisy gradient estimates. “Too small” is also relative to the problem at hand. The input parameter c has a significant impact on the quality of the estimate and must be chosen appropriately. Ideally, algorithms/methods are robust, and their performance should not be strongly correlated with user specified parameters. 1.1 Motivating Examples For motivation, we provide application examples where stochastic approxima- tion and sensitivity analysis can be applied, two of which are more detailed. Queueing networks. Many application areas such as communication networks, production systems, and smartgrids can be modeled using queueing networks. A possible metric of interest is the total throughput of the system. Each node or server has a particular mean service time, which is a controllable parameter, and the 5 objective is to find the optimal parameter values to maximize the total throughput. Furthermore, the sensitivity of the throughput to the mean service time also be examined. Inventory management. Production/manufacturing/retail businesses require inventory management systems to help maintain operations and increase profitabil- ity. A commonly known inventory policy is the (s, S) policy, where an order is placed to replenish the supply up to S after the inventory on hand falls below s. The goal is to minimize the total cost (e.g., ordering, holding, and shortage) for implementing such a policy. Finance. Well-modeled stock prices are essential in making well-informed in- vestment decisions. Recently, stochastic volatility models have been introduced with volatility varying in time. The parameters are estimated based on the maximum likelihood function, where SA can be easily applied. Furthermore, payo↵s for op- tions are based on stock price models, and optimal exercise strategies can be created to maximize the payo↵. Machine learning using “big data”. Traditional regression models are based on batch data, where the analysis is conducted on a fixed data set. However, with the rise in the amount of data available, models must be generated/updated in an online manner, and stochastic approximation is a commonly applied method. 6 1.1.1 Stochastic Approximation Consider a G/G/1 queue with one server, unlimited bu↵er or waiting capacity, interarrival times and service times that follow general distributions with means 1/ and x, respectively. In this system, there is only one service station with one server. The customers arrive from outside of the system with interarrival times following some general distribution with mean 1/, and the server serves each customer on a first-come first-served basis with a service time also following a general distribution with mean x. This queueing system can be modeled easily with distributional in- formation about the interarrival and service times, and SA can be directly applied. An important objective to consider is increasing customer satisfaction. Among the various correlated performance measures that can be used to quantify the satisfac- tion level, the total time spent in the system (i.e., time in waiting in queue plus the actual service time) is arguably one of the most significant factors. The objective function could be the expected time in system E[T (x, ⇠)], where T (·, ·) is the time in system of a customer, plus a positive decreasing cost C(x) associated with the mean service time x, e.g., c/x, where c > 0. The objective of stochastic approximation is to find x⇤ = argminx2(0,1/) E[T (x, ⇠)] + C(x), by solving r(E[T (x, ⇠)] + c/x) = 0 iteratively. One could also be interested in the sensitivity of the function, but we will provide a di↵erent example for sensitivity analysis. 7 1.1.2 Sensitivity Analysis Financial options are instruments that give option holders the right to purchase or sell stock for a particular strike price once certain conditions are met. Investors often use options to hedge financial risk and are especially interested in the expected payo↵ E[g(S)], where g is the sample payo↵ and S is the stock price over a time period, which depend on market parameters such as interest rate, initial stock price, volatility, etc., as well as its sensitivity. The derivatives of the expected payo↵ with respect to market parameters @E[g(S)]/@x are referred to as Greeks. To implement Greek estimators, the stock price can be modeled as a stochastic process (e.g., Brownian motion and OrnsteinUhlenbeck process) over a pre-specific duration with certain parameters. When making financial decisions, it is essential that investors make them with accurate and reliable information, such as good Greek estimates. 1.2 Contributions First, we propose two novel stochastic gradient estimation methods in stochas- tic approximation for univariate and multivariate problems. We consider a setting where direct gradients are also available, so we combine both direct and indirect gradient estimates using a convex weight to form a hybrid gradient. Currently, the existing SA algorithms only consider either direct or indirect gradient estimates, but not in conjunction. For the indirect gradient, we consider a symmetric di↵erence- type gradient estimate, and an average of the two associated direct gradients for the direct gradient estimate. In the one-dimensional case, we use a symmetric dif- 8 ference gradient estimate, which can be directly extended to higher dimensions, but instead, we employ the well-known simultaneous perturbation gradient to exploit its potential computational eciency. A critical component of our hybrid gradient estimate is the convex weight, which we derived to minimize the variance of the hy- brid gradient, leading to favorable theoretical properties. Our new hybrid gradients are provably convergent in SA. Second, we conduct preliminary experiments to investigate the sensitivity of the KW algorithm and two of its variants to the step size parameter and explore the tightness of a finite-time MSE bound. Then, we introduce two new adaptive step size sequences for SA, both of which adjust based on the perceived proximity of the current iterate to the optimum. Most of the present adaptive step sizes only consider one adaptive sequence for either each dimension or for all dimensions. Instead, our adaptive step size is based on two sequences, and the step size used in the current iteration depends on the current sample performance(s) compared to the past observations. Although the method is adaptive, the two initial sequences are deterministic, so it also su↵ers from the same disadvantages of deterministic step sizes. Therefore, we introduce a method to adaptively adjust the two deterministic sequences. Finally, we examine the sensitivity of two methods: 1) a modified pathwise method involving a kernel estimator to the smoothing bandwidth parameter and 2) a bandwidth selection method to various input parameters. 9 1.3 Outline The rest of the dissertation is as follows. In Chapter 2, we provide background information to better understand stochastic approximation and sensitivity analysis. In Section 2.1, we discuss the most common stochastic gradient estimation methods for both direct and indirect gradients. Then in Section 2.2, we introduce the classical SA algorithms in addition to modifications to the gradient, step size, output, and projection operator. Chapter 3 presents our two novel approaches to stochastic gradient estimation methods for single and multidimensional problems along with theoretical and numerical results. We derive optimal weight parameters and prove convergence for both cases. We test our new algorithms against well-known SA algorithms on a deterministic problem with added noise and on a queueing network. In Chapter 4, we explore step size selection techniques in SA, in addition to the tightness of a finite-time MSE bound, before introducing our new adaptive step sizes. We investigate the performance of our new adaptive step sizes on two contrasting deterministic functions with added noise for the one-dimensional case and on a queueing network for the multidimensional case. Chapter 5 presents a proposed modified pathwise gradient estimation method that incorporates a kernel estimator for options with discontinuous payo↵ functions. We empirically test the sensitivity of the kernel to the bandwidth parameter as well as the sensitivity of a proposed bandwidth selection algorithm to its input parameters. At the end of Chapters 3, 4, and 5, we provide concluding summaries as well as future research directions. 10 Chapter 2 Background 2.1 Stochastic Gradient Estimation Methods Stochastic gradient estimation is essential in stochastic approximation and sensitivity analysis but serves di↵erent purposes. In SA, the gradient estimate is only an intermediary step that provides a trajectory direction, so it is more tolerant of less accurate estimates. Sensitivity analysis accesses the parameter e↵ects on the function of interest, so the goal is to accurately estimate the gradient, and accuracy is key. Certain stochastic gradient methods do not provide estimates accurate enough to be suitable for sensitivity analysis and are designed specifically for SA. In general, stochastic gradient estimates fall under one of two main categories:direct or indirect. Indirect gradient methods approximate gradients using finite di↵erence-type gradient estimates, which only require function evaluations, e.g., via the secant method in the one-dimensional case. Since indirect gradient estimates only require sample performances, they can be easily applied to any case but are biased. Direct gradient methods are known as gradient-based approaches, mainly because the techniques results in unbiased gradient estimates, which lead to faster convergence rates in SA. 11 2.1.1 Indirect Gradients Indirect gradients are applicable as long as the simulator can generate sample performances. Unfortunately, indirect gradient estimates are biased, which lead to slower asymptotic convergence rates in stochastic approximation. 2.1.1.1 Finite Di↵erences Finite di↵erence methods stem from Taylor’s series expansion, and require the additional task of selecting a perturbation sequence {cn}, which impacts the variance. In the one-dimensional case, the derivative can be rewritten as a finite di↵erence plus a bias term that diminishes as the perturbation size cn approaches zero as n ! 1 (e.g., f 0(xn) = [f(xn + cn) f(xn)]/cn + O(cn)). The perturbation size cn influences the noise level of the finite di↵erence gradient estimate, and if it is too small, the gradient estimates can be very noisy in the stochastic setting. If sensitivity analysis is the ultimate goal, then a relatively larger perturbation size is preferable. In higher dimensions, the idea is to slightly perturb one component while keeping all others constant and return a corresponding function value estimate. Here are two common examples of finite di↵erences:brJi(xn) = 8>><>>: Y (xn + cnei, ⇠+n,i) Y (xn cnei, ⇠n,i)2cn symmetric di↵erence,Y (xn + cnei, ⇠+n,i) Y (xn, ⇠n) cn forward di↵erence, for i = 1, . . . , d, where brJi(xn) is the ith component of the gradient estimate,ei denotes the ith unit basis vector, cn 2 R+ is the perturbation size, ⇠±n,i and ⇠n,i denote the stochastic e↵ects, and Y (x, ⇠) is an unbiased estimate of J(x). Symmetric 12 di↵erences require the estimate of two function values J(xn+ cnei) and J(xn cnei) for each dimension i and one-sided forward di↵erences require J(xn) and J(xncnei) for i = 1, . . . , d; therefore, two-sided symmetric di↵erences and one-sided forward di↵erence estimates involve 2d and d+1 simulation replications, respectively. Since the cost to generate each gradient is linear in the number of dimensions d, finite di↵erences can be very inecient in higher dimensions. The next indirect gradient estimate addresses this issue. 2.1.1.2 Simultaneous Perturbation In 1992, [55] introduced the simultaneous perturbation (SP) gradient estimate specifically for multidimensional stochastic approximation problems. Similar to fi- nite di↵erences, SP is a gradient-free approach that only requires objective function values to approximate the underlying gradient, and is therefore easy to implement. The ith component of the SP gradient has the formbrJi(xn) = Y (xn + cnn, ⇠+n ) Y (xn cnn, ⇠n )2cnn,i , where cn 2 R+ is the perturbation size, ⇠±n denotes the stochastic e↵ects, n = (n,1, . . . ,n,d), and the sequence {n} is generally assumed to be i.i.d. and inde- pendent across components. Notice that, unlike finite di↵erences, the numerator of the SP gradient is independent of i, so a gradient only requires 2 simulation runs, re- gardless of the number of dimensions d, which can be ecient in higher dimensions. The only additional requirement is to generate n, which is relatively inexpensive compared to simulating sample performances. The SP gradient is only intended for 13 use in stochastic approximation algorithms and has limited relevance in sensitivity analysis. 2.1.1.3 Random Directions The random directions (RD) gradient was also developed specifically for mul- tidimensional problems and was the inspiration behind SP. The ith component of the RD gradient isbrJi(xn) = [Y (xn + cnn, ⇠+n ) Y (xn cnn, ⇠n )]n,i2cn , with perturbation size cn 2 R+ and stochastic perturbation n = (n,1, . . . ,n,d). The sequence {n} is generally assumed to be i.i.d. and independent across compo- nents, and ⇠±n denote the stochastic e↵ects. Notice that RD is almost identical to SP, except instead of dividing by the stochastic perturbation component n,i, it multi- plies the di↵erence term. This di↵erence changes the restrictions on the stochastic sequence {n} to guarentee convergence in SA and translates to a bound on the second moment, instead of the inverse moment, with zero mean, so the Gaussian distribution is applicable [23]. 2.1.2 Direct Gradients Direct gradients generally require more “o✏ine” work and involve additional coding within the simulation model. However, if direct gradient methods are applica- ble, they usually provide unbiased gradient estimates, which lead to faster asymp- totic convergence rates in stochastic approximation. In addition, direct gradient 14 methods are computationally ecient, especially in higher dimensions, since they often only require one simulation replication for any dimension d. For notation, assume J(X) = E[Y (✓)] = E[Y (X1, . . . , Xd)], (2.1) where Y : Rd ! R and X = (X1, . . . , Xd) is a vector of input random variables. The parameter ✓ has been purposely left out of the right-hand side of expression (2.1) to emphasize the point that it can appear in either the sample (pathwise) or measure (distributional). Let f denote the joint distribution of the input random variables X, then (2.1) can be written in the following two ways: E[Y (X)] = 8>>><>>>: Z 11 Y (x)f(x; ✓)dx distributional,Z 1 0 Y (X(✓; u))du pathwise, (2.2) where x, u, and the integrals are d-dimensional. When the parameter is located in Y (·), then the dependency of ✓ is pathwise and perturbation analysis is usually applied; whereas, if ✓ is located in f(·), then its ✓ dependency is distributional and the likelihood ratio or score function and weak derivatives could potentially be applied. In some instances, the ✓ can be pushed in and/or out of the distribution with an appropriate change of variables, also known as the push in/out method, which will change the applicability of the estimation method. If we assume that the interchange of the integration and di↵erentiation is valid, then di↵erentiating (2.2) with respect to ✓ results in the following two cases: dE[Y (X)] d✓ = 8>>><>>>: Z 11 Y (x)df(x; ✓)d✓ dx distributional,Z 1 0 dY (X(✓; u)) d✓ du pathwise. (2.3) 15 We assume each random number ui produces a random variate xi for i = 1, . . . , d. For simplicity, in Sections 2.1.2.1 and 2.1.2.2, we assume that the parameter only appears in X1, which is generated independently of the other random variables. For the simplified example, we will apply the method to the specific case where d = 2, X1 ⇠ exp(✓), and X2 ⇠ U(0, 1). The random variate x1 is generated via inverse transform method where Xi = ✓ lnUi and Ui ⇠ U(0, 1). 2.1.2.1 Infinitesimal Perturbation Analysis For the pathwise dependent case, the second integral can be expressed as @E[Y (X)] @✓ = Z 1 0 dY (X1(✓; u1), X2, . . . , Xd) d✓ du = Z 1 0 @Y (X) @X1 dX1(✓; u) d✓ du, where @Y@X1 @X1@✓ is the infinitesimal perturbation analysis (IPA) estimator. In this particular case, @E[Y (X)] @✓ = Z 1 0 Z 1 0 @Y (X1(✓; u1), u2) @x1 dX1(u1; ✓) d✓ du1du2, and dX1d✓ depends on the construction of X1 and since X1 = ✓ lnU, then dX1d✓ = lnU = X1✓ , so the final IPA estimator is @Y (X1, X2) @X1 dX1 d✓ = @Y (X1, X2) @X1 X1 ✓ . Implementation of the estimator requires more knowledge about how the Xi’s are generated in order to compute the derivative of the random variable dXid✓ ; there- fore the representation of Xi, which depends on ✓, is key. As a general rule, if the 16 sample performance is continuous with respect to the parameter of interest, then IPA is a suitable gradient estimation method. However, in practice, discontinu- ities may occur in the sample performance (e.g., indicator function), thereby forcing other estimation techniques to be applied. Refer to [23,25] for detailed examples. The next two types of estimators pertain to the distribution dependent case. 2.1.2.2 Likelihood Ratio/Score Function Assume fi(·; ✓) is the marginal p.d.f. of Xi, so the joint density can be written as f(x) = ⇧di=1fi(xi). Therefore, (2.3) can be expressed as: dE[Y (X)] d✓ = Z 1 1 Y (x) @f1(x1; ✓) @✓ ⇧dk=2fk(xk)dx = Z 1 1 Y (x) @ ln f1(x1; ✓) @✓ f(x)dx, where Y (X)@ ln f1(X1;✓)@✓ is the score function or likelihood ratio estimator. In our particular example, dE[Y (X)] d✓ = Z 1 0 Z 1 0 Y (x1, x2) @f1(x1; ✓) @✓ dx1dx2 = Z 1 0 Z 1 0 Y (x1, x2)  1 ✓ ⇣x1 ✓ 1 ⌘ 1 ✓ e x1✓ dx1dx2. SF/LR estimators are generally simple to implement when they are applicable. However, this method is not applicable to distributions where the underlying support depends on ✓. For instance, U(0, ✓) and Ber(p; ✓, b) are examples of continuous and discrete distributions, respectively, with ✓ dependent support, so LR/SF estimators do not exist. However, they exist for distributions where the underlying probabilities rely on the parameter of interest, such as bin(n, ✓), Ber(✓; a, b), Poisson(✓), exp(✓), 17 and N(✓, 2). Furthermore, this method usually has diculty with nondistributional parameters, but the issue might be overcome by using the push in or push out method to push the parameter of interest in or out of the distribution. If the parameter appears in more than one input random variable distribution, the variance will increase linearly as the number of times those random variables are used in the simulation, but this issue could be controlled through batching. For example, if the performance measure of interest of an M/M/1 queue is the average time in system of 1000 customers, we could look at the average of 100 samples, 10 customers each, instead. Furthermore, for higher derivative estimators, this method is easier to apply. 2.1.2.3 Weak Derivatives For the weak derivatives estimator, there exists c(✓) and densities f (2)1 and f (1) 1 such that @f1 @✓ = c(✓)[f (2)1 f (1) 1 ]. (2.4) This representation always exists, because if we let f (1)1 = 1c @f1@✓ and f (2)1 = 1c @f1@✓ +, where c = R @f1@✓ dx = R @f1@✓ + dx, then @f1@✓ is written as the di↵erence of two signed measures, otherwise known as the Hahn-Jordan decomposition, which is not unique. Therefore, using this representation, dE[Y (X)] d✓ = Z 1 1 Y (x) @f(x; ✓) @✓ dx = Z 1 1 Y (x)c(✓)[f (2)1 f (1) 1 ]dx, 18 where the weak derivatives estimator is of the form c(✓)[Y (X(2)1 , X2, . . . , Xd) Y (X(1)1 , X2, . . . , Xd)], where X(1)1 ⇠ f (1) 1 and X (2) 1 ⇠ f (2) 1 . This is called a weak derivative because the left hand side of (2.4) might not be proper, but when it is integrated against a test function, it is well-defined. These estimators are not unique and in our specific example, dE[Y (X)] d✓ = 1 ✓ Z 1 0 Z 1 0 Y (x1, x2)  1 ✓ ⇣x1 ✓ 1 ⌘ 1 ✓ e x1✓ dx1dx2, where the weak derivatives estimator is 1✓ [Y (X(2)1 , X2) Y (X(1)1 , X2)], with X(2)1 ⇠ Erl(2, ✓) and X(1)1 ⇠ exp(✓). Similar to the LR method, WD estimators variance grows linearly as the num- ber of random variables with ✓ dependency increases in the simulation. For example, in a queueing system, ✓ could appear in the interarrival and service times, which can be decreased through batching. In addition, WDs are not unique, so deciding which to use in generating the estimator could be a challenge. 2.1.2.4 General Extension We now generalize the estimators for situations where more Xi’s have a de- pendence on ✓. Let P ⇤ denote the set of indices where Xi is dependent on ✓. Then the following are the general estimators:IPA estimator: Xi2P ⇤ @Y (X)@Xi dXid✓ 19 LR/SF estimator (multivariate, independent): Y (X) @ ln f(X; ✓) @✓ , Y (X) Xi2P ⇤ @ ln fi(Xi; ✓)@✓WD estimator (multivariate, independent): c(✓)(L(X(2)) L(X(1))), Xi2P ⇤ ci(✓)(L(X1 . . . , X(2)i , . . . , Xd) L(X1 . . . , X(1)i , . . . , Xd)) 2.2 Stochastic Approximation 2.2.1 Classical Methods The two classical stochastic approximation methods, Robbins-Monro (RM) and Kiefer-Wolfowitz (KW), were first applied to the one-dimensional unconstrained stochastic optimization problem, and the recursive scheme follows xn+1 = xn an brJ(xn), (2.5) which is identical to (1.2) with the exception of the projection operator. The main di↵erence between RM an KW is the gradient estimate brJ . In RM, the gradient is estimated by an unbiased estimator, whereas in KW, the gradient estimate is only asymptotically unbiased. Both algorithms have their advantages and will be discussed in the next two sections. 2.2.1.1 Robbins-Monro Robbins and Monro pioneered the way for stochastic approximation in [50], and the number of literature citations since has grown to over 3500. RM was first 20 intended to solve root-finding problems, i.e., h(x⇤) = 0, where h : Rd ! R. The Robbins-Monro (RM) stochastic approximation algorithm was applied to a stochas- tic optimization problem with the objective function J by setting h = rJ , where the true gradient is estimated using an unbiased direct gradient. RM solves this problem iteratively as in (2.5) by choosing the gradient estimate brJ(xn) to be an unbiased estimator, i.e., E[brJ(xn)] = rJ(xn), and the output is taken as the last iterate x⇤N , where N is the stopping time. Unfortunately, the direct gradient measurements are still approximations to the actual gradients because of the noise (i.e., brJ(xn) = rJ(xn) + n, where ✏n is noise with zero mean). When used in SA algorithms, unbiased gradients estimates lead to faster convergence rates, and under certain conditions RM can achieve an asymptotic convergence rate of up to O(n1/2) [53]. The following is the original convergence theorem.Theorem 2.2.1. (Theorem 2 [50]) Assume rJ(x) has a unique root x⇤ and supposebrJ(x) is an unbiased gradient estimator, i.e., E[brJ(x)] = rJ(x). If the sequence {xn} is generated from (2.5) and the following conditions hold: 1. {an} is a sequence of positive constants such that P1n=1 an = 1, P1n=1 a2n < 1. 2. rJ(x) 0 for x > x⇤ and rJ(x)  0 for x < x⇤. 3. There exists a positive constant C such that P (|brJ(x)|  C) = 1 8x. Then xn p! x⇤ as n !1, where p! denotes convergence in probability. The objective function J is assumed to have a global minimum with a bounded 21 derivative. The most well-known conditions are restrictions on the gain sequence {an}. Generally, the step size an ! 0 but P1n=1 an = 1, which prevents the step size from converging to zero too quickly, so the iterates are able to make progress towards x⇤ and not get stuck at a poor estimate. The typical form for the step size is an = ✓a/(n + A)↵, where ✓a > 0, A 0, and 12 < ↵  1, with A = 0 and ↵ = 1 as a commonly used choice. The restrictions still allow for an uncountable number of step size options, and the finite-time performance of SA is notoriously known to be sensitive to the an choice. Theoretically, unbiased gradients lead to faster convergence rates but are not always available, so the next method addresses this issue. 2.2.1.2 Kiefer-Wolfowitz The Kiefer-Wolfowitz stochastic approximation algorithm only requires sample performances measurements to implement and does not require additional informa- tion on the system dynamics or input distributions as in RM. The original KW iterative scheme xn+1 = xn anY (xn + cn, ⇠+n ) Y (xn cn, ⇠n )2cn , (2.6) estimates the gradient using a symmetric finite di↵erence gradient estimate, and un- der certain conditions, KW can achieve an asymptotic convergence rate of O(n1/3). In addition, common random numbers (CRN) can be employed to decrease the vari- ance of estimates, and KW can achieve an asymptotic convergence rate of O(n1/2) in certain settings [38]. 22 Theorem 2.2.2. ( [37]) Assume J(x) = E[Y (x, ⇠)]. If the sequence {xn} is gener- ated from (2.6) and the following conditions hold: 1. Let {an} and {cn} be positive tuning sequences satisfying the conditions cn ! 0, X an = 1, X ancn < 1, X a2nc2n < 1. 2. J(x) is strictly decreasing for x < x⇤ and strictly increasing for x > x⇤. 3. V ar(Y (x, ⇠)) < 1 and satisfies the following regularity conditions: 1) There exist positive constants and B such that |x0 x⇤| + |x00 x⇤| < =) |J(x0) J(x00)| < B|x0 x00|. 2) There exist positive ⇢ and R such that |x0 x00| < ⇢ =) |J(x0) J(x00)| < R. 3) For every > 0 there exists a positive ⇡() such that |x x⇤| > =) inf/2>✏>0 |J(x+ ✏) J(x ✏)|✏ > ⇡(). Then xn p! x⇤ as n !1, where p! denotes convergence in probability. Condition 1 assures that the step size an does not converge to zero too fast, so the iterates do not get stuck at a poor estimate. In addition, the condition prohibits the finite di↵erence step size cn from decreasing too quickly, as well as to prevent noisy gradients. The second condition insures that there is a global optimum. The first regularity condition requires J(x) to be locally Lipschitz in a neighborhood of 23 x⇤; the second one prevents J(x) from changing drastically in the feasible region; and the last one prohibits the function from being very flat outside a neighborhood of x⇤ so that the iterates approach the optimum. Although the KW algorithm converges asymptotically, its finite-time performance is dependent on the choice of tuning sequences, {an} and {cn}. If the current xn is in a relatively flat region of the function and an is small, then the convergence will be slow. On the other hand, if the xn is located in a very steep region of the function and {an} is large, then the iterates will experience a long oscillation period. If {cn} is too small, the gradient estimates using finite di↵erences could be extremely noisy. KW has been extended to higher dimensions, and two common gradients con- sidered are symmetric di↵erences and forward di↵erences as discussed in Section 2.1.1.1. Although using the symmetric di↵erence scheme is computationally more expensive, it has the potential to reach an asymptotic convergence rate of O(n1/3) compared to O(n1/4) for forward di↵erences. For d = 1, the computational cost is identical for both gradient estimates. Even though both schemes are easy to imple- ment, their convergence rates are typically inferior to the RM algorithm, although under certain conditions with CRN ⇠+n,i = ⇠n,i for the symmetric di↵erence, they also can achieve the O(n1/2) asymptotic convergence rate. For simulation optimization, RM is not always applicable since additional information is needed, which may not be readily available or is dicult to obtain. For KW, there is an additional task of appropriately choosing the perturbation sequence {cn}. In general, KW is a simple algorithm to implement for simulation optimization applications, albeit costly in high-dimensional settings. 24 2.2.2 Robust Gradient 2.2.2.1 Simultaneous Perturbation Stochastic Approximation Simultaneous perturbation stochastic approximation (SPSA) specifically ad- dresses multivariate optimization problems [55]. Let ✏+n and ✏n be the noise from the sample performances J(xn + cnn) and J(xn cnn), respectively.SPSA Algorithm • Input. Choose x1 2 ⇥, {an}, {cn}, and stopping time N . • Initialize. Let n = 1. • While n < N, – Step 1. Generate a d-dimensional random perturbation vector n. – Step 2. Generate an estimate of rJ(xn):brJ(xn) = Y (xn + cnn, ⇠+n ) Y (xn cnn, ⇠n )2cn 266666641n,1... 1n,d37777775 – Step 3. Compute xn+1 = xn an brJ(xn). – Step 4. Let n = n+ 1. Go to Step 1. • Output. x⇤N = xN . 25 Theorem 2.2.3. (Theorem 7.1 [57]) Suppose J has a unique minimum x⇤ 2 ⇥ and {xn} is generated using SPSA. If the following conditions hold: 1. The positive sequences of real numbers {an} and {cn} converge to zero such that P1n=0 an = 1 and P1n=0 a2nc2n < 1. 2. The function J(x) 2 C3 and bounded on Rd. 3. ||xn|| < 1 for all n. 4. E[✏+n ✏n |n,Fn] = 0 and E[(Y (xn ± cnn, ⇠±n )/n,i)2] is uniformly bounded for all n, i. 5. x⇤ is an asymptotically stable solution of the di↵erential equation @x(t)/@t = rJ(x(t)). 6. For each n, {n,i}di=1 are identically distributed, {n,i} are independent and symmetrically distributed with zero mean and uniformly bounded in magnitude for all n, i. Then xn ! x⇤ a.s. as n !1. The optimal convergence rate for SPSA is O(n1/3) [55]. Various convergence proofs have been presented with slight modifications to the conditions (cf. [12, 17, 30, 55, 60]). The perturbation sequence {n} where n = (n,1, . . . ,n,d) with the sequence {n,i} is independent with mean zero and finite inverse moments (i.e., E[n,i] = 0 and E[|n,i|1] < 1 for i = 1, . . . , d) to guarantee a.s. convergence when applied to SA. As a result, the Gaussian distribution is not applicable. Instead, 26 the most common distribution used is the symmetric Bernoulli taking a positive and negative value (e.g., ±1) each with probability 1/2. In addition, an appropriately scaled xn is approximately normal for large n, and the relative eciency of SPSA depends on the geometric shape of J(x), choice of {an}, {cn}, distribution of {n,i}, and noise level. Many extensions to the original SPSA algorithm have been developed, for example, the constrained setting using projection operators [26, 54]. A slight mod- ification is the averaging of the SPSA gradient estimators, where instead of gen- erating one gradient estimate at each iteration, multiple gradient estimates can be generated at additional computational cost and averaged to reduce the noise. An accelerated form of SPSA approximates the second-order Hessian r2J(x) to accelerate the convergence [57], analogous to the Newton-Raphson method. Iter- ate averaging in the SPSA setting has also been explored, but performs relatively poor in finite-time [17, 56]. All in all, SPSA has been shown to be an e↵ective SA method for tackling high-dimensional problems, with ease of implementation and the asymptotic theory to support it. 2.2.2.2 Gradient Averaging Gradient averaging can help stabilize the gradient estimate, especially if it is noisy. One obvious straightforward method is to generate multiple gradient esti- mates at each iteration and average them to provide a better estimate, but this process can be expensive and is not worth the computational e↵ort, according to 27 Robbins and Monro. In SA, the gradient estimate is only an intermediate step that provides a direction for the iterates to move, but the goal is to find the opti- mum x⇤ and not the best estimate of the gradient as in sensitivity analysis. The computational cost could be better expended in future iterations. Another form of gradient averaging uses previous gradient estimates in the current gradient approx- imation. One of the earliest proposed gradient averaging methods was introduced in [22] and [27], where “present” and “past” data is used to improve convergence properties of SA. The method introduced in [22] considers all past gradients with the gradient update of the formdn+1 = dn + bn(brJ(xn) dn), where dn represents the previous direction, brJ(xn, ⇠n) is the new gradient, bn is the averaging coecient, and dn+1 is the new updated direction used in the stochastic approximation algorithm. Later, [27] proposed a modified version of this gradient averaging technique, which involves an additional step.dn+1 = dn + bn(brJ(xn) dn),dn = dn + T (xn,n), where n = xn xn1, T (xn,n) represents a updating function with higher derivatives, and dn+1 is the new gradient estimate used in SA, which incorporates higher derivatives. In addition, [27] propose an updating method that averages past gradient estimates if the change in estimates does not surpass a certain value. 28 2.2.3 Adaptive Step Sizes 2.2.3.1 Kesten’s Rule It is well-known that the classical SA algorithms are extremely sensitive to the step size sequence {an}. Therefore, it could be advantageous to consider adaptive step sizes that adjust based on the ongoing performance of the algorithm, in hopes of adapting them to the characteristics of the function at the current location of the iterate and proximity of the current iterate to the optimum. Kesten’s rule [36] decreases the step size only when there is a directional change in the iterates. The notion behind this adaptive step size is that, if the iterates continue in the same direction, there is reason to believe they are approaching the optimum and the pace should not be decreased in order to accelerate the convergence. If the errors in the estimate values change signs, it is an indication that either the step size is “too large” and the iterates are experiencing long oscillation periods or the iterates are in the vicinity of the true optimum; either way, the step size should be reduced to a more appropriate step size or to hone in on x⇤. The following algorithm is for the one-dimensional case d = 1.SA Algorithm using Kesten’s rule • Input. Choose x1 2 ⇥, {an}, ⇧⇥, and stopping time N . • Initialize. – Let n = 2 and k = 1. 29 – Generate an estimate brJ(x1) of rJ(x1). – Compute x2 = ⇧⇥(x1 a1 brJ(x1)). • While n < N , – Step 1. Generate an estimate brJ(xn) of rJ(xn). – Step 2. Compute xn+1 = ⇧⇥(xnak brJ(xn)). If (xn+1xn)(xnxn1) < 0, go to Step 3. Otherwise, go to Step 4. – Step 3. Let n = n+ 1 and k = k + 1. Go to Step 1. – Step 4. Let n = n+ 1. Go to Step 1. • Output. x⇤N = xN . Kesten’s rule can be applied to both RM and KW and still guarantee convergence in probability, as long as {an} satisfies condition 1 in Theorem 2.2.1 and 2.2.2 for RM and KW, respectively [36]. An extension of Kesten’s rule to higher dimensions is discussed in [15]. See [29] for an extensive review of both deterministic and stochastic step sizes. 2.2.3.2 Scaled-and-Shifted Kiefer-Wolfowitz The scaled-and-shifted Kiefer-Wolfowitz (SSKW) algorithm [7] adaptively ad- justs {an} and {cn} finitely many times during the course of the algorithm to adapt to the characteristics of the function and noise level in hopes of preventing slow convergence in finite-time. The idea is to increase {an} so the iterates are able to 30 make noticeable progress towards the optimum with the option of decreasing {an} later if it is too large. Furthermore, if the direction of the gradient is classified as incorrect, then {cn} is increased to reduce the noise. Note that KW only requires two parameter choices {an} and {cn}, whereas SSKW requires eleven, as seen in the algorithm below.SSKW AlgorithmScaling Phase • Input. {an}, {cn}, ⇥ = [l, u], ⇧⇥, stopping time N , and – h0 = number of forced boundary hits, – 0 = scale up factor for {cn}, – ka = maximum number of shifts of {an}, – va = initial upper bound of shift, – a = maximum scale up factor for {an}, – kc = maximum number of scale ups for {cn}, – c0 = maximum value of {cn} after scale ups (i.e., cn  cmax = c0(u l)), – g0 = maximum number of gradient estimates in scaling phase, – mmax = maximum number of adaptive iterations (mmax  N). • Initialize. – Choose x1 2 [l + c1, u c1]. 31 – Let n = 1, m = 1, g = 1, sh = 0, and sc = 0. • Do while m  h0 and g  g0. – Step 1. ⇤ Generate an estimate brJ(xn) using symmetric di↵erences. ⇤ Compute xn+1 = ⇧⇥ ⇣xn an brJ(xn)⌘. · If xn+1 2 (l + cn, xn), go to Step 2. · If xn+1 2 (xn, u cn, ), go to Step 3. · If xn+1 > u cn+1 and xn = u cn or if xn+1 < l + cn+1 and xn = l + cn, go to Step 4, if sc  kc. · If xn+1 > u cn+1 and xn = l + cn or if xn+1 < l + cn+1 and xn = u cn, go to Step 5. – Step 2. ⇤ Scale {an} up by ↵ = min{a, (u cn+1 xn)/(xn+1 xn)} and use {↵an} for the remaining iterations. ⇤ Set xn+1 = l + cn+1. Let n = n+ 1, m = m+ 1, g = g + 1 and go to Step 1. – Step 3. ⇤ Scale {an} up by ↵ = min{a, (l + cn+1 xn)/(xn+1 xn)} and use {↵an} for the remaining iterations. ⇤ Set xn+1 = u cn+1. Let n = n+1, m = m+1, g = g+1 and go to Step 1. 32 – Step 4. ⇤ Scale {cn} up by = min{0, cmax/cn} and use {cn} for the remain- ing iterations. ⇤ Let sc = sc+ 1 and go to Step 5. – Step 5. ⇤ Set xn+1 = min{u cn+1,max{xn+1, l + cn+1}}. ⇤ Let n = n+ 1, g = g + 1 and go to Step 1.Shifting Phase • While n  mmax and n  N , – Step 1. ⇤ Generate an estimate brJ(xn) using symmetric di↵erences. ⇤ Compute xn+1 using (1.2). · If xn+1 > u cn+1 and xn = l + cn or if xn+1 < l + cn+1 and xn = u cn, go to Step 2, if sh  ka. · If xn+1 > u cn+1 and xn = u cn or if xn+1 < l + cn+1 and xn = l + cn, go to Step 3, if sc  kc. · Otherwise, go to Step 4. – Step 2. ⇤ Find smallest integer 0 such that xn+1 2 (l+ cn, u cn) with an+0 . ⇤ Set = min(va, 0) and shift {an} to {an+}. If = va, set va = 2va. 33 ⇤ Let sh = sh+ 1 and go to Step 4. – Step 3. ⇤ Scale {cn} up by = min{0, cmax/cn} and use {cn} for the remain- ing iterations. ⇤ Let sc = sc+ 1 and go to Step 4. – Step 4. ⇤ Set xn+1 = min{u cn+1,max{xn+1, l + cn+1}}. ⇤ Let n = n+ 1 and go to Step 1.KW Algorithm • If n > mmax and n < N , then SSKW reverts back to KW and stops when n = N . • Output. x⇤N = xN . The SSKW algorithm has two pre-processing phases, scaling and shifting, which adjust the tuning sequences in order to improve the finite-time performance, before reverting back to the original KW algorithm. In the scaling phase, the {an} is scaled up by a factor ↵, i.e., {an} to {↵an}, so the iterates can move from one boundary to the other to ensure the step sizes are not too small relative to the gra- dient. In the shifting phase, the sequence {an} is decreased by shifting or “skipping” a finite number () of terms from {an} to {an+}, when the iterates fall outside of the feasible region when the sign of the gradient is correct. This acts as a recourse 34 stage and reduces the step size faster in case the step size sequence {an} is too large. During both phases, cn is scaled up by , i.e., {cn} to {cn}, if the previous iterate is at the boundary and the update falls outside the feasible region but is moving in the wrong direction. This increase is an attempt to reduce the noise of the gradient estimate. These adjustments do not a↵ect the asymptotic convergence, since the scaling phase only scales the sequences by a constant and the shifting phase only scales up the {cn} finitely many times and skips a finite number of terms in {an}. 2.2.4 Robust Output 2.2.4.1 Averaging Iterates Iterate averaging, introduced in [51] and [49], approaches SA from a di↵erent angle. Instead of fine-tuning the step sizes to adapt to the function characteristics, iterate averaging takes bigger steps (i.e., an larger than O(n1)) for the estimates to oscillate around the optimum, so the average of the iterates will result in a good approximation to the true optimum. The idea is simple, and yet can be very e↵ective. It is easy to see that for this method to be successful, it is essential that the iterates surround the optimum in a balanced manner, and that the domain in which the iterates oscillate shrinks as n increases. Averaging trajectories reduces the sensitivity to the initial step size choice. The algorithm follows recursion (1.2) for the RM case; however, instead of the taking the last iterate xN as the output, 35 the optimum is estimated by x⇤N = 1N NXn=1 xn, which is an average of N iterates, where N is the stopping time. Under “clas- sic” assumptions, iterate averaging achieves the same convergence rate as the RM method. Furthermore, p n(x⇤n x⇤) is asymptotically normal with mean zero and the smallest covariance matrix, which is the inverse of the average Fisher informa- tion matrix. (cf. [49]). A constant step size can be applied and yields convergence in distribution [42]. A variation of this method is called the “sliding window” average, which is based on the last m iterates: x⇤N = 1m NXn=Nm+1xn. (2.7) An advantage of (2.7) is it ignores the first N m iterates, which may be poor estimates, since the first iterate is arbitrary, and averages only the last m, which are assumed to be closer to x⇤. Asymptotic normality for a growing window is shown in [40] and [42], which also includes constant step sizes. Another modification of the original method incorporates xN in the components being averaged x⇤N1, which is known as the feedback approach [41]. These methods are suited for problems where the iterates hover around the optimum. In an empirical study, iterate averaging was applied to SPSA [44]. The results suggest that if the Hessian of J(x) is large, averaging is considered ideal, since it is associated with a high variability in J(x), which indicates the iterates are moving around the optimum. In general, averaging iterates leads to more robustness with respect to step size sequence because of the 36 reduced sensitivity, while converging at the same optimal asymptotic rate as RM. Inspired by iterate averaging, weighted averages for KW was presented to achieve the optimal asymptotic convergence rate O(n1/2) under certain conditions [17]. Under certain parameter settings, iterate averaging and weighted averaging produce the same estimator. 2.2.4.2 Robust Stochastic Approximation The robust SA (RSA) method is intended to be relatively insensitive to the choose of the step size sequence, similar to Polyak-Ruppert iterate averaging. The form of RSA is identical to (1.2) with the exception of the output. Instead ofx⇤N = xN , where xN is the last iterate, x⇤N is calculated asx⇤N = PNn=1 anxnPNn=1 an , where an > 0 for all n. It is clear that if an = a, where a 2 R+ for all n, thenx⇤N = 1N PNn=1 xn, giving the uniformly weighted average of Polyak-Ruppert. As mentioned earlier, iterate averaging under a constant step size for a moving window is asymptotically normal [42]. A finite-time bound for E[J(x⇤N)J(x)] was derived under RSA when J is assumed convex [45]. Assume there exists C > 0 such that E[||rJ(x)||]  C2 for all x 2 ⇥. Then for an N -step iteration policy, E[J(x⇤N) J(x)]  ||x0 x⇤||2 + C2 PNn=1 a2n 2 PNn=1 an . (2.8) 37 For equal weights or iterate averaging, the bound on the right hand side of (2.8) can be minimized if an = a := D⇥ C p N , where D⇥ = maxx,y2⇥ ||x y||. The distance ||x0 x⇤|| in place of D⇥ tightens the bound in (2.8), but x⇤ is unknown so the improvement may not be practically meaningful. The step size requires the number of iterations N to be fixed. Similar to iterate averaging, a sliding window average can also be employed in RSA. the estimate consists of the last N K + 1 estimates and has the formx⇤N,K = PNn=K anxnPNn=K an . If we consider the varying step size an = ✓D⇥Cpn, for ✓ > 0, then we have the bound E[J(x⇤N,K) J(x)]  D⇥CpN "2✓ ✓ NN K + 1◆ ✓2rNK# , for 1  K  N . 2.2.4.3 Acceleration Stochastic Approximation The Accelerated SA (AC-SA) algorithm [32] takes a similar approach to iterate averaging and RSA by taking long strides and incorporating each of the iterates into the output. The next two algorithms, Accelerated SA for strongly convex and convex 38 functions, take advantage of the smoothness factor of the function if it exists. AC- SA for convex functions is a special case of AC-SA for strongly convex functions, so we first introduce AC-SA for strongly convex functions and then restrict the strong convexity parameter for J(·) for the convex case. AC-SA is an example of a proximal method, which introduces a proximity function into the objective function. The prox-function acts as a regularization term to prevent the next iterate update xn+1 from being too far from xn and is comprised of a distance generating function or Bregman function ! : ⇥! R, which is continuously di↵erentiable and strongly convex with modulus ⌫ > 0 satisfying hx y,r!(x)r!(y)i ⌫||x y||2 8x,y 2 ⇥, where h·, ·i denotes the inner product. A prox-function with the given distance generating function is V (x,y) = V!(x,y) = !(y) [!(x) + hr!(x),y xi]. As xn ! x⇤, the regularization term disappears, so minimizing f(x) plus a regular- izer is equivalent to minimizing the function J(x). Consider a strongly convex function J(·) satisfying µ 2 ||y x||2  J(y) J(x) hrJ(x),y xi  L 2 ||y x||2 +M ||y x||, for all x,y 2 ⇥ where µ > 0 is the strong convexity parameter. If M > 0 and L = 0, then J is Lipschitz continuous with Lipschitz constant M/2. If M = 0 and L > 0, then J has Lipschitz continuous gradients with Lipschitz constant L. 39 The AC-SA algorithm updates three sequences, {xmdn }, {xagn }, and {xn}. Here, “md” and “ag” are abbreviations for median and aggregate, respectively, and median is used in a loose sense.Accelerated SA Method for Strongly Convex Functions • Input. – Specify V (x,y), {↵n} and {n} be given such that ↵1 = 1, ↵n 2 (0, 1) for n 2, and n > 0 for n 1 and a stopping time N . • Initialize. Choose xag0 = x0 2 ⇥ and let n = 1. • While n < N , – Step 1. Generate an estimate brJ(xn) of rJ(xn). – Step 2. Computexmdn = ↵n[(1 ↵n)µ+ n]n + (1 ↵2n)µ xn1 + (1 ↵n)(1 ↵n)(µ+ n)n + (1 ↵2n)µ xagn1xn = argminx2⇥{↵n[hrJ(xmdn ),xi+ µV (xmdn )] + [(1 ↵n)µ+ n]V (xn1,x)}xagn = ↵nxn + (1 ↵n)xagn1 – Step 3. Let n = n+ 1 and go to Step 1. • Output. x⇤N = xagN . Note: V (x,y) = 12 ||x y||2 using the Euclidean norm with ⌫ = 1 is a common prox-function. Refer to [31] for details. 40 Theorem 2.2.4. (Theorem 1 [31]) Assume V (x,y)  12 ||x y||2 for all x,y 2 ⇥ when µ > 0 and E[(brJ(x))rJ(x))2]  2 8x 2 ⇥. Choose {↵n} and {n} such that ⌫(µ+ n) > L↵2n, (2.9) n/n = n+1/n+1 for n 1, (2.10) where n = 8>><>>: 1 if n = 1;(1 ↵n)n1 if n 2. Then, E[J(xagN ) J(x⇤)]  N 1V (x0,x⇤) + NXn=1 2(M2 + 2)↵2nn[⌫(µ+ n) L↵2n]! . (2.11) Consider ↵n = 2/(n + 1), n = 4L/[⌫n(n + 1)], and n = 2/[n(n + 1)]. It can be easily checked that these choices satisfy conditions (2.9) and (2.10). Under these conditions, the right hand side of (2.12) can be bounded by 4LV (x0,x⇤) ⌫N(N + 1) + 8(M2 + 2) ⌫µ(N + 1) , (2.12) for µ > 0. The bounds in (2.11) and (2.12) rely on additional information of the function and gradient, which are unknown, so they must be approximated. AC-SA for convex functions is a special case of AC-SA for strongly convex functions with µ = 0. The algorithm is identical to AC-SA for strongly convex function with the exception of the xmdn and xn update since µ = 0. The resulting 41 updates are xmdn = ↵nxn1 + (1 ↵n)xagn1,xn = argminx2⇥{↵nhrJ(xmdn ),xi+ nV (xn1,x)}. Interestingly, if V (x,y) = 12 ||x y||2, then the update for xn simplifies toxn = ⇧⇥✓xn1 ↵nn brJ(xmdn )◆ , (2.13) which has a similar form to the standard SA algorithm. Notice in the update for xn in (2.13), ↵n/n takes the place of the step size an in (1.2) and the gradient estimate brJ is evaluated at xmdn as opposed to xn1. If we consider the same parameter setting as in the strongly convex case, the “step size” ↵n/n increases with n. Furthermore, the lower and upper bounds for the optimal objective function can be computed online and the di↵erence converges to 0 as the number of iterations increases to infinity [31].Theorem 2.2.5. (Proposition 7 [31]) Assume that the assumptions in Theorem 3.3.2 hold for µ = 0 and the sequences ↵n = 1/(n+ 1) and n = 4/[⌫n(n+ 1)] for 2L. Then E[J(xagN ) J(x⇤)]  4V (x0,x⇤)⌫N(N + 1) + 4(M2 + 2)(N + 2)3 , (2.14) where = max ( 2L,  ⌫(M2 + 2)N(N + 1)(N + 2) 3V (x0,x⇤) 1/2) . minimizes the bound in (2.14). 42 2.2.4.4 Numerical Comparison We investigate the MSE performance of RSA and AC-SA using direct gradient estimates and compare the results against the classical RM algorithm and RM with iterate averaging. We consider the optimal parameter settings for RSA and AC-SA, which require additional knowledge of the function, its gradient, and the optimum, so in practice, they must be approximated. We consider a simple quadratic function, f(x) = 13x 2, on the truncated intervals [50, 50] and [5, 95] with x1 = 30.0, = 1.0, and 1000 sample paths. For the RM and RM with iterate averaging algorithm, we employ a common step size an = ✓a/n, where ✓a = 10.0. RM performed relatively well for a wide range of multiplicative constants. We chose to use ✓a = 10.0, although it did not yield the lowest MSE at the 1000th or 10000th iteration from preliminary numerical tests. For RSA, we adopt a constant step size that minimizes the finite-time bound in (2.8), where C = 100/3, 190/3 for the intervals [-50, 50] and [-5, 95], respectively, and D⇥ = 100. For the AC-SA algorithm, we consider ↵n = 2/(n + 1) and n = 4/[n(n+ 1)], where is given in (2.15) with ⌫ = 1, L = 2/3 and M = 0. Figure 2.1 plots the MSE as a function of the number of iterations from 1 to 10000 on a log scale. The results for both the centered and skew truncated intervals appear to have the same behavior across all four algorithms. RM performs well with a good parameter choice, although it is not the best, but averaging the iterates improves the performance, resulting in a smoother monotonically decreasing MSE curve as the number of iterations increase. Compared to a decently/reasonably 43 (a) [ 50, 50] (b) [ 5, 95]Figure 2.1: MSE under AC-SA, RM, RM w/averaging and RSA for f(x) = 13x 2, x1 = 30.0, = 1.0. tuned RM and RM with iterate averaging algorithm, RSA appears to be inferior, at least in this simple numerical experiment. The most interesting curve is from the AC-SA algorithm, where one can observe periodic oscillations, which decrease in magnitude as the number of iterations increase. We further investigated this behavior by analyzing individual sample paths, and the estimates {xagn } appear to have the same behavior, following a smooth oscillating path/curve. From Figure 2.1, the AC-SA curve appears to level o↵ and hover slightly over the RSA curve. The stopping time dictates relative performance of AC-SA when there are a smaller number of iterations because of the oscillations. For the case of the skewed interval, there is a small range of iterations where AC-SA outperforms RSA, RM, and RM with iterate averaging, as well as other small ranges where it outperforms RSA. Keep in mind that these experiments are for a simple quadratic function for a particular setting, so the relative performance will most likely change in a di↵erent setting. From our numerical experiments, one can conclude that RM and RM with 44 iterate averaging have the potential to outperform RSA and AC-SA if the step size parameter is chosen appropriately for a wide range of choices. In this case, iterate averaging improves the performance of RM for all 10000 iterations. Both the AC-SA and RSA algorithms require additional knowledge to choose the optimal step size that minimizes the bound in (2.11) and (2.8) for AC-SA and RSA, respectively. 2.2.5 Varying Bounds Initially, the asymptotic theory for SA only considered functions satisfying specific global conditions; however, subsequently it was shown the requirements need only hold on a compact set ⇥ 2 Rd containing the optimum. Therefore, the projection operator is particularly important in the constrained optimization setting. Since the optimum is unknown, the compact set should be large enough so that x⇤ 2 ⇥ with high probability; however, this may increase the potential of an algorithm to perform poorly due to the size of the parameter search space [2] . For instance, if the compact set is very large, the step size is extremely small, and the current iterate is extremely far from the optimum, then the convergence is likely to be slow; however, if the compact set is small and contains the optimum, then the iterates will never be too far from the optimum. Even if the step sizes are small, the convergence will be much faster in comparison to the algorithm restricted to a much larger set. One of the first ideas was to project the iterates onto a predetermined fixed point once the magnitude of the iterate surpassed an arbitrarily specified thresh- 45 old, with the threshold increasing after it is exceeded [13]. This method converges asymptotically, but in practice, it has its pitfalls. When an iterate is projected onto an arbitrary fixed point, in a sense, the algorithm restarts from this “initial” value with a smaller sequence of step sizes. Not only does it lose all of the progress gained from the iterations prior to the projection, but the reduction in step size could hin- der the convergence by moving even slower towards the optimum. To circumvent this issue, it was shown that it suces to project the iterates onto a predetermined bounded set [64]. This is a slight improvement since the iterates do not start from the same position with an even smaller step size. However, it still has its limitations since the initial start values are restricted to the predetermined compact set. Later, an algorithm defined over a growing feasible region by writing ⇥ as an increasing sequence of compact sets (i.e., ⇥n ✓ ⇥n+1, where ⇥ = [⇥n) was introduced [2]. The orthogonal projection operator changes from ⇧⇥k to ⇧⇥k+1 if xn+1 /2 ⇥k. The idea is to start with a smaller feasible region ⇥1 and only increase when there is reason to believe the optimum x⇤ /2 ⇥1 (i.e., when the xk /2 ⇥1). Since the projection is made onto the current compact set ⇥m, the progress gained up to that point is not lost. The feasible region ⇥ is written as [k⇥k, so it is impossible for x⇤ /2 ⇥k for some k. If x⇤ is contained in one of the earlier compact sets and if they grow slowly, the empirical results could improve significantly. The key in the performance is to choose the sequence {⇥n} appropriately. If it grows too quickly, the results might be very similar to that of the original SA. The following algorithm and convergence result are for the RM multidimensional case d 1, where || · || denotes the Euclidean norm. 46 SA with Varying Bounds • Input. Choose x1 2 ⇥1, {an} and {⇥n}. • Initialize. Let n = 1 and m = 1. • While n < N , – Step 1. Generate an estimate brJ(xn) of rJ(xn). – Step 2. Compute x0n+1 = xn an brJ(xn). If x0n+1 2 ⇥m, go to Step 3. Otherwise, go to Step 4. – Step 3. Let xn+1 = x0n+1, n = n+ 1 and go to Step 1. – Step 4. Let xn+1 = ⇧⇥m(x0n+1), n = n+ 1, m = m+ 1 and go to Step 1. • Output. x⇤N = xN .Theorem 2.2.6. (Theorem 2 [2]) Let the sequence {xn} be generated using the above algorithm, ✏n = brJ(x) E[brJ(x)|Fn], and n = E[brJ(x)|Fn] rJ(x), where Fn is the smallest -algebra used to generate xn+1. If the following conditions hold: 1. The sequence {⇥k} is a set of compact convex sets such that ⇥k ✓ ⇥k+1 for all k and [1k=1⇥k = ⇥. 2. The positive sequences of real numbers {an} and {cn} converge to zero such that P1n=1 an = 1, P1n=1 ancn < 1, and P1n=1 a2nc2n < 1. 47 3. There exists  0 such that E[||✏n||2|Fn]  c2n (1 + ||xn x⇤||2) a.s. for all n. 4. ||n|| is bounded a.s. and P1n=1 an||n|| < 1 a.s. 5. There exist a positive sequence of real numbers {Mn} an integer N 1 such that P1n=1 a2nM2n < 1 and for all n N , supx2⇥n1 ||J(x)||  Mn. 6. There exists a unique x⇤ 2 ⇥ such that rJ(x⇤) = 0, and for all 0 <  1, infx2⇥:||xx⇤||1 J(x)T (x x⇤) > 0. Then xn ! x⇤ a.s. as n !1. If an appropriate increasing sequence of compact sets is chosen, the finite- time performance can improve significantly, but this optimal choice is still an open problem. 48 Chapter 3 New Hybrid Stochastic Approximation Methods 3.1 Motivation Theoretically, unbiased direct gradients lead to faster asymptotic convergence rates in SA algorithms compared to indirect gradient estimates, but this does not necessarily translate to better finite-time performance. In the deterministic setting, where the true gradient is known, indirect gradients are not useful, whereas in the stochastic setting, direct gradients are noisy and indirect gradients actually contain information that can be used to better approximate the gradient and in turn accelerate the SA convergence to a neighborhood of the optimum. To the best of our knowledge, the current SA algorithms either use direct gradients or indirect gradients but not both simultaneously. Our algorithms exploit the additional information provided by indirect gradients by using a gradient that integrates both direct and indirect gradient estimates. In this chapter, we do the following: 1. We propose two new hybrid stochastic approximation algorithms, STAR-SA and STAR-SPSA, which incorporate both direct and indirect gradient esti- mates, and are provably convergent. 2. We derive variance minimizing weights for the STAR-SA and STAR-SPSA 49 gradient estimates. 3. We show that the variance of the STAR-SA gradient is lower than that of RM and KW, and that the variance of the STAR-SPSA gradient is less than that of RM and SPSA, for a deterministic weight sequence. 4. For simple quadratic functions, we show that the MSE of the estimate gener- ated using STAR-SA is strictly less than that of RM and KW under certain conditions. 5. We illustrate the robustness/e↵ectiveness of STAR-SA and STAR-SPSA through numerical experiments. 3.2 Secant-Tangents AveRaged Stochastic Approximation In this section, we introduce the one-dimensional Secant-Tangents AveRaged stochastic approximation (STAR-SA) algorithm, which incorporates both direct and indirect gradient estimates [?, 10]. Figure 3.1 illustrates the STAR gradient estima- tor, which uses estimates of the sample performance f˜ and its gradient f˜ 0 at two points, xn + cn and xn cn, where the values f˜(xn + cn) and f˜(xn cn) do not lie on the graph of the true function due to estimation noise error. For the indirect gradient, the values f˜(xn+cn) and f˜(xncn) are used to compute a symmetric finite di↵erence (Secant). The two direct gradient estimates, f˜ 0(xn + cn) and f˜ 0(xn cn), can be averaged to provide another gradient estimate (Tangents AveRaged). The Secant-Tangents AveRaged stochastic approximation (STAR-SA) gra- 50 Introduction Classical SA methods STAR-SA & STAR-SPSA Summary and future work Motivation Algorithms Theoretical results Numerical experiments STAR illustration M. Chau, M.C. Fu, H. Qu: Multivariate SA using a Secant-Tangents AverRaged (STAR) Gradient Estimator Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 5 M. Chau, M.C. Fu, H. Qu: Multivariate SA using a Secant-Tangents AverRaged (STAR) Gradient Estimator Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 5 Introduction Classical SA methods STAR-SA Numerical experiments Conclusion and future work Motivation Algorithm Theoretical results STAR illustration xn + cnxn cn f (x) f˜ (xn cn) f˜ (xn + cn) 2 1.5 1 0.5 0 ·104 Secant f˜ (xn + cn) f˜ (xn cn) 2cn Tangents AveRaged f˜ 0(xn + cn) + f˜ 0(xn cn) 2 Marie Chau, Huashuai Qu & Michael C. Fu A New Hybrid Stochastic Approximation Algorithm Figure 1 Ill stration of STAR gardent, where f˜ nd f˜ 0 are estimates of f and f 0 y x f(x) ˜ f(x+ c) ˜ f(x c) x + cx c The Secant-Tangents AveRaged stochastic approximation (STAR-SA) gradient estimator is a convex combination of a symmet c finite dierence gradient estimator (secant) and an average of two direct gradient estimators (tangents): gSTAR(xn) = ↵n ˜ f(xn + cn) ˜f(xn cn) 2cn +(1↵n) ˜ f (xn + cn)+ ˜ f (xn cn) 2 , (2) where ↵n 2 [0,1] and cn 2R for all n, and cn! 0 as n!1. The weight of the convex combination is crucial in the performance of STAR-SA. In fact, for any non-zero and finite variance level of the function and its gradient, there exists a sequence {↵n} such that the variance of the STAR-SA gradient estimator is less than the individual gradient estimates,Secant gS and Tangents AveRaged gTAR. Now we present theoretical results pertaining to STAR-SA. Let ˜ f(xn± cn) = f(xn± cn)+ ± n and ˜ f (xn± cn) = f (xn± cn)+ ± n , where ± n and ± n are Gaussian noise with zero mean. For notational Introduction Classical SA methods STAR-SA Numerical experiments Conclusion and future work Motivation Algorithm Theoretical results STAR illus ration xn + cnxn cn f (x) f˜ (xn cn) f˜ (xn + cn) 2 1.5 1 0.5 0 ·104 Secant f˜ (xn + cn) f˜ (xn cn) 2cn Tangents AveRaged f˜ 0(xn + cn) + f˜ 0(xn cn) 2 Marie Chau, Huashuai Qu & Michael C. Fu A New Hybrid Stochastic Approximation Algorithm Figure 1 Illustration of STAR gardent, where f˜ and f˜ 0 are estimates of f and f 0 ˜ f(xn + cn) and ˜ f(xn cn) are used to compute a ymmetric finite dierence (Secant). The two direct gradients estimates, ˜ f (xn+cn) and ˜ f (xncn), can be averaged to provide another gradient estimate (Tangents AveRaged). y x f(x) ˜ f(xn + cn) ˜ f(xn cn) xn + cnxn cn The Secant-Tangents AveRaged stochastic approximation (STAR-SA) gradient estimator is a convex combination of a symmetric finite dierence gradient estimator (secant) and an average of two direct gradient estimators (tangents): gSTAR(xn) = ↵n ˜ f(xn + cn) ˜f(xn cn) 2cn +(1↵n) ˜ f (xn + cn)+ ˜ f (xn cn) 2 , (2) where ↵n 2 [0,1] and cn 2R for all n, and cn! 0 as n!1. The weight of the convex combination is crucial in the performance of STAR-SA. In fact, for any non-zero and finite variance level of the function and its gradient, there exists a sequence {↵n} such Secant gS f˜ (xn + cn) f˜ (xn cn) 2cn Tangents AveRaged gTAR f˜ 0(xn + cn) + f˜ 0(xn cn) 2 Figure: Illustration of STAR gradient, where f˜ and f˜ 0 are estimates of f and f 0, respectively. Marie Chau, Michael C. Fu & Huashuai Qu Secant-Tangents AveRaged (STAR) stochastic approximation Figure 3.1: Illustration of STAR gradient, where f˜ and f˜ 0 are esti- mates of f and f 0, re pectiv ly. dient estimator is a convex combination of a symmetric finite di↵erence gradient estimator (secant) and an average of two direct gradient estimators (tangents): gSTAR(xn) = ↵n f˜(xn + cn) f˜(xn cn)2cn + (1 ↵n) f˜ 0(xn + cn) + f˜ 0(xn cn)2 , (3.1) where ↵n 2 [0, 1] and cn 2 R+ for all n, and cn ! 0 as n !1. The weight ↵n in the convex combination is crucial in the performance of STAR-SA. In fact, for any non-zero and finite variance level of the function and its gradient, there exists a sequence {↵n} such that the variance of the STAR-SA gradient estimator is less than the individual gradient estimates, Secant gS andTangents AveRaged gTAR. Now we present theoretical results pertaining to STAR-SA. Let f˜(xn ± cn) = f(xn±cn)+✏±n and f˜ 0(xn±cn) = f 0(xn±cn)+±n , where ✏±n and ±n are Gaussian noise 51 with zero mean. For notational simplicity, let V ar(f˜(xn)) = 2f , V ar(f˜ 0(xn)) = 2g , V ar(gSTAR(xn)) = 2n, V ar(gS(xn)) = V ar ⇣ f˜(xn+cn)f˜(xncn)2cn ⌘= 2f2c2n , V ar(gTAR(xn)) = V ar ⇣ f˜ 0(xn+cn)+f˜ 0(xncn) 2 ⌘ = 2g2 , and Corr(f˜(xn), f˜ 0(xn)) = ⇢. 3.2.1 Optimal Convex Weight 3.2.1.1 Homogeneous Noise The following provides a particular weight for minimizing the variance of the STAR-SA gradient.Lemma 3.2.1. For gSTAR defined in (3.1), assume for all n, ✏+n ? ✏n and +n ? n . Then V ar(gSTAR(xn)) is minimized when ↵⇤n = 2gc2n2f + 2gc2n . (3.2)Proof. By the orthogonality assumption, the variance of the STAR-SA gra- dient (3.1) can be written as 2n = V ar ↵n f˜(xn + cn) f˜(xn cn)2cn + (1 ↵n) f˜ 0(xn + cn) + f˜ 0(xn cn)2 ! = V ar ✓ ↵n 2cn f˜(xn + cn) + 1 ↵n2 f˜ 0(xn + cn)◆ + V ar ✓ ↵n 2cn f˜(xn cn) + 1 ↵n2 f˜ 0(xn cn)◆ = ✓ ↵n 2cn◆2 2f + ✓1 ↵n2 ◆2 2g + 2 ↵n2cn 1 ↵n2 ⇢fg + ✓ ↵n 2cn◆2 2f + ✓1 ↵n2 ◆2 2g 2 ↵n2cn 1 ↵n2 ⇢fg = ↵2n 2c2n2f + (1 ↵n)22 2g , (3.3) 52 where the second equality follows from the independence of the function/gradient at di↵erent points. Di↵erentiating (3.3) with respect to ↵n, we find @2n @↵n = ↵nc2n 2f (1 ↵n)2g . (3.4) Setting (3.4) equal to zero and solving for ↵n yields (3.2). Since (3.4) is monotoni- cally increasing in ↵n, (3.2) is the global minimum. ⇤ The following result shows that the variance of the STAR-SA gradient is prov- ably less than the variance of the gradients generated under RM and KW.Proposition 3.2.2. Let the conditions of Lemma 3.2.1 hold. If the convex weight in (3.1) is given by (3.2), then V ar(gSTAR(xn))  min {V ar(gS(xn)), V ar(gTAR(xn))} for all n, where the inequality is strict if f , g > 0.Proof. Under the conditions of Lemma 3.2.1, the variance of the STAR-SA gradient (3.1) can be written as 2n = ↵2n2c2n2f + (1 ↵n)22 2g . (3.5) Substitute ↵⇤n from (3.2) into (3.5) for ↵n to obtain 2n = 12c2n 2gc2n2f + 2gc2n!2 2f + 12 1 2gc2n2f + 2gc2n!2 2g = 2f2g2 2f + 2gc2n . To show 2n is strictly less than min{V ar(gS(xn)), V ar(gTAR(xn))}, consider two cases: 1) 2n < 2f2c2n and 2) 2n < 2g2 . However, since cn ! 0 as n ! 1, then2f 2c2n !1, so eventually we will be in the latter case.Case 1: If 2n < 2f2c2n , then 2f2g2(2f+2gc2n) < 2f2c2n , which holds when 0 < 2f .Case 2: If 2n < 2g2 , then 2f2g2(2f+2gc2n) < 2g2 , which holds when 0 < 2gc2n. 53 If f , g > 0, then the inequality is strict; however, if fg = 0, then the same arguments hold with “ < ” replaced with “  ” for the non-strict case. There- fore, 2n  minn 2f2c2n , 2g2 o = min{V ar(gS(xn)), V ar(gTAR(xn))}, and the inequality is strict for f , g > 0. ⇤ The weight ↵⇤n depends on the function and gradient variances, and the finite di↵erence perturbation size. These values, with the exception of the perturbation size, are generally unknown and must be estimated. Although the orthogonality con- dition in Lemma 3.2.1 allows the function and its gradient to be correlated, the op- timal weight does not depend on the correlation coecient, due to the homogeneous noise assumption. If we eliminate this assumption to allow for non-homogeneous noise, the weight will depend on the correlation. Also, the STAR-SA gradient es- timate converges to the average of the direct gradient estimates, since ↵⇤n ! 0 as n ! 0. 3.2.1.2 Non-homogeneous Noise For non-homogeneous noise, f and g depend on x, and we write V ar(f˜±n ) = 2f,n± , V ar(g˜±n ) = 2g,n± , Corr(f˜n , g˜+n ) = Corr(f˜+n , g˜n ) = 0, Corr(f˜+n , g˜+n ) = ⇢+n and Corr(f˜n , g˜n ) = ⇢n , with f˜±n = f˜(xn ± cn) = f(xn ± cn)+ ✏±n and g˜±n = f˜ 0(xn ± cn) = f 0(xn ± cn) + ±n , where ✏± and ±n is Gaussian noise with zero mean.Lemma 3.2.3. If the conditions of Lemma 3.2.1 and ⇢nf,ng,n⇢+nf,n+g,n+ 0 hold, then V ar(gSTAR(xn)) is minimized when ↵⇤n = c2n(2g,n+ + 2g,n) + 2cn(⇢nf,ng,n ⇢+nf,n+g,n+)2f,n+ + 2f,n + c2n(2g,n+ + 2g,n) + 2cn(⇢nf,ng,n ⇢+nf,n+g,n+) . (3.6) 54 Proof. The variance of the STAR-SA gradient can be written as 2n = V ar ↵n f˜+n f˜n2cn + (1 ↵n) g˜+n + g˜n2 ! = V ar ✓ ↵n 2cn f˜+n + 1 ↵2 g˜+n◆+ V ar✓ ↵n2cn f˜n + 1 ↵2 g˜n◆ = ✓ ↵n 2cn◆2 V ar(f˜+n ) + ✓1 ↵n2 ◆2 V ar(g˜+n ) + 2 ↵n2cn 1 ↵n2 Cov(f˜+n , g˜+n ) + ✓ ↵n 2cn◆2 V ar(f˜n ) + ✓1 ↵n2 ◆2 V ar(g˜n ) 2 ↵n2cn 1 ↵n2 Cov(f˜n , g˜n ) = ↵2n 4c2n 2f,n+ + 2f,n+ (1 ↵n)24 (2g,n+ + 2g,n) + ↵n(1 ↵n) 2cn ⇢+nf,n+g,n+ ⇢nf,ng,n , (3.7) where the second equality follows from the independence of the function/gradient at di↵erent points. Di↵erentiate (3.7) with respect to the convex weight ↵n to obtain @V @↵n = ↵n2c2n 2f,n+ + 2f,n (1 ↵n)2 (2g,n+ + 2g,n) + ✓ 1 2↵n 2cn ◆ (⇢+nf,n+g,n+ ⇢nf,ng,n), which is monotonically increasing in ↵n. Set @V@↵n = 0 to obtain ↵n 2f,n+ + 2f,n2c2n + 2g,n+ + 2g,n2 + ⇢nf,ng,n ⇢+nf,n+g,n+cn ! = 2g,n+ + 2g,n 2 + ⇢nf,ng,n ⇢+nf,n+g,n+ cn , so the optimal weight is ↵⇤n = 2c2n2f,n+ + 2f,n + c2n(2g,n+ + 2g,n) + 2cn(⇢nf,ng,n ⇢+nf,n+g,n+) · cn(2g,n+ + 2g,n) + 2⇢nf,ng,n 2⇢+nf,n+g,n+ 2cn = c2n(2g,n+ + 2g,n) + 2cn(⇢nf,ng,n ⇢+nf,n+g,n+) 2f,n+ + 2f,n + c2n(2g,n+ + 2g,n) + 2cn(⇢nf,ng,n ⇢+nf,n+g,n+) .⇤ 55 Now we prove Proposition 3.2.2 for the non-homogeneous setting. For notation simplicity, Fn = 2f,n+ +2f,n , Gn = 2g,n+ +2g,n , Pn = ⇢nf,ng,n⇢+nf,n+g,n+ , and Dn = 2f,n+ + 2f,n + c2n(2g,n+ + 2g,n) + 2cn(⇢nf,ng,n ⇢+nf,n+g,n+). Let the conditions of Lemma 3.2.1 hold, then the variance of the STAR-SA gradient (3.1) can be written as 2n = ↵2n4c2n 2f,n+ + 2f,n+ (1 ↵n)24 (2g,n+ + 2g,n) + ↵n(1 ↵n) 2cn ⇢+nf,n+g,n+ ⇢nf,ng,n = ↵2n 4c2nFn + (1 ↵n)24 Gn ↵n(1 ↵n)2cn Pn. (3.8) Substitute the weight ↵⇤n from (3.6) into each of the three components in (3.8) to obtain 2n = c4nG2n + 4c3nGnPn + 4c2nP 2n4c2nD2n Fn + F 2nGn4D2n c2nGnFn + 2cnPnFn2cnD2n Pn = c2nG2nFn + 4cnGnPnFn + 4P 2nFn 4D2n + F 2nGn4D2n 2cnGnFnPn + 4P 2nFn4D2n = F 2nGn + 2cnGnPnFn + c2nG2nFn 4D2n . (3.9) To show 2n is strictly less than min{V ar(gS(xn)), V ar(gTAR(xn))}, consider two cases:Case 1: If V ar(gSTAR(xn)) < V ar(gS(xn)), then F 2nGn+2cnGnPnFn+c2nG2nF4D2n < Fn4c2n , which can be simplified to 0 < F 3n + 4cnF 2nPn + c2n(4P 2nFn + F 2nGn) + 4c3nGnPnFn, and is satisfied if Fn > 0 and Pn 0.Case 2: If V ar(gSTAR(xn)) < V ar(gTAR(xn), then F 2nGn+2cnGnPnFn+c2nG2nFn4D2n < Gn4 , which simplifies to 0 < c4nG3n + 4c3nPnG2n + c2n(4P 2nGn + FnG2n) + 2cnFnPnGn, and holds for Gn > 0 and Pn 0. 56 The same argument holds for the non-strict case by replacing “ < ” with “  ” in both cases and are both satisfied for Pn 0. Therefore, 2n  min{V ar(gS(xn)), V ar(gTAR(xn))}, where the inequality is strict for Fn, Gn > 0. ⇤ 3.2.2 Convergence The next result establishes mean-squared convergence of the STAR-SA algo- rithm. The proof closely follows [18] and [62].Theorem 3.2.4. Let {xn} be a sequence following recursion (1.2) with ⇥ = R andbrf(xn) as the STAR gradient estimate defined in (3.1). If the following conditions hold: 1. There exist positive sequences {an} and {cn} such that P1n=1 an = 1,P1n=1 ancn < 1, P1n=1 a2n < 1, and P1n=1 a2nc2n < 1. 2. There exist K0, K1 > 0 such that K0|x x⇤|  |f 0(x)|  K1|x x⇤| for all x 2 ⇥. 3. f 0(x)(x x⇤) > 0 for all x 2 ⇥\{x⇤}. 4. For c > 0, 2 = supx2⇥ V ar[f˜(x+ c) f˜(x c)|x] < 1 for all x 2 ⇥. 5. ✏+n , ✏n , +n , n are i.i.d. with mean zero for all n. Then xn converges to x⇤ in L2 as n !1.Proof. Fix positive sequences {an} and {cn} satisfying condition 1. Let {xn} be a sequence of estimates following recursion (1.2) with ⇥ = R and brf(xn) = gSTAR(xn). As defined in Section 3.2, gSTAR = ↵gS + (1 ↵)gTAR. 57 The mean-squared error of the nth estimate can be expressed as E ⇥ (xn+1 x⇤)2⇤ = E[(xn x⇤ angSTAR(xn))2] = E[(xn x⇤)2] 2anE[(xn x⇤)gSTAR(xn)] + a2nE[g2STAR(xn)]. (3.10) Now we establish necessary bounds in order to generate an upper bound for (3.10). By the mean-value theorem, there exist ⇠1 and ⇠2 such that 0  ⇠1, ⇠2  1 where f(xn+cn) = f(xn)+f 0(xn+⇠1cn)cn and f(xncn) = f(xn)f 0(xn⇠2cn)cn. E[gS(xn)|xn] = f(xn + cn) f(xn cn)2cn = f 0(xn + ⇠1cn) + f 0(xn ⇠2cn) 2 (3.11) = (xn x⇤)12  f 0(xn + ⇠1cn)xn x⇤ + ⇠1cn + f 0(xn ⇠2cn)xn x⇤ ⇠2cn + cn 2  ⇠1 f 0(xn + ⇠1cn) xn x⇤ + ⇠1cn ⇠2 f 0(xn ⇠2cn)xn x⇤ ⇠2cn  |xn x⇤|K1 + cnK1, (3.12) where the last inequality follows from K0  f 0(x) x x⇤ = f 0(x) x x⇤  K1, (3.13) as a result of combining conditions 2 and 3. By applying (3.13), E[(xn x⇤)gS(xn)|xn] = (xn x⇤)212  f 0(xn + ⇠1cn)xn x⇤ + ⇠1cn + f 0(xn ⇠2cn)xn x⇤ ⇠2cn + cn 2 (xn x⇤) ⇠1 f 0(xn + ⇠1cn)xn x⇤ + ⇠1cn ⇠2 f 0(xn ⇠2cn)xn x⇤ ⇠2cn (xn x⇤)2K0 |xn x⇤|cnK1. (3.14) Using conditional variance, condition 4, and after applying |a+b|r  2r1(|a|r+ |b|r) 58 for r > 1 to the square of equation (3.12) yields E[gS(xn)2|xn] = V ar(gS(xn)|xn) + E[gS(xn)|xn]2  2 4c2n + 2|xn x⇤|2K21 + 2c2nK21 . (3.15) The bounds regarding gTAR(xn) are identical to those of gS(xn), since gTAR(xn) is equal to the RHS of (3.11) with ⇠1 = ⇠2 = 1. Using (3.12), (3.14), and (3.15) yields E ⇥ (xn+1 x⇤)2|xn⇤ = E[(xn x⇤ angSTAR(xn))2|xn] = E[(xn x⇤)2|xn] 2anE[(xn x⇤)gSTAR(xn)|xn] + a2nE[g2STAR(xn)|xn]  (xn x⇤)2 2an (xn x⇤)2K0 |xn x⇤|cnK1 + 4a2n✓ 24c2n + 2|xn x⇤|2K21 + 2c2nK21◆  (xn x⇤)2 1 2anK0 + 8a2nK21+ 2ancnK1|xn x⇤| + a2n2c2n + 8a2nc2nK21 . (3.16) Define bn = E[(xnx⇤)2]. Take the expectation of (3.16) and apply the inequalities E[ p bn] pE[bn] and pbn  bn + 1 to obtain E[(xn+1 x⇤)2]  E[(xn x⇤)2] 1 2anK0 + 8a2nK21+ 2ancnK1E[|xn x⇤|] + a2n2 c2n + 8a2nc2nK21  bn 1 2anK0 + 8a2nK21+ 2ancnK1pbn + a2n2c2n + 8a2nc2nK21  bn 1 2anK0 + 8a2nK21+ 2ancnK1(bn + 1) + a2n2c2n + 8a2nc2nK21 = bn 1 2anK0 + 2ancnK1 + 8a2nK21+ 2ancnK1 + a2n2c2n + 8a2nc2nK21 = bnRn + Tn, where Rn = 1 2anK0 + 2ancnK1 + 8a2nK21 1 2anK0 + 8a2nK21 > 0, which 59 is quadratic in an, and the discriminant 4K20 16K21 is negative if K0 < 2p2K1, which holds by condition 2. The coecient for an in Rn, 2cnK1 2K0, is negative for suciently large n since cn ! 0 as n ! 1. By condition 1, Q1n=1 Rn < 1 and P1n=1 Tn < 1. As a result, all partial products are uniformly bounded. Thus, bn  M < 1. Since Ri > 0 for all i, we have the recursion bn+1  b1 nYi=1 Ri + n1Xi=2 Ti nYj=i+1Rj!+ Tn. Applying bn  M to the second terms in both (3.14) and (3.15) yields E[(xn+1 x⇤)2]  E[(xn x⇤)2] 1 2anK0 + 8a2nK21+ 2ancnK1E[|xn x⇤|] + a2n2c2n + 8a2nc2nK21 = E[(xn x⇤)2] (1 2anK0) + 8a2nK21E[(xn x⇤)2] + 2ancnK1E[|xn x⇤|] + a2n2 c2n + 8a2nc2nK21  E[(xn x⇤)2] (1 2anK0) + 8a2nK21M + 2ancnK1pM + a2n2c2n + 8a2nc2nK21 = bnR0n + T 0n. Since an ! 0, there exists n0 such that an < 1/(2K0) for all n n0, which implies R0n > 0. Then we have the recursion bn+1  bn0 nYi=n0 R0i + n1Xi=n0 T 0i nYj=i+1R0j!+ T 0n for n = n0 + 1, n0 + 2, ...Qni=n0 R0i ! 0 as n !1 since P an = 1. By condition 1 and Kronecker’s lemma,Pn1i=n0 ⇣T 0i Qnj=i+1 R0j⌘! 0 as n !1. Therefore, for ✏ > 0, there exists n1 such thatQnj=n0 R0j < ✏3M2 for all n n1, there exists n2 such that Pn1i=n0 ⇣T 0i Qnj=i+1 R0j⌘ < ✏/3 for all n n2, and there exists n3 such that T 0n < ✏/3 for all n n3. Hence, for all n > max (n1, n2, n3) > n0, bn+1  M2 ✏3M2 + ✏3 + ✏3 = ✏. ⇤ 60 We investigate the exact MSE for quadratic functions of the form f(x) = ax2, where a > 0 and x⇤ = 0. The finite-time MSE of quadratic functions in this form can be computed explicitly, which allows us to compare the performance of di↵erent algorithms analytically.Proposition 3.2.5. For f(x) = ax2, a > 0, the respective MSEs after n iterations following recursion (1.2) can be expressed as MSERM = An + 2g n2Xk=1 a2k⇧n1j=k+1 (1 2aja)2 + 2ga2n1, MSEKW = An + 2f2 n2Xk=1 ✓a2kc2k◆⇧n1j=k+1 (1 2aja)2 + 2fa2n12c2n1 , MSESTAR = An + n2Xk=1 a2k2 ✓↵2k2fc2k + (1 ↵k)22g◆ · ⇧n1j=k+1(1 2aja)2 + a2n1 2 ✓ ↵2n12f c2n1 + (1 ↵n1)22g◆ , where An = x20⇧n1i=1 (1 2aia)2 and x0 is the initial iterate.Proof. We can express the nth estimates of RM, KW, and STAR-SA respec- tively as xRMn = xn1 + an1f˜(xn1) = xn1 + an1(2axn1 + n1) = xn1(1 2an1a) + an1n1 xKWn = xn1 + an1f˜(xn1) = xn1 + an1 ✓2axn1 + ✏+n1 ✏n12cn1 ◆ = xn1(1 2an1axn1) + an1 ✏+n1 ✏n12cn1 61 xSTARn = xn1 + an1f˜(xn1) = xn1 + an1 ✓2axn1 + ↵n1 ✏+n1 ✏n12cn1 + (1 ↵n1)n1◆ = xn1(1 2an1axn1) + an1 ✓↵n1 ✏+n1 ✏n12cn1 + (1 ↵n1)n1◆ . The recursions can be rewritten in the general form xn = xn1Bn1 + Cn1, where Bn1 = 1 2an1a and Cn1 depends on the algorithm. Then xn = xn1Bn1 + Cn1 = xn2Bn2Bn1 + Cn2Bn1 + Cn1 = xn3Bn3Bn2Bn1 + Cn3Bn2Bn1 + Cn2Bn1 + Cn1 = x0⇧n1i= Bi + n2Xj=0 Cj⇧n2k=j+1Bk + Cn1 Then the MSE of the nth estimate can be written as MSE = E[(xn x⇤)2] = E[x2n] = E[(x0⇧n1i=0 Bi + n2Xj=0 Cj⇧n2k=j+1Bk + Cn1)2] = x20⇧ n1i=0 B2i + E[(n2Xj=0 Cj⇧n2k=j+1Bk)2] + E[C2n1] + 2x0⇧n1i=0 BiE[n2Xj=0 Cj⇧n2k=j+1Bk] + 2x0⇧n1i=0 BiE[Cn1] + 2E[Cn1 n2Xj=0 Cj⇧n2k=j+1Bk] = x20⇧ n1i=0 B2i + n2Xj=0 E[C2j ]⇧n2k=j+1B2k + E[C2n1], 62 where E[C2k ] = 8>>>>>>><>>>>>>>: a2k2g for RM ;2fa2k2c2k for KW ;a2k 2 ✓ ↵2k2f c2k + (1 ↵k)22g◆ for STAR, which concludes the proof. ⇤Proposition 3.2.6. If ↵2k2f < 2gc2k for all k, then MSESTAR < MSERM . If 22gc2k < 2f for all k, then MSERM < MSEKW . If 22gc2k < 2f and ↵2k2f < 2gc2k for all k, then MSESTAR < MSEKW .Proof. From Proposition 3.2.5, MSESTAR < MSERM () a2k2 ✓↵2k2fc2k + (1 ↵k)22g◆ < 2ga2k () ↵2k2f c2k + (1 ↵k)22g < 22g () ↵2k2f + c2k(1 ↵k)22g < 22gc2k () ↵2k2f + (1 ↵k)22gc2k < 22gc2k () ↵2k2f < (2 (1 ↵k)2)2gc2k () ↵2k2f < 2gc2k, MSERM < MSEKW () 2ga2k < 2fa2k2c2k () 22gc2k < 2f , for all k. If ↵2k2f < 2gc2k and 22gc2k < 2f , then MSESTAR < MSEKW .⇤ The exact MSE of the estimators resulting from each of the algorithms has three components. The first term is identical for all three algorithms and is the only component that contains the initial value, so it does not a↵ect our comparison of 63 the MSE across algorithms for the unconstrained, quadratic case. From Proposition 3.2.6, if certain conditions are satisfied, the MSE resulting from STAR-SA for all steepness levels is less than the MSE resulting from RM and KW for the same number of iterations 3.2.3 Numerical Experiments We conduct two sets of numerical experiments to compare the performance of the STAR-SA algorithm against the classical RM and KW methods under various combinations of noise levels f and g [11]. We implement all three algorithms and generate the finite-time MSE of the estimate xN for quadratic functions of the form f(x) = ax2, a > 0, and ⇥ = [50, 50]. The gain sequence is chosen as an = ✓a(n+1)1 and the finite di↵erence perturbation sequence is cn = ✓c(n+1)1/4. To illustrate the performance of the proposed algorithm, we choose parameter values ✓a 2 {1, 10, 100}, ✓c 2 {0.1, 1.0}, and N 2 {100, 1000, 10000} as well as steepness level a 2 {10k|k = 3,2.5, . . . , 1.5, 2} and initial value x0 2 {50 + 5k|k = 1, . . . , 10}. In a practical setting, the noise level of the function and its gradient could di↵er, so we consider a variety of noise level combinations, f 2 {10k|k = 3, . . . , 1} and g 2 {10k|k = 3, . . . , 1}. Although we implemented RM, KW, and STAR-SA for all of the parameter settings mentioned above, we will only discuss two representative subsets of our results. For both sets of numerical experiments, we choose an = 10(n + 1)1 and cn = 0.1(n+ 1)1/4, and N = 1000. The MSE of the 1000th iterate is computed for 64 the algorithms based on 2000 sample paths. In this section, MSE refers to the MSE of the 1000th iterate for STAR-SA and KW, and 2000th iteration for RM. 3.2.3.1 Experiment 1: vary initial value For the first set of numerical experiments, we fix the steepness level and vary the initial value x0 for di↵erent combinations of noise levels, f and g. We only present the results for f(x) = 0.1x2 for f 2 {0.001, 0.1, 1.0} and g 2 {0.001, 0.1, 1.0}, to illustrate the potential gains of STAR-SA. The STAR-SA algo- rithm outperforms KW and RM in six out of the nine combinations for all initial values, which can be seen in the last two rows of Figure 3.2 (i.e., 3.2d, 3.2e, 3.2f, 3.2g, 3.2h, and 3.2i). The commonality among these six cases is the higher function noise, i.e., f = 0.1, 1.0. With the exception of Figure 3.2f, RM performs signif- icantly better than KW, but STAR-SA has the lowest MSE among the three SA algorithms. Figure 3.2f depicts a case where RM and KW both perform similarly with intertwining lines, but combining secant and tangents averaged gradients in STAR-SA results in a much lower MSE. In two of the remaining three cases, STAR-SA does not necessarily outperform RM nor KW, but it performs as well as the better of the two algorithms, as seen in Figures 3.2b and 3.2c. The function noise level is very low and the gradient noise is higher, i.e., f = 0.001 and g > f , which increases the accuracy of the indirect gradient and has the opposite e↵ect on the direct gradient. Not surprisingly in the two cases, KW outperforms RM, but the performance of STAR-SA is on par with 65 (a) f = 0.001,g = 0.001 (b) f = 0.001,g = 0.1 (c) f = 0.001,g = 1.0(d) f = 0.1,g = 0.001 (e) f = 0.1,g = 0.1 (f) f = 0.1,g = 1.0(g) f = 1.0,g = 0.001 (h) f = 1.0,g = 0.1 (i) f = 1.0,g = 1.0Figure 3.2: f(x) = 0.1x2,⇥ = [50, 50], an = 10(n + 1)1, cn = 0.1(n+ 1)1/4, N = 1000. 66 KW. From the eight figures discussed, it appears that integrating both indirect and direct gradients together will either improve the SA performance, which can be significant, or in the worst case, it will not hurt the performance. The only case where the MSE of the STAR-SA algorithm is not approximately less than or equal to the MSE of KW and RM is in the case when both noise levels are very low, i.e., f = g = 0.001, which is shown in Figure 3.2a. In this instance, RM performs better than STAR-SA with the exception of when the initial value is close to the optimum x⇤ = 0. In fact, the MSE of STAR-SA decreases as initial value x0 approaches x⇤. 3.2.3.2 Experiment 2: vary steepness level In practice, the geometry of the function is unknown, so our second set of nu- merical experiments investigates the performance of the STAR-SA algorithm com- pared to RM and KW as a function of the steepness level a for di↵erent combinations of noise. We set the initial value to be far from the optimum and consider identical parameters as in the first experiment, i.e., x0 = 30, an = 10(n+1)1, cn = 0.1(n+ 1)1/4, and vary the steepness level a. We implement RM, KW, and STAR-SA and compute the MSE for N = 1000 and all combinations of f 2 {0.001, 0.1, 1.0} and g 2 {0.001, 0.1, 1.0}, illustrated in Figure 3.3. For a fixed step size {an}, the MSE is much higher for both flat and steep quadratic functions across algorithms, which is most likely because the selected step size {an} is too small for extremely flat and too large for steep functions, respectively. If the step size is not chosen 67 (a) f = 0.001,g = 0.001 (b) f = 0.001,g = 0.1 (c) f = 0.001,g = 1.0(d) f = 0.1,g = 0.001 (e) f = 0.1,g = 0.1 (f) f = 0.1,g = 1.0(g) f = 1.0,g = 0.001 (h) f = 1.0,g = 0.1 (i) f = 1.0,g = 1.0Figure 3.3: f(x) = ax2,⇥ = [50, 50], an = 10(n + 1)1, cn = 0.1(n+ 1)1/4, N = 1000. 68 appropriately, all three algorithms perform equally poorly, with the exception of the case when both noise levels are low, i.e., f = g = 0.001, where KW results in a lower MSE. The fixed step size and stopping time only appear to be appropriate for a = 101, 101/2. For all noise levels under the these two steepness levels, the MSE of STAR-SA is either less than or equal to RM and KW. The only instance when combining direct and indirect gradients is inferior is when both noise levels are low, i.e., f = g = 0.001 for a = 101. Otherwise, STAR-SA shows promise since it either outperforms or at least it never performs worse than either RM or KW. When the variance of the function is high, i.e., f 2 {0.1, 1.0}, KW performs poorly across steepness levels. 3.2.3.3 Results Summary The following conclusions are specifically for the fixed parameter setting f(x) = ax2, an = 10n1, cn = 0.1n1/4. • If the function is neither too steep nor too flat: • All algorithms are insensitive to the initial start value. • If f is large, STAR-SA has a significantly lower MSE than that of KW. • If f and g are high, the MSE of STAR-SA is lower than that of RM, but the di↵erence is not as prominent as the di↵erence compared to that of KW. • For very flat functions: • If f < g, then MSESTAR ⇡ MSEKW and MSEKW < MSERM , and this gap decreases as the function flattens. 69 • If g < f or if f and g are both high, then MSESTAR ⇡ MSERM and MSERM < MSEKW , and this di↵erence shrinks as the steepness level de- creases. • For very steep functions, the performances of the algorithms are equally poor. Overall, as a result from both sets of numerical tests, the STAR-SA algorithm either performs significantly better than both KW and RM or the MSE is approximately equal to that of the algorithm with the lower MSE. Therefore, the new STAR-SA algorithm is competitive against the classic KW and RM methods. 3.3 STAR-SPSA Now we extend STAR-SA to the multidimensional case. The direct extension using finite di↵erences is computationally inecient for high-dimensional problems, so instead, we consider the SP gradient, which only requires the two performance estimates f˜(xn + cnn) and f˜(xn cnn) for each indirect gradient (Secant). The available direct gradient estimates rf˜(xn + cnn) and rf˜(xn cnn) can be averaged to provide another gradient estimate (Tangents AveRaged). For multivariate stochastic optimization problems, we present the Secant-Tangents AveRaged simultaneous perturbation stochastic approximation (STAR- SPSA) gradient estimate, which is a convex combination of an SP gradient estimate and an average of two associated direct gradients. For each dimension i 2 {1, . . . , d}, 70 the ith component of the STAR-SPSA gradient can be written as gSTAR,i(xn) = ↵n,i f˜(xn + cnn) f˜(xn cnn)2cnn,i + (1 ↵n,i)rf˜i(xn + cnn) +rf˜i(xn cnn)2 , (3.17) where n = (n,1, . . . ,n,d), ↵n,i 2 [0, 1], and cn 2 R+ for n 2 N, and cn ! 0 as n !1. Let f˜(xn±cnn) = f(xn±cnn)+✏±n ,rf˜i(xn±cnn) = rfi(xn±cnn)+±n,i, where ✏±n and ±n,i are Gaussian noise with zero mean for i = 1, . . . , d. For notational simplicity, let V ar(f˜(xn ± cnn)) = 2f , V ar(rf˜i(xn ± cnn)) = 2g,i, Corr(f˜(xn ± cnn),rf˜i(xn ± cnn)) = ⇢i, V ar(gSPSA,i(xn)) = V ar ⇣ f˜(xn+cnn)f˜(xncnn)2cnn,i ⌘ =2f 2c2n2n,i , V ar(gTAR,i(xn)) = V ar ⇣rf˜i(xn+cnn)+rf˜i(xncnn)2 ⌘ = 2g,i2 , and V ar(gSTAR,i (xn)) = 2n,i for i = 1, . . . , d. Analogous to Lemma 3.2.1 and Proposition 3.2.2, the next two results provide a STAR-SPSA gradient variance minimizing deterministic weight sequence that guar- antees the variance of the STAR-SPSA gradient is strictly less than that of RM and SPSA. 3.3.1 Optimal Deterministic WeightsLemma 3.3.1. For gSTAR defined in (3.17), assume ✏+n ? ✏n and +n,i ? n,i for all n and i = 1, . . . , d. Then V ar(gSTAR,i(xn)) is minimized when ↵⇤n,i = 2g,ic2n2n,i2f + 2g,ic2n2n,i . (3.18)Proof. For notational simplicity, let f˜±n = f˜(xn ± cnn), g˜±n = rf˜(xn ± cnn), and g˜±n = (g˜±n,1), . . . , g˜±n,d). 71 By the orthogonality assumption, the ith component of the variance can be written as 2n,i = V ar ↵n,i ˜f+n ˜fn 2cnn,i + (1 ↵n,i) g˜+n,i + g˜n,i2 ! =V ar ✓ ↵n,i 2cnn,i ˜f+n + 1 ↵n,i2 g˜+n,i◆+ V ar✓ ↵n,i2cnn,i ˜fn + 1 ↵n,i2 g˜n,i◆ = ✓ ↵n,i 2cnn,i◆2 V ar( ˜f+n ) + ✓1 ↵n,i2 ◆2 V ar(g˜+n,i) + 2 ↵n,i2cnn,i 1 ↵n,i2 Cov( ˜f+n , g˜+n,i) + ✓ ↵n,i 2cnn,i◆2 V ar( ˜fn ) + ✓1 ↵n,i2 ◆2 V ar(g˜n,i) 2 ↵n,i2cnn,i 1 ↵n,i2 Cov( ˜fn , g˜n,i) = ✓ ↵n,i 2cnn,i◆2 2f + ✓1 ↵n,i2 ◆2 2g,i + 2 ↵n,i2cnn,i 1 ↵n,i2 ⇢ifg,i + ✓ ↵n,i 2cnn,i◆2 2f + ✓1 ↵n,i2 ◆2 2g,i 2 ↵n,i2cnn,i 1 ↵n,i2 ⇢ifg,i = ↵ 2n,i 2c 2n2n,i2f + (1 ↵n,i)22 2g,i, (3.19) where the second equality follows from the independence of the function/gradient at di↵erent points. Di↵erentiate the variance (3.19) with respect to the weight ↵n,i to obtain @2n,i @↵n,i = ↵n,ic2n2n,i2f (1 ↵n,i)2g,i. (3.20) Solve for the zero of (3.20) to obtain (3.18), which is a global minimum since (3.20) is monotonically increasing in ↵n,i. ⇤Proposition 3.3.2. Let brf(xn) be defined as in (3.17) and the convex weight be defined in (3.18). If the conditions of Lemma 3.3.1 hold, then V ar(gSTAR,i(xn))  min {V ar(gSPSA,i(xn)), V ar(gTAR,i(xn))} for i = 1, . . . , d and all n, where the in- equality is strict if 0 < 2f and 0 < 2g,ic2n2n,i hold.Proof. By the conditions of Lemma 3.3.1, the ith component of the variance 72 can be written as 2n,i = ↵2n,i2c2n2n,i2f + (1 ↵n,i)22 2g,i. (3.21) Substitute ↵⇤n,i from (3.18) into the variance (3.21) to obtain 2n,i = 12c2n2n,i 2g,ic2n2n,i2f + 2g,ic2n2n,i!2 2f + 12 1 2g,ic2n2n,i2f + 2g,ic2n2n,i!2 2g,i = 2f2g,i 2 2f + 2g,ic2n2n,i . To show 2n,i is strictly less than min{V ar(gSPSA,i(xn)), V ar(gTAR,i(xn))}, we con- sider two cases: 1) 2n,i < 2f2c2n2n,i and 2) 2n,i < 2g,i2 . However, since cn ! 0 as n !1, then 2f 2c2n2n,i !1, so eventually, we will be in the latter case.Case 1: If 2n,i < 2f2c2n2n,i , then 2f2g,i2(2f+2g,ic2n2n,i) < 2f2c2n2n,i , which holds when 0 < 2f .Case 2: If 2n,i < 2g,i2 , then 2f2g,i2(2f+2g,ic2n2n,i) < 2g,i2 , which holds when 0 < 2g,ic2n2n,i. If 0 < 2f and 0 < 2g,ic2n2n,i, then the strict inequality holds; however if either f = 0 or 2g,ic2n2n,i = 0, then the same argument holds for the two cases with “ < ” replaced with “ , ” resulting in a non-strict inequality. Therefore, 2n,i  min n 2f 2c2n2n,i , 2g,i2 o = min{V ar(gSPSA,i(xn)), V ar(gTAR,i(xn))}, where the inequality is strict for 0 < 2f and 0 < 2g,ic2n2n,i. ⇤ As in the STAR-SA result, the weight ↵⇤n,i for STAR-SPSA is also independent of the correlation coecient because of the homogeneous noise 73 3.3.2 Convergence We define the error and bias of the STAR-SPSA gradient estimate respectively as bn(xn) = E[gSTAR(xn)rf(xn)|xn],en(xn) = gSTAR(xn) E[gSTAR(xn)|xn], where bn(xn) = (bn,1(xn), . . . , bn,d(xn)) and en(xn) = (en,1(xn), . . . , en,d(xn)). The following lemma is used in the convergence proof of Theorem 3.3.4.Lemma 3.3.3. Suppose {xn} is generated under recursion (1.2) with ⇥ = Rd using the STAR-SPSA gradient estimate defined in (3.17). If {n,i}di=1 are i.i.d., symmet- rically distributed with mean zero, uniformly bounded with finite inverse moments, and f(·) is three-times continuously di↵erentiable and f (3)(·) is uniformly bounded, then bn(xn) ! 0 as n !1.Proof. We use proof techniques similar to those in [55] by applying results from [39]. By assumption for each n 2 N, there exist positive constants K1, K2, and K3 such that |n,l|  K1 a.s., E[|1n,l |]  K2 for l = 1, . . . , d and |f (3)i,j,k(·)|  K3 8i, j, k. bn,i(xn) = E[↵n,i · gSPSA,i(xn) + (1 ↵n,i) · gTAR,i(xn)rfi(xn)|xn] = ↵n,iE[gSPSA,i(xn)rfi(xn)|xn] + (1 ↵n,i)E[gTAR,i(xn)rfi(xn)|xn]. Since E[✏+n ✏n |xn] = 0 a.s. and f (3)(·) exists and is continuous, using Taylor’s 74 series expansions, we have f(xn ± cnn) = f(xn) ± cnhf 0(xn),ni+ c2nhf 00(xn)n,ni ±c3nf (3)(t±n )[n ⌦n ⌦n], (3.22) where t+n and tn are on the line segment between xn and xn ± cnn, respectively, and ⌦ denotes the Kronecker product. Using (3.22), we have E[gSPSA,i(xn)rfi(xn)|xn] = E " f˜(xn + cnn) f˜(xn cnn) 2cnn,i 1n,ihf 0(xn),nixn# = E  f(xn + cnn) f(xn cnn) 2cnn,i 1n,ihf 0(xn),nixn = c2n 12 E ⇥ 1n,i(f (3)(t+n ) + f (3)(tn ))[n ⌦n ⌦n]|xn⇤ . (3.23) The magnitude of the right hand side of (3.23) is bounded by c2nK3 6 dXi=1 dXj=1 dXk=1 En,in,jn,kn,i  c2nK3 6 · {[d3 (d 1)3]K21 + (d 1) 3K2K31}. (3.24) Similarly, since E[+n,i+n,i|xn] = 0 a.s. for i = 1, . . . , d, by Taylor’s series expansions, rfi(xn ± cnn) = rfi(xn) ± cnhr2fi(xn),ni+ c2nhr3fi(s±n )n,ni, (3.25) where s+n and sn are on the line segment between xn and xn ± cnn, respectively. 75 Using (3.25), E[gTAR,i(xn)rfi(xn)|xn] = E " rf˜i(xn + cnn) +rf˜i(xn cnn) 2 rfi(xn)xn# = E  rfi(xn + cnn) +rfi(xn cnn) 2 rfi(xn)xn = c2n 2 E ⇥Tn (f (3)(s+n ) + f (3)(sn ))n|xn⇤ . (3.26) The magnitude of the right hand side of (3.26) is bounded by c2nK3 dXi=1 E|2n,i|  c2nK3dK21 . (3.27) Apply bounds (3.24) and (3.27) to obtain bn,i(xn)  (1 ↵n,i)c2nK3d2K21 + ↵n,ic2nK3 6 · {[d3 (d 1)3]K21 + (d 1) 3K2K31}. (3.28) The RHS of (3.28) converges to zero since cn ! 0 as n !1, and this holds for all i, which concludes the proof. ⇤Theorem 3.3.4. Let {xn} be a sequence generated using recursion (1.2) with ⇥ = Rd using the STAR-SPSA gradient estimate defined in (3.17). Assume the condi- tions of Lemma 3.3.3 in addition to the following hold: 1. There exist positive sequences {an} and {cn} such that cn ! 0 as n ! 1,P1n=1 an = 1, P1n=1 a2nc2n < 1. 2. There exist positive constants C1, C2, C3 such that E[rf˜i(xn ± cnn)2]  C1, E[f˜(xn ± cnn)4]  C22 , and E[4n,i]  C23 for i = 1, . . . , d. 76 3. For all n, ||xn|| < 1 a.s. 4. Let x⇤ be an asymptotically stable solution to the ODE @x(t)/@t = rf(x). 5. There exists a compact set D ✓ D(x⇤) such that xn 2 D infinitely often, where D(x⇤) is the domain of attraction (i.e., D(x⇤) = {x0| limt!1 x(t|x0) = x⇤}). Then xn ! x⇤ a.s. as n !1.Proof. By Lemma 2.2.1 and Theorem 2.3.1 in [39], if conditions 1 - 5 are satisfied, then xn ! x⇤ a.s. as n !1 if (a) ||bn(xn)|| < 1 for all n and bn(xn) ! 0 as n !1 a.s., (b) limn!1 P supmn ||Pmk=n akek(xk)|| ⌘ = 0 for all ⌘ > 0. Since (a) follows directly from Lemma 3.3.3, it remains to establish (b). Apply Doob’s inequality to the martingale sequence { Pmi=k aiei(xi)}mk to obtain P supmn || mXk=n akek|| ⌘!  ⌘2E "|| mXk=n akek||2# = ⌘2 mXk=n a2kE||ek||2. (3.29) Consider i 2 {1, . . . , d}. Then the variance of the ith gradient component can be expressed as E[g2STAR,i(xn)] = ↵2n,iE[g2SPSA,i(xn)] + 2↵n,i(1 ↵n,i)E[gTAR,i(xn) · gSPSA,i(xn)] + (1 ↵n,i)2E[g2TAR,i(xn)]. (3.30) It remains to show that each of the three terms on the right hand side of (3.30) are bounded. We apply |a+ b|r  2r1(|a|r + |b|r) and Holder’s inequality to obtain E[g2TAR,i(xn)] = E 24 rf˜i(xn + cnn) +rf˜i(xn cnn)2 !235  1 2 E h (rf˜i(xn + cnn)2 + (rf˜i(xn cnn)2i  C1 (3.31) 77 and E[g2SPSA,i(xn)] = E 24 f˜(xn + cnn) f˜(xn cnn)2cnn,i !235  1 2c2nE " f˜(xn + cnn)2 + f˜(xn cnn)22n,i #  1 2c2n E h|(f˜(xn + cnn)|4i1/2 + E h|(f˜(xn cnn)|4i1/2 · E  1n,i 41/2  C2C3 c2n . (3.32) Then we apply Jensen’s inequality, (3.31), and (3.32) to obtain E[gTAR,i(xn) · gSPSA,i(xn)] = E " r ˜fi(xn + cnn) +r ˜fi(xn cnn) 2 ! · ˜ f(xn + cnn) ˜f(xn cnn) 2cnn,i !#  E "r ˜fi(xn + cnn) +r ˜fi(xn cnn) 2 # · E " ˜f(xn + cnn) ˜f(xn cnn) 2cnn,i #  p C1C2C3 cn . (3.33) Applying (3.31), (3.32), and (3.33) yields E[||en||2]  d✓C2C34c2n + pC1C2C32cn + C14 ◆ . (3.34) By condition 1, (3.34), and (3.29), (b) holds. ⇤ 3.3.3 Numerical Experiments 3.3.3.1 9-station Closed Jackson Network For the multidimensional case, we implement RM, SPSA, and STAR-SPSA on a 9-station closed Jackson queueing network problem from [35], depicted in Figure 3.4 with associated transition probabilities. In a closed Jackson queueing network, 78 1" 4" 3" 2" 5" 6" 7" 8" 9" .3" .3" .3" .7" .7" .7" .4" .4" .4" .6" .6" .6" .5" .5" .5" .5" .5" .5"Figure 3.4: 9-station closed Jackson queueing network. the number of customers in the system remains fixed, hence the absence of arrows from outside. The objective is to maximize the total throughput of the system TP (x) (to- tal # of customers served/total time), given restrictions on the mean service time x(i), i = 1, . . . , d, which can be summarized as maxx2⇥ TP (x) s.t. dXi=1 x(i) = M, x(i) > 0 for i = 1, . . . , d, where M = 10, d = 9 (number of stations). The network consists of 9 first-come, first-served stations, where the service time follows an exponential distribution with mean service time x(i) for station i and with 10 customers in the system. The (equality and positivity) constraints prevent straightforward implemen- 79 tations of the gradients and projection operator. For the direct gradient, we use the IPA gradient estimate designed specifically for the total throughput of a closed Jackson queueing network proposed in [34], with a slight modification by rewriting one of the estimates as a function of the other components and using the chain rule to take into account the equality constraint (i.e., x(r) = M Pdi=1,i 6=r x(i), where r is an arbitrary integer between 1 and d, inclusive). For the SP gradient estimator, the components in the random vector n must sum to zero to ensure the perturbed components satisfy the equality condition (i.e., n,r = 0, n,i = ±1 for i 6= r, whereP9i=1,i 6=rn,i = 0). In addition, each perturbed component must be positive, i.e., x(i)n > cn for all i. After each iteration, a projection must be applied to maintain feasibility; however, an orthogonal projection onto the hyperplane could violate the x(i)n > cn condition for some i, so we make a slight adjustment by projecting the current iterate back to the previous iterate if the condition is violated. In the SA algorithms, we use the parameters an = ✓a(n + 1)1, ↵n,i = c2n/(1 + c2n), uniform start values x(i)0 = 10/9 for i = 1, . . . , 9, cn = ✓c(n + 1)1/4, 20 macroreplications, a stopping time of N = 50 for SPSA and STAR-SPSA, and N = 100 for RM. For simplicity, we drop the second subscript i from the convex weight ↵n,i, since they are identical for all i. The update occurs for only eight es- timates since the rth component is a result of the binding constraint. However, in our preliminary experiments, the SA algorithms often performed poorly when r was fixed for each iteration, so instead, we let r be a uniform random integer between 1 and 9, inclusive. For each simulation run, we generate the throughput and gradient estimates using 300 customers, which is large enough for a valid non-zero direct 80 10 20 30 40 501 1.5 2 2.5 3 3.5 exp(M SE of X) θa RM2RMSPSASTAR 10 20 30 40 500 2 4 6 8 10 12 exp(M SE of TP) θa RM2RMSPSASTAR 10 20 30 40 500 1 2 3 4 5 6 7 TP θa RM2RMSPSASTAR(a) cn = 0.1(n + 1)1/4 10 20 30 40 501 1.2 1.4 1.6 1.8 2 2.2 2.4 exp(M SE of X) θa RM2RMSPSASTAR 10 20 30 40 501 2 3 4 5 6 7 8 9 10 exp(M SE of TP) θa RM2RMSPSASTAR 10 20 30 40 500 1 2 3 4 5 6 7 TP θa RM2RMSPSASTAR(b) cn = 0.3(n + 1)1/4Figure 3.5: 9-station closed Jackson queueing network, P9i=1 x(i) = 10, x(i) > 0, an = ✓a/(n + 1), ↵n = c2n/(1 + c2n), N = 50, customers serviced = 300, macroreplications = 20. gradient. If the direct IPA gradient is generated with an insucient number of cus- tomers, the the gradient could be exactly zero or invalid since the denominator is zero. We compute the total MSE of the estimates x(i)N for i = 1, . . . , 9 and MSE of the throughput TP (xN) using x⇤ = (1.66, .68, .983, 1.66, .68, .983, 1.66, .68, .983) and TP (x⇤) = 4.82, as well as the mean throughput with a 95% confidence band using 20 macroreplications for a range of step sizes and two di↵erent perturbation sizes, 81 i.e., ✓a 2 {5k|k = 1, . . . , 10} and ✓c 2 {0.1, 0.3}. Figures 3.5a and 3.5b illustrate the performances as a function of ✓a for two perturbations ✓c = 0.1, 0.3, respectively. We vary the step size and perturbation size to investigate the sensitivity of STAR-SPSA. The leftmost graph in both figures show that STAR-SPSA significantly outperforms SPSA for all choices of an, in terms of both MSE of the estimates as well as the throughput. Similarly, STAR-SPSA performs better than RM in both metrics with the exception of the case ✓a = 15, when RM results in a slightly lower MSE of xN and a higher MSE of TP (xN), but the results are fairly close. The standard errors are within ±0.1. In both Figures 3.5a and 3.5b, the STAR-SPSA results do not appear to be very sensitive to the step size, especially not the throughput. In fact, combining the direct and indirect gradients often improves the performance, and sometimes the di↵erence can be significant, as in the case ✓a = 50, cn = 0.1(n + 1)1/4. Furthermore, the STAR-SPSA mean throughput is always greater than SPSA, and either greater than RM very close to it. In addition, the confidence bands are much tighter for STAR-SPSA for both ✓c = 0.1, 0.3. From these results, we conclude that incorporating the SP gradient will either improve the finite time performance or in the worst case, it will not hurt the performance significantly.Orthogonal Projection Algorithm • Step 0. Input: Xn, M , cn • Step 1. Set Xn+1 = Xn. • Step 2. For i = 1, . . . , d, if Xn,i < cn, then set Xn+1,i = cn. 82 • Step 3. Let – P0 = M/d ⇤ 1 – N = 1/pd – T = Xn+1 P0 – t =< T,N >, where < ·, · > denotes dot product – Xn+1 = Xn+1 tN • Step 4. If Xn+1,i cn for all i, return Xn+1. Otherwise, return Xn.Random Perturbation n • Step 0. Input: d • Step 1. If d is odd, generate r ⇠ Unif{1, d}, and set n,r = 0. If d is even, let r = 0. • Step 2. Generate ri ⇠ Unif(0, 1) for i = 1, . . . , d, i 6= r. Let m = {i|ri = median(r1, . . . , rr1, rr+1, . . . , rd) }. • Step 3. For i = 1, . . . , d, i 6= r, if ri < m, set n,i = 1 and n,i = 1 otherwise. • Step 4. Return n.IPA Algorithm: Closed Jackson Queueing Network 83 • Step 0. Set A = 0, where 0 is a dxd zero matrix, n = 1, and N = number of customers serviced. • Step 1. Suppose the next customer to finish service is currently located at node i. Increment Ai,i by the service time si. • Step 2. Suppose the customer leaving node i moves to node j. If node j is empty upon arrival, set A:,j = A:,i. • Step 3. Set n = n+ 1. If n < N , go back to Step 1. If n = N , set T= current time, go to Step 4. • Step 4. Calculate minimum and maximum values of each row and set toA = (A1 , . . . , Ad ) and A+ = (A+1 , . . . , A+d ), respectively. • Step 5. Compute Sk TP · @TP @Sk = AkT A+k + Ak , where Sk is the mean service time of server k and TP = N/(T A+k + Ak ).IPA Algorithm: Closed Jackson Queueing Network w/Equality Con-straint • Step 1. Apply IPA gradient method for the unconstrained closed Jackson queueing network for throughput. • Step 2. Generate r ⇠ Unif{1, d}. • Step 3. For k = 1, . . . , d, set @TPc @Xk = @TP@Xk @TP@Xr . 84 3.4 Summary and Future Work We have introduced the first set of stochastic approximation algorithms that integrate both direct and indirect gradient estimates for single and multidimensional problems. Our new hybrid gradient estimates use a symmetric finite di↵erence-type gradient estimate for the indirect gradient and an average of the associated direct gradients for the direct gradient. The crux of the algorithm lies in the convex weight derived to minimize the variance of the hybrid gradient estimates. Under mild conditions, we have analytically shown the improvement over individual direct and indirect gradient estimates in terms of variance, which in turn can improve the SA performance. We have demonstrated the promise of STAR-SA numerically on a one-dimensional stylized problem as well as STAR-SPSA on a multidimensional queueing network. Previous work had only considered SA with either direct or indirect gradients, but we exploit the information contained in the indirect gradient estimates, which we have shown to be theoretically and practically beneficial. Our hybrid technique can be generalized to include other gradient estimates within the two general gradient categories, e.g., IPA with LR/SF. In addition, we can adaptively choose both the weight sequence and the perturbation sequence. Currently, the variance minimizing weights (3.2) for the homogeneous case are deterministic and depend on both the gradient and function noise, which are unknown. As cn ! 0, the influence of the finite di↵erence gradient estimate di- minishes and the weight on the Tangents AveRaged gradient estimate approaches 1. Ideally, independent of the noise level, the weights will put more emphasis on 85 the gradient with higher accuracy. Since we are in the setting where both direct and indirect gradients available, we have access to various gradient approximations, i.e., f 0(xn ± cn), f 0(xn+cn)+f 0(xncn)2 , and f(xn+cn)f(xncn)2cn . At each iteration, these values can be used to determine which gradient is more accurate, gS or gTAR. If the sample performances are noisy and the perturbation size is very small, then the finite di↵erence gradient could be very inaccurate, so ideally, less emphasis is placed on gS, but cn should not decrease since small perturbation sizes can lead to noisy gradients. Instead, we propose to decrease ↵n in two ways: 1) decrease cn and 2) increase the constant (f/g)2. We can explore how to make the adjustments. 86 Chapter 4 Step Size Selection in Stochastic Approximation 4.1 Sensitivity of Finite-time Performance to Step Size The asymptotic theory of stochastic approximation algorithms guarantees al- most sure convergence under certain conditions. The most common requirements restrict the step size sequence {an}, but even a smaller subset of step size options still allow for an uncountable number of choices. It is impossible to find a universally optimal deterministic step size unless more information is known about the geomet- ric structure of the function. Given a step size, one can always find a function where it performs poorly. For example in the one-dimensional case, if we consider a large step size applied to a very steep function, to ensure the iterates have an opportu- nity to move around the feasible region, then the iterates might oscillate back and forth, moving further and further away from the optimum for an extended period before making any progress towards the true solution, as seen in Figure 4.1a. The opposite could also occur, where the function is extremely flat and the step size is relatively small in comparison, so as the number of iterations increases, the iterates barely make any progress, and once the stopping time is reached, the algorithm returns a poor estimate in the region where it is currently “stuck,” illustrated in Figure 4.1b. The finite-time performance of stochastic approximation algorithms is sensitive to the step size, and there are clear disadvantages for using an arbitrarily 87 An Overview of Stochastic Approximation The Evolution of Stochastic Approximation Marie Chau and Michael C. Fu Abstract This chapter... x1x⇤x2 x3x4 x5x6 0 0.5 1 1.5 2 ·104 (a) an is “too large” relative to the gradient x1x⇤ 0 20 40 60 80 100 (b) an is “too small” relative to the gradient Fig. 1: Sensitivity of SA to step size an. Marie Chau University of Maryland, College Park, College Park, MD 20742, e-mail: mchau@math.umd.edu Michael C. Fu University of Maryland, College Park, College Park, MD 20742 e-mail: mfu@isr.umd.edu 1 Figure 4.1: Illustration of Sensitivity of SA to {an}. chosen deterministic step size. To circumvent this issue, adaptive step sizes have been proposed to adjust based on the ongoing performance of the algorithm. Ideally, an adaptive step size algorithm is able to recognize/detect the current estimates’ proximity to the true solution, path behavior/trend, and geometric structure near the current iterate, while adjusting the step size accordingly. Before we propose our new adaptive step sizes, we conduct preliminary tests on existing finite-time theory on MSE bounds and two adaptive step sizes, Kesten’s rule and SSKW. 4.1.1 KW and its Variants For our preliminary numerical experiments, we focus on the one-dimensional KW algorithm, which generates brf(Xn) in (1.2) using finite di↵erences. Although theoretical convergence can be guaranteed by satisfying certain requirements, practi- cal performance depends on the choice of tuning sequences. In addition to selecting 88 a gain sequence {an} in (1.2), the KW algorithm requires an additional task of choosing a finite di↵erence perturbation sequence {cn} for the gradient. The finite- time performance of KW depends on both sequences {an} and {cn}. Because of the sensitivity of the KW algorithm to the tuning sequences, it is essential to choose an appropriate pair. In practice, KW could have the following shortcomings: long oscillatory period if the gain sequence {an} is “too large,” degraded convergence rate if {an} is “too small,” and poor gradient estimates if the perturbation sequence {cn} is “too small.” We conduct an empirical investigation of the sensitivity of KW and two of its adaptive variants, namely Kesten’s rule and the scaled-and-shifted KW (or SSKW) algorithm of [7]. Our goal is to identify problem characteristics that exert a strong impact on algorithm performance, even in the presence of theoretical guarantees. For example, in the numerical results reported in [7], SSKW outperforms the KW algorithm in terms of both MSE and oscillatory behavior in finite time; however, this result is obtained using what seem to be nearly worst-case parameter settings for KW. We replicate these results, but we also find that the performance of KW can be significantly improved over a fairly wide choice of parameter settings. Although the worst-case performance of SSKW is much better than that of KW, it is also the case that KW provides the best performance in a significant proportion of problem instances. In addition, we find that Kesten’s rule performs similar to KW, and sometimes better, when both algorithms begin with the same initial start value. We also investigate the finite-time MSE bound in [7] and characterize instances where this bound is tight. These results underscore the well-known diculty of tuning, 89 even for adaptive versions of KW. 4.2 Finite-time MSE Bound Asymptotic convergence properties of the KW algorithm and its variations have been a major research focus in SA. Convergence proofs in MSE appear in [18], [16], [19], [58], and [49] for various assumptions and modifications of the KW algorithm. However, in practice, where the run-time is finite, a good finite-time bound for the MSE is useful. By applying similar technique as in [18], [7] derived a finite-time bound for the MSE of the KW algorithm. The MSE bound depends on certain problem- dependent constants, which are typically dicult to calculate in practice. However, in the special case where f is quadratic, the bound can be computed in closed form, allowing us to observe its tightness. We briefly summarize the bound as follows. First, we make the following assumptions on the function f(x): 1. There exist positive constants K0, K1, and C0 such that for every c 2 [0, C0], K1(x x⇤)2  f(x+ c) f(x c) c (x x⇤)  K0(x x⇤)2. 2. f 0(x)(x x⇤) < 0 for all x 2 R\{x⇤}. We also assume that the tuning sequences satisfy: 1. an/c2n  (an+1/c2n+1)(1 + Aan+1) for all n 1, 2. an ! 0 as n !1, 90 with 0 < A < 2K0. Then, E(Xn+1 x⇤)2  Can/c2n for all n 1, (4.1) where C is a constant explicitly defined as C = max ⇢ 2 ⇠ , max 1nn0 ⇢ c2nanBn+1 , and Dn = K21Aa2n + (K21 2AK0)an 2K0 A, n0 = 1 if Dn < 0 for all n 1 and sup n 1 : (K21 2AK0)an +K21Aa2n 2K0 A + 1 otherwise, ⇠ = sup{A 2K0 + (K21 2AK0)an +K21Aa2n : n n0}, Bn = X21 nYi=1 pi + n1Xi=2 qi nYj=i+1 pj + qn, pi = 1 2aiK0 +K21a2i , for i = 1, 2, . . . , n, qi = a2ic2i 2, for i = 1, 2, . . . , n, 2 = supx2⇥Var[f˜(Xn + cn) f˜(Xn cn)|Xn = x]. Equation (4.1) does not guarantee convergence in MSE, but rather establishes a finite bound for each iteration. The bound is thus more useful when it is tight. In Section 4.3.1, we investigate the tightness of (4.1) by comparing the bound with the exact MSE of simple quadratic functions of the form f(x) = ↵x2 where ↵ < 0 and the optimal x⇤ = 0. The exact MSE can be computed as follows: E(Xn+1 x⇤)2 = X21⇧ni=1(1 + 2↵ai)2 + 22 n1Xk=1 a2kc2k⇧nj=k+1(1 + 2↵aj)2 + a2n22c2n . (4.2) 91 4.3 Numerical Experiments 4.3.1 Tightness of the Finite-time MSE Bound for Quadratics We generated the MSE bound in (4.1) and the exact MSE in (4.2) for quadratic functions with various noise levels and initial starting values for three di↵erent cases: 1) f(x) = 0.001x2, cn = 1/n1/2, f(x) = 0.15x2, cn = 1/n1/4 and f(x) = 0.15x2, cn = 1/n1/2. The MSE bound is a function of constants that are not unique, satisfying S1, S3, A1, and A2. We picked the largest K0 and smallest K1 satisfying A1 and A slightly less than 2K0. Table 4.2 lists the constants used in our calculations for the MSE bound in (4.1), and the exact MSE and MSE bound are listed in Table 4.1. The exact MSE (4.2) is a sum of three components. The first term on the right hand side (RHS) of (4.2) is independent of and is dominated by the initial starting value, X1. The second and third terms in (4.2) are dominated by . When 2 {0.001, 0.01, 0.1, 1.0}, both terms are small (< 1), but when = 10, the RHS is dominated by the second term. Therefore, the exact MSE increases with X1 and . Using the parameters in Table 5.1, the constant C in the MSE bound can be expressed as C=max ⇢ 2 ⇠ , c21 a1 ✓ X1(1 2a1K0 +K21a 2 1)(1 2a2K0 +K 2 1a 2 2) + a22 c22 2 ◆ . (4.3) The first term in (4.3) dominates when is large since ⇠ = 0.001, 0.1 for f(x) = 0.001x2,0.15x2, respectively. Therefore, the MSE bound and di↵erence between the exact MSE and MSE bound increases significantly when increases from 1.0 to 10.0. Otherwise, the MSE bound is equal to the second term, which 92 Table 4.1: Finite-time MSE bound and exact MSE for KW with n = 10000, an = 1/n f(x) = 0.001x2 f(x) = 0.15x2, f(x) = 0.15x2 cn = 1/n1/2 cn = 1/n1/4 cn = 1/n1/2 X1 Exact Bound Exact Bound Exact Bound 0.001 0 0.00 0.00 0.00 0.00 0.00 0.00 -5 24.04 24.85 0.06 8.85 0.06 0.09 -10 96.16 99.40 0.24 35.40 0.24 0.35 -20 384.64 397.61 0.95 141.61 0.95 1.42 -40 1538.56 1590.42 3.78 566.44 3.78 5.66 0.01 0 0.00 0.10 0.00 0.00 0.00 0.00 -5 24.04 24.85 0.06 8.85 0.06 0.09 -10 96.16 99.40 0.24 35.40 0.24 0.35 -20 384.64 397.61 0.95 141.61 0.95 1.42 -40 1538.56 1590.42 3.78 566.44 3.78 5.66 0.1 0 0.05 10.0 0.01 0.10 0.00 0.00 -5 24.09 24.86 0.07 8.86 0.06 0.09 -10 96.21 99.41 0.24 35.41 0.24 0.35 -20 384.69 397.61 0.95 141.62 0.95 1.42 -40 1538.61 1590.43 3.79 566.45 3.78 5.66 1.0 0 4.8 999.99 0.83 10.00 0.03 0.10 -5 28.84 999.99 0.89 10.00 0.09 0.10 -10 100.96 999.99 1.07 35.90 0.27 0.36 -20 389.44 999.99 1.78 142.11 0.98 1.42 -40 1543.36 1590.92 4.61 566.94 3.81 5.67 10.0 0 480.08 99999.20 83.23 999.79 3.20 9.99 -5 504.12 99999.20 83.29 999.79 3.26 9.99 -10 576.24 99999.20 83.47 999.79 3.43 9.99 -20 864.72 99999.20 84.18 999.79 4.14 9.99 -40 2018.64 99999.20 87.01 999.79 6.98 9.99 93 Table 4.2: Finite-time MSE Bound Parameters for KW f(x) an cn K0 K1 A n0 ⇠ 0.001x2 1/n 1/n1/2 0.002 0002 0.003 1 0.001 0.15x2 1/n 1/n1/2 0.3 0.3 0.5 1 0.1 0.15x2 1/n 1/n1/4 0.3 0.3 0.5 1 0.1 increases with X1 and . Table 4.1 contains the exact MSE and MSE bound for three di↵erent parameter settings and for 2 {0.001, 0.01, 0.1, 1.0, 10.0}. The first column in Table 4.1 presents results for f(x) = 0.001x2, cn = 1/n1/2. In the presence of more noise, i.e. = 10.0, the MSE bound is 99, 999.20, which is the first term in (4.3) for each initial starting value. The di↵erence between this bound and the exact MSE is significant with a di↵erence greater than 97, 500. For = 1.0, the MSE bound only takes the second term in (4.3), when the starting position is farther from the optimum, i.e. X1 = 40 and is tight. However, when the initial starting value is closer to the optimum, i.e. X1 = 0,5,10,20, the MSE bound is equal to 999.99, which is the first term in (4.3), and thus the MSE bound is significantly greater than the exact MSE. The MSE bound is very tight for rest of the cases with the exception of when = 0.1 and X1 = 0. For the second column with f(x) = 0.15x2, cn = 1/n1/2, the MSE bound is significantly greater than the exact MSE across the board. The third column reports results for f(x) = 0.15x2, cn = 1/n1/2 the MSE bound is tight for all cases with the exception of the case with = 10.0. It would seem that the bound is a useful guideline for problems with low variance, but becomes less tight as the noise level increases. 94 4.3.2 Sensitivity of KW and its Variants We compare the MSE performance between KW and two of its variants de- scribed in Section 2.2.3, Kesten’s rule and SSKW. All experiments were implemented with an = ✓a/n, cn = ✓c/ns where s 2 {1/4, 1/2}, ✓a > 0, ✓c > 0, 10000 iterations, and 1000 sample paths. Not surprisingly, the performance of SSKW relative to KW heavily depends on the chosen parameters such as truncated interval length, initial starting value, and tuning sequences. Our analysis replicates the results of [7], where SSKW per- forms significantly better than KW in terms of MSE and oscillatory period, but we find that the chosen parameters for this experiment are among the worst pos- sible parameters for KW as illustrated in Figure 4.2 with KW and SSKW under ✓a = ✓c = 1. By choosing a di↵erent initial starting position, the performance of KW can be significantly improved, as demonstrated in Table 4.3 for two functions f(x) = 0.001x2 and f(x) = 100e0.006x2 . To o↵er a contrast with the quadratic function, the second function considered is very steep and has flat tails. [7] compared the SSKW performance with that of KW whose MSE is highly reliant on the tuning sequences and initial start value. The MSE performance results for f(x) = 0.001x2 using KW in [7] were poor because the initial position was chosen to be far from the optimum and the gain size an was too small to make any noticeable progress towards it after 10000 iterations, so the iterates hover around the initial position. In our numerical experiments, we also consider an = ✓an and cn = ✓cn1/4 for ✓a, ✓c > 0. If ✓a = ✓c = 1 as in [7], but the initial start value is 0.01 95 -40 -20 0 20 40 0.0 00 0.0 05 0.0 10 0.0 15 0.0 20 X 1 MS E 0.0 00 0.0 05 0.0 10 0.0 15 0.0 20 MS E 0.0 00 0.0 05 0.0 10 0.0 15 0.0 20 MS E 0.0 00 0.0 05 0.0 10 0.0 15 0.0 20 MS E 0.0 00 0.0 05 0.0 10 0.0 15 0.0 20 MS E 0.0 00 0.0 05 0.0 10 0.0 15 0.0 20 MS E 0.0 00 0.0 05 0.0 10 0.0 15 0.0 20 MS E 0.0 00 0.0 05 0.0 10 0.0 15 0.0 20 MS E 0.0 00 0.0 05 0.0 10 0.0 15 0.0 20 MS E 0.0 00 0.0 05 0.0 10 0.0 15 0.0 20 MS E 0.0 00 0.0 05 0.0 10 0.0 15 0.0 20 MS E 0.0 00 0.0 05 0.0 10 0.0 15 0.0 20 MS E 0.0 00 0.0 05 0.0 10 0.0 15 0.0 20 MS E 0.0 00 0.0 05 0.0 10 0.0 15 0.0 20 MS E KW θ a = 1, θ c = 1 KW θ a = 500, θ c = 4 KW θ a = 90, θ c = 5 Kesten θ a = 1, θ c = 1 Kesten θ a = 10, θ c = 5 Kesten θ a = 100, θ c = 1 SSKW θ a = 1, θ c = 1Figure 4.2: MSE of the 10000th iterate of KW and Kesten for three parameter settings and SSKW for f(x) = 0.001x2, = 0.001, an = ✓a/n, cn = ✓c/n1/4. instead of 30, then the MSE from KW is significantly lower compared to SSKW. The first column in Table 4.3 compares the MSE all three algorithms with X1 = 0.01, and clearly, KW outperforms SSKW in almost all cases. Of course, a practitioner would have no way of knowing whether or not the starting iterate was close to the true optimum, so these results do not indicate that KW will always perform well. They do indicate, however, that KW exhibits substantial variation in performance. We also conduct a sensitivity analysis for f(x) = 0.001x2 with various start- ing positions X1 and multiplicative constants, ✓a, and ✓c and implement SSKW and KW using Kesten’s rule. For the sensitivity analysis, we considered a wide selec- tion of parameters: 19 initial starting values uniformly spaced within the truncated 96 f(x) = 0.001x2 [-50, 50] f(x) = 100e0.006x2 [-50, 50] X1 = 0.01 X1 = 30 Alg. 100 1000 10000 100 1000 10000 .001 SSKW 5.10x10 2 1.70x10 2 5.00x10 3 5.07x10 2 1.68x10 2 4.84x10 3 KW 10 4 10 4 10 4 763.8 653.3 431.4 Kesten 1.12x10 4 1.08x10 4 1.04x10 4 10 7 3x10 8 10 8 .01 SSKW 5.10x10 2 1.70x10 2 5.00x10 3 5.07 1.68 4.90x10 1 KW 10 4 10 4 10 4 763.8 653.3 431.2 Kesten 2.10x10 3 2.11x10 3 2.05x10 3 9.54x10 6 2.76x10 6 8.41x10 7 .1 SSKW 5.10x10 2 1.70x10 2 5.00x10 3 165.8 57.4 16.0 KW 10 4 10 4 10 4 763.4 651.4 418.2 Kesten 2.01x10 1 2.03x10 1 1.97x10 1 5.65x10 2 2.76x10 4 8.41x10 5 1.0 SSKW 5.10x10 2 1.70x10 2 5.00x10 3 187.2 57.8 18.7 KW 10 4 10 4 10 4 722.5 562.5 415.7 Kesten 20.1 20.3 19.7 456.9 315.1 239.7Table 4.3: MSE of the 100th, 1000th, and 10000th iteration for KW and its variates with an = 1/n, cn = 1/n1/4. interval X1 2 {50 + 5k | k = 1, 2, ...19}, 45 di↵erent ✓a values parametrized by ✓a 2 {10sk | k = 1, 2, . . . , 9, s = 0, 1, . . . , 4} and 10 di↵erent ✓c values parametrized by ✓c 2 {10sk | k = 1, 2, ..., 5, s = 0, 1}. In total, there are 8550 possible combina- tions of parameters. The results show that KW and Kesten’s rule are sensitive to the parameter choice, but near-optimal performance can be obtained with tuning. Figure 4.2 plots the MSE of KW for f(x) = 0.001x2, = 0.001 against the initial starting values X1 for di↵erent sets of parameter choices. These cases serve as a good representation of the majority of the MSE behaviors among the entire set of results. The case with ✓a = ✓c = 1 is among the worst for KW and Kesten’s rule. The MSE is represented by a nearly vertical line for both algorithms. For this parameter setting, SSKW beats 97 KW and Kesten’s rule significantly for all initial values with the exception of X1 = 0. For the case where ✓a = 90, ✓c = 5, KW outperforms SSKW in a neighborhood around the optimum. However, there are cases such as ✓a = 500, ✓c = 4 for KW and ✓a = 100, ✓c = 1 for Kesten’s rule that outperform SSKW for all initial start values. Of the 8550 combinations varying all parameters and 450 combinations with X1 = 30, KW performs better than SSKW in 4275 and 215 cases, respectively, suggesting that KW requires some tuning to perform well, but that there is a fairly wide range of tunable parameters that yield good performance. If KW performs better than KW, the di↵erence is not as pronounced as when SSKW outperforms KW, but careful tuning can partially mitigate KW’s sensitivity to parameters such as the initial iterate. 0 2 4 6 8 10 0 40 0 80 0 σ=0.01 ln(θ a ) MS E 0 40 0 80 0 MS E 0 40 0 80 0 MS E θc = 1θ c = 10 θ c = 40 0 2 4 6 8 10 0 40 0 80 0 σ=0.1 ln(θ a ) MS E 0 40 0 80 0 MS E 0 40 0 80 0 MS E 0 2 4 6 8 10 0 40 0 80 0 σ=1.0 ln(θ a ) MS E 0 40 0 80 0 MS E 0 40 0 80 0 MS E 0 2 4 6 8 10 0 40 0 80 0 σ=10.0 ln(θ a ) MS E 0 40 0 80 0 MS E 0 40 0 80 0 MS EFigure 4.3: Sensitivity of KW to ✓a for f(x) = 0.001x2, an = ✓a/n, cn = ✓c/n1/4, n = 10000. 98 2 4 6 8 10 -5 0 5 10 σ = 0.01 φ a ln( MS E) -5 0 5 10 ln( MS E) -5 0 5 10 ln( MS E) 2 4 6 8 10 -5 0 5 10 σ = 0.1 φ a ln( MS E) -5 0 5 10 ln( MS E) -5 0 5 10 ln( MS E) 2 4 6 8 10 -5 0 5 10 σ = 1.0 φ a ln( MS E) -5 0 5 10 ln( MS E) -5 0 5 10 ln( MS E) 2 4 6 8 10 -5 0 5 10 σ = 10.0 φ a ln( MS E) -5 0 5 10 ln( MS E) -5 0 5 10 ln( MS E)Figure 4.4: Sensitivity of SSKW for f(x) = 0.001x2 as a function of a, n = 10000. Figure 4.3 plots the MSE f(x) = 0.001x2, an = ✓a/n, cn = ✓c/n1/4of the 10000th iterate as a function of log ✓a given ✓c = 1, 10, 40. The case where = 0.001 is omitted, because the results are similar to those for = 0.01. For log ✓a < 4, the MSE decreases for each given value of ✓c. However, for log ✓a 4, the MSE behaves di↵erently for all noise levels. But, the overall behavior as a function of ✓c is similar across noise levels. The MSE decreases for all ✓a as ✓c increases, so in the case where ✓c = 40, there is a wide range of ✓a values where the MSE of KW is lower than that of SSKW. But the MSE of KW could also be extremely high if the tuning sequences are not chosen well. Moreover, we investigate the sensitivity of SSKW to a, which is the upper bound of the scale up factor for {an} as depicted in Figure 4.4. The MSE decreases until a, the maximum scale up factor for {an}, is equal to 99 -40 -20 0 20 40 -15 -5 0 5 σ = 0.01 X 1 ln( MS E) -15 -5 0 5 ln( MS E) -15 -5 0 5 ln( MS E) -40 -20 0 20 40 -15 -5 0 5 σ = 0.1 X 1 ln( MS E) -15 -5 0 5 ln( MS E) -15 -5 0 5 ln( MS E) -40 -20 0 20 40 -15 -5 0 5 σ = 1.0 X 1 ln( MS E) -15 -5 0 5 ln( MS E) -15 -5 0 5 ln( MS E) -40 -20 0 20 40 -15 -5 0 5 σ = 10.0 X 1 ln( MS E) -15 -5 0 5 ln( MS E) -15 -5 0 5 ln( MS E) KW Kesten SSKWFigure 4.5: MSE Comparison of KW, Kesten, and SSKW for f(x) = 100e0.006x2 , an = 1/n, cn = 1/n1/4, n = 10000. 4 and increases for = 0.01, 0.1 while it levels o↵ for 2 {1.0, 10.0} thereafter. It seems that for lower noise levels,i.e. 2 {0.01, 0.1}, a = 4 is a better choice, while a = 10 leads to a lower MSE for 2 {1.0, 10.0}. In addition, we implement KW and its variants using the same parameters (i.e., an = 1/n, cn = 1/n1/4, X1 = 30) as in [7] on f(x) = 100e0.006x2 to test the algorithms under the same setting for a di↵erent function. Figure 4.5 plots the MSE of the 10000th iterate as a function of the initial start value. KW and Kesten’s rule outperform SSKW within certain intervals around the optimum for 2 {0.001, 0.01, 0.1, 1.0} and Kesten’s better performance intervals overlap the intervals of KW. However, the KW using the deterministic step-size 1/n performs 100 better than using Kesten’s step-size where the intervals overlap, which can be seen in Figure 4.5. Unfortunately, outside of those intervals, both algorithms have a tendency to perform poorly. However, for the other four noise levels, the intervals where KW and Kesten’s rule outperform SSKW are larger. However, there is a tradeo↵, since if by chance the initial start value is closer to the boundary, the di↵erence in performance can be drastic. 4.4 PROX-step The idea behind our adaptive step size can be easily conveyed in the two examples illustrated in Figures 4.6a and 4.6b. For a well-behaved function without drastic changes in its gradient in neighborhoods, as depicted in Figure 4.6b, there is a possibility of selecting a step size that leads to good finite-time performance, where the iterates can approach the minimum at a good pace without heavy oscillations. However, the function in Figure 4.6a has a steep valley region around the optimum, whereas the rest of the function is extremely flat. This type of function arises in the likelihood function typically used to find maximum likelihood estimators. If we use a single step size sequence, the performance will often be poor since the two extreme regions require two drastically di↵erent step sizes. For instance, a larger step size is appropriate in the flat area to avoid getting stuck at a poor estimate; however, once the iterate reaches the valley region, the larger step size is no longer appropriate, and the recursion will overshoot and launch the estimate back into a flat region. This cycle will continue until the step size is suciently small so that 101 −5 −4 −3 −2 −1 0 1 2 3 4 5−80 −60 −40 −20 0 20 40 (a) Contrasting regions −5 −4 −3 −2 −1 0 1 2 3 4 50510152025 (b) Well-behavedFigure 4.6: Illustration of PROX-Step Motivational the iterates make progress towards the minimum. This is only possible if the step size was small before entering the valley region, which means many iterations were necessary to reach that point. We propose an adaptive step size, inspired by both Kesten’s rule and SSKW, that adjusts based on sample performances, gradient estimates, and parameter es- timates, taking advantage of past data. Kesten’s rule is Markovian-like, since the next step size depends on the previous gradient (or estimate) as well as the current gradient (or estimate). On the other hand, SSKW’s adaptive mechanisms depend largely on the feasible search space/region and the two most recent parameter es- timates. While the aforementioned adaptive step sizes have significantly advanced SA algorithms, they have their pitfalls. Kesten’s rule starts with an arbitrary non- increasing step size, which can still fall into shortcomings as in the deterministic case if an is too large. It does however, prevent the step size from decreasing if an is too small (assuming consecutive gradients are the same sign). An extension by [53] 102 allows the step size to increase, but it can be slow. SSKW certainly addresses the issue of implementing SA with an inappropriately small or large an, but di↵erent regions in the feasible search space call for di↵erent step sizes. For instance, if the current iterate is in a very steep portion of the function, then under SSKW, it will reduce the step size faster. However in the next iteration after the step size reduc- tion, the estimate could be located in a flat region, which requires a larger an to accelerate the convergence to a neighborhood of the optimum. In the worst case, the estimate gets stuck in a region because of the newly reduced an is too small, since aside from the initial phase, SSKW can only reduce an and cannot enlarge/increase it. Although the underlying notion of our adaptive step size is motivated by Kesten and SSKW, we approach adaptivity from a di↵erent angle. Our method adopts a memory-based approach, where the step size is adjusted based on the perceived proximity to the optimum as in Kesten and is increased/decreased when deemed necessary as in SSKW, but the adjustments are determined by the sample performance(s) in addition to the gradient(s) and parameter estimate(s) used in other methods. Each simulation run used to obtain the sample performance and gradient estimate (if available) is expensive and the values generated contain infor- mation that can be used in future iterations to improve the SA performance. We eciently exploit the data collected in previous iterations to aid in the selection of an appropriate step size to be used in the current update. For illustrative purposes, consider the simple one-dimensional KW algorithm. At each iteration, two sample performances are generated to compute the finite 103 di↵erence gradient estimate for the recursive update. After collecting a sucient number of observations, we can roughly guess how close the current estimate is from the optimum based on the moving average of sample performances in comparison to the current observation. If the current iterate is considered close, then a smaller step size could prevent the next iterate from completely overshooting. In contrast, if the estimate is believed to be further away, then a larger step size could decrease the amount of e↵ort required to put the iterate in closer range of the optimum. For a minimization problem, the estimate is perceived to be closer to the optimum if the current objective value is less than the mean of the past objective values, and further from the optimum if it is greater than the mean, i.e., • fn > f¯n (farther), • fn < f¯n (closer), where fn is the average of sample performances generated at the nth iteration and f¯n is the running average of all sample performances (i.e., for KW with symmetric di↵erences, f+n = f(xn+cn), fn = f(xncn), fn = (f+n +fn )/2, and f¯n = 1n Pni=1 fi. Our adaptive step size begins with two positive fixed sequences {a+n } and {an }, where an < a+n for all n. We first tailor the adaptive step size to SA algorithms that involve a symmetric finite di↵erence gradient (e.g., KW or STAR-SA), and then generalize it. We now present PROX-step for one-dimensional SA algorithms involving a symmetric finite di↵erence, followed by a generalization. 104 PROX-step for KW or STAR-SA • Step 0. Input: x1 2 ⇥, {a+n }, {an }, {cn}, N . • Step 1. Initialize f¯0 = 0, n = 1. • Step 2. Generate f+n and fn to compute brf(xn). • Step 3. Update mean objective value f¯n = n1n f¯n1 + 1nfn. • Step 4. Set an = 8>><>>: an if fn < f¯n,a+n if fn > f¯n. • Step 5. Update xn+1 = ⇧⇥ ⇣xn an brf(xn)⌘. • Step 6. If n < N , set n = n+ 1, go to Step 2. • Step 7. Output: x⇤N = xNGeneralized PROX-step for SA • Step 0. Input: x1 2 ⇥, {a+n }, {an }, N . • Step 1. Initialize f¯0 = 0, n = 1. • Step 2. Generate nd sample performances f (i)n for i = 1, . . . , nd to computebrf(xn). 105 • Step 3. Update mean objective value f¯n = n1n f¯n1 + 1nfn, where fn = 1nd Pndi=1 f (i)n . • Step 4. Set an = 8>><>>: an if fn < f¯n,a+n if fn > f¯n. • Step 5. Update xn+1 = ⇧⇥ ⇣xn an brf(xn)⌘. • Step 6. If n < N , set n = n+ 1, go to Step 2. • Step 7. Output: x⇤N = xN 4.5 Adaptive PROX-step Although the PROX-step is adaptive, the two sequences used to generate it are both deterministic, so in the end, it still faces some of the same shortcomings of implementing a deterministic step size. Each sequence {a+n } and {an } could be inappropriately selected for its corresponding feasible region; therefore, we propose a method to adjust the sequences based on past events. In particular, throughout the recursion, in addition to the necessary information used to implement the PROX- step, we track the step size used at each iteration to determine whether or not the smaller sequence {an } appropriate. If the step sizes used are alternating between {a+n } and {an } consistently, then it is an indication that either we are close to the optimum or the smaller step size is too big. In either case, it would be ideal to 106 decrease the smaller step size sequence. Furthermore, in applying Kesten’s rule, we do not have a separate step sizes for each dimension, so instead, we look at the the total number of sign changes in the gradient and if more than half of the dimensions keep the same sign, then we keep the same two step size sequences for the next iteration (i.e., {a±n } = {a±n1}). Lastly, we look at the magnitude of each violation ||xn+1 ⇧ (xn+1) ||2 and of each iteration change ||xn ⇧ (xn+1) ||2 and reduce or increase the step sir sequence accordingly while maintaining an < a+n still holds. Step size adjustments can be done in the following ways: • Scaling • Kesten’s rule Scaling will be used for decreasing and increasing the step size by dividing and multiplying by the scaling factor, respectively, and Kesten’s rule will keep the step size constant. Additional input parameters must be chosen: • ↵ > 1 is the scale factor • {dn} is the projection violation threshold • {en} is the no-movement threshold • nL is the number of allowable scales The asymptotic theory still applies, since the scaling and shifting will only occur finitely many times. 107 Adaptive PROX-step • Step 0. Input: x1 2 ⇥, {a+n }, {an }, {dn}, ↵, nL, N . • Step 1. Initialize f¯0 = 0, n = 1, nL = 0. • Step 2. Generate nd sample performances f (i)n for i = 1, . . . , nd to computebrf(xn). • Step 3. Update mean objective value f¯n = n1n f¯n1 + 1nfn. • Step 4. Set an = 8>><>>: an if fn < f¯n,a+n if fn > f¯n. • Step 5. Set a[n] = 8>><>>: 1 if fn < f¯n,1 if fn > f¯n. • Step 6. Update xn+1 = ⇧⇥ ⇣xn an brf(xn)⌘. • Step 7. Modify step size sequences. – Scaling down. ⇤ If nS < nL and ||xn+1 ⇧ (xn+1) ||2 > dn, and · if an = a+n , then set {a+n } = {a+n /↵}, nS = nS + 1. · if an = an , then set {an } = {an /↵}, nS = nS + 1. 108 ⇤ If n > 3, nS < nL, a[n]a[n 1] < 0, a[n 1]a[n 2] < 0, and a[n 2]a[n 3] < 0, then set {an } = {an /↵} and nS = nS + 1. – Scaling up. ⇤ If nS < nL and 0 < ||xn ⇧ (xn+1) ||2 < en, and · if an = a+n , then set {a+n } = {↵a+n }, nS = nS + 1. · if an = an , then set {an } = {↵an }, nS = nS + 1. – Kesten. ⇤ If Pdi=1 I{brfi(xn)·brfi(xn+1)>0} > d/2, then {a+n } = {a+n1} and {an } = {an1}. • Step 8. If n < N , set n = n+ 1, go to Step 2. • Step 9. Output: x⇤N = xN Unfortunately, the input parameters must be chosen by the users, which can have a significant e↵ect on the performance. In our numerical experiments, we choose the parameters conservatively so that the original two sequences do not change dramatically from one iteration to the next. 4.6 Numerical Experiments We investigate the performance of PROX-step and aPROX-step on two con- trasting functions with added noise as well as the queueing network example de- scribed in Section 3.4. 109 4.6.1 Deterministic Problem with Added Noise For the one-dimensional case, we consider two deterministic functions: 1) ex- ponential that is very steep near the optimum with very flat tails and 2) a simple quadratic. We compare the standard decreasing step size, Kesten’s rule, and the PROX-step and Adaptive PROX-step applied to RM. For the fixed decreasing step sizes, we look at three di↵erent step size parameters (i.e., from the PROX-step, Big: {a+n }, Small: {an }, Average: {(a+n +an )/2}). We apply Kesten’s rule to the average step size sequence {(a+n + an )/2} and use PROX and aPROX with the parameter dn = 10 for all n. The initial value x1 = 10 and feasible region ⇥ = [30, 70] were arbitrarily chosen. We consider N = 50 to be the stopping time, 20 macro replications, and two step size settings for a±n . To demonstrate the robustness of PROX and aPROX, we fix the smaller se- quence {an }, vary the larger sequence {a+n }, and plot the MSE of the estimate xN and f(xN) for both f(x) = 100e0.006x2 and f(x) = x2. Figure 4.7 presents results for the exponential case with an = 10/n and a+n = 5(k+1)/n, where k is the x-axis, where Figure 4.7a is the full plot and Figure 4.7b is a magnified version. Clearly, Kesten’s rule applied to the average step size {(a+n + an )/2} and the larger step size {a+n } perform poorly as a+n increases, since their MSEs of the estimate xN are much higher than the rest in the left figure, and the function value f(xN) is much lower in the right graph (higher is better since we are maximizing). The performance of using the fixed small step size {an } is represented by the horizontal lines since it is independent of a+n , performs better than Big and KestenAvg, but both PROX 110 1 2 3 4 5 6 7 8 9 100 500 1000 1500 2000 2500 3000 MSE of X N k 1 2 3 4 5 6 7 8 9 1020 30 40 50 60 70 80 90 100 100e xp(− 0.00 6x N2 ) k AvgPROXKestenAvgSmallBigAdaptPROX AvgPROXKestenAvgSmallBigAdaptPROX(a) MSE of xN and f(xN ) 1 2 3 4 5 6 7 8 9 100 100 200 300 400 500 MSE of X N k 1 2 3 4 5 6 7 8 9 1075 80 85 90 100e xp(− 0.00 6x N2 ) k AvgPROXKestenAvgSmallBigAdaptPROX AvgPROXKestenAvgSmallBigAdaptPROX(b) Magnified versionFigure 4.7: f(x) = 100e0.006x2 , ✓ = [30, 70], x1 = 10, an = 10/n, a+n = 5(k + 1)/n, f = g = 0.1, ↵ = 1.025, nL = 20, N = 50, macroreplications = 20. 111 1 2 3 4 5 6 7 8 9 100 50 100 150 200 250 300 MSE of X N k 1 2 3 4 5 6 7 8 9 1050 55 60 65 70 75 80 85 90 95 100 100e xp(− 0.00 6x N2 ) k AvgPROXKestenAvgSmallBigAdaptPROX AvgPROXKestenAvgSmallBigAdaptPROX(a) MSE of xN and f(xN ) 1 2 3 4 5 6 7 8 9 100.01 0.0105 0.011 0.0115 0.012 MSE of X N k 1 2 3 4 5 6 7 8 9 1099.9925 99.993 99.9935 99.994 99.9945 100e xp(− 0.00 6x N2 ) k AvgPROXKestenAvgSmallBigAdaptPROX AvgPROXKestenAvgSmallBigAdaptPROX(b) Magnified versionFigure 4.8: f(x) = 100e0.006x2 , ✓ = [30, 70], x1 = 10, an = 1/n, a+n = 2k/n, f = g = 0.1, ↵ = 1.025, nL = 20, N = 50, macroreplications = 20. 112 0 5 10 15 20 25 30 35 40 45 500 200 400 600 800 1000 MSE of X N Number of Iterations 0 5 10 15 20 25 30 35 40 45 500 20 40 60 80 100 100 exp (−0. 006 x2 ) Number of Iterations AvgPROXKestenAvgSmallBigAdaptPROX AvgPROXAvgKestenSmallBigAdaptPROX(a) an = 1/n, a+n = 10/n 0 5 10 15 20 25 30 35 40 45 500 500 1000 1500 2000 2500 MS E o f X N Number of Iterations 0 5 10 15 20 25 30 35 40 45 500 20 40 60 80 100 100 exp (−0. 006 x2 ) Number of Iterations AvgPROXKestenAvgSmallBigAdaptPROX AvgPROXAvgKestenSmallBigAdaptPROX(b) an = 10/n, a+n = 50/nFigure 4.9: MSE of xN and f(xN), f(x) = 100 exp0.006x2 , ↵ = 1.5, ✓ = [30, 70], x1 = 10, f = g = 0.1,, ↵ = 1.5, nL = 50, N = 50, macroreplications = 20. and aPROX appear to outperform in almost all cases. Figure 4.7b zooms in on the better performances for a clearer comparison. Figure 4.8 presents results for a similar setting with the exception of step sizes, where, an = 1/n and a+n = 2k/n. The reduction in step sizes result in the same two step sizes (Big and KestenAvg) performing poorly, but the average step size sequence also joins the group. From Figure 4.8b, aPROX outperforms all algorithms for the range of step sizes {a+n }, and the PROX results are not far behind. We also plot the MSE of xN and the function f(xN) as a function of the 113 0 5 10 15 20 25 30 35 40 45 500 200 400 600 800 1000 1200 1400 1600 1800 2000 MSE of X N Number of Iterations 0 5 10 15 20 25 30 35 40 45 500 10 20 30 40 50 60 70 80 90 100 100e xp(− 0.00 6x2 ) Number of Iterations AvgPROXKestenAvgSmallBigAdaptPROX AvgPROXAvgKestenSmallBigAdaptPROX(a) an = .1/n, a+n = 50/n 0 5 10 15 20 25 30 35 40 45 500 500 1000 1500 2000 2500 MSE of X N Number of Iterations 0 5 10 15 20 25 30 35 40 45 500 10 20 30 40 50 60 70 80 90 100 100e xp(− 0.00 6x2 ) Number of Iterations AvgPROXKestenAvgSmallBigAdaptPROX AvgPROXAvgKestenSmallBigAdaptPROX(b) an = 1/n, a+n = 50/n 0 5 10 15 20 25 30 35 40 45 500 500 1000 1500 2000 2500 MSE of X N Number of Iterations 0 5 10 15 20 25 30 35 40 45 500 10 20 30 40 50 60 70 80 90 100 100e xp(− 0.00 6x2 ) Number of Iterations AvgPROXKestenAvgSmallBigAdaptPROX AvgPROXAvgKestenSmallBigAdaptPROX(c) an = 10/n, a+n = 50/nFigure 4.10: MSE of xN and f(xN), f(x) = 100 exp0.006x2 , ↵ = 2 ✓ = [30, 70], x1 = 10, f = g = 0.1, nL = 20, N = 50, macrorepli- cations = 20. 114 1 2 3 4 5 6 7 8 9 100 1000 2000 3000 4000 5000 MSE of X N k 1 2 3 4 5 6 7 8 9 10−5000 −4000 −3000 −2000 −1000 0 100e xp(− 0.00 6x N2 ) k AvgPROXKestenAvgSmallBigAdaptPROX AvgPROXKestenAvgSmallBigAdaptPROX(a) N = 10, an = 1/n, a+n = 2k/n 1 2 3 4 5 6 7 8 9 100 200 400 600 800 1000 MSE of X N k 1 2 3 4 5 6 7 8 9 10−1000 −800 −600 −400 −200 0 100e xp(− 0.00 6x N2 ) k AvgPROXKestenAvgSmallBigAdaptPROX AvgPROXKestenAvgSmallBigAdaptPROX(b) N = 50, an = 10/n, a+n = 5(k + 1)/nFigure 4.11: MSE of xN and f(xN), f(x) = x2, ✓ = [30, 70], x1 = 10, f = 0.1, g = 0.1, nL = 20, macroreplications = 20. iteration in Figure 4.9 for two di↵erent step size settings. Figure 4.9a and 4.9a illustrate the results for the exponential function with smaller and larger step sizes, respectively. For the smaller step sizes, (i.e., an = 1/n and a+n = 10/n), all of the algorithms with the exception of Big seem to be approaching the optimum, but PROX, aPROX, and Small do so earlier on, followed by KestenAvg and then Avg. With a larger step size, KestenAvg and Big perform very poorly, while the others make progress as the number of iterations increase. From Figure 4.9b, PROX appears to outperform all algorithms followed by aPROX. Moreover, Figure 4.10 also plots the MSE of xN and f(xN) for a similar setting, but we increase the scale factor ↵ from 1.5 to 2. For all three subfigures a+n = 50/n, but the smaller step size an increases from 0.1/n to 1/n to 10/n. In Figures 4.10a and 4.10c, aPROX performs 115 0 5 10 15 20 25 30 35 40 45 500 1000 2000 3000 4000 5000 MSE of X N Number of Iterations 0 5 10 15 20 25 30 35 40 45 50−5000 −4000 −3000 −2000 −1000 0 100e xp(− 0.00 6x2 ) Number of Iterations AvgPROXKestenAvgSmallBigAdaptPROX AvgPROXAvgKestenSmallBigAdaptPROXFigure 4.12: MSE of xN and f(xN), f(x) = x2, N = 50, x1 = 10, f = 0.1, g = 1.0, nL = 50, an = 1/n, a+n = 20/n the best followed by PROX, and both algorithms outperform the rest. These figures show that PROX has promise and aPROX can improve the performance of PROX in certain cases. The performance for PROX and aPROX in Figure 4.10b are almost identical. We also test the adaptive step sizes on the simple quadratic function f(x) = x2 for the same two step size settings. The results are similar to those in the exponential case, which can be seen in Figures 4.11a, 4.11b, and 4.12. This results shows that PROX and aPROX can be applied to both well-behaved and ill-behaved functions. In all of the cases, PROX and aPROX perform similarly, which is not a surprise since we used conservative parameter settings. 4.6.2 9-station Closed Jackson Queueing Network We empirically test the PROX-step and the aPROX-step on the close queueing network problem from Section 3.4 with identical settings with the exception of the step size parameter. Figures 4.13a, 4.13b, 4.13c, 4.14a, 4.14b, and 4.14c illustrate the results of the MSE of the xN as well as the throughput TP (xN) for various step 116 sizes for six di↵erent step size parameters. In all of the cases, PROX and aPROX are either in the top three algorithms or the top two at the stopping time. Figure 4.14c depicts an extreme case where the step sizes are either extremely small or large, but outperform the other algorithms. 4.7 Summary and Future Work We investigated the sensitivity of SA algorithms to step size sequences as well as the tightness of a finite-time MSE bound. In addition, we introduced the first adaptive step size algorithm based on two deterministic sequences from which the current step size is selected from, which is determined by the current and past sample performances. We also propose an adaptive method to adjust the two deterministic sequences based on current and past events. We empirically show the promise on stylized problems and a queueing network. Unfortunately, the adaptive method in modifying the two initial sequences rely on user chosen parameters, so in the future, we would like to find ways to automatically select these parameters. In addition, we could apply this step size to other algorithms and numerical examples. 117 0 5 10 15 20 25 30 35 40 45 500.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 MSE of Estimate xN MSE of x N # of iterations 0 5 10 15 20 25 30 35 40 45 501.5 2 2.5 3 3.5 4 4.5 5 5.5 6 Throughput TP( x N) # of iterations AvgPROXSmBigaPROX AvgPROXSmBigaPROX(a) an = 0.1/n, a+n = 10/n, ↵ = 1.5, dn = 5 0 5 10 15 20 25 30 35 40 45 500 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 MSE of Estimate xN MSE of x N # of iterations 0 5 10 15 20 25 30 35 40 45 501.5 2 2.5 3 3.5 4 4.5 5 5.5 6 Throughput TP( x N) # of iterations AvgPROXSmBigaPROX AvgPROXSmBigaPROX(b) an = 0.01/n, a+n = 20/n, ↵ = 1.5, dn = 5 0 5 10 15 20 25 30 35 40 45 500 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 MSE of Estimate xN MSE of x N # of iterations 0 5 10 15 20 25 30 35 40 45 501 2 3 4 5 6 7 Throughput TP( x N) # of iterations AvgPROXSmBigaPROX AvgPROXSmBigaPROX(c) an = 10/n, a+n = 20/n, ↵ = 1.5, dn = 5Figure 4.13 118 0 5 10 15 20 25 30 35 40 45 500 0.2 0.4 0.6 0.8 1 1.2 1.4 MSE of Estimate xN MSE of x N # of iterations 0 5 10 15 20 25 30 35 40 45 500 1 2 3 4 5 6 7 Throughput TP( x N) # of iterations AvgPROXSmBigaPROX AvgPROXSmBigaPROX(a) an = 10/n, a+n = 50/n, ↵ = 1.5, dn = 5 0 5 10 15 20 25 30 35 40 45 500.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 MSE of Estimate xN MSE of x N # of iterations 0 5 10 15 20 25 30 35 40 45 501.5 2 2.5 3 3.5 4 4.5 5 5.5 6 Throughput TP( x N) # of iterations AvgPROXSmBigaPROX AvgPROXSmBigaPROX(b) an = 0.1/n, a+n = 10/n, ↵ = 2, nL = 20 0 5 10 15 20 25 30 35 40 45 500 0.2 0.4 0.6 0.8 1 1.2 1.4 MSE of Estimate xN MSE of x N # of iterations 0 5 10 15 20 25 30 35 40 45 500 1 2 3 4 5 6 7 Throughput TP( x N) # of iterations AvgPROXSmBigaPROX AvgPROXSmBigaPROX(c) an = 0.001/n, a+n = 50/n, ↵ = 2, nL = 20Figure 4.14 119 Chapter 5 Greek Kernel Estimators 5.1 Introduction Financial derivatives have become an integral part of risk management. There- fore, accessing and quantifying the risk of these assets have sparked great interest in the finance industry. The Greeks are derivatives of (financial) derivative prices with respect to an underlying market parameter. Each Greek measures a di↵erent dimension of market risk. For instance, “delta,” “vega,” and “theta” are first-order derivatives of the expected payo↵ with respect to price, volatility, and time, respec- tively, and “gamma” is the second-order derivative with respect to price. One type of financial derivative used for hedging is an option, which gives the security holder the right to exercise the option once certain conditions are satisfied, which is dependent on the option type. For example, the exercise condition for an Asian digital option depends on depends on the average price over a pre-specified period, and it pays a lump sum if the option is “in the money” and zero otherwise. Up-and-out barrier call options can be exercised once the price reaches the strike price but does not cross the upper barrier. Both of these options have discontinuous payo↵ functions, and the option price can be written as E[g(S) · I{h(S)0}], (5.1) 120 where the expectation is taken under the risk-neutral measure, S is the price modeled by a stochastic process, I{·} is the indicator function, and g(S) is the discounted payo↵ when the exercise condition, h(S) 0, is satisfied. The price S depends on the market parameters ✓ 2 ⇥, where ⇥ is an open set. For simplicity, assume ✓ is one-dimensional, since we can focus on one market parameter and hold all else constant. The derivative of the payo↵ function (5.1) with respect to ✓ is a first-order Greek, and the derivative of the first-order Greek with respect to ✓ is a second-order Greek, both of which we wish to estimate. Some well-known derivative estimation methods include finite di↵erence approximation, perturbation analysis (also known as the pathwise method), the score function or likelihood ratio (SF/LR) method, and the weak derivative (WD) method. Each method has its advantages and drawbacks. The most straightforward approach to estimate Greeks is the finite di↵erence method. Although this method is easy to implement, the resulting estimators are biased and generally have a high mean-squared error (MSE) [33]. In addition, a per- turbation sequence {✓} must be selected, which a↵ects the bias/variance tradeo↵, although symmetric di↵erences can be used to reduce the order of bias [24]. In addition, smaller perturbations could lead to more noise because of the stochastic- ity. The pathwise method, generally known as infinitesimal perturbation analysis (IPA) in the simulation community, is easy to implement and often performs well in comparison with other methods, but unfortunately, it is not always applicable. For instance, it is not suitable for discontinuous functions nor second-order Greeks [43]. Smoothed perturbation analysis (SPA) is able to overcome the discontinuity issue, and [28] suggest using SPA to smooth out the discontinuity by conditioning on ap- 121 propriate random variables (e.g., [61]). The key, which is the most dicult part of this method, is choosing the random variable on which to condition. On the con- trary, the weak derivative (WD) method is applicable to discontinuous functions, but may require a high number of simulations [24]. The likelihood ratio method can also approximate the gradient of discontinuous functions, but usually leads to high variance. For a comprehensive review of gradient estimation methods, refer to [33], [23], and [4]. [43] introduces a modified pathwise method to estimate first- and second- order Greeks for options with discontinuous payo↵s. It is a generalization of the classical pathwise method, which extends to options with discontinuous payo↵s (5.1) by rewriting the Greek as a sum of an expectation and a derivative(s) with respect to an auxiliary parameter. The general pathwise method and kernel method are applied to the expectation and derivative terms, respectively. The accuracy of the kernel estimator depends/hinges on the chosen kernel function, which relies on a bandwidth/smoothing parameter, described in more detail in Section 5.3.3. A pilot simulation proposed in [43] generates an “optimal” bandwidth parameter prior to applying the modified pathwise method. The pilot simulation involves various input parameters, which a↵ect the bandwidth output. The modified pathwise method has been applied to estimate delta, vega, theta and gamma of the Asian digital option and up-and-out barrier call option. In this chapter, we examine the accuracy of the Greeks estimated using a modi- fied pathwise gradient estimation method for an Asian digital option and up-and-out barrier call option by conducting two sets of numerical experiments. The first set 122 investigates the sensitivity of the Greek estimators to the bandwidth parameter(s), and the second explores the sensitivity of a proposed method used to generate “op- timal” bandwidths for the modified pathwise method to various input parameters. Our numerical results show that the Greek estimators are quite sensitive to the bandwidth, so the performance of the pilot simulation is critical. 5.2 Problem Setting We focus on options with a discontinuous payo↵ function that can be written in the form in (5.1), where g(·) and h(·) are di↵erentiable functions, and S is a vector of discretized prices. Let Sti denote the price of the security at time ti 0, where ti = iT/k for i = 0, 1, 2, . . . , k, which are evenly spaced time points between 0 and T. For simplicity, denote Sti by Si, and S = (S0, S1, . . . , Sk)0 . Let p(✓) = E[f(S)], where p 0 (✓) = @p(✓)/@✓ is the first-order Greek with respect to ✓ and p 00 (✓) = @2p(✓)/@✓2 is the second-order Greek with respect to ✓. If ✓ = S0, ✓ = , and ✓ = ⌧ , then p 0 (✓) is called delta, vega, and theta, respectively, and if ✓ = S0, then p 00 (✓) is called gamma. We explore Asian digital options and up-and-out barrier call options, both of which have discontinuous payo↵s. For the Asian digital option, the discounted payo↵ function g(S) = erT , where r is the risk-free rate and T is the duration of time considered, and the exercise criterion is based on an average of k discretized prices S = 1k Pki=1 Si exceeding a pre-specified threshold K over a predetermined period of time, i.e., whether or not h(S) , S K 0. The up-and-out barrier call 123 Table 5.1: Parameters. r b µ S0 K T U k Asian digital option 0.05 0.30 0.2 98 100 100 1 - 10, 20, 50 up-and-out barrier call option 0.05 0.20 - - 100 100 1 120 20, 50 option can be exercised if the final price Sk is greater than or equal to a threshold K, but the maximum price Smax = max{S1, . . . , Sk} never exceeds an upper barrier U , i.e., whether or not h(S) = min{Sk K,U Smax} 0. The payo↵ function is the amount by which the final price Sk exceeds the strike price K discounted by erT , i.e., g(S) = erT (Sk K). In the numerical experiments, the prices considered for the Asian digital option and the up-and-out barrier call option follow an Ornstein-Uhlenbeck process and a geometric Brownian motion generated, respectively, using Si = Si1eb⌧ + µ(1 eb⌧ ) + p(1 e2b⌧ )/(2b)Zi for i = 1, 2, ..., k, St = S0e(r2/2)t+Bt for t = T/k, 2T/k, ..., T, where b, , µ, r, k, and ⌧ are the mean reversion rate, volatility, mean return, risk- free interest rate, number of discretized intervals, and incremental time step (T/k), respectively. The variables Zi and Bt are independent standard normal random variables and standard Brownian motion, respectively. The values considered are listed in Table 5.1. 124 5.3 Generalized Pathwise Method The pathwise method is in general not applicable to second-order Greeks nor discontinuous functions, but the modified kernel version is one method that circum- vents this issue. 5.3.1 First-Order Greeks The first-order Greek estimator based on the modified pathwise method is based on the following theoretical result from [43].Assumption 1. For any ✓ 2 ⇥, g(S) and h(S) are di↵erentiable with respect to ✓ with probability 1 (w.p.1), and there exist random variables Kg and Kh with finite second moments that may depend on ✓, such that |g(S(✓+✓))g(S(✓))|  Kg|✓| and |h(S(✓ +✓)) h(S(✓))|  Kh|✓| when |✓| is suciently small.Assumption 2. For any ✓ 2 ⇥, @✓(✓, y) exists and is continuous at (✓, y) with (✓, y) = E[g(S) · 1{h(S)y}].Theorem 5.3.1. Suppose g(·) and h(·) satisfy Assumptions 1 and 2, and E[|g(S)|2] < +1 and E[|h(S)|2] < +1. Then p 0 (✓) = E[@✓g(S) · 1{h(S)0}] @yE[g(S)@✓h(S) · 1{h(S)y}]|y=0 (5.2) The two terms on the right hand side of (2) can be estimated easily, because the first term is an ordinary expectation, where g(S) is di↵erentiable with respect to ✓, and the second term is a derivative with respect to y, which is independent 125 of g(S) and h(S). Thus, the first term can be estimated using the law of large numbers, and finite di↵erences can be used to estimate the second term based on @yE ⇥g(S)@✓h(S) · 1{h(S)y}⇤ |y=0 = lim!0 1{E ⇥g(S)@✓h(S) · 1{h(S)/2}⇤ E ⇥g(S)@✓h(S) · 1{h(S)/2}⇤} = lim!0 1E[g(S)@✓h(S) · 1{/2h(S)/2}] = lim!0 1E g(S)@✓h(S) · Z ✓h(S) ◆ , where Z(u) = 1{1/2u1/2}, which is known as the uniform or naive kernel because u ⇠ U(1/2, 1/2). It can be replaced with any other kernel, which we elaborate on in Section 5.3.3. In practice, Assumptions 1 and 2 are typically satisfied. Refer to [43] for the proof and a discussion of the assumptions. 5.3.2 Second-Order Greeks The second-order Greeks are estimated using a theorem from [43] based on the following two assumptions:Assumption 3. For any (✓1, ✓2) 2 ⇥, g(S) and h(S) are di↵erentiable with respect to ✓1 with probability 1 (w.p.1), and @✓1g(S) and @✓1h(S) are di↵erentiable with re- spect to ✓2 with probability 1 (w.p.1), and there exist variables Kg, Kh, Lg, and Lc with finite fourth moments, which may depend on (✓1, ✓2), such that |g(S(✓1 + ✓1, ✓2)) g(S(✓1, ✓2))|  Kg|✓1|, |g(S(✓1, ✓2 +✓2, )) g(S(✓1, ✓2))|  Kg|✓2|, |h(S(✓1 +✓1, ✓2)) h(S(✓1, ✓2))|  Kh|✓1|, |h(S(✓1, ✓2 +✓2, )) h(S(✓1, ✓2))|  Kh|✓2|, |@✓1g(S(✓1 + ✓1, ✓2)) @✓1g(S(✓1, ✓2))|  Lg|✓2|, and |g(S(✓1, ✓2 + 126 ✓2, ))@✓1h(S(✓1, ✓2 + ✓2, )) g(S(✓1, ✓2))@✓1h(S(✓1, ✓2))|  Lc|✓2|, when |✓1| and |✓2| are suciently small.Assumption 4. For any ✓ 2 ⇥, @✓1@✓2(✓1, ✓2, y) exists and is continuous at (✓1, ✓2, 0), where (✓1, ✓2, y) = E[g(S) · 1{h(S)y}].Theorem 5.3.2. Suppose g(·) and h(·) satisfy Assumptions 3 and 4, and E[|g(S)|4] < +1 and E[|h(S)|4] < +1. Then @✓1@✓2p(✓1, ✓2) = E[@✓1@✓2g(S) · 1{h(S)0}] @yE[(g(S)@✓1@✓2h(S) + @✓1g(S)@✓2h(S) + @✓2g(S)@✓1h(S)) · 1{h(S)y}]|y=0 +@2yE[g(S)@✓1h(S)@✓2h(S)1{h(S)y}]|y=0. We focus on the second-order Greek gamma, where ✓ = S0; therefore, @2✓p(✓) = E[@2✓g(S) · 1{h(S)0}] @yE[(g(S)@2✓h(S) + 2 · @✓g(S)@✓h(S)) · 1{h(S)y}]|y=0 +@2yE[g(S)@✓h(S)@✓h(S)1{h(S)y}]|y=0. Again, the first term on the right hand side can be estimated using using a sample mean. The finite di↵erences derivation for the second term is similar to (5.3), and for the finite di↵erence representation of the third term, observe that @yE[g(S)(@✓h(S))2 · 1{h(S)y}]|y=0 = lim!0 1E g(S)(@✓h(S))2 · Z ✓h(S) y ◆ y=0, 127 and @2yE ⇥g(S)(@✓h(S))2 · 1{h(S)y}⇤ |y=0 = lim!0 12E g(S)(@✓h(S))2 · Z 0✓h(S) ◆ , (5.3) where Z(·) is the kernel function. The approximation of the second-order auxiliary term (5.3) along with the first-order term (5.3) can be seen in the estimator (5.5). 5.3.3 Kernel A d-dimensional kernel, K : Rd ! R, is a bounded symmetric density with respect to the the Lebesgue measure with limkukd!1 k u k K(u) = 0 and RRd k u k2 K(u)du < 1 where k · k is any norm on Rd [6]. A kernel K has the form K(u) = 1 h K ⇣u h ⌘ , u 2 Rd, where h is the bandwidth, also known as the smoothing parameter [6]. For this paper, we focus on the case for d = 1. According to [6], a reasonable kernel does not a↵ect the asymptotic behavior of the estimator; however, the bandwidth significantly influences the accuracy of the estimator. Large bandwidths decrease the variance but increase the bias, whereas small bandwidths lead to the exact opposite; hence, there is a tradeo↵ between variance and bias. Therefore, careful attention should be given to the selection process. Although there are other methods to quantify the accuracy of the kernel estimator, [43] measure it based on the MSE. Consequently, they attempt to select the bandwidth that maximizes the accuracy using an iterative pilot simulation. 128 5.3.4 First- and Second-Order Greek Estimators Based on Theorem 5.3.1 and 5.3.2, the following estimators are used to ap- proximate the first- and second-order Greeks, respectively: Gn = 1n nXl=1 g0l · 1[hl0] + 1nn nXl=1 gl · h0l · Z hln!, (5.4) Hn = 1n nXl=1 g00l · 1[hl0] + 1nn nXl=1 {gl · h00l + 2 · g0l · h0l} · Z hln! + 1 n2n nXl=1 gl · (h0l)2 · Z 0 hln!, (5.5) where (gl, hl, g0l, h0l, g00l , h00l ) is the lth observation of (g(S), h(S), @✓g(S), @✓h(S), @2✓g(S), @2✓h(S), Z(·) is a kernel, and n and n are constant bandwidth parameters gener- ated using a pilot simulation. The second-order estimator (5.5) is a special case when both derivatives are taken with respect to the same parameter. 5.4 Pilot Simulation Ideally, the optimal bandwidth parameter would minimize the error between the Greek estimator and its true value, and the pilot simulation proposed in [43] employs an iterative method that generates an “optimal” bandwidth that minimize the asymptotic mean-squared error (MSE). The MSE of the first-order estimator can be expressed as MSE(Gn) = [E(Gn) @✓E[g(S) · 1{h(S)0}]]2 + Var(Gn) = 4n  00(0)2 Z 11 u2Z(u)du+ ✏n2 + 1nn (21 + ⇠n), 129 where (u) = fh(u)·E[g(S)·@✓h(S)|h(S) = u], (u) = fh(u)·E[|g(S)·@✓h(S)||h(S) = u], and 21 = 2(0) R1 1 Z 2(u)du. Since ✏n ! 0 and ⇠n ! 0 as n ! 1, then the optimal bandwidth choice is c · n1/5, where c = " 21 ( 00(0) R1 1 u 2Z(u)du)2 #1/5 . This constant c is estimated by cˆ = " V n ( b 00(0) R11 u2Z(u)du)2#1/5 , where V n is the sample variance of Gn and b 00(0) is the finite di↵erence approxima- tion of 00 (0): b 00(0) = Gn(s) +Gn(s) 2Gn(0) s2 , where s is a suciently small step size and Gn(u) = 1nn nXl=1 gl · h0l · Z ✓h(S) un ◆ . Therefore, the optimal bandwidth is approximated by ?n = cˆ · n1/5. The pilot simulation begins with cˆ = 1, and is updated after each iteration using the same sample. More specifically, at each iteration, V n and b 00(0) are simulated using the updated ?n with n = 500. This process continues for a pre-specified number of iterations, and the bandwidth selected is the one generated after 30 pilot iterations, as in [43]. The pilot simulation algorithm does not specify the variance sample size nv to generate V n, step size s to estimate 00(0), and the number of iterations for the pilot simulation np, which are all predetermined. Refer to the e-companion of [43] for the algorithm details and the pilot simulation for the second-order Greeks. 130 5.5 Numerical Experiments We conduct two sets of numerical experiments to test the robustness of a modified pathwise method and a pilot simulation used to generate an input pa- rameter for the modified method [8]. The first is a sensitivity analysis of Greek estimators for both the Asian digital option and up-and-out barrier call option to bandwidth parameters, and the second explores the e↵ect of the input parameters to the bandwidth generated from the pilot simulation. We considered identical settings as in [43], which are shown in Table 5.1 and described in Section 5.2. However, the sample size considered here is 100 as opposed to 1000 in [43]. From our preliminary results, the increase in sample size only tightens the confidence band and smooths out the curves, but the overall behavior is similar across estimators, so we opt to use a smaller sample size. Furthermore, we also consider the standard normal density as the kernel to increase the robustness of the resulting estimator. 5.5.1 Sensitivity to Bandwidth In our experiment for each Greek, we generate 100 estimators for incremen- tally increasing bandwidths and observe the e↵ect on the relative root mean-squared error (RRMSE) and the 95% confidence band (CB). Figures 5.1, 5.2, 5.3, and 5.4 plot the 95% confidence band for the Greek estimator generated using 1000 sample paths for Asian Greeks, 10000 sample paths for barrier Greeks, and sample size of 100 for incrementally increasing cˆ. The black curve in between is the mean, and the horizontal dashed line represents the exact Greek value. The three vertical lines in 131 Figures 5.1, 5.2, and 5.3 represent the 95% confidence interval for the cˆ generated using the pilot simulation with np = 30, nv = 100, which we will elaborate on in the next section. In the same figures, the second row of graphs plots the RRMSE of the 100 generated Greeks for incrementally increasing cˆ, where the additional vertical dashed line denotes the c⇤ that minimizes the RRMSE. The only di↵erence when n = 1000 increases to n = 10000 for the Asian Greek estimators is the increased smoothness of the curves and tighter confidence bands, but again, the overall be- havior is the same. [43] reports results for n = 103, 104, 105 for the Asian Greeks, but we only test the case for n = 103. Not surprisingly, the bandwidth a↵ects the accuracy and precision of the Greek estimator. Generally, as the bandwidth decreases, the variance increases, which can be seen in the widening of the 95% confidence band as cˆ ! 0 and is observable in all of the estimators. Similarly, the 95% confidence band decreases as the bandwidth increases, but eventually the confidence band no longer contains the exact Greek value. Figures 5.2, 5.3, and 5.4 for the Asian vega, Asian gamma, and barrier theta estimators illustrate this behavior, respectively. In addition, the minimum RRMSE of these estimates occur in one particular region for all discretization levels. Figure 5.2 shows that the Asian vega estimator is rather close to the exact value for a range of cˆ values less than 0.1. The barrier theta estimator behaves similarly but for a di↵erent range of cˆ values as seen in Figure 5.4. In contrast, the Asian gamma estimator is extremely sensitive to cˆ, and the RRMSE increases significantly with even slight deviations from the optimal bandwidth. Moreover, the 95% confidence interval (CI) for the bandwidth parameter is very narrow and does not contain the 132 Table 5.2: Asian delta. n for s n k c ⇤ RRMSE lower upper range ⇤n 0.1 0.01 0.001 500 10 0.64 0.028 0.915 1.021 0.106 0.185 0.14 0.06 0.06 20 0.46 0.038 0.975 1.132 0.156 0.133 0.15 0.05 0.05 50 0.54 0.033 0.998 1.130 0.132 0.156 0.14 0.06 0.06 1000 10 0.74 0.020 0.931 1.005 0.074 0.186 0.14 0.05 0.04 20 0.54 0.028 0.996 1.104 0.108 0.136 0.13 0.05 0.05 50 0.60 0.024 1.019 1.117 0.099 0.151 0.13 0.05 0.05 2000 10 0.84 0.014 0.943 0.996 0.053 0.184 0.13 0.03 0.03 20 0.56 0.021 1.012 1.097 0.084 0.122 0.13 0.03 0.02 50 0.68 0.017 1.034 1.105 0.071 0.149 0.13 0.03 0.02 5000 10 1.00 0.009 0.954 0.988 0.035 0.182 0.15 0.06 0.06 20 0.66 0.014 1.027 1.083 0.055 0.120 0.16 0.07 0.07 50 0.80 0.011 1.050 1.094 0.044 0.146 0.16 0.06 0.06 10000 10 1.14 0.007 0.960 0.985 0.020 0.181 0.15 0.03 0.02 20 0.72 0.011 1.036 1.076 0.040 0.114 0.12 0.03 0.02 50 0.92 0.008 1.056 1.088 0.032 0.146 0.13 0.03 0.02 exact Asian gamma value. The numerical results for the Asian delta are quite di↵erent as illustrated in Figure 5.1. The mean of the delta estimator intersects with the exact/analytical value at least twice for k = 10 and k = 50. For the case k = 10, the RRMSE has a minimum and a nearly flat region. The minimum of the RRMSE for each discretization case occurs at di↵erent points for a given number of sample paths n, unlike the Asian vega, Asian gamma, and barrier theta. The exact cˆ values at which the minimums occurs, the resulting RRMSE, and 95% confidence interval of the Asian delta estimators generated using the cˆ are listed in Table 5.2. Another interesting case is the gamma estimator for the up-and-out barrier call option, which relies on two bandwidth parameters n and n. The Asian gamma 133 0.0 0.2 0.4 0.6 0.8 1.0 0.8 1.0 1.2 k=10 c ^ De lta 0.8 1.0 1.2 0.8 1.0 1.2 95% CB mean delta exact delta 95% CI 0.0 0.2 0.4 0.6 0.8 1.0 0.8 1.0 1.2 k=20 c ^ De lta 0.8 1.0 1.2 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 0.8 1.0 1.2 k=50 c ^ De lta 0.8 1.0 1.2 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0 0.1 0 c ^ RR MS E 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0 0.1 0 c ^ RR MS E 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0 0.1 0 c ^ RR MS E 95% CI for c ^ optimal c ^Figure 5.1: 95% confidence band for Asian delta (solid curves) for n = 1000 with w/95% confidence interval for bandwidth (vertical dashed lines) generated using s = 0.1 and RRMSE for 100 sample paths. 0.0 0.1 0.2 0.3 0.4 0.5 0.7 0.9 1.1 k=10 c ^ Ve ga 0.5 0.7 0.9 1.1 0.5 0.7 0.9 1.1 95% CB mean vega exact vega 95% CI 0.0 0.1 0.2 0.3 0.4 0.5 0.7 0.9 1.1 k=20 c ^ Ve ga 0.5 0.7 0.9 1.1 0.5 0.7 0.9 1.1 0.0 0.1 0.2 0.3 0.4 0.5 0.7 0.9 1.1 k=50 c ^ Ve ga 0.5 0.7 0.9 1.1 0.5 0.7 0.9 1.1 0.0 0.1 0.2 0.3 0.4 0.0 0 0.1 5 0.3 0 c ^ RR MS E 0.0 0.1 0.2 0.3 0.4 0.0 0 0.1 5 0.3 0 c ^ RR MS E 0.0 0.1 0.2 0.3 0.4 0.0 0 0.1 5 0.3 0 c ^ RR MS E 95% CI for c ^ optimal c ^Figure 5.2: 95% confidence band for Asian vega (solid curves) for n = 1000 w/95% confidence interval for bandwidth (vertical dashed lines) generated using s = 0.001 and RRMSE for 100 sample paths. 134 0.0 0.1 0.2 0.3 0.4 0.5 0.6 4 6 8 10 k=10 c ^ Ga mm a 4 6 8 10 4 6 8 10 95% CB mean gamma exact gamma 95% CI 0.0 0.1 0.2 0.3 0.4 0.5 0.6 4 6 8 10 k=20 c ^ Ga mm a 4 6 8 10 4 6 8 10 0.0 0.1 0.2 0.3 0.4 0.5 0.6 4 6 8 10 k=50 c ^ Ga mm a 4 6 8 10 4 6 8 10 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 1.0 2.0 3.0 c ^ RR MS E 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 1.0 2.0 3.0 c ^ RR MS E 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 1.0 2.0 3.0 c ^ RR MS E 95% CI for c ^ optimal c ^Figure 5.3: 95% confidence band for Asian gamma (solid curves) for n = 1000 w/95% confidence interval for bandwidth (vertical dashed lines) generated using s = 0.01 and RRMSE for 100 sample paths. is also a second-order Greek, but the second term in (5.4) containing n disappears with the given settings. Figure 5.5 plots the barrier gamma as a function of the bandwidth parameters n (visible horizontal axis) and n (not visible), where the vertical planes represent the approximately exact gamma for k = 20, 50, respectively. It is clear that gamma is insensitive to n, but extremely sensitive to n, especially when n < .03. Unfortunately, this is the region in which the estimator attains the value closest to the true value. Since the analytical expression is not available for comparison, the true values are approximated using finite di↵erences with a large sample of 109. In this case, the steepness of the gamma estimator near the “exact” value prompts the need for a very accurate and precise n; otherwise, the estimate could be drastically di↵erent from its true value. As expected, the margin of error decreases as the number of sample paths n increases. For each discretization level k, cˆ increases with n, but the resulting n values are similar for each k regardless of 135 0 5 10 15 20 25 30 1.0 1.4 1.8 k=20 c ^ The ta 1.0 1.4 1.8 1.0 1.4 1.8 0 5 10 15 20 25 30 1.0 1.4 1.8 k=50 c ^ The ta 1.0 1.4 1.8 1.0 1.4 1.8 0 5 10 15 20 25 30 0.0 0 0.1 0 c ^ RR MS E 0 5 10 15 20 25 30 0.0 0 0.1 0 c ^ RR MS EFigure 5.4: 95% confidence band for barrier theta for n = 10000 and RRMSE for 100 sample paths. n. The conclusions in this section are representative of all of the estimators tested. For instance, the Asian delta is the only estimator whose mean matches the exact Greek value for two di↵erent bandwidth values and the barrier gamma is the only estimator that requires the input of two bandwidth parameters. The Asian vega, in general, has a decreasing 95% confidence band whose mean also decreases with cˆ and matches the exact Greek value for one particular cˆ, and there is one global minimum RRMSE, which represents the results of the remaining Greek estimators. 5.5.2 Sensitivity of Bandwidth Generator to Input Parameters Our sensitivity analysis of the modified pathwise method applied to Greeks concludes that it is, in general, very sensitive to the bandwidth parameter(s), so the key to a good estimator is in the bandwidth selection process. Therefore, we inves- 136 (a) k = 20 (b) k = 50Figure 5.5: Barrier gamma estimator for n = 10000, 100 sample paths. 137 tigate the sensitivity of the pilot simulation to the various input parameters such as variance sample size nv, perturbation size s, number of sample paths n, and number of pilot iterations np, as described in Section 5.4. We considered the following set- tings: nv 2 {50, 100, 500, 1000}, s 2 {0.001, 0.01, 0.1}, n 2 {500, 1000, 2000, 5000}, and np 2 {1, . . . , 100}. For each setting, we compute the 95% confidence inter- val for the bandwidth output and compare it against the optimal bandwidth that minimizes the RRMSE in Table 5.2. The pilot simulation experiments consider a variance sample size of 100 (i.e., nv = 100) and 30 pilot iterations (i.e, np = 30), unless stated otherwise. Over an extensive set of parameter settings, the results are quite similar, so we limit our discussion to representative cases. First, we discuss the results of the pilot simulation for the Asian delta. For any fixed perturbation size, the average cˆ under the Asian delta estimator is quite similar across sample paths and discretization level k, with a maximum range of 0.04. As n increases, the average cˆ decreases ever so slightly. One would think that variance of cˆ decreases with n, i.e., the confidence band would decrease with the n; however, the perturbation size s appears to be the determining factor from Figure 5.6. In this case, the smaller perturbation size s = 0.1 leads to wider confidence intervals, which is interesting, because smaller perturbations tend to generate noisier estimates. Therefore, it could be advantageous to consider an average or a window average, as opposed to selecting the cˆ generated after a certain number of iterations. For instance, the average cˆ fluctuates, especially in the case of s = 0.1, so taking a further average of multiple pilot iterations at the later stage, could generate a parameter that is more stable. 138 0 10 20 30 40 50 0.0 0.2 0.4 n=500, s=0.1 Number of Pilot Iterations c^ 0.0 0.2 0.4 0.0 0.2 0.4 0 10 20 30 40 50 0.0 0.2 0.4 n=1000, s=0.1 Number of Pilot Iterations c^ 0.0 0.2 0.4 0.0 0.2 0.4 0 10 20 30 40 50 0.0 0.2 0.4 n=2000, s=0.1 Number of Pilot Iterations c^ 0.0 0.2 0.4 0.0 0.2 0.4 0 10 20 30 40 50 0.0 0.2 0.4 n=5000, s=0.1 Number of Pilot Iterations c^ 0.0 0.2 0.4 0.0 0.2 0.4 0 10 20 30 40 50 0.0 0.2 0.4 n=500, s=0.01 Number of Pilot Iterations c^ 0.0 0.2 0.4 0.0 0.2 0.4 0 10 20 30 40 50 0.0 0.2 0.4 n=1000, s=0.01 Number of Pilot Iterations c^ 0.0 0.2 0.4 0.0 0.2 0.4 0 10 20 30 40 50 0.0 0.2 0.4 n=2000, s=0.01 Number of Pilot Iterations c^ 0.0 0.2 0.4 0.0 0.2 0.4 0 10 20 30 40 50 0.0 0.2 0.4 n=5000, s=0.01 Number of Pilot Iterations c^ 0.0 0.2 0.4 0.0 0.2 0.4 0 10 20 30 40 50 0.0 0.2 0.4 n=500, s=0.001 Number of Pilot Iterations c^ 0.0 0.2 0.4 0.0 0.2 0.4 0 10 20 30 40 50 0.0 0.2 0.4 n=1000, s=0.001 Number of Pilot Iterations c^ 0.0 0.2 0.4 0.0 0.2 0.4 0 10 20 30 40 50 0.0 0.2 0.4 n=2000, s=0.001 Number of Pilot Iterations c^ 0.0 0.2 0.4 0.0 0.2 0.4 0 10 20 30 40 50 0.0 0.2 0.4 n=5000, s=0.001 Number of Pilot Iterations c^ 0.0 0.2 0.4 0.0 0.2 0.4Figure 5.6: Asian delta pilot 95% confidence interval for cˆ for k = 10. Another input parameter that was not specified is the sample size nv used to generate the sample variance V n. So we investigate the sensitivity of the bandwidth parameter to nv for each of the perturbation sizes s. Again, we focus on the Asian delta case for the discretization case k = 10, sample path n = 1000, vary nv 2 {50, 100, 500, 1000}, and s 2 {0.001, 0.01, 0.1}. Figure 5.7 plots the generated cˆ against the number of pilot simulations np for a sample size of 100. The results of the sensitivity to nv are inconclusive, since the behavior is inconsistent across di↵erent parameter settings. For this particular example, the larger perturbation size s = 0.1 leads to a higher bandwidth parameter, which can also be seen in Figure 5.6. Clearly, the input parameters have a significant impact on the bandwidth 139 0 20 40 60 80 100 0.0 6 0.1 0 0.1 4 0.1 8 k = 10 s = 0.1 n p c^ 0.0 6 0.1 0 0.1 4 0.1 8 0.0 6 0.1 0 0.1 4 0.1 8 0.0 6 0.1 0 0.1 4 0.1 8 vn=50 vn=100 vn=500 vn=1000 0 20 40 60 80 100 0.0 6 0.1 0 0.1 4 0.1 8 k = 10 s = 0.01 n p c^ 0.0 6 0.1 0 0.1 4 0.1 8 0.0 6 0.1 0 0.1 4 0.1 8 0.0 6 0.1 0 0.1 4 0.1 8 vn=50 vn=100 vn=500 vn=1000 0 20 40 60 80 100 0.0 6 0.1 0 0.1 4 0.1 8 k = 10 s = 0.001 n p c^ 0.0 6 0.1 0 0.1 4 0.1 8 0.0 6 0.1 0 0.1 4 0.1 8 0.0 6 0.1 0 0.1 4 0.1 8 vn=50 vn=100 vn=500 vn=1000 0 20 40 60 80 100 0.0 6 0.1 0 0.1 4 0.1 8 k = 20 s = 0.1 n p c^ 0.0 6 0.1 0 0.1 4 0.1 8 0.0 6 0.1 0 0.1 4 0.1 8 0.0 6 0.1 0 0.1 4 0.1 8 vn=50 vn=100 vn=500 vn=1000 0 20 40 60 80 100 0.0 6 0.1 0 0.1 4 0.1 8 k = 20 s = 0.01 n p c^ 0.0 6 0.1 0 0.1 4 0.1 8 0.0 6 0.1 0 0.1 4 0.1 8 0.0 6 0.1 0 0.1 4 0.1 8 vn=50 vn=100 vn=500 vn=1000 0 20 40 60 80 100 0.0 6 0.1 0 0.1 4 0.1 8 k = 20 s = 0.001 n p c^ 0.0 6 0.1 0 0.1 4 0.1 8 0.0 6 0.1 0 0.1 4 0.1 8 0.0 6 0.1 0 0.1 4 0.1 8 vn=50 vn=100 vn=500 vn=1000 0 20 40 60 80 100 0.0 6 0.1 0 0.1 4 0.1 8 k = 50 s = 0.1 n p c^ 0.0 6 0.1 0 0.1 4 0.1 8 0.0 6 0.1 0 0.1 4 0.1 8 0.0 6 0.1 0 0.1 4 0.1 8 vn=50 vn=100 vn=500 vn=1000 0 20 40 60 80 100 0.0 6 0.1 0 0.1 4 0.1 8 k = 50 s = 0.01 n p c^ 0.0 6 0.1 0 0.1 4 0.1 8 0.0 6 0.1 0 0.1 4 0.1 8 0.0 6 0.1 0 0.1 4 0.1 8 vn=50 vn=100 vn=500 vn=1000 0 20 40 60 80 100 0.0 6 0.1 0 0.1 4 0.1 8 k = 50 s = 0.001 n p c^ 0.0 6 0.1 0 0.1 4 0.1 8 0.0 6 0.1 0 0.1 4 0.1 8 0.0 6 0.1 0 0.1 4 0.1 8 vn=50 vn=100 vn=500 vn=1000Figure 5.7: Asian delta pilot, n = 500, sample size = 100. parameter generated from the pilot simulation, and we investigate further to deter- mine whether or not the confidence interval of the bandwidth parameter captures the bandwidth that minimizes the RRMSE for parameters n = 1000, nv = 100, np = 30, sample size 100, s = 0.1 for the delta, s = 0.01 for the gamma, and s = 0.001 for both vega and theta Asian Greek estimators. Values for the perturba- tion parameters were not specified in [43], but were obtained from the authors via personal communication. The sizes were chosen relative to the parameter of interest. Our results vary significantly, from successfully capturing the optimal bandwidth to completely missing it. Figures 5.1, 5.2, and 5.3 contain vertical dotted lines rep- resenting the 95% confidence interval for the optimal c⇤ with the above specified parameters, the dotted line denotes the mean, and the dashed line is the c⇤ that 140 minimizes the RRMSE from our first set of numerical experiments. None of the 95% confidence intervals of cˆ capture the minimum RRMSE for the Asian delta Greek estimator. Instead, the CI contains mean cˆ⇤. The confidence interval for the cˆ gen- erated from the pilot simulation for the Asian vega, covers the minimum RRMSE bandwidth, which can be seen in Figure 5.2. 5.6 Summary and Future Work In this chapter, we analyzed the sensitivity of the Greek kernel estimators of [43] to the bandwidth parameters and the sensitivity and performance of the pilot simulation in selecting the “optimal” bandwidth parameters to input parame- ters. The simulated experiments show that the kernel estimators are quite sensitive to the bandwidth choice, and the pilot simulation is sensitive to the input param- eters. Of the input parameters, the perturbation step size s significantly impacts the generated bandwidth. If s is too small relative to the parameter of interest, then the pilot simulation could generate bandwidths that lead to poor gradient es- timates. However, even if the s is chosen proportional to the parameter of interest, the 95% confidence interval for the cˆ might not capture the c⇤ that minimizes the RRMSE. The input parameters play a vital role in the performance of the estimator, so the input selection process of the pilot simulation and other bandwidth selection methods are important future research areas. 141 Bibliography [1] S. Andrado´ttir. A new algorithm for stochastic approximation. Proceedings of the 1990 Winter Simulation Conference, pages 364–366, 1990. [2] S. Andrado´ttir. A stochastic approximation algorithm with varying bounds. Operations Research, 43(6):1037–1048, 1995. [3] S. Andrado´ttir. A scaled stochastic approximation algorithm. Management Science, 42(4):475–498, 1996. [4] S. Asmussen and P. Glynn. Stochastic Simulation: Algorithms and Analysis. Springer, New York, 2007. [5] J. Blum. Multidimensional stochastic approximation methods. The Annals of Mathematical Statistics, 25(4):737–744–200, 1954. [6] D. Bosq. Nonparametric Statistics for Stochastic Processes. Springer, second edition, 1998. [7] M. Broadie, D. Cicek, and A. Zeevi. General bounds and finite-time improve- ment for the Kiefer-Wolfowitz stochastic approximation algorithm. Operations Research, 59(5):1211–1224, 2011. [8] M. Chau and M. Fu. On the sensitivity of Greek kernel estimators to band- width parameters. Proceedings of the 2014 Winter Simulation Conference, pages 3845–3856, 2014. [9] M. Chau and M. Fu. An overview of stochastic approximation. In M. Fu, editor, Handbook of Simulation Optimization, chapter 6, pages 149–178. Springer, 2014. [10] M. Chau, H. Qu, and M. Fu. A new hybrid stochastic approximation algorithm. In Proceedings of the 2014 Workshop on Discrete Event Systems, volume 12, pages 241–246, 2014. [11] M. Chau, H. Qu, and M. F. I. Ryzhov. An empirical sensitivity analysis of the Kiefer-Wolfowitz algorithm and its variants. Proceedings of the 2013 Winter Simulation Conference, pages 945–965, 2013. [12] H. Chen, T. Duncan, and B. Pasik-Duncan. A Kiefer-Wolfowitz algorithm with randomized di↵erences. IEEE Transactions on Automatic Control, 44(3):442– 453, 1999. [13] H. Chen and Y. Zhu. Stochastic approximation procedure with randomly vary- ing truncations. Scientia Sinica Series A, 29:914–926, 1986. [14] K. Chung. On a stochastic approximation method. The Annals of Mathematical Statistics, 25(3):463–483, 1954. 142 [15] B. Delyon and A. Juditsky. Accelerated stochastic approximation. SIAM Jour- nal on Optimization, 3(4):868–881, 1993. [16] C. Derman. An application of Chung’s lemma to the Kiefer-Wolfowitz stochas- tic approximation procedure. The Annals of Mathematical Statistics, 27(2):532– 536, 1956. [17] J. Dippon and J. Renz. Weighted means in stochastic approximation of minima. SIAM Journal on Control Optimization, 35(5):1811–1827, 1997. [18] V. Dupac. On the Kiefer-Wolfowitz approximation method. Cˇasopis pro pe´stova´n´ı Matematiky, 082(1):47–75, 1957. [19] V. Fabian. Stochastic approximation of minima with improved asymptotic speed. The Annals of Mathematical Statistics, 38(1):191–200, 1967. [20] V. Fabian. On asymptotic normally in stochastic approximation. The Annals of Mathematical Statistics, 39(4):1107–1380, 1968. [21] V. Fabian. Stochastic approximation. In J. Rustagi, editor, Optimizing Methods in Statistics, pages 439–470. Academic Press, New York, 1971. [22] E. Fogel. A fundamental approach to the convergence analysis of least squares algorithms. IEEE Transactions on Automatic Control, 26(3):646 – 655, 1981. [23] M. Fu. Gradient estimation. In S. Henderson and B. Nelson, editors, Hand- books in Operations Research and Management Science: Simulation, chapter 19, pages 575–616. Elsevier, Amsterdam, 2006. [24] M. Fu. What you should know about simulation and derivatives. Naval Research Logistics, 55(8):723 – 736, 2008. [25] M. Fu. Stochastic gradient estimation. In M. Fu, editor, Handbook of Simulation Optimization, chapter 5, pages 105–147. Springer, 2014. [26] M. Fu and S. Hill. Optimization of discrete event systems via simultaneous perturbation stochastic approximation. IIE Transactions, 29(3):233–243, 1997. [27] M. Fu and Y. Ho. Using perturbation analysis for gradient estimation, averaging and updating in a stochastic approximation algorithm. Proceedings of the 1988 Winter Simulation Conference, pages 509–517, 1988. [28] M. Fu and J. Hu. Conditional Monte Carlo: Gradient Estimation and Opti- mization Applications. Kluwer, Boston, MA, 1997. [29] A. George and W. Powell. Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming. Machine Learning, 65(1):167–198, Oct. 2006. 143 [30] L. Gerencse´r. Convergence rates of moments in stochastic approximation with simultaneous perturbation gradient approximation and resetting. IEEE Trans- actions on Automatic Control, 44(5):894–905, 1998. [31] S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012. [32] S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization ii: shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061–2089, 2013. [33] P. Glasserman. Monte Carlo Methods in Financial Engineering. Springer, New York, NY, 2004. [34] Y. Ho and X. Cao. Perturbation analysis and optimization of queueing net- works. Journal of Optimization Theory and Applications, 40(4):550–582, 1983. [35] Y. Ho, L. Shi, L. Dai, and W.-B. Gong. Optimizing discrete event dynamic systems via the gradient surface method. Discrete Event Dynamic Systems: Theory and Applications, 2(2):99–120, 1992. [36] H. Kesten. Accelerated stochastic approximation. The Annals of Mathematical Statistics, 29(1):41–59, 1958. [37] K. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regres- sion function. The Annals of Mathematical Statistics, 23(3):462–466, 1952. [38] N. Kleinman, J. Spall, and D. Naiman. Simulation-based optimization with stochastic approximation using common random numbers. Management Sci- ence, 45(11):1570–1578, 1999. [39] H. Kushner and D. Clark. Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer, New York, NY, 1987. [40] H. Kushner and J. Yang. Stochastic approximation with averaging of the it- erates: Optimal asymptotic rates of convergence for general processes. SIAM Journal on Control and Optimization, 31:1045–1062, 1993. [41] H. Kushner and J. Yang. Stochastic approximation with averaging and feed- back: Rapidly convergent ‘on-line’ algorithms. IEEE Transactions on Auto- matic Control, 40:24–34, 1995. [42] H. Kushner and G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer, New York, NY, 2003. [43] G. Liu and L. Hong. Kernel estimation of the Greeks for options with discon- tinuous payo↵s. Operations Research, 59(1):96–108, 2011. 144 [44] J. Maryak. Some guidelines for using iterate averaging in stochastic approx- imation. Proceedings of the 36th IEEE Conference on Decision and Control, 3:2287–2290, 1997. [45] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approx- imation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009. [46] A. Nemirovski and D. Yudin. Problem Complexity and Method Eciency in Optimization. John Wiley, New York, NY, 1983. [47] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathe- matical Programming, 120(1):221–259, 2009. [48] B. Polyak. New stochastic approximation type procedures. Automat. i Tele- mekh, 51:98–105, 1990. [49] B. Polyak and A. Juditsky. Acceleration of stochastic approximation by aver- aging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992. [50] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22:400–407, 1951. [51] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure. The Annals of Statistics, 13(1):236–245, 1985. [52] D. Ruppert. Ecient estimations from a slowly convergent Robbins-Monro pro- cess. Technical report, Cornell University Operations Research and Industrial Engineering, Ithaca, NY, February 1988. [53] J. Sacks. Asymptotic distribution of stochastic approximation procedures. The Annals of Mathematical Statistics, 29(2):351–634, 1958. [54] P. Sadegh. Constrained optimization via stochastic approximation with a si- multaneous perturbation gradient approximation. Automatica, 33(5):889–892, 1997. [55] J. Spall. Multivariate stochastic approximation using a simultaneous pertur- bation gradient approximation. IEEE Transactions on Automatic Control, 37(3):332–341, 1992. [56] J. Spall. Adaptive stochastic approximation by the simultaneous perturbation method. IEEE Transactions on Automatic Control, 45(10):1839–1853, 2000. [57] J. Spall. Introduction to Stochastic Search and Optimization: Estimation, Sim- ulation, and Control. John Wiley, Hoboken, NJ, 2003. [58] A. Tsybakov and B. Polyak. Optimal order of accuracy of search algorithms in stochastic optimization. Problemy Peredachi Informatsii, 26(2):45–63, 1990. 145 [59] J. Venter. An extension of the Robbins-Monro procedure. The Annals of Mathematical Statistics, 38(1):181–190, 1967. [60] I. Wang and E. Chong. A deterministic analysis of stochastic approxima- tion with randomized di↵erences. IEEE Transactions on Automatic Control, 43(12):1745–1749, 1998. [61] Y. Wang, M. Fu, and S. Marcus. Sensitivity analysis for barrier options. In M. Rossetti, R. Hill, B. Johansson, and R. Ingalls, editors, Proceedings of the 2009 Winter Simulation Conference, pages 1272 – 1282, Piscataway, New Jer- sey, 2009. Institute of Electrical and Electronics Engineers, Inc. [62] M. Wasan. Stochastic Approximation. Cambridge Tracts Mathematics and Mathematical Physics. Cambridge University Press, London, 1969. [63] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning, 11:2543–2596, 2010. [64] G. Yin and Y. Zhu. Almost sure convergence of stochastic approximation algo- rithms with nonadditive noise. International Journal of Control, 49(4):1361– 1376, 1989. [65] F. Yousefian, A. Ned´ıc, and U. Shanbhag. On stochastic gradient and subgra- dient methods with adaptive steplength sequences. Automatica, 48(1):56–67, 2011. 146