ABSTRACT
Title of dissertation: STOCHASTIC SIMULATION:
NEW STOCHASTIC APPROXIMATION
METHODS AND SENSITIVITY ANALYSES
Marie Chau, Doctor of Philosophy, 2015
Dissertation directed by: Prof. Michael C. Fu
R.H. Smith School of Business
& Institute of Systems Research
In this dissertation, we propose two new types of stochastic approximation
(SA) methods and study the sensitivity of SA and of a stochastic gradient method
to various input parameters. First, we summarize the most common stochastic
gradient estimation techniques, both direct and indirect, as well as the two classical
SA algorithms, Robbins-Monro (RM) and Kiefer-Wolfowitz (KW), followed by some
well-known modifications to the step size, output, gradient, and projection operator.
Second, we introduce two new stochastic gradient methods in SA for univari-
ate and multivariate stochastic optimization problems. Under a setting where both
direct and indirect gradients are available, our new SA algorithms estimate the gra-
dient using a hybrid estimator, which is a convex combination of a symmetric finite
di↵erence-type gradient estimate and an average of two associated direct gradient
estimates. We derive variance minimizing weights that lead to desirable theoretical
properties and prove convergence of the SA algorithms.
Next, we study the finite-time performance of the KW algorithm and its sen-
sitivity to the step size parameter, along with two of its adaptive variants, namely
Kesten’s rule and scale-and-shifted KW (SSKW). We conduct a sensitivity anal-
ysis of KW and explore the tightness of an mean-squared error (MSE) bound for
quadratic functions, a relevant issue for determining how long to run an SA algo-
rithm. Then, we propose two new adaptive step size sequences inspired by both
Kesten’s rule and SSKW, which address some of their weaknesses. Instead of us-
ing one step size sequence, our adaptive step size is based on two deterministic
sequences, and the step size used in the current iteration depends on the perceived
proximity of the current iterate to the optimum. In addition, we introduce a method
to adaptively adjust the two deterministic sequences.
Lastly, we investigate the performance of a modified pathwise gradient esti-
mation method that is applied to financial options with discontinuous payo↵s, and
in particular, used to estimate the Greeks, which measure the rate of change of
(financial) derivative prices with respect to underlying market parameters and are
central to financial risk management. The newly proposed kernel estimator relies
on a smoothing bandwidth parameter. We explore the accuracy of the Greeks with
varying bandwidths and investigate the sensitivity of a proposed iterative scheme
that generates an estimate of the optimal bandwidth.
STOCHASTIC SIMULATION:
NEW STOCHASTIC APPROXIMATION METHODS
AND SENSITIVITY ANALYSES
by
Marie Chau
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2015
Advisory Committee:
Prof. Michael C. Fu, Chair/Advisor
Prof. Steven I. Marcus
Prof. Kasso A. Okoudjou
Prof. Ilya O. Ryzhov
Prof. Paul J. Smith
c  Copyright by
Marie Chau
2015
To my loving and supportive parents.
ii
Acknowledgments
First, I would like to express my sincerest gratitude and deepest apprecia-
tion to my advisor, Prof. Michael C. Fu, for the tremendous amount of support,
guidance, feedback, and faith throughout this entire journey. Without him, this dis-
sertation would not be possible. Prof. Fu is not only a highly-respected scholar and
an excellent teacher who genuinely cares about his students, but also an amazing
mentor, always with the best intentions. I admire his breadth and depth of knowl-
edge, patience, humility, authenticity, honesty, compassion, optimism, and innate
good-hearted nature. For this experience and the opportunity to work with him, I
will be forever grateful.
Next, I would like to thank the other committee members - Professors Steve I.
Marcus, Kasso A. Okoudjou, Ilya O. Ryzhov, and Paul J. Smith - for taking the time
to read this dissertation and attend my defense. Special thanks to Professors Patrick
M. Fitzpatrick, Eric V. Slud, and David H. Hamilton for their encouragement and
willingness to help in their classes as well as in my job search.
Last, but certainly not least, I would like to thank the people I’ve met at UMD
who have since become some of my nearest and dearest friends. A very special thanks
to Patrick Sodre´ Carlos who I met randomly in di↵erential equations during summer
school in 2007 and is now one of my closest friends and biggest supporters. He’s
always willing to help not only myself but others as well, and is one of the main
reasons why I’m still here. Another special thanks to Karamatou Yacoubou Djima
for her loyal friendship from day one of boot camp. Graduate school would not
iii
have been the same without her, from our countless food outings, endless hours of
studying for quals, interesting conversations, fun times, and many laughs. Also a
very special thanks to Huashuai Qu, Xuan Liu, Jong Jun Lee, Zhixin Lu, Ran Ji,
and Anusha Dandapani for being amazing! Each of you have made a dent in my life
in your own special way. Thanks to the rest of my 1305 o cemates. Many thanks
to Rhyneta Gumbs who has a very kind heart, always looking out for the students.
Thanks to the other girl’s night math ladies, Jennifer Clarkson, Clare Wickman, and
Hana Ueda for your emotional support and regular dinner outings. Also, thanks to
the rest of the boot camp class of 2008 for unforgettable memories. Thanks to
Hisham Talukder for the many laughs, Temba and Joe for our fun outings, Lucia
Simonelli for our salsa nights, Yimei Fan for fun times, Dana Botesteanu for good
conversations over dinner/drinks, Changhui Tan and Wenqing Hu for their patience
and invaluable math discussions. Also, thanks to Alverda McCoy for always being
so pleasant and flexible, saving me from many administrative disasters.
Most of all, I thank my parents. Words can’t begin to explain how grateful I am
for all they’ve done throughout my entire life. Their morals, values, determination,
patience, and supportive nature have shaped me into the person I am today.
iv
Table of Contents
List of Figures viii
List of Notations x
1 Introduction 1
1.1 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Stochastic Approximation . . . . . . . . . . . . . . . . . . . . 7
1.1.2 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Background 11
2.1 Stochastic Gradient Estimation Methods . . . . . . . . . . . . . . . . 11
2.1.1 Indirect Gradients . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1.1 Finite Di↵erences . . . . . . . . . . . . . . . . . . . . 12
2.1.1.2 Simultaneous Perturbation . . . . . . . . . . . . . . . 13
2.1.1.3 Random Directions . . . . . . . . . . . . . . . . . . . 14
2.1.2 Direct Gradients . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2.1 Infinitesimal Perturbation Analysis . . . . . . . . . . 16
2.1.2.2 Likelihood Ratio/Score Function . . . . . . . . . . . 17
2.1.2.3 Weak Derivatives . . . . . . . . . . . . . . . . . . . . 18
2.1.2.4 General Extension . . . . . . . . . . . . . . . . . . . 19
2.2 Stochastic Approximation . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Classical Methods . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1.1 Robbins-Monro . . . . . . . . . . . . . . . . . . . . . 20
2.2.1.2 Kiefer-Wolfowitz . . . . . . . . . . . . . . . . . . . . 22
2.2.2 Robust Gradient . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.2.1 Simultaneous Perturbation Stochastic Approximation 25
2.2.2.2 Gradient Averaging . . . . . . . . . . . . . . . . . . 27
2.2.3 Adaptive Step Sizes . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.3.1 Kesten’s Rule . . . . . . . . . . . . . . . . . . . . . . 29
2.2.3.2 Scaled-and-Shifted Kiefer-Wolfowitz . . . . . . . . . 30
2.2.4 Robust Output . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2.4.1 Averaging Iterates . . . . . . . . . . . . . . . . . . . 35
2.2.4.2 Robust Stochastic Approximation . . . . . . . . . . . 37
2.2.4.3 Acceleration Stochastic Approximation . . . . . . . . 38
2.2.4.4 Numerical Comparison . . . . . . . . . . . . . . . . . 43
2.2.5 Varying Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . 45
v
3 New Hybrid Stochastic Approximation Methods 49
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Secant-Tangents AveRaged Stochastic Approximation . . . . . . . . . 50
3.2.1 Optimal Convex Weight . . . . . . . . . . . . . . . . . . . . . 52
3.2.1.1 Homogeneous Noise . . . . . . . . . . . . . . . . . . 52
3.2.1.2 Non-homogeneous Noise . . . . . . . . . . . . . . . . 54
3.2.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2.3 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . 64
3.2.3.1 Experiment 1: vary initial value . . . . . . . . . . . . 65
3.2.3.2 Experiment 2: vary steepness level . . . . . . . . . . 67
3.2.3.3 Results Summary . . . . . . . . . . . . . . . . . . . . 69
3.3 STAR-SPSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.3.1 Optimal Deterministic Weights . . . . . . . . . . . . . . . . . 71
3.3.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.3.3 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . 78
3.3.3.1 9-station Closed Jackson Network . . . . . . . . . . . 78
3.4 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . 85
4 Step Size Selection in Stochastic Approximation 87
4.1 Sensitivity of Finite-time Performance to Step Size . . . . . . . . . . 87
4.1.1 KW and its Variants . . . . . . . . . . . . . . . . . . . . . . . 88
4.2 Finite-time MSE Bound . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.3.1 Tightness of the Finite-time MSE Bound for Quadratics . . . 92
4.3.2 Sensitivity of KW and its Variants . . . . . . . . . . . . . . . 95
4.4 PROX-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.5 Adaptive PROX-step . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.6 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.6.1 Deterministic Problem with Added Noise . . . . . . . . . . . . 110
4.6.2 9-station Closed Jackson Queueing Network . . . . . . . . . . 116
4.7 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . 117
5 Greek Kernel Estimators 120
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.2 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.3 Generalized Pathwise Method . . . . . . . . . . . . . . . . . . . . . . 125
5.3.1 First-Order Greeks . . . . . . . . . . . . . . . . . . . . . . . . 125
5.3.2 Second-Order Greeks . . . . . . . . . . . . . . . . . . . . . . . 126
5.3.3 Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.3.4 First- and Second-Order Greek Estimators . . . . . . . . . . . 129
5.4 Pilot Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.5.1 Sensitivity to Bandwidth . . . . . . . . . . . . . . . . . . . . . 131
5.5.2 Sensitivity of Bandwidth Generator to Input Parameters . . . 136
5.6 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . 141
vi
Bibliography 142
vii
List of Figures
1.1 Illustration: Generating sample performance . . . . . . . . . . . . . . 3
2.1 MSE under AC-SA, RM, RM w/averaging and RSA for f(x) =  13x
2,
x1 = 30.0,   = 1.0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1 Illustration of STAR gradient, where f˜ and f˜ 0 are estimates of f and
f 0, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 f(x) =  0.1x2,⇥ = [ 50, 50], an = 10(n+1) 1, cn = 0.1(n+1) 1/4, N =
1000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3 f(x) =  ax2,⇥ = [ 50, 50], an = 10(n+1) 1, cn = 0.1(n+1) 1/4, N =
1000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.4 9-station closed Jackson queueing network. . . . . . . . . . . . . . . 79
3.5 9-station closed Jackson queueing network,
P9i=1 x(i) = 10, x(i) > 0,
an = ✓a/(n+1), ↵n = c2n/(1+c2n), N = 50, customers serviced = 300,
macroreplications = 20. . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.1 Illustration of Sensitivity of SA to {an}. . . . . . . . . . . . . . . . . 88
4.2 MSE of the 10000th iterate of KW and Kesten for three parameter settings
and SSKW for f(x) =  0.001x2,   = 0.001, an = ✓a/n, cn = ✓c/n1/4. . . . 96
4.3 Sensitivity of KW to ✓a for f(x) =  0.001x2, an = ✓a/n, cn =
✓c/n1/4, n = 10000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.4 Sensitivity of SSKW for f(x) =  0.001x2 as a function of  a, n =
10000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.5 MSE Comparison of KW, Kesten, and SSKW for f(x) = 100e 0.006x2 ,
an = 1/n, cn = 1/n1/4, n = 10000. . . . . . . . . . . . . . . . . . . . 100
4.6 Illustration of PROX-Step Motivational . . . . . . . . . . . . . . . . 102
4.7 f(x) = 100e 0.006x2 , ✓ = [ 30, 70], x1 = 10, a n = 10/n, a+n = 5(k +
1)/n,  f =  g = 0.1, ↵ = 1.025, nL = 20, N = 50, macroreplications
= 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.8 f(x) = 100e 0.006x2 , ✓ = [ 30, 70], x1 = 10, a n = 1/n, a+n = 2k/n,
 f =  g = 0.1, ↵ = 1.025, nL = 20, N = 50, macroreplications = 20. 112
4.9 MSE of xN and f(xN), f(x) = 100 exp 0.006x2 , ↵ = 1.5, ✓ = [ 30, 70],
x1 = 10,  f =  g = 0.1,, ↵ = 1.5, nL = 50, N = 50, macroreplications
= 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.10 MSE of xN and f(xN), f(x) = 100 exp 0.006x2 , ↵ = 2 ✓ = [ 30, 70],
x1 = 10,  f =  g = 0.1, nL = 20, N = 50, macroreplications = 20. . . 114
4.11 MSE of xN and f(xN), f(x) =  x2, ✓ = [ 30, 70], x1 = 10,  f = 0.1,
 g = 0.1, nL = 20, macroreplications = 20. . . . . . . . . . . . . . . 115
4.12 MSE of xN and f(xN), f(x) =  x2, N = 50, x1 = 10,  f = 0.1,
 g = 1.0, nL = 50, a n = 1/n, a+n = 20/n . . . . . . . . . . . . . . . . 116
4.13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
viii
5.1 95% confidence band for Asian delta (solid curves) for n = 1000
with w/95% confidence interval for bandwidth (vertical dashed lines)
generated using s = 0.1 and RRMSE for 100 sample paths. . . . . . 134
5.2 95% confidence band for Asian vega (solid curves) for n = 1000
w/95% confidence interval for bandwidth (vertical dashed lines) gen-
erated using s = 0.001 and RRMSE for 100 sample paths. . . . . . . 134
5.3 95% confidence band for Asian gamma (solid curves) for n = 1000
w/95% confidence interval for bandwidth (vertical dashed lines) gen-
erated using s = 0.01 and RRMSE for 100 sample paths. . . . . . . . 135
5.4 95% confidence band for barrier theta for n = 10000 and RRMSE for
100 sample paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.5 Barrier gamma estimator for n = 10000, 100 sample paths. . . . . . . 137
5.6 Asian delta pilot 95% confidence interval for cˆ for k = 10. . . . . . . . 139
5.7 Asian delta pilot, n = 500, sample size = 100. . . . . . . . . . . . . . 140
ix
List of Notations
R the field of real numbers
d number of dimensions
⇥ parameter space (continuous)
✓ continuous parameter
f(·) distribution of input parameters (excl. ch. 3, 4)
J(·) function of interest
cn perturbation size n random vector
Y (·, ·) sample performance measure (ch. 1, 2)
f˜ sample performance measure (ch. 3)
an step or gain sizeei ith unit basis vector
⇠ random e↵ects
|| · || Euclidean norm
h·, ·i dot product
x
Chapter 1
Introduction
Consider the stochastic optimization problem
minx2⇥ J(x), (1.1)
where J : Rd ! R, J(x) = E[Y (x, ⇠)], and ⇠ denotes the stochastic e↵ects. In
simulation optimization, the objective is to find x⇤ that minimizes the function of
interest J(x). The search space ⇥ is, in general, either continuous or discrete, but
we only focus on the continuous case, which restricts the range of methods that
can be applied. One of the most useful and well-known methods in simulation
optimization for continuous parameters is stochastic approximation (SA), which
generates a sequence of estimates {xn} converging to a solution of rJ(x) = 0 using
the recursion xn+1 = ⇧⇥ ⇣xn   an brJ(xn)⌘ , (1.2)
where ⇧⇥(x) is a projection of x back into the feasible region ⇥ if x /2 ⇥, an is a
positive step size or gain size, brJ(xn) is an estimate of the true gradient rJ(xn),
and xN is the output, where N is the stopping time, which we denote by x⇤N .
The first SA algorithm introduced in [50] estimated the true gradient using an
unbiased direct gradient, which unfortunately, is not always available in practice,
so [37] proposed the use of indirect gradients (i.e., finite di↵erences), which only
1
require sample performances. SA is simple and requires very little memory due
to its recursive nature, and thus can be implemented as an online method. As a
result, SA algorithms are used in a wide variety of application areas such as signal
processing, statistics, operations research, and machine learning [9].
The landmark papers, [50] and [37], proved convergence in probability for the
one-dimensional classical Robbins-Monro (RM) and Kiefer-Wolfowitz (KW) meth-
ods, respectively. Later, [5] modified the original conditions of the convergence
theorems for both RM and KW to obtain almost sure (a.s.) convergence in higher
dimensions. Then, [14] and [53] proved asymptotic normality, and [20] established
asymptotic normality for general SA algorithms in the multidimensional setting.
More recently, the focus has shifted to finite-time error bounds such as the mean-
squared error (MSE) of the estimate E[||x⇤N x⇤||2], where ||·|| denotes the Euclidean
norm [7, 65], or the di↵erence between the objective value at the estimate and the
optimal objective value [31, 45,46].
Stochastic approximation converges asymptotically under certain conditions,
but the practical performance depends highly on the components of recursion (1.2).
Methods have been introduced to increase its robustness, such as choosing appro-
priate deterministic or adaptive step sizes [7, 21, 36, 51, 59, 65], generating estimates
based on a subset of iterates [48, 49, 52], projecting iterates back into the feasible
region in a clever/strategic manner [2], and stabilizing gradients or increasing gradi-
ent accuracy. By exploiting previous gradient estimates, [27], [47], and [63] propose
gradient averaging methods to increase gradient stability. Moreover, [1] and [3]
present methods to prevent gradients from taking extreme values, and [7] introduce
2
SIMULATOR
x Y (x, ⇠)Figure 1.1: Illustration: Generating sample performance
a technique that reduces the variance by increasing the perturbation size of finite
di↵erences adaptively.
The SA algorithms with the fastest convergence rates are not necessarily su-
perior. Biased gradient estimates lead to slower asymptotic convergence rates, but
could outperform unbiased direct gradients in finite-time. One major factor in the
performance is the choice in step size. It is di cult to select an appropriate step
size since the function (1.1) is unknown, and it is impossible to find a universally
optimal step size for all situations. Therefore, adaptive step sizes that adjust based
on the ongoing performance of the algorithm have been proposed to tackle the issue.
In the simulation optimization context, the gradient estimate brJ(xn) is as-
sumed to be computationally expensive to generate. Each sample performance
Y (x, ⇠) is generated through a simulator, as illustrated in Figure 1.1, which can
be used to estimate the gradient. In some cases, the simulator can also generate
an estimate of the sample performance gradient rY (x, ⇠) simultaneously, but it
requires additional information about the system dynamics, which is not always
available. Each simulation run is computationally heavy; therefore, the number of
iterations N is limited, so the final parameter estimate does not have any guarantees
based on the asymptotic theory. The expensive gradient together with the step size
sensitivity reinforces the importance of finite-time performance.
3
The sensitivity of the function of interest to the parameters is also critical, and
the category/class of methods for such gradient approximations is called sensitivity
analysis. The main objective is to estimate the sensitivity of J(x) accurately, so
the key is to generate an accurate and reliable gradient estimate. Unfortunately,
the function J(x) could have undesirable properties such as discontinuities, which
restricts the applicable methods. Common gradient estimates and their applicability
will be discussed in Section 2.1.
In finance, the accuracy of the sensitivity estimates of hedging tools is critical.
For example, an option is a financial instrument that can be used to hedge risk, and
its sensitivity to market parameters is useful in making investment decisions. If the
price of the underlying stock hits the strike price, then the option can be exercised,
resulting in some positive payo↵. However, if the stock does not reach the strike
price, then it cannot be exercised, resulting in zero profit. The payo↵ function of
an option may be discontinuous, which adds a level of di culty in estimating the
gradient with respect to market parameters, also known as Greeks. The well-known
infinitesimal perturbation analysis (IPA) is not applicable in this case, but a modified
version, smoothed perturbation analysis (SPA), can be applied to discontinuous
functions [28]. More recently, kernel estimators have been used in combination with
IPA estimators to generate a hybrid gradient or modified IPA [43]. The kernel relies
on a bandwidth parameter chosen by the user, and [43] proposed a selection method
based on minimizing the asymptotic MSE, which has several input parameters, again
determined by the user.
The commonality between stochastic approximation (SA) and sensitivity anal-
4
ysis lies in their requirement for a gradient estimate, although serving di↵erent pur-
poses. Many of the stochastic gradient estimation methods have associated asymp-
totic results, which are necessary, but in practice, finite-time performance is also
essential. Both methods often require user specified input parameters that must be
selected, and the finite-time performance has a strong dependence on those chosen
parameters. Take for example the widely-known finite di↵erence gradient estimate,
which requires a perturbation size c. If the function is deterministic, a smaller c
would increase the accuracy of the gradient estimate. However, in the stochastic
setting, where the function evaluations are noisy, perturbations which are too small
can result in extremely noisy gradient estimates. “Too small” is also relative to the
problem at hand. The input parameter c has a significant impact on the quality
of the estimate and must be chosen appropriately. Ideally, algorithms/methods are
robust, and their performance should not be strongly correlated with user specified
parameters.
1.1 Motivating Examples
For motivation, we provide application examples where stochastic approxima-
tion and sensitivity analysis can be applied, two of which are more detailed.
Queueing networks. Many application areas such as communication networks,
production systems, and smartgrids can be modeled using queueing networks. A
possible metric of interest is the total throughput of the system. Each node or
server has a particular mean service time, which is a controllable parameter, and the
5
objective is to find the optimal parameter values to maximize the total throughput.
Furthermore, the sensitivity of the throughput to the mean service time also be
examined.
Inventory management. Production/manufacturing/retail businesses require
inventory management systems to help maintain operations and increase profitabil-
ity. A commonly known inventory policy is the (s, S) policy, where an order is
placed to replenish the supply up to S after the inventory on hand falls below s.
The goal is to minimize the total cost (e.g., ordering, holding, and shortage) for
implementing such a policy.
Finance. Well-modeled stock prices are essential in making well-informed in-
vestment decisions. Recently, stochastic volatility models have been introduced with
volatility varying in time. The parameters are estimated based on the maximum
likelihood function, where SA can be easily applied. Furthermore, payo↵s for op-
tions are based on stock price models, and optimal exercise strategies can be created
to maximize the payo↵.
Machine learning using “big data”. Traditional regression models are based
on batch data, where the analysis is conducted on a fixed data set. However, with
the rise in the amount of data available, models must be generated/updated in an
online manner, and stochastic approximation is a commonly applied method.
6
1.1.1 Stochastic Approximation
Consider a G/G/1 queue with one server, unlimited bu↵er or waiting capacity,
interarrival times and service times that follow general distributions with means 1/ 
and x, respectively. In this system, there is only one service station with one server.
The customers arrive from outside of the system with interarrival times following
some general distribution with mean 1/ , and the server serves each customer on a
first-come first-served basis with a service time also following a general distribution
with mean x. This queueing system can be modeled easily with distributional in-
formation about the interarrival and service times, and SA can be directly applied.
An important objective to consider is increasing customer satisfaction. Among the
various correlated performance measures that can be used to quantify the satisfac-
tion level, the total time spent in the system (i.e., time in waiting in queue plus the
actual service time) is arguably one of the most significant factors. The objective
function could be the expected time in system E[T (x, ⇠)], where T (·, ·) is the time in
system of a customer, plus a positive decreasing cost C(x) associated with the mean
service time x, e.g., c/x, where c > 0. The objective of stochastic approximation is
to find x⇤ = argminx2(0,1/ ) E[T (x, ⇠)] + C(x), by solving r(E[T (x, ⇠)] + c/x) = 0
iteratively. One could also be interested in the sensitivity of the function, but we
will provide a di↵erent example for sensitivity analysis.
7
1.1.2 Sensitivity Analysis
Financial options are instruments that give option holders the right to purchase
or sell stock for a particular strike price once certain conditions are met. Investors
often use options to hedge financial risk and are especially interested in the expected
payo↵ E[g(S)], where g is the sample payo↵ and S is the stock price over a time
period, which depend on market parameters such as interest rate, initial stock price,
volatility, etc., as well as its sensitivity. The derivatives of the expected payo↵ with
respect to market parameters @E[g(S)]/@x are referred to as Greeks. To implement
Greek estimators, the stock price can be modeled as a stochastic process (e.g.,
Brownian motion and OrnsteinUhlenbeck process) over a pre-specific duration with
certain parameters. When making financial decisions, it is essential that investors
make them with accurate and reliable information, such as good Greek estimates.
1.2 Contributions
First, we propose two novel stochastic gradient estimation methods in stochas-
tic approximation for univariate and multivariate problems. We consider a setting
where direct gradients are also available, so we combine both direct and indirect
gradient estimates using a convex weight to form a hybrid gradient. Currently, the
existing SA algorithms only consider either direct or indirect gradient estimates, but
not in conjunction. For the indirect gradient, we consider a symmetric di↵erence-
type gradient estimate, and an average of the two associated direct gradients for
the direct gradient estimate. In the one-dimensional case, we use a symmetric dif-
8
ference gradient estimate, which can be directly extended to higher dimensions, but
instead, we employ the well-known simultaneous perturbation gradient to exploit
its potential computational e ciency. A critical component of our hybrid gradient
estimate is the convex weight, which we derived to minimize the variance of the hy-
brid gradient, leading to favorable theoretical properties. Our new hybrid gradients
are provably convergent in SA.
Second, we conduct preliminary experiments to investigate the sensitivity of
the KW algorithm and two of its variants to the step size parameter and explore
the tightness of a finite-time MSE bound. Then, we introduce two new adaptive
step size sequences for SA, both of which adjust based on the perceived proximity
of the current iterate to the optimum. Most of the present adaptive step sizes
only consider one adaptive sequence for either each dimension or for all dimensions.
Instead, our adaptive step size is based on two sequences, and the step size used
in the current iteration depends on the current sample performance(s) compared to
the past observations. Although the method is adaptive, the two initial sequences
are deterministic, so it also su↵ers from the same disadvantages of deterministic step
sizes. Therefore, we introduce a method to adaptively adjust the two deterministic
sequences.
Finally, we examine the sensitivity of two methods: 1) a modified pathwise
method involving a kernel estimator to the smoothing bandwidth parameter and 2)
a bandwidth selection method to various input parameters.
9
1.3 Outline
The rest of the dissertation is as follows. In Chapter 2, we provide background
information to better understand stochastic approximation and sensitivity analysis.
In Section 2.1, we discuss the most common stochastic gradient estimation methods
for both direct and indirect gradients. Then in Section 2.2, we introduce the classical
SA algorithms in addition to modifications to the gradient, step size, output, and
projection operator. Chapter 3 presents our two novel approaches to stochastic
gradient estimation methods for single and multidimensional problems along with
theoretical and numerical results. We derive optimal weight parameters and prove
convergence for both cases. We test our new algorithms against well-known SA
algorithms on a deterministic problem with added noise and on a queueing network.
In Chapter 4, we explore step size selection techniques in SA, in addition to the
tightness of a finite-time MSE bound, before introducing our new adaptive step sizes.
We investigate the performance of our new adaptive step sizes on two contrasting
deterministic functions with added noise for the one-dimensional case and on a
queueing network for the multidimensional case. Chapter 5 presents a proposed
modified pathwise gradient estimation method that incorporates a kernel estimator
for options with discontinuous payo↵ functions. We empirically test the sensitivity
of the kernel to the bandwidth parameter as well as the sensitivity of a proposed
bandwidth selection algorithm to its input parameters. At the end of Chapters 3,
4, and 5, we provide concluding summaries as well as future research directions.
10
Chapter 2
Background
2.1 Stochastic Gradient Estimation Methods
Stochastic gradient estimation is essential in stochastic approximation and
sensitivity analysis but serves di↵erent purposes. In SA, the gradient estimate is
only an intermediary step that provides a trajectory direction, so it is more tolerant
of less accurate estimates. Sensitivity analysis accesses the parameter e↵ects on the
function of interest, so the goal is to accurately estimate the gradient, and accuracy is
key. Certain stochastic gradient methods do not provide estimates accurate enough
to be suitable for sensitivity analysis and are designed specifically for SA.
In general, stochastic gradient estimates fall under one of two main categories:direct or indirect. Indirect gradient methods approximate gradients using finite
di↵erence-type gradient estimates, which only require function evaluations, e.g., via
the secant method in the one-dimensional case. Since indirect gradient estimates
only require sample performances, they can be easily applied to any case but are
biased. Direct gradient methods are known as gradient-based approaches, mainly
because the techniques results in unbiased gradient estimates, which lead to faster
convergence rates in SA.
11
2.1.1 Indirect Gradients
Indirect gradients are applicable as long as the simulator can generate sample
performances. Unfortunately, indirect gradient estimates are biased, which lead to
slower asymptotic convergence rates in stochastic approximation.
2.1.1.1 Finite Di↵erences
Finite di↵erence methods stem from Taylor’s series expansion, and require
the additional task of selecting a perturbation sequence {cn}, which impacts the
variance. In the one-dimensional case, the derivative can be rewritten as a finite
di↵erence plus a bias term that diminishes as the perturbation size cn approaches
zero as n ! 1 (e.g., f 0(xn) = [f(xn + cn)  f(xn)]/cn + O(cn)). The perturbation
size cn influences the noise level of the finite di↵erence gradient estimate, and if it
is too small, the gradient estimates can be very noisy in the stochastic setting. If
sensitivity analysis is the ultimate goal, then a relatively larger perturbation size is
preferable. In higher dimensions, the idea is to slightly perturb one component while
keeping all others constant and return a corresponding function value estimate.
Here are two common examples of finite di↵erences:brJi(xn) = 8>><>>: Y (xn + cnei, ⇠+n,i)  Y (xn   cnei, ⇠ n,i)2cn symmetric di↵erence,Y (xn + cnei, ⇠+n,i)  Y (xn, ⇠n)
cn forward di↵erence,
for i = 1, . . . , d, where brJi(xn) is the ith component of the gradient estimate,ei denotes the ith unit basis vector, cn 2 R+ is the perturbation size, ⇠±n,i and ⇠n,i
denote the stochastic e↵ects, and Y (x, ⇠) is an unbiased estimate of J(x). Symmetric
12
di↵erences require the estimate of two function values J(xn+ cnei) and J(xn  cnei)
for each dimension i and one-sided forward di↵erences require J(xn) and J(xn cnei)
for i = 1, . . . , d; therefore, two-sided symmetric di↵erences and one-sided forward
di↵erence estimates involve 2d and d+1 simulation replications, respectively. Since
the cost to generate each gradient is linear in the number of dimensions d, finite
di↵erences can be very ine cient in higher dimensions. The next indirect gradient
estimate addresses this issue.
2.1.1.2 Simultaneous Perturbation
In 1992, [55] introduced the simultaneous perturbation (SP) gradient estimate
specifically for multidimensional stochastic approximation problems. Similar to fi-
nite di↵erences, SP is a gradient-free approach that only requires objective function
values to approximate the underlying gradient, and is therefore easy to implement.
The ith component of the SP gradient has the formbrJi(xn) = Y (xn + cn n, ⇠+n )  Y (xn   cn n, ⇠ n )2cn n,i ,
where cn 2 R+ is the perturbation size, ⇠±n denotes the stochastic e↵ects,  n =
( n,1, . . . , n,d), and the sequence { n} is generally assumed to be i.i.d. and inde-
pendent across components. Notice that, unlike finite di↵erences, the numerator of
the SP gradient is independent of i, so a gradient only requires 2 simulation runs, re-
gardless of the number of dimensions d, which can be e cient in higher dimensions.
The only additional requirement is to generate  n, which is relatively inexpensive
compared to simulating sample performances. The SP gradient is only intended for
13
use in stochastic approximation algorithms and has limited relevance in sensitivity
analysis.
2.1.1.3 Random Directions
The random directions (RD) gradient was also developed specifically for mul-
tidimensional problems and was the inspiration behind SP. The ith component of
the RD gradient isbrJi(xn) = [Y (xn + cn n, ⇠+n )  Y (xn   cn n, ⇠ n )] n,i2cn ,
with perturbation size cn 2 R+ and stochastic perturbation  n = ( n,1, . . . , n,d).
The sequence { n} is generally assumed to be i.i.d. and independent across compo-
nents, and ⇠±n denote the stochastic e↵ects. Notice that RD is almost identical to SP,
except instead of dividing by the stochastic perturbation component  n,i, it multi-
plies the di↵erence term. This di↵erence changes the restrictions on the stochastic
sequence { n} to guarentee convergence in SA and translates to a bound on the
second moment, instead of the inverse moment, with zero mean, so the Gaussian
distribution is applicable [23].
2.1.2 Direct Gradients
Direct gradients generally require more “o✏ine” work and involve additional
coding within the simulation model. However, if direct gradient methods are applica-
ble, they usually provide unbiased gradient estimates, which lead to faster asymp-
totic convergence rates in stochastic approximation. In addition, direct gradient
14
methods are computationally e cient, especially in higher dimensions, since they
often only require one simulation replication for any dimension d.
For notation, assume
J(X) = E[Y (✓)] = E[Y (X1, . . . , Xd)], (2.1)
where Y : Rd ! R and X = (X1, . . . , Xd) is a vector of input random variables.
The parameter ✓ has been purposely left out of the right-hand side of expression
(2.1) to emphasize the point that it can appear in either the sample (pathwise) or
measure (distributional). Let f denote the joint distribution of the input random
variables X, then (2.1) can be written in the following two ways:
E[Y (X)] =
8>>><>>>: Z 1 1 Y (x)f(x; ✓)dx distributional,Z 1
0
Y (X(✓; u))du pathwise,
(2.2)
where x, u, and the integrals are d-dimensional. When the parameter is located
in Y (·), then the dependency of ✓ is pathwise and perturbation analysis is usually
applied; whereas, if ✓ is located in f(·), then its ✓ dependency is distributional
and the likelihood ratio or score function and weak derivatives could potentially be
applied. In some instances, the ✓ can be pushed in and/or out of the distribution
with an appropriate change of variables, also known as the push in/out method,
which will change the applicability of the estimation method.
If we assume that the interchange of the integration and di↵erentiation is valid,
then di↵erentiating (2.2) with respect to ✓ results in the following two cases:
dE[Y (X)]
d✓
=
8>>><>>>: Z 1 1 Y (x)df(x; ✓)d✓ dx distributional,Z 1
0
dY (X(✓; u))
d✓
du pathwise.
(2.3)
15
We assume each random number ui produces a random variate xi for i = 1, . . . , d.
For simplicity, in Sections 2.1.2.1 and 2.1.2.2, we assume that the parameter
only appears in X1, which is generated independently of the other random variables.
For the simplified example, we will apply the method to the specific case where d = 2,
X1 ⇠ exp(✓), and X2 ⇠ U(0, 1). The random variate x1 is generated via inverse
transform method where Xi =  ✓ lnUi and Ui ⇠ U(0, 1).
2.1.2.1 Infinitesimal Perturbation Analysis
For the pathwise dependent case, the second integral can be expressed as
@E[Y (X)]
@✓
=
Z 1
0
dY (X1(✓; u1), X2, . . . , Xd)
d✓
du
=
Z 1
0
@Y (X)
@X1
dX1(✓; u)
d✓
du,
where @Y@X1 @X1@✓ is the infinitesimal perturbation analysis (IPA) estimator.
In this particular case,
@E[Y (X)]
@✓
=
Z 1
0
Z 1
0
@Y (X1(✓; u1), u2)
@x1
dX1(u1; ✓)
d✓
du1du2,
and dX1d✓ depends on the construction of X1 and since X1 =  ✓ lnU, then dX1d✓ =
  lnU = X1✓ , so the final IPA estimator is
@Y (X1, X2)
@X1
dX1
d✓
=
@Y (X1, X2)
@X1
X1
✓
.
Implementation of the estimator requires more knowledge about how the Xi’s
are generated in order to compute the derivative of the random variable dXid✓ ; there-
fore the representation of Xi, which depends on ✓, is key. As a general rule, if the
16
sample performance is continuous with respect to the parameter of interest, then
IPA is a suitable gradient estimation method. However, in practice, discontinu-
ities may occur in the sample performance (e.g., indicator function), thereby forcing
other estimation techniques to be applied. Refer to [23,25] for detailed examples.
The next two types of estimators pertain to the distribution dependent case.
2.1.2.2 Likelihood Ratio/Score Function
Assume fi(·; ✓) is the marginal p.d.f. of Xi, so the joint density can be written
as f(x) = ⇧di=1fi(xi). Therefore, (2.3) can be expressed as:
dE[Y (X)]
d✓
=
Z 1
 1
Y (x)
@f1(x1; ✓)
@✓
⇧dk=2fk(xk)dx
=
Z 1
 1
Y (x)
@ ln f1(x1; ✓)
@✓
f(x)dx,
where Y (X)@ ln f1(X1;✓)@✓ is the score function or likelihood ratio estimator.
In our particular example,
dE[Y (X)]
d✓
=
Z 1
0
Z 1
0
Y (x1, x2)
@f1(x1; ✓)
@✓
dx1dx2
=
Z 1
0
Z 1
0
Y (x1, x2)

1
✓
⇣x1
✓
  1
⌘  1
✓
e 
x1✓ dx1dx2.
SF/LR estimators are generally simple to implement when they are applicable.
However, this method is not applicable to distributions where the underlying support
depends on ✓. For instance, U(0, ✓) and Ber(p; ✓, b) are examples of continuous and
discrete distributions, respectively, with ✓ dependent support, so LR/SF estimators
do not exist. However, they exist for distributions where the underlying probabilities
rely on the parameter of interest, such as bin(n, ✓), Ber(✓; a, b), Poisson(✓), exp(✓),
17
and N(✓,  2). Furthermore, this method usually has di culty with nondistributional
parameters, but the issue might be overcome by using the push in or push out
method to push the parameter of interest in or out of the distribution. If the
parameter appears in more than one input random variable distribution, the variance
will increase linearly as the number of times those random variables are used in the
simulation, but this issue could be controlled through batching. For example, if the
performance measure of interest of an M/M/1 queue is the average time in system
of 1000 customers, we could look at the average of 100 samples, 10 customers each,
instead. Furthermore, for higher derivative estimators, this method is easier to
apply.
2.1.2.3 Weak Derivatives
For the weak derivatives estimator, there exists c(✓) and densities f (2)1 and f
(1)
1
such that
@f1
@✓
= c(✓)[f (2)1   f
(1)
1 ]. (2.4)
This representation always exists, because if we let f (1)1 =
1c  @f1@✓    and f (2)1 =
1c  @f1@✓  +, where c = R  @f1@✓    dx = R  @f1@✓  + dx, then @f1@✓ is written as the di↵erence
of two signed measures, otherwise known as the Hahn-Jordan decomposition, which
is not unique.
Therefore, using this representation,
dE[Y (X)]
d✓
=
Z 1
 1
Y (x)
@f(x; ✓)
@✓
dx
=
Z 1
 1
Y (x)c(✓)[f (2)1   f
(1)
1 ]dx,
18
where the weak derivatives estimator is of the form
c(✓)[Y (X(2)1 , X2, . . . , Xd)  Y (X(1)1 , X2, . . . , Xd)],
where X(1)1 ⇠ f
(1)
1 and X
(2)
1 ⇠ f
(2)
1 .
This is called a weak derivative because the left hand side of (2.4) might not
be proper, but when it is integrated against a test function, it is well-defined. These
estimators are not unique and in our specific example,
dE[Y (X)]
d✓
=
1
✓
Z 1
0
Z 1
0
Y (x1, x2)

1
✓
⇣x1
✓
  1
⌘  1
✓
e 
x1✓ dx1dx2,
where the weak derivatives estimator is 1✓ [Y (X(2)1 , X2)   Y (X(1)1 , X2)], with X(2)1 ⇠
Erl(2, ✓) and X(1)1 ⇠ exp(✓).
Similar to the LR method, WD estimators variance grows linearly as the num-
ber of random variables with ✓ dependency increases in the simulation. For example,
in a queueing system, ✓ could appear in the interarrival and service times, which
can be decreased through batching. In addition, WDs are not unique, so deciding
which to use in generating the estimator could be a challenge.
2.1.2.4 General Extension
We now generalize the estimators for situations where more Xi’s have a de-
pendence on ✓. Let P ⇤ denote the set of indices where Xi is dependent on ✓. Then
the following are the general estimators:IPA estimator: Xi2P ⇤ @Y (X)@Xi dXid✓
19
LR/SF estimator (multivariate, independent):
Y (X)
@ ln f(X; ✓)
@✓
, Y (X)
Xi2P ⇤ @ ln fi(Xi; ✓)@✓WD estimator (multivariate, independent):
c(✓)(L(X(2))  L(X(1))),
Xi2P ⇤ ci(✓)(L(X1 . . . , X(2)i , . . . , Xd)  L(X1 . . . , X(1)i , . . . , Xd))
2.2 Stochastic Approximation
2.2.1 Classical Methods
The two classical stochastic approximation methods, Robbins-Monro (RM)
and Kiefer-Wolfowitz (KW), were first applied to the one-dimensional unconstrained
stochastic optimization problem, and the recursive scheme follows
xn+1 = xn   an brJ(xn), (2.5)
which is identical to (1.2) with the exception of the projection operator. The main
di↵erence between RM an KW is the gradient estimate brJ . In RM, the gradient
is estimated by an unbiased estimator, whereas in KW, the gradient estimate is
only asymptotically unbiased. Both algorithms have their advantages and will be
discussed in the next two sections.
2.2.1.1 Robbins-Monro
Robbins and Monro pioneered the way for stochastic approximation in [50],
and the number of literature citations since has grown to over 3500. RM was first
20
intended to solve root-finding problems, i.e., h(x⇤) = 0, where h : Rd ! R. The
Robbins-Monro (RM) stochastic approximation algorithm was applied to a stochas-
tic optimization problem with the objective function J by setting h = rJ , where
the true gradient is estimated using an unbiased direct gradient. RM solves this
problem iteratively as in (2.5) by choosing the gradient estimate brJ(xn) to be
an unbiased estimator, i.e., E[brJ(xn)] = rJ(xn), and the output is taken as the
last iterate x⇤N , where N is the stopping time. Unfortunately, the direct gradient
measurements are still approximations to the actual gradients because of the noise
(i.e., brJ(xn) = rJ(xn) +  n, where ✏n is noise with zero mean). When used in
SA algorithms, unbiased gradients estimates lead to faster convergence rates, and
under certain conditions RM can achieve an asymptotic convergence rate of up to
O(n 1/2) [53]. The following is the original convergence theorem.Theorem 2.2.1. (Theorem 2 [50]) Assume rJ(x) has a unique root x⇤ and supposebrJ(x) is an unbiased gradient estimator, i.e., E[brJ(x)] = rJ(x). If the sequence
{xn} is generated from (2.5) and the following conditions hold:
1. {an} is a sequence of positive constants such that P1n=1 an = 1, P1n=1 a2n <
1.
2. rJ(x)   0 for x > x⇤ and rJ(x)  0 for x < x⇤.
3. There exists a positive constant C such that P (|brJ(x)|  C) = 1 8x.
Then xn p! x⇤ as n !1, where p! denotes convergence in probability.
The objective function J is assumed to have a global minimum with a bounded
21
derivative. The most well-known conditions are restrictions on the gain sequence
{an}. Generally, the step size an ! 0 but P1n=1 an = 1, which prevents the step
size from converging to zero too quickly, so the iterates are able to make progress
towards x⇤ and not get stuck at a poor estimate. The typical form for the step
size is an = ✓a/(n + A)↵, where ✓a > 0, A   0, and 12 < ↵  1, with A = 0 and
↵ = 1 as a commonly used choice. The restrictions still allow for an uncountable
number of step size options, and the finite-time performance of SA is notoriously
known to be sensitive to the an choice. Theoretically, unbiased gradients lead to
faster convergence rates but are not always available, so the next method addresses
this issue.
2.2.1.2 Kiefer-Wolfowitz
The Kiefer-Wolfowitz stochastic approximation algorithm only requires sample
performances measurements to implement and does not require additional informa-
tion on the system dynamics or input distributions as in RM. The original KW
iterative scheme
xn+1 = xn   anY (xn + cn, ⇠+n )  Y (xn   cn, ⇠ n )2cn , (2.6)
estimates the gradient using a symmetric finite di↵erence gradient estimate, and un-
der certain conditions, KW can achieve an asymptotic convergence rate of O(n 1/3).
In addition, common random numbers (CRN) can be employed to decrease the vari-
ance of estimates, and KW can achieve an asymptotic convergence rate of O(n 1/2)
in certain settings [38].
22
Theorem 2.2.2. ( [37]) Assume J(x) = E[Y (x, ⇠)]. If the sequence {xn} is gener-
ated from (2.6) and the following conditions hold:
1. Let {an} and {cn} be positive tuning sequences satisfying the conditions
cn ! 0, X an = 1, X ancn < 1, X a2nc 2n < 1.
2. J(x) is strictly decreasing for x < x⇤ and strictly increasing for x > x⇤.
3. V ar(Y (x, ⇠)) < 1 and satisfies the following regularity conditions:
1) There exist positive constants   and B such that
|x0   x⇤| + |x00   x⇤| <   =) |J(x0)  J(x00)| < B|x0   x00|.
2) There exist positive ⇢ and R such that
|x0   x00| < ⇢ =) |J(x0)  J(x00)| < R.
3) For every   > 0 there exists a positive ⇡( ) such that
|x  x⇤| >   =) inf /2>✏>0 |J(x+ ✏)  J(x  ✏)|✏ > ⇡( ).
Then xn p! x⇤ as n !1, where p! denotes convergence in probability.
Condition 1 assures that the step size an does not converge to zero too fast, so
the iterates do not get stuck at a poor estimate. In addition, the condition prohibits
the finite di↵erence step size cn from decreasing too quickly, as well as to prevent
noisy gradients. The second condition insures that there is a global optimum. The
first regularity condition requires J(x) to be locally Lipschitz in a neighborhood of
23
x⇤; the second one prevents J(x) from changing drastically in the feasible region;
and the last one prohibits the function from being very flat outside a neighborhood
of x⇤ so that the iterates approach the optimum. Although the KW algorithm
converges asymptotically, its finite-time performance is dependent on the choice of
tuning sequences, {an} and {cn}. If the current xn is in a relatively flat region of
the function and an is small, then the convergence will be slow. On the other hand,
if the xn is located in a very steep region of the function and {an} is large, then the
iterates will experience a long oscillation period. If {cn} is too small, the gradient
estimates using finite di↵erences could be extremely noisy.
KW has been extended to higher dimensions, and two common gradients con-
sidered are symmetric di↵erences and forward di↵erences as discussed in Section
2.1.1.1. Although using the symmetric di↵erence scheme is computationally more
expensive, it has the potential to reach an asymptotic convergence rate of O(n 1/3)
compared to O(n 1/4) for forward di↵erences. For d = 1, the computational cost is
identical for both gradient estimates. Even though both schemes are easy to imple-
ment, their convergence rates are typically inferior to the RM algorithm, although
under certain conditions with CRN ⇠+n,i = ⇠ n,i for the symmetric di↵erence, they also
can achieve the O(n 1/2) asymptotic convergence rate. For simulation optimization,
RM is not always applicable since additional information is needed, which may not
be readily available or is di cult to obtain. For KW, there is an additional task of
appropriately choosing the perturbation sequence {cn}. In general, KW is a simple
algorithm to implement for simulation optimization applications, albeit costly in
high-dimensional settings.
24
2.2.2 Robust Gradient
2.2.2.1 Simultaneous Perturbation Stochastic Approximation
Simultaneous perturbation stochastic approximation (SPSA) specifically ad-
dresses multivariate optimization problems [55]. Let ✏+n and ✏ n be the noise from
the sample performances J(xn + cn n) and J(xn   cn n), respectively.SPSA Algorithm
• Input. Choose x1 2 ⇥, {an}, {cn}, and stopping time N .
• Initialize. Let n = 1.
• While n < N,
– Step 1. Generate a d-dimensional random perturbation vector  n.
– Step 2. Generate an estimate of rJ(xn):brJ(xn) = Y (xn + cn n, ⇠+n )  Y (xn   cn n, ⇠ n )2cn 26666664  1n,1...
  1n,d37777775
– Step 3. Compute xn+1 = xn   an brJ(xn).
– Step 4. Let n = n+ 1. Go to Step 1.
• Output. x⇤N = xN .
25
Theorem 2.2.3. (Theorem 7.1 [57]) Suppose J has a unique minimum x⇤ 2 ⇥ and
{xn} is generated using SPSA. If the following conditions hold:
1. The positive sequences of real numbers {an} and {cn} converge to zero such
that
P1n=0 an = 1 and P1n=0 a2nc 2n < 1.
2. The function J(x) 2 C3 and bounded on Rd.
3. ||xn|| < 1 for all n.
4. E[✏+n   ✏ n | n,Fn] = 0 and E[(Y (xn ± cn n, ⇠±n )/ n,i)2] is uniformly bounded
for all n, i.
5. x⇤ is an asymptotically stable solution of the di↵erential equation @x(t)/@t =
 rJ(x(t)).
6. For each n, { n,i}di=1 are identically distributed, { n,i} are independent and
symmetrically distributed with zero mean and uniformly bounded in magnitude
for all n, i.
Then xn ! x⇤ a.s. as n !1.
The optimal convergence rate for SPSA is O(n 1/3) [55]. Various convergence
proofs have been presented with slight modifications to the conditions (cf. [12, 17,
30, 55, 60]). The perturbation sequence { n} where  n = ( n,1, . . . , n,d) with
the sequence { n,i} is independent with mean zero and finite inverse moments (i.e.,
E[ n,i] = 0 and E[| n,i| 1] < 1 for i = 1, . . . , d) to guarantee a.s. convergence
when applied to SA. As a result, the Gaussian distribution is not applicable. Instead,
26
the most common distribution used is the symmetric Bernoulli taking a positive and
negative value (e.g., ±1) each with probability 1/2. In addition, an appropriately
scaled xn is approximately normal for large n, and the relative e ciency of SPSA
depends on the geometric shape of J(x), choice of {an}, {cn}, distribution of { n,i},
and noise level.
Many extensions to the original SPSA algorithm have been developed, for
example, the constrained setting using projection operators [26, 54]. A slight mod-
ification is the averaging of the SPSA gradient estimators, where instead of gen-
erating one gradient estimate at each iteration, multiple gradient estimates can
be generated at additional computational cost and averaged to reduce the noise.
An accelerated form of SPSA approximates the second-order Hessian r2J(x) to
accelerate the convergence [57], analogous to the Newton-Raphson method. Iter-
ate averaging in the SPSA setting has also been explored, but performs relatively
poor in finite-time [17, 56]. All in all, SPSA has been shown to be an e↵ective SA
method for tackling high-dimensional problems, with ease of implementation and
the asymptotic theory to support it.
2.2.2.2 Gradient Averaging
Gradient averaging can help stabilize the gradient estimate, especially if it is
noisy. One obvious straightforward method is to generate multiple gradient esti-
mates at each iteration and average them to provide a better estimate, but this
process can be expensive and is not worth the computational e↵ort, according to
27
Robbins and Monro. In SA, the gradient estimate is only an intermediate step
that provides a direction for the iterates to move, but the goal is to find the opti-
mum x⇤ and not the best estimate of the gradient as in sensitivity analysis. The
computational cost could be better expended in future iterations. Another form of
gradient averaging uses previous gradient estimates in the current gradient approx-
imation. One of the earliest proposed gradient averaging methods was introduced
in [22] and [27], where “present” and “past” data is used to improve convergence
properties of SA. The method introduced in [22] considers all past gradients with
the gradient update of the formdn+1 = dn + bn(brJ(xn)  dn),
where dn represents the previous direction, brJ(xn, ⇠n) is the new gradient, bn is the
averaging coe cient, and dn+1 is the new updated direction used in the stochastic
approximation algorithm. Later, [27] proposed a modified version of this gradient
averaging technique, which involves an additional step.dn+1 = dn + bn(brJ(xn)  dn),dn = dn + T (xn, n),
where  n = xn   xn 1, T (xn, n) represents a updating function with higher
derivatives, and dn+1 is the new gradient estimate used in SA, which incorporates
higher derivatives. In addition, [27] propose an updating method that averages past
gradient estimates if the change in estimates does not surpass a certain value.
28
2.2.3 Adaptive Step Sizes
2.2.3.1 Kesten’s Rule
It is well-known that the classical SA algorithms are extremely sensitive to the
step size sequence {an}. Therefore, it could be advantageous to consider adaptive
step sizes that adjust based on the ongoing performance of the algorithm, in hopes
of adapting them to the characteristics of the function at the current location of
the iterate and proximity of the current iterate to the optimum. Kesten’s rule [36]
decreases the step size only when there is a directional change in the iterates. The
notion behind this adaptive step size is that, if the iterates continue in the same
direction, there is reason to believe they are approaching the optimum and the pace
should not be decreased in order to accelerate the convergence. If the errors in the
estimate values change signs, it is an indication that either the step size is “too
large” and the iterates are experiencing long oscillation periods or the iterates are
in the vicinity of the true optimum; either way, the step size should be reduced to
a more appropriate step size or to hone in on x⇤. The following algorithm is for the
one-dimensional case d = 1.SA Algorithm using Kesten’s rule
• Input. Choose x1 2 ⇥, {an}, ⇧⇥, and stopping time N .
• Initialize.
– Let n = 2 and k = 1.
29
– Generate an estimate brJ(x1) of rJ(x1).
– Compute x2 = ⇧⇥(x1   a1 brJ(x1)).
• While n < N ,
– Step 1. Generate an estimate brJ(xn) of rJ(xn).
– Step 2. Compute xn+1 = ⇧⇥(xn ak brJ(xn)). If (xn+1 xn)(xn xn 1) <
0, go to Step 3. Otherwise, go to Step 4.
– Step 3. Let n = n+ 1 and k = k + 1. Go to Step 1.
– Step 4. Let n = n+ 1. Go to Step 1.
• Output. x⇤N = xN .
Kesten’s rule can be applied to both RM and KW and still guarantee convergence
in probability, as long as {an} satisfies condition 1 in Theorem 2.2.1 and 2.2.2 for
RM and KW, respectively [36]. An extension of Kesten’s rule to higher dimensions
is discussed in [15]. See [29] for an extensive review of both deterministic and
stochastic step sizes.
2.2.3.2 Scaled-and-Shifted Kiefer-Wolfowitz
The scaled-and-shifted Kiefer-Wolfowitz (SSKW) algorithm [7] adaptively ad-
justs {an} and {cn} finitely many times during the course of the algorithm to adapt
to the characteristics of the function and noise level in hopes of preventing slow
convergence in finite-time. The idea is to increase {an} so the iterates are able to
30
make noticeable progress towards the optimum with the option of decreasing {an}
later if it is too large. Furthermore, if the direction of the gradient is classified as
incorrect, then {cn} is increased to reduce the noise. Note that KW only requires
two parameter choices {an} and {cn}, whereas SSKW requires eleven, as seen in the
algorithm below.SSKW AlgorithmScaling Phase
• Input. {an}, {cn}, ⇥ = [l, u], ⇧⇥, stopping time N , and
– h0 = number of forced boundary hits,
–  0 = scale up factor for {cn},
– ka = maximum number of shifts of {an},
– va = initial upper bound of shift,
–  a = maximum scale up factor for {an},
– kc = maximum number of scale ups for {cn},
– c0 = maximum value of {cn} after scale ups (i.e., cn  cmax = c0(u  l)),
– g0 = maximum number of gradient estimates in scaling phase,
– mmax = maximum number of adaptive iterations (mmax  N).
• Initialize.
– Choose x1 2 [l + c1, u  c1].
31
– Let n = 1, m = 1, g = 1, sh = 0, and sc = 0.
• Do while m  h0 and g  g0.
– Step 1.
⇤ Generate an estimate brJ(xn) using symmetric di↵erences.
⇤ Compute xn+1 = ⇧⇥ ⇣xn   an brJ(xn)⌘.
· If xn+1 2 (l + cn, xn), go to Step 2.
· If xn+1 2 (xn, u  cn, ), go to Step 3.
· If xn+1 > u   cn+1 and xn = u   cn or if xn+1 < l + cn+1 and
xn = l + cn, go to Step 4, if sc  kc.
· If xn+1 > u   cn+1 and xn = l + cn or if xn+1 < l + cn+1 and
xn = u  cn, go to Step 5.
– Step 2.
⇤ Scale {an} up by ↵ = min{ a, (u  cn+1   xn)/(xn+1   xn)} and use
{↵an} for the remaining iterations.
⇤ Set xn+1 = l + cn+1. Let n = n+ 1, m = m+ 1, g = g + 1 and go to
Step 1.
– Step 3.
⇤ Scale {an} up by ↵ = min{ a, (l + cn+1   xn)/(xn+1   xn)} and use
{↵an} for the remaining iterations.
⇤ Set xn+1 = u  cn+1. Let n = n+1, m = m+1, g = g+1 and go to
Step 1.
32
– Step 4.
⇤ Scale {cn} up by   = min{ 0, cmax/cn} and use { cn} for the remain-
ing iterations.
⇤ Let sc = sc+ 1 and go to Step 5.
– Step 5.
⇤ Set xn+1 = min{u  cn+1,max{xn+1, l + cn+1}}.
⇤ Let n = n+ 1, g = g + 1 and go to Step 1.Shifting Phase
• While n  mmax and n  N ,
– Step 1.
⇤ Generate an estimate brJ(xn) using symmetric di↵erences.
⇤ Compute xn+1 using (1.2).
· If xn+1 > u   cn+1 and xn = l + cn or if xn+1 < l + cn+1 and
xn = u  cn, go to Step 2, if sh  ka.
· If xn+1 > u   cn+1 and xn = u   cn or if xn+1 < l + cn+1 and
xn = l + cn, go to Step 3, if sc  kc.
· Otherwise, go to Step 4.
– Step 2.
⇤ Find smallest integer  0 such that xn+1 2 (l+ cn, u  cn) with an+ 0 .
⇤ Set   = min(va,  0) and shift {an} to {an+ }. If   = va, set va = 2va.
33
⇤ Let sh = sh+ 1 and go to Step 4.
– Step 3.
⇤ Scale {cn} up by   = min{ 0, cmax/cn} and use { cn} for the remain-
ing iterations.
⇤ Let sc = sc+ 1 and go to Step 4.
– Step 4.
⇤ Set xn+1 = min{u  cn+1,max{xn+1, l + cn+1}}.
⇤ Let n = n+ 1 and go to Step 1.KW Algorithm
• If n > mmax and n < N , then SSKW reverts back to KW and stops when
n = N .
• Output. x⇤N = xN .
The SSKW algorithm has two pre-processing phases, scaling and shifting,
which adjust the tuning sequences in order to improve the finite-time performance,
before reverting back to the original KW algorithm. In the scaling phase, the {an}
is scaled up by a factor ↵, i.e., {an} to {↵an}, so the iterates can move from one
boundary to the other to ensure the step sizes are not too small relative to the gra-
dient. In the shifting phase, the sequence {an} is decreased by shifting or “skipping”
a finite number ( ) of terms from {an} to {an+ }, when the iterates fall outside of
the feasible region when the sign of the gradient is correct. This acts as a recourse
34
stage and reduces the step size faster in case the step size sequence {an} is too large.
During both phases, cn is scaled up by  , i.e., {cn} to { cn}, if the previous iterate
is at the boundary and the update falls outside the feasible region but is moving in
the wrong direction. This increase is an attempt to reduce the noise of the gradient
estimate. These adjustments do not a↵ect the asymptotic convergence, since the
scaling phase only scales the sequences by a constant and the shifting phase only
scales up the {cn} finitely many times and skips a finite number of terms in {an}.
2.2.4 Robust Output
2.2.4.1 Averaging Iterates
Iterate averaging, introduced in [51] and [49], approaches SA from a di↵erent
angle. Instead of fine-tuning the step sizes to adapt to the function characteristics,
iterate averaging takes bigger steps (i.e., an larger than O(n 1)) for the estimates
to oscillate around the optimum, so the average of the iterates will result in a
good approximation to the true optimum. The idea is simple, and yet can be very
e↵ective. It is easy to see that for this method to be successful, it is essential that
the iterates surround the optimum in a balanced manner, and that the domain in
which the iterates oscillate shrinks as n increases. Averaging trajectories reduces
the sensitivity to the initial step size choice. The algorithm follows recursion (1.2)
for the RM case; however, instead of the taking the last iterate xN as the output,
35
the optimum is estimated by x⇤N = 1N NXn=1 xn,
which is an average of N iterates, where N is the stopping time. Under “clas-
sic” assumptions, iterate averaging achieves the same convergence rate as the RM
method. Furthermore,
p
n(x⇤n   x⇤) is asymptotically normal with mean zero and
the smallest covariance matrix, which is the inverse of the average Fisher informa-
tion matrix. (cf. [49]). A constant step size can be applied and yields convergence
in distribution [42].
A variation of this method is called the “sliding window” average, which is
based on the last m iterates: x⇤N = 1m NXn=N m+1xn. (2.7)
An advantage of (2.7) is it ignores the first N   m iterates, which may be poor
estimates, since the first iterate is arbitrary, and averages only the last m, which are
assumed to be closer to x⇤. Asymptotic normality for a growing window is shown
in [40] and [42], which also includes constant step sizes. Another modification of the
original method incorporates xN in the components being averaged x⇤N 1, which is
known as the feedback approach [41]. These methods are suited for problems where
the iterates hover around the optimum. In an empirical study, iterate averaging
was applied to SPSA [44]. The results suggest that if the Hessian of J(x) is large,
averaging is considered ideal, since it is associated with a high variability in J(x),
which indicates the iterates are moving around the optimum. In general, averaging
iterates leads to more robustness with respect to step size sequence because of the
36
reduced sensitivity, while converging at the same optimal asymptotic rate as RM.
Inspired by iterate averaging, weighted averages for KW was presented to achieve the
optimal asymptotic convergence rate O(n 1/2) under certain conditions [17]. Under
certain parameter settings, iterate averaging and weighted averaging produce the
same estimator.
2.2.4.2 Robust Stochastic Approximation
The robust SA (RSA) method is intended to be relatively insensitive to the
choose of the step size sequence, similar to Polyak-Ruppert iterate averaging. The
form of RSA is identical to (1.2) with the exception of the output. Instead ofx⇤N = xN , where xN is the last iterate, x⇤N is calculated asx⇤N = PNn=1 anxnPNn=1 an ,
where an > 0 for all n. It is clear that if an = a, where a 2 R+ for all n, thenx⇤N = 1N PNn=1 xn, giving the uniformly weighted average of Polyak-Ruppert. As
mentioned earlier, iterate averaging under a constant step size for a moving window
is asymptotically normal [42]. A finite-time bound for E[J(x⇤N) J(x)] was derived
under RSA when J is assumed convex [45]. Assume there exists C > 0 such that
E[||rJ(x)||]  C2 for all x 2 ⇥. Then for an N -step iteration policy,
E[J(x⇤N)  J(x)]  ||x0   x⇤||2 + C2 PNn=1 a2n
2
PNn=1 an . (2.8)
37
For equal weights or iterate averaging, the bound on the right hand side of (2.8) can
be minimized if
an = a := D⇥
C
p
N
,
where D⇥ = maxx,y2⇥ ||x   y||. The distance ||x0   x⇤|| in place of D⇥ tightens
the bound in (2.8), but x⇤ is unknown so the improvement may not be practically
meaningful. The step size requires the number of iterations N to be fixed. Similar
to iterate averaging, a sliding window average can also be employed in RSA. the
estimate consists of the last N  K + 1 estimates and has the formx⇤N,K = PNn=K anxnPNn=K an .
If we consider the varying step size
an = ✓D⇥Cpn,
for ✓ > 0, then we have the bound
E[J(x⇤N,K)  J(x)]  D⇥CpN "2✓ ✓ NN  K + 1◆ ✓2rNK# ,
for 1  K  N .
2.2.4.3 Acceleration Stochastic Approximation
The Accelerated SA (AC-SA) algorithm [32] takes a similar approach to iterate
averaging and RSA by taking long strides and incorporating each of the iterates into
the output. The next two algorithms, Accelerated SA for strongly convex and convex
38
functions, take advantage of the smoothness factor of the function if it exists. AC-
SA for convex functions is a special case of AC-SA for strongly convex functions, so
we first introduce AC-SA for strongly convex functions and then restrict the strong
convexity parameter for J(·) for the convex case.
AC-SA is an example of a proximal method, which introduces a proximity
function into the objective function. The prox-function acts as a regularization
term to prevent the next iterate update xn+1 from being too far from xn and is
comprised of a distance generating function or Bregman function ! : ⇥! R, which
is continuously di↵erentiable and strongly convex with modulus ⌫ > 0 satisfying
hx  y,r!(x) r!(y)i   ⌫||x  y||2 8x,y 2 ⇥,
where h·, ·i denotes the inner product. A prox-function with the given distance
generating function is
V (x,y) = V!(x,y) = !(y)  [!(x) + hr!(x),y   xi].
As xn ! x⇤, the regularization term disappears, so minimizing f(x) plus a regular-
izer is equivalent to minimizing the function J(x).
Consider a strongly convex function J(·) satisfying
µ
2
||y   x||2  J(y)  J(x)  hrJ(x),y   xi  L
2
||y   x||2 +M ||y   x||,
for all x,y 2 ⇥ where µ > 0 is the strong convexity parameter. If M > 0 and L = 0,
then J is Lipschitz continuous with Lipschitz constant M/2. If M = 0 and L > 0,
then J has Lipschitz continuous gradients with Lipschitz constant L.
39
The AC-SA algorithm updates three sequences, {xmdn }, {xagn }, and {xn}. Here,
“md” and “ag” are abbreviations for median and aggregate, respectively, and median
is used in a loose sense.Accelerated SA Method for Strongly Convex Functions
• Input.
– Specify V (x,y), {↵n} and { n} be given such that ↵1 = 1, ↵n 2 (0, 1)
for n   2, and  n > 0 for n   1 and a stopping time N .
• Initialize. Choose xag0 = x0 2 ⇥ and let n = 1.
• While n < N ,
– Step 1. Generate an estimate brJ(xn) of rJ(xn).
– Step 2. Computexmdn = ↵n[(1  ↵n)µ+  n] n + (1  ↵2n)µ xn 1 + (1  ↵n)(1  ↵n)(µ+  n) n + (1  ↵2n)µ xagn 1xn = argminx2⇥{↵n[hrJ(xmdn ),xi+ µV (xmdn )] + [(1  ↵n)µ+  n]V (xn 1,x)}xagn = ↵nxn + (1  ↵n)xagn 1
– Step 3. Let n = n+ 1 and go to Step 1.
• Output. x⇤N = xagN .
Note: V (x,y) = 12 ||x   y||2 using the Euclidean norm with ⌫ = 1 is a common
prox-function. Refer to [31] for details.
40
Theorem 2.2.4. (Theorem 1 [31]) Assume V (x,y)  12 ||x  y||2 for all x,y 2 ⇥
when µ > 0 and E[(brJ(x)) rJ(x))2]   2 8x 2 ⇥. Choose {↵n} and { n} such
that
⌫(µ+  n) > L↵2n, (2.9)
 n/ n =  n+1/ n+1 for n   1, (2.10)
where
 n = 8>><>>: 1 if n = 1;(1  ↵n) n 1 if n   2.
Then,
E[J(xagN )  J(x⇤)]   N   1V (x0,x⇤) + NXn=1 2(M2 +  2)↵2n n[⌫(µ+  n)  L↵2n]! . (2.11)
Consider ↵n = 2/(n + 1),  n = 4L/[⌫n(n + 1)], and  n = 2/[n(n + 1)]. It can be
easily checked that these choices satisfy conditions (2.9) and (2.10). Under these
conditions, the right hand side of (2.12) can be bounded by
4LV (x0,x⇤)
⌫N(N + 1)
+
8(M2 +  2)
⌫µ(N + 1)
, (2.12)
for µ > 0. The bounds in (2.11) and (2.12) rely on additional information of the
function and gradient, which are unknown, so they must be approximated.
AC-SA for convex functions is a special case of AC-SA for strongly convex
functions with µ = 0. The algorithm is identical to AC-SA for strongly convex
function with the exception of the xmdn and xn update since µ = 0. The resulting
41
updates are xmdn = ↵nxn 1 + (1  ↵n)xagn 1,xn = argminx2⇥{↵nhrJ(xmdn ),xi+  nV (xn 1,x)}.
Interestingly, if V (x,y) = 12 ||x  y||2, then the update for xn simplifies toxn = ⇧⇥✓xn 1   ↵n n brJ(xmdn )◆ , (2.13)
which has a similar form to the standard SA algorithm. Notice in the update for xn in
(2.13), ↵n/ n takes the place of the step size an in (1.2) and the gradient estimate brJ
is evaluated at xmdn as opposed to xn 1. If we consider the same parameter setting
as in the strongly convex case, the “step size” ↵n/ n increases with n. Furthermore,
the lower and upper bounds for the optimal objective function can be computed
online and the di↵erence converges to 0 as the number of iterations increases to
infinity [31].Theorem 2.2.5. (Proposition 7 [31]) Assume that the assumptions in Theorem
3.3.2 hold for µ = 0 and the sequences ↵n = 1/(n+ 1) and  n = 4 /[⌫n(n+ 1)] for
    2L. Then
E[J(xagN )  J(x⇤)]  4 V (x0,x⇤)⌫N(N + 1) + 4(M2 +  2)(N + 2)3  , (2.14)
where
  = max
(
2L,

⌫(M2 +  2)N(N + 1)(N + 2)
3V (x0,x⇤)  1/2) .
minimizes the bound in (2.14).
42
2.2.4.4 Numerical Comparison
We investigate the MSE performance of RSA and AC-SA using direct gradient
estimates and compare the results against the classical RM algorithm and RM with
iterate averaging. We consider the optimal parameter settings for RSA and AC-SA,
which require additional knowledge of the function, its gradient, and the optimum,
so in practice, they must be approximated.
We consider a simple quadratic function, f(x) =  13x
2, on the truncated
intervals [ 50, 50] and [ 5, 95] with x1 = 30.0,   = 1.0, and 1000 sample paths.
For the RM and RM with iterate averaging algorithm, we employ a common step
size an = ✓a/n, where ✓a = 10.0. RM performed relatively well for a wide range of
multiplicative constants. We chose to use ✓a = 10.0, although it did not yield the
lowest MSE at the 1000th or 10000th iteration from preliminary numerical tests.
For RSA, we adopt a constant step size that minimizes the finite-time bound in
(2.8), where C = 100/3, 190/3 for the intervals [-50, 50] and [-5, 95], respectively,
and D⇥ = 100. For the AC-SA algorithm, we consider ↵n = 2/(n + 1) and  n =
4 /[n(n+ 1)], where   is given in (2.15) with ⌫ = 1, L = 2/3 and M = 0.
Figure 2.1 plots the MSE as a function of the number of iterations from 1 to
10000 on a log scale. The results for both the centered and skew truncated intervals
appear to have the same behavior across all four algorithms. RM performs well
with a good parameter choice, although it is not the best, but averaging the iterates
improves the performance, resulting in a smoother monotonically decreasing MSE
curve as the number of iterations increase. Compared to a decently/reasonably
43
(a) [  50, 50] (b) [  5, 95]Figure 2.1: MSE under AC-SA, RM, RM w/averaging and RSA for
f(x) =  13x
2, x1 = 30.0,   = 1.0.
tuned RM and RM with iterate averaging algorithm, RSA appears to be inferior,
at least in this simple numerical experiment. The most interesting curve is from
the AC-SA algorithm, where one can observe periodic oscillations, which decrease
in magnitude as the number of iterations increase. We further investigated this
behavior by analyzing individual sample paths, and the estimates {xagn } appear to
have the same behavior, following a smooth oscillating path/curve. From Figure
2.1, the AC-SA curve appears to level o↵ and hover slightly over the RSA curve.
The stopping time dictates relative performance of AC-SA when there are a smaller
number of iterations because of the oscillations. For the case of the skewed interval,
there is a small range of iterations where AC-SA outperforms RSA, RM, and RM
with iterate averaging, as well as other small ranges where it outperforms RSA. Keep
in mind that these experiments are for a simple quadratic function for a particular
setting, so the relative performance will most likely change in a di↵erent setting.
From our numerical experiments, one can conclude that RM and RM with
44
iterate averaging have the potential to outperform RSA and AC-SA if the step size
parameter is chosen appropriately for a wide range of choices. In this case, iterate
averaging improves the performance of RM for all 10000 iterations. Both the AC-SA
and RSA algorithms require additional knowledge to choose the optimal step size
that minimizes the bound in (2.11) and (2.8) for AC-SA and RSA, respectively.
2.2.5 Varying Bounds
Initially, the asymptotic theory for SA only considered functions satisfying
specific global conditions; however, subsequently it was shown the requirements
need only hold on a compact set ⇥ 2 Rd containing the optimum. Therefore,
the projection operator is particularly important in the constrained optimization
setting. Since the optimum is unknown, the compact set should be large enough so
that x⇤ 2 ⇥ with high probability; however, this may increase the potential of an
algorithm to perform poorly due to the size of the parameter search space [2] . For
instance, if the compact set is very large, the step size is extremely small, and the
current iterate is extremely far from the optimum, then the convergence is likely to
be slow; however, if the compact set is small and contains the optimum, then the
iterates will never be too far from the optimum. Even if the step sizes are small,
the convergence will be much faster in comparison to the algorithm restricted to a
much larger set.
One of the first ideas was to project the iterates onto a predetermined fixed
point once the magnitude of the iterate surpassed an arbitrarily specified thresh-
45
old, with the threshold increasing after it is exceeded [13]. This method converges
asymptotically, but in practice, it has its pitfalls. When an iterate is projected onto
an arbitrary fixed point, in a sense, the algorithm restarts from this “initial” value
with a smaller sequence of step sizes. Not only does it lose all of the progress gained
from the iterations prior to the projection, but the reduction in step size could hin-
der the convergence by moving even slower towards the optimum. To circumvent
this issue, it was shown that it su ces to project the iterates onto a predetermined
bounded set [64]. This is a slight improvement since the iterates do not start from
the same position with an even smaller step size. However, it still has its limitations
since the initial start values are restricted to the predetermined compact set. Later,
an algorithm defined over a growing feasible region by writing ⇥ as an increasing
sequence of compact sets (i.e., ⇥n ✓ ⇥n+1, where ⇥ = [⇥n) was introduced [2]. The
orthogonal projection operator changes from ⇧⇥k to ⇧⇥k+1 if xn+1 /2 ⇥k. The idea
is to start with a smaller feasible region ⇥1 and only increase when there is reason
to believe the optimum x⇤ /2 ⇥1 (i.e., when the xk /2 ⇥1). Since the projection is
made onto the current compact set ⇥m, the progress gained up to that point is not
lost. The feasible region ⇥ is written as [k⇥k, so it is impossible for x⇤ /2 ⇥k for
some k. If x⇤ is contained in one of the earlier compact sets and if they grow slowly,
the empirical results could improve significantly. The key in the performance is to
choose the sequence {⇥n} appropriately. If it grows too quickly, the results might
be very similar to that of the original SA. The following algorithm and convergence
result are for the RM multidimensional case d   1, where || · || denotes the Euclidean
norm.
46
SA with Varying Bounds
• Input. Choose x1 2 ⇥1, {an} and {⇥n}.
• Initialize. Let n = 1 and m = 1.
• While n < N ,
– Step 1. Generate an estimate brJ(xn) of rJ(xn).
– Step 2. Compute x0n+1 = xn   an brJ(xn). If x0n+1 2 ⇥m, go to Step 3.
Otherwise, go to Step 4.
– Step 3. Let xn+1 = x0n+1, n = n+ 1 and go to Step 1.
– Step 4. Let xn+1 = ⇧⇥m(x0n+1), n = n+ 1, m = m+ 1 and go to Step 1.
• Output. x⇤N = xN .Theorem 2.2.6. (Theorem 2 [2]) Let the sequence {xn} be generated using the
above algorithm, ✏n = brJ(x)   E[brJ(x)|Fn], and  n = E[brJ(x)|Fn]   rJ(x),
where Fn is the smallest  -algebra used to generate xn+1. If the following conditions
hold:
1. The sequence {⇥k} is a set of compact convex sets such that ⇥k ✓ ⇥k+1 for
all k and [1k=1⇥k = ⇥.
2. The positive sequences of real numbers {an} and {cn} converge to zero such
that
P1n=1 an = 1, P1n=1 ancn < 1, and P1n=1 a2nc 2n < 1.
47
3. There exists    0 such that E[||✏n||2|Fn]  c2n (1 + ||xn   x⇤||2) a.s. for all n.
4. || n|| is bounded a.s. and P1n=1 an|| n|| < 1 a.s.
5. There exist a positive sequence of real numbers {Mn} an integer N   1 such
that
P1n=1 a2nM2n < 1 and for all n   N , supx2⇥n 1 ||J(x)||  Mn.
6. There exists a unique x⇤ 2 ⇥ such that rJ(x⇤) = 0, and for all 0 <    1,
infx2⇥: ||x x⇤||  1 J(x)T (x  x⇤) > 0.
Then xn ! x⇤ a.s. as n !1.
If an appropriate increasing sequence of compact sets is chosen, the finite-
time performance can improve significantly, but this optimal choice is still an open
problem.
48
Chapter 3
New Hybrid Stochastic Approximation Methods
3.1 Motivation
Theoretically, unbiased direct gradients lead to faster asymptotic convergence
rates in SA algorithms compared to indirect gradient estimates, but this does not
necessarily translate to better finite-time performance. In the deterministic setting,
where the true gradient is known, indirect gradients are not useful, whereas in
the stochastic setting, direct gradients are noisy and indirect gradients actually
contain information that can be used to better approximate the gradient and in turn
accelerate the SA convergence to a neighborhood of the optimum. To the best of our
knowledge, the current SA algorithms either use direct gradients or indirect gradients
but not both simultaneously. Our algorithms exploit the additional information
provided by indirect gradients by using a gradient that integrates both direct and
indirect gradient estimates.
In this chapter, we do the following:
1. We propose two new hybrid stochastic approximation algorithms, STAR-SA
and STAR-SPSA, which incorporate both direct and indirect gradient esti-
mates, and are provably convergent.
2. We derive variance minimizing weights for the STAR-SA and STAR-SPSA
49
gradient estimates.
3. We show that the variance of the STAR-SA gradient is lower than that of RM
and KW, and that the variance of the STAR-SPSA gradient is less than that
of RM and SPSA, for a deterministic weight sequence.
4. For simple quadratic functions, we show that the MSE of the estimate gener-
ated using STAR-SA is strictly less than that of RM and KW under certain
conditions.
5. We illustrate the robustness/e↵ectiveness of STAR-SA and STAR-SPSA through
numerical experiments.
3.2 Secant-Tangents AveRaged Stochastic Approximation
In this section, we introduce the one-dimensional Secant-Tangents AveRaged
stochastic approximation (STAR-SA) algorithm, which incorporates both direct and
indirect gradient estimates [?, 10]. Figure 3.1 illustrates the STAR gradient estima-
tor, which uses estimates of the sample performance f˜ and its gradient f˜ 0 at two
points, xn + cn and xn   cn, where the values f˜(xn + cn) and f˜(xn   cn) do not lie
on the graph of the true function due to estimation noise error. For the indirect
gradient, the values f˜(xn+cn) and f˜(xn cn) are used to compute a symmetric finite
di↵erence (Secant). The two direct gradient estimates, f˜ 0(xn + cn) and f˜ 0(xn   cn),
can be averaged to provide another gradient estimate (Tangents AveRaged).
The Secant-Tangents AveRaged stochastic approximation (STAR-SA) gra-
50
Introduction
Classical SA methods
STAR-SA & STAR-SPSA
Summary and future work
Motivation
Algorithms
Theoretical results
Numerical experiments
STAR illustration
M. Chau, M.C. Fu, H. Qu: Multivariate SA using a Secant-Tangents AverRaged (STAR) Gradient Estimator
Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 5
M. Chau, M.C. Fu, H. Qu: Multivariate SA using a Secant-Tangents AverRaged (STAR) Gradient Estimator
Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 5
Introduction
Classical SA methods
STAR-SA
Numerical experiments
Conclusion and future work
Motivation
Algorithm
Theoretical results
STAR illustration
xn + cnxn   cn
f (x)
f˜ (xn   cn)
f˜ (xn + cn)
 2
 1.5
 1
 0.5
0
·104
Secant
f˜ (xn + cn)  f˜ (xn   cn)
2cn
Tangents AveRaged
f˜ 0(xn + cn) + f˜
0(xn   cn)
2
Marie Chau, Huashuai Qu & Michael C. Fu A New Hybrid Stochastic Approximation Algorithm
Figure 1 Ill stration of STAR gardent, where f˜ nd f˜ 0 are estimates of f and f 0
y
x
f(x)
˜
f(x+ c)
˜
f(x  c)
x + cx   c
The Secant-Tangents AveRaged stochastic approximation (STAR-SA) gradient estimator is a
convex combination of a symmet c finite di erence gradient estimator (secant) and an average of
two direct gradient estimators (tangents):
gSTAR(xn) = ↵n
˜
f(xn + cn)  ˜f(xn  cn)
2cn
+(1 ↵n)
˜
f
 
(xn + cn)+
˜
f
 
(xn  cn)
2
, (2)
where ↵n 2 [0,1] and cn 2R for all n, and cn! 0 as n!1.
The weight of the convex combination is crucial in the performance of STAR-SA. In fact, for any
non-zero and finite variance level of the function and its gradient, there exists a sequence {↵n} such
that the variance of the STAR-SA gradient estimator is less than the individual gradient estimates,Secant gS and Tangents AveRaged gTAR.
Now we present theoretical results pertaining to STAR-SA. Let
˜
f(xn± cn) = f(xn± cn)+  
±
n and
˜
f
 
(xn± cn) = f
 
(xn± cn)+  
±
n , where  
±
n and  
±
n are Gaussian noise with zero mean. For notational
Introduction
Classical SA methods
STAR-SA
Numerical experiments
Conclusion and future work
Motivation
Algorithm
Theoretical results
STAR illus ration
xn + cnxn   cn
f (x)
f˜ (xn   cn)
f˜ (xn + cn)
 2
 1.5
 1
 0.5
0
·104
Secant
f˜ (xn + cn)  f˜ (xn   cn)
2cn
Tangents AveRaged
f˜ 0(xn + cn) + f˜
0(xn   cn)
2
Marie Chau, Huashuai Qu & Michael C. Fu A New Hybrid Stochastic Approximation Algorithm
Figure 1 Illustration of STAR gardent, where f˜ and f˜ 0 are estimates of f and f 0
˜
f(xn + cn) and
˜
f(xn   cn) are used to compute a ymmetric finite di erence (Secant). The two
direct gradients estimates,
˜
f
 
(xn+cn) and
˜
f
 
(xn cn), can be averaged to provide another gradient
estimate (Tangents AveRaged).
y
x
f(x)
˜
f(xn + cn)
˜
f(xn  cn)
xn + cnxn   cn
The Secant-Tangents AveRaged stochastic approximation (STAR-SA) gradient estimator is a
convex combination of a symmetric finite di erence gradient estimator (secant) and an average of
two direct gradient estimators (tangents):
gSTAR(xn) = ↵n
˜
f(xn + cn)  ˜f(xn  cn)
2cn
+(1 ↵n)
˜
f
 
(xn + cn)+
˜
f
 
(xn  cn)
2
, (2)
where ↵n 2 [0,1] and cn 2R for all n, and cn! 0 as n!1.
The weight of the convex combination is crucial in the performance of STAR-SA. In fact, for any
non-zero and finite variance level of the function and its gradient, there exists a sequence {↵n} such
Secant
gS  
f˜ (xn + cn)  f˜ (xn   cn)
2cn
Tangents AveRaged
gTAR  
f˜ 0(xn + cn) + f˜
0(xn   cn)
2
Figure: Illustration of STAR gradient, where f˜ and f˜ 0 are estimates
of f and f 0, respectively.
Marie Chau, Michael C. Fu & Huashuai Qu Secant-Tangents AveRaged (STAR) stochastic approximation
Figure 3.1: Illustration of STAR gradient, where f˜ and f˜ 0 are esti-
mates of f and f 0, re pectiv ly.
dient estimator is a convex combination of a symmetric finite di↵erence gradient
estimator (secant) and an average of two direct gradient estimators (tangents):
gSTAR(xn) = ↵n f˜(xn + cn)  f˜(xn   cn)2cn
+ (1  ↵n) f˜ 0(xn + cn) + f˜ 0(xn   cn)2 , (3.1)
where ↵n 2 [0, 1] and cn 2 R+ for all n, and cn ! 0 as n !1.
The weight ↵n in the convex combination is crucial in the performance of
STAR-SA. In fact, for any non-zero and finite variance level of the function and
its gradient, there exists a sequence {↵n} such that the variance of the STAR-SA
gradient estimator is less than the individual gradient estimates, Secant gS andTangents AveRaged gTAR.
Now we present theoretical results pertaining to STAR-SA. Let f˜(xn ± cn) =
f(xn±cn)+✏±n and f˜ 0(xn±cn) = f 0(xn±cn)+ ±n , where ✏±n and  ±n are Gaussian noise
51
with zero mean. For notational simplicity, let V ar(f˜(xn)) =  2f , V ar(f˜ 0(xn)) =  2g ,
V ar(gSTAR(xn)) =  2n, V ar(gS(xn)) = V ar ⇣ f˜(xn+cn) f˜(xn cn)2cn ⌘=  2f2c2n , V ar(gTAR(xn))
= V ar
⇣ f˜ 0(xn+cn)+f˜ 0(xn cn)
2
⌘
=  2g2 , and Corr(f˜(xn), f˜ 0(xn)) = ⇢.
3.2.1 Optimal Convex Weight
3.2.1.1 Homogeneous Noise
The following provides a particular weight for minimizing the variance of the
STAR-SA gradient.Lemma 3.2.1. For gSTAR defined in (3.1), assume for all n, ✏+n ? ✏ n and  +n ?   n .
Then V ar(gSTAR(xn)) is minimized when
↵⇤n =  2gc2n 2f +  2gc2n . (3.2)Proof. By the orthogonality assumption, the variance of the STAR-SA gra-
dient (3.1) can be written as
 2n = V ar ↵n f˜(xn + cn)  f˜(xn   cn)2cn + (1  ↵n) f˜ 0(xn + cn) + f˜ 0(xn   cn)2 !
= V ar
✓
↵n
2cn f˜(xn + cn) + 1  ↵n2 f˜ 0(xn + cn)◆
+ V ar
✓
 
↵n
2cn f˜(xn   cn) + 1  ↵n2 f˜ 0(xn   cn)◆
=
✓
↵n
2cn◆2  2f + ✓1  ↵n2 ◆2  2g + 2 ↵n2cn 1  ↵n2 ⇢ f g
+
✓
↵n
2cn◆2  2f + ✓1  ↵n2 ◆2  2g   2 ↵n2cn 1  ↵n2 ⇢ f g
=
↵2n
2c2n 2f + (1  ↵n)22  2g , (3.3)
52
where the second equality follows from the independence of the function/gradient
at di↵erent points. Di↵erentiating (3.3) with respect to ↵n, we find
@ 2n
@↵n = ↵nc2n  2f   (1  ↵n) 2g . (3.4)
Setting (3.4) equal to zero and solving for ↵n yields (3.2). Since (3.4) is monotoni-
cally increasing in ↵n, (3.2) is the global minimum. ⇤
The following result shows that the variance of the STAR-SA gradient is prov-
ably less than the variance of the gradients generated under RM and KW.Proposition 3.2.2. Let the conditions of Lemma 3.2.1 hold. If the convex weight in
(3.1) is given by (3.2), then V ar(gSTAR(xn))  min {V ar(gS(xn)), V ar(gTAR(xn))}
for all n, where the inequality is strict if  f ,  g > 0.Proof. Under the conditions of Lemma 3.2.1, the variance of the STAR-SA
gradient (3.1) can be written as
 2n = ↵2n2c2n 2f + (1  ↵n)22  2g . (3.5)
Substitute ↵⇤n from (3.2) into (3.5) for ↵n to obtain
 2n = 12c2n   2gc2n 2f +  2gc2n!2  2f + 12  1   2gc2n 2f +  2gc2n!2  2g =  2f 2g2   2f +  2gc2n  .
To show  2n is strictly less than min{V ar(gS(xn)), V ar(gTAR(xn))}, consider two
cases: 1)  2n <  2f2c2n and 2)  2n <  2g2 . However, since cn ! 0 as n ! 1, then 2f
2c2n !1, so eventually we will be in the latter case.Case 1: If  2n <  2f2c2n , then  2f 2g2( 2f+ 2gc2n) <  2f2c2n , which holds when 0 <  2f .Case 2: If  2n <  2g2 , then  2f 2g2( 2f+ 2gc2n) <  2g2 , which holds when 0 <  2gc2n.
53
If  f ,  g > 0, then the inequality is strict; however, if  f g = 0, then the same
arguments hold with “ < ” replaced with “  ” for the non-strict case. There-
fore,  2n  minn  2f2c2n ,  2g2 o = min{V ar(gS(xn)), V ar(gTAR(xn))}, and the inequality
is strict for  f ,  g > 0. ⇤
The weight ↵⇤n depends on the function and gradient variances, and the finite
di↵erence perturbation size. These values, with the exception of the perturbation
size, are generally unknown and must be estimated. Although the orthogonality con-
dition in Lemma 3.2.1 allows the function and its gradient to be correlated, the op-
timal weight does not depend on the correlation coe cient, due to the homogeneous
noise assumption. If we eliminate this assumption to allow for non-homogeneous
noise, the weight will depend on the correlation. Also, the STAR-SA gradient es-
timate converges to the average of the direct gradient estimates, since ↵⇤n ! 0 as
n ! 0.
3.2.1.2 Non-homogeneous Noise
For non-homogeneous noise,  f and  g depend on x, and we write V ar(f˜±n ) =
 2f,n± , V ar(g˜±n ) =  2g,n± , Corr(f˜ n , g˜+n ) = Corr(f˜+n , g˜ n ) = 0, Corr(f˜+n , g˜+n ) = ⇢+n and
Corr(f˜ n , g˜ n ) = ⇢ n , with f˜±n = f˜(xn ± cn) = f(xn ± cn)+ ✏±n and g˜±n = f˜ 0(xn ± cn) =
f 0(xn ± cn) +  ±n , where ✏± and  ±n is Gaussian noise with zero mean.Lemma 3.2.3. If the conditions of Lemma 3.2.1 and ⇢ n f,n  g,n  ⇢+n f,n+ g,n+  
0 hold, then V ar(gSTAR(xn)) is minimized when
↵⇤n = c2n( 2g,n+ +  2g,n ) + 2cn(⇢ n f,n  g,n    ⇢+n f,n+ g,n+) 2f,n+ +  2f,n  + c2n( 2g,n+ +  2g,n ) + 2cn(⇢ n f,n  g,n    ⇢+n f,n+ g,n+) . (3.6)
54
Proof. The variance of the STAR-SA gradient can be written as
 2n = V ar ↵n f˜+n   f˜ n2cn + (1  ↵n) g˜+n + g˜ n2 !
= V ar
✓
↵n
2cn f˜+n + 1  ↵2 g˜+n◆+ V ar✓  ↵n2cn f˜ n + 1  ↵2 g˜ n◆
=
✓
↵n
2cn◆2 V ar(f˜+n ) + ✓1  ↵n2 ◆2 V ar(g˜+n ) + 2 ↵n2cn 1  ↵n2 Cov(f˜+n , g˜+n )
+
✓
↵n
2cn◆2 V ar(f˜ n ) + ✓1  ↵n2 ◆2 V ar(g˜ n )  2 ↵n2cn 1  ↵n2 Cov(f˜ n , g˜ n )
=
↵2n
4c2n   2f,n+ +  2f,n  + (1  ↵n)24 ( 2g,n+ +  2g,n )
+
↵n(1  ↵n)
2cn  ⇢+n f,n+ g,n+   ⇢ n f,n  g,n   , (3.7)
where the second equality follows from the independence of the function/gradient at
di↵erent points. Di↵erentiate (3.7) with respect to the convex weight ↵n to obtain
@V
@↵n = ↵n2c2n   2f,n+ +  2f,n    (1  ↵n)2 ( 2g,n+ +  2g,n )
+
✓
1  2↵n
2cn ◆ (⇢+n f,n+ g,n+   ⇢ n f,n  g,n ),
which is monotonically increasing in ↵n. Set @V@↵n = 0 to obtain
↵n  2f,n+ +  2f,n 2c2n +  2g,n+ +  2g,n 2 + ⇢ n f,n  g,n    ⇢+n f,n+ g,n+cn !
=
 2g,n+ +  2g,n 
2
+
⇢ n f,n  g,n    ⇢+n f,n+ g,n+
cn ,
so the optimal weight is
↵⇤n = 2c2n 2f,n+ +  2f,n  + c2n( 2g,n+ +  2g,n ) + 2cn(⇢ n f,n  g,n    ⇢+n f,n+ g,n+) ·
cn( 2g,n+ +  2g,n ) + 2⇢ n f,n  g,n    2⇢+n f,n+ g,n+
2cn
=
c2n( 2g,n+ +  2g,n ) + 2cn(⇢ n f,n  g,n    ⇢+n f,n+ g,n+)
 2f,n+ +  2f,n  + c2n( 2g,n+ +  2g,n ) + 2cn(⇢ n f,n  g,n    ⇢+n f,n+ g,n+) .⇤
55
Now we prove Proposition 3.2.2 for the non-homogeneous setting. For notation
simplicity, Fn =  2f,n+ + 2f,n  , Gn =  2g,n+ + 2g,n  , Pn = ⇢ n f,n  g,n  ⇢+n f,n+ g,n+ ,
and Dn =  2f,n+ +  2f,n  + c2n( 2g,n+ +  2g,n ) + 2cn(⇢ n f,n  g,n    ⇢+n f,n+ g,n+).
Let the conditions of Lemma 3.2.1 hold, then the variance of the STAR-SA
gradient (3.1) can be written as
 2n = ↵2n4c2n   2f,n+ +  2f,n  + (1  ↵n)24 ( 2g,n+ +  2g,n )
+
↵n(1  ↵n)
2cn  ⇢+n f,n+ g,n+   ⇢ n f,n  g,n  
=
↵2n
4c2nFn + (1  ↵n)24 Gn   ↵n(1  ↵n)2cn Pn. (3.8)
Substitute the weight ↵⇤n from (3.6) into each of the three components in (3.8) to
obtain
 2n = c4nG2n + 4c3nGnPn + 4c2nP 2n4c2nD2n Fn + F 2nGn4D2n   c2nGnFn + 2cnPnFn2cnD2n Pn
=
c2nG2nFn + 4cnGnPnFn + 4P 2nFn
4D2n + F 2nGn4D2n   2cnGnFnPn + 4P 2nFn4D2n
=
F 2nGn + 2cnGnPnFn + c2nG2nFn
4D2n . (3.9)
To show  2n is strictly less than min{V ar(gS(xn)), V ar(gTAR(xn))}, consider two
cases:Case 1: If V ar(gSTAR(xn)) < V ar(gS(xn)), then F 2nGn+2cnGnPnFn+c2nG2nF4D2n < Fn4c2n ,
which can be simplified to 0 < F 3n + 4cnF 2nPn + c2n(4P 2nFn + F 2nGn) + 4c3nGnPnFn,
and is satisfied if Fn > 0 and Pn   0.Case 2: If V ar(gSTAR(xn)) < V ar(gTAR(xn), then F 2nGn+2cnGnPnFn+c2nG2nFn4D2n < Gn4 ,
which simplifies to 0 < c4nG3n + 4c3nPnG2n + c2n(4P 2nGn + FnG2n) + 2cnFnPnGn, and
holds for Gn > 0 and Pn   0.
56
The same argument holds for the non-strict case by replacing “ < ” with “  ”
in both cases and are both satisfied for Pn   0. Therefore,  2n  min{V ar(gS(xn)),
V ar(gTAR(xn))}, where the inequality is strict for Fn, Gn > 0. ⇤
3.2.2 Convergence
The next result establishes mean-squared convergence of the STAR-SA algo-
rithm. The proof closely follows [18] and [62].Theorem 3.2.4. Let {xn} be a sequence following recursion (1.2) with ⇥ = R andbrf(xn) as the STAR gradient estimate defined in (3.1). If the following conditions
hold:
1. There exist positive sequences {an} and {cn} such that P1n=1 an = 1,P1n=1 ancn < 1, P1n=1 a2n < 1, and P1n=1 a2nc 2n < 1.
2. There exist K0, K1 > 0 such that K0|x  x⇤|  |f 0(x)|  K1|x  x⇤|
for all x 2 ⇥.
3. f 0(x)(x  x⇤) > 0 for all x 2 ⇥\{x⇤}.
4. For c > 0,  2 = supx2⇥ V ar[f˜(x+ c)  f˜(x  c)|x] < 1 for all x 2 ⇥.
5. ✏+n , ✏ n ,  +n ,   n are i.i.d. with mean zero for all n.
Then xn converges to x⇤ in L2 as n !1.Proof. Fix positive sequences {an} and {cn} satisfying condition 1. Let {xn}
be a sequence of estimates following recursion (1.2) with ⇥ = R and brf(xn) =
gSTAR(xn). As defined in Section 3.2, gSTAR = ↵gS + (1  ↵)gTAR.
57
The mean-squared error of the nth estimate can be expressed as
E
⇥
(xn+1   x⇤)2⇤ = E[(xn   x⇤   angSTAR(xn))2]
= E[(xn   x⇤)2]  2anE[(xn   x⇤)gSTAR(xn)] + a2nE[g2STAR(xn)]. (3.10)
Now we establish necessary bounds in order to generate an upper bound for
(3.10). By the mean-value theorem, there exist ⇠1 and ⇠2 such that 0  ⇠1, ⇠2  1
where f(xn+cn) = f(xn)+f 0(xn+⇠1cn)cn and f(xn cn) = f(xn) f 0(xn ⇠2cn)cn.
E[gS(xn)|xn] = f(xn + cn)  f(xn   cn)2cn
=
f 0(xn + ⇠1cn) + f 0(xn   ⇠2cn)
2
(3.11)
= (xn   x⇤)12  f 0(xn + ⇠1cn)xn   x⇤ + ⇠1cn + f 0(xn   ⇠2cn)xn   x⇤   ⇠2cn 
+
cn
2

⇠1
f 0(xn + ⇠1cn)
xn   x⇤ + ⇠1cn   ⇠2 f 0(xn   ⇠2cn)xn   x⇤   ⇠2cn 
 |xn   x⇤|K1 + cnK1, (3.12)
where the last inequality follows from
K0 
    f 0(x)
x  x⇤
    = f 0(x)
x  x⇤
 K1, (3.13)
as a result of combining conditions 2 and 3. By applying (3.13),
E[(xn   x⇤)gS(xn)|xn] = (xn   x⇤)212  f 0(xn + ⇠1cn)xn   x⇤ + ⇠1cn + f 0(xn   ⇠2cn)xn   x⇤   ⇠2cn 
+
cn
2
(xn   x⇤) ⇠1 f 0(xn + ⇠1cn)xn   x⇤ + ⇠1cn   ⇠2 f 0(xn   ⇠2cn)xn   x⇤   ⇠2cn 
  (xn   x⇤)2K0   |xn   x⇤|cnK1. (3.14)
Using conditional variance, condition 4, and after applying |a+b|r  2r 1(|a|r+ |b|r)
58
for r > 1 to the square of equation (3.12) yields
E[gS(xn)2|xn] = V ar(gS(xn)|xn) + E[gS(xn)|xn]2

 2
4c2n + 2|xn   x⇤|2K21 + 2c2nK21 . (3.15)
The bounds regarding gTAR(xn) are identical to those of gS(xn), since gTAR(xn) is
equal to the RHS of (3.11) with ⇠1 = ⇠2 = 1. Using (3.12), (3.14), and (3.15) yields
E
⇥
(xn+1   x⇤)2|xn⇤ = E[(xn   x⇤   angSTAR(xn))2|xn]
= E[(xn   x⇤)2|xn]  2anE[(xn   x⇤)gSTAR(xn)|xn] + a2nE[g2STAR(xn)|xn]
 (xn   x⇤)2   2an  (xn   x⇤)2K0   |xn   x⇤|cnK1 
+ 4a2n✓  24c2n + 2|xn   x⇤|2K21 + 2c2nK21◆
 (xn   x⇤)2  1  2anK0 + 8a2nK21 + 2ancnK1|xn   x⇤| + a2n 2c2n
+ 8a2nc2nK21 . (3.16)
Define bn = E[(xn x⇤)2]. Take the expectation of (3.16) and apply the inequalities
E[
p
bn] pE[bn] and pbn  bn + 1 to obtain
E[(xn+1   x⇤)2]  E[(xn   x⇤)2]  1  2anK0 + 8a2nK21 + 2ancnK1E[|xn   x⇤|]
+
a2n 2
c2n + 8a2nc2nK21
 bn  1  2anK0 + 8a2nK21 + 2ancnK1pbn + a2n 2c2n + 8a2nc2nK21
 bn  1  2anK0 + 8a2nK21 + 2ancnK1(bn + 1) + a2n 2c2n + 8a2nc2nK21
= bn  1  2anK0 + 2ancnK1 + 8a2nK21 + 2ancnK1 + a2n 2c2n + 8a2nc2nK21
= bnRn + Tn,
where Rn = 1   2anK0 + 2ancnK1 + 8a2nK21   1   2anK0 + 8a2nK21 > 0, which
59
is quadratic in an, and the discriminant 4K20   16K21 is negative if K0 < 2p2K1,
which holds by condition 2. The coe cient for an in Rn, 2cnK1   2K0, is negative
for su ciently large n since cn ! 0 as n ! 1. By condition 1, Q1n=1 Rn < 1
and
P1n=1 Tn < 1. As a result, all partial products are uniformly bounded. Thus,
bn  M < 1. Since Ri > 0 for all i, we have the recursion
bn+1  b1 nYi=1 Ri + n 1Xi=2  Ti nYj=i+1Rj!+ Tn.
Applying bn  M to the second terms in both (3.14) and (3.15) yields
E[(xn+1   x⇤)2]
 E[(xn   x⇤)2]  1  2anK0 + 8a2nK21 + 2ancnK1E[|xn   x⇤|] + a2n 2c2n + 8a2nc2nK21
= E[(xn   x⇤)2] (1  2anK0) + 8a2nK21E[(xn   x⇤)2] + 2ancnK1E[|xn   x⇤|]
+
a2n 2
c2n + 8a2nc2nK21
 E[(xn   x⇤)2] (1  2anK0) + 8a2nK21M + 2ancnK1pM + a2n 2c2n + 8a2nc2nK21
= bnR0n + T 0n.
Since an ! 0, there exists n0 such that an < 1/(2K0) for all n   n0, which implies
R0n > 0. Then we have the recursion
bn+1  bn0 nYi=n0 R0i + n 1Xi=n0  T 0i nYj=i+1R0j!+ T 0n for n = n0 + 1, n0 + 2, ...Qni=n0 R0i ! 0 as n !1 since P an = 1. By condition 1 and Kronecker’s lemma,Pn 1i=n0 ⇣T 0i Qnj=i+1 R0j⌘! 0 as n !1. Therefore, for ✏ > 0, there exists n1 such thatQnj=n0 R0j < ✏3M2 for all n   n1, there exists n2 such that Pn 1i=n0 ⇣T 0i Qnj=i+1 R0j⌘ <
✏/3 for all n   n2, and there exists n3 such that T 0n < ✏/3 for all n   n3. Hence, for
all n > max (n1, n2, n3) > n0, bn+1  M2 ✏3M2 + ✏3 + ✏3 = ✏. ⇤
60
We investigate the exact MSE for quadratic functions of the form f(x) =  ax2,
where a > 0 and x⇤ = 0. The finite-time MSE of quadratic functions in this form
can be computed explicitly, which allows us to compare the performance of di↵erent
algorithms analytically.Proposition 3.2.5. For f(x) =  ax2, a > 0, the respective MSEs after n iterations
following recursion (1.2) can be expressed as
MSERM = An +  2g n 2Xk=1 a2k⇧n 1j=k+1 (1  2aja)2 +  2ga2n 1,
MSEKW = An +  2f2 n 2Xk=1 ✓a2kc2k◆⇧n 1j=k+1 (1  2aja)2 +  2fa2n 12c2n 1 ,
MSESTAR = An + n 2Xk=1 a2k2 ✓↵2k 2fc2k + (1  ↵k)2 2g◆ · ⇧n 1j=k+1(1  2aja)2
+
a2n 1
2
✓
↵2n 1 2f
c2n 1 + (1  ↵n 1)2 2g◆ ,
where An = x20⇧n 1i=1 (1  2aia)2 and x0 is the initial iterate.Proof. We can express the nth estimates of RM, KW, and STAR-SA respec-
tively as
xRMn = xn 1 + an 1f˜(xn 1)
= xn 1 + an 1( 2axn 1 +  n 1)
= xn 1(1  2an 1a) + an 1 n 1
xKWn = xn 1 + an 1f˜(xn 1)
= xn 1 + an 1 ✓ 2axn 1 + ✏+n 1   ✏ n 12cn 1 ◆
= xn 1(1  2an 1axn 1) + an 1 ✏+n 1   ✏ n 12cn 1
61
xSTARn = xn 1 + an 1f˜(xn 1)
= xn 1 + an 1 ✓ 2axn 1 + ↵n 1 ✏+n 1   ✏ n 12cn 1 + (1  ↵n 1) n 1◆
= xn 1(1  2an 1axn 1) + an 1 ✓↵n 1 ✏+n 1   ✏ n 12cn 1 + (1  ↵n 1) n 1◆ .
The recursions can be rewritten in the general form
xn = xn 1Bn 1 + Cn 1,
where Bn 1 = 1  2an 1a and Cn 1 depends on the algorithm. Then
xn = xn 1Bn 1 + Cn 1
= xn 2Bn 2Bn 1 + Cn 2Bn 1 + Cn 1
= xn 3Bn 3Bn 2Bn 1 + Cn 3Bn 2Bn 1 + Cn 2Bn 1 + Cn 1
= x0⇧n 1i= Bi + n 2Xj=0 Cj⇧n 2k=j+1Bk + Cn 1
Then the MSE of the nth estimate can be written as
MSE = E[(xn   x⇤)2] = E[x2n]
= E[(x0⇧n 1i=0 Bi + n 2Xj=0 Cj⇧n 2k=j+1Bk + Cn 1)2]
= x20⇧
n 1i=0 B2i + E[(n 2Xj=0 Cj⇧n 2k=j+1Bk)2] + E[C2n 1]
+ 2x0⇧n 1i=0 BiE[n 2Xj=0 Cj⇧n 2k=j+1Bk] + 2x0⇧n 1i=0 BiE[Cn 1]
+ 2E[Cn 1 n 2Xj=0 Cj⇧n 2k=j+1Bk]
= x20⇧
n 1i=0 B2i + n 2Xj=0 E[C2j ]⇧n 2k=j+1B2k + E[C2n 1],
62
where
E[C2k ] = 8>>>>>>><>>>>>>>: a2k 2g for RM ; 2fa2k2c2k for KW ;a2k
2
✓
↵2k 2f
c2k + (1  ↵k)2 2g◆ for STAR,
which concludes the proof. ⇤Proposition 3.2.6. If ↵2k 2f <  2gc2k for all k, then MSESTAR < MSERM . If
2 2gc2k <  2f for all k, then MSERM < MSEKW . If 2 2gc2k <  2f and ↵2k 2f <  2gc2k for
all k, then MSESTAR < MSEKW .Proof. From Proposition 3.2.5,
MSESTAR < MSERM () a2k2 ✓↵2k 2fc2k + (1  ↵k)2 2g◆ <  2ga2k
()
↵2k 2f
c2k + (1  ↵k)2 2g < 2 2g
() ↵2k 2f + c2k(1  ↵k)2 2g < 2 2gc2k
() ↵2k 2f + (1  ↵k)2 2gc2k < 2 2gc2k
() ↵2k 2f < (2  (1  ↵k)2) 2gc2k
() ↵2k 2f <  2gc2k,
MSERM < MSEKW ()  2ga2k <  2fa2k2c2k
() 2 2gc2k <  2f ,
for all k. If ↵2k 2f <  2gc2k and 2 2gc2k <  2f , then MSESTAR < MSEKW .⇤
The exact MSE of the estimators resulting from each of the algorithms has
three components. The first term is identical for all three algorithms and is the only
component that contains the initial value, so it does not a↵ect our comparison of
63
the MSE across algorithms for the unconstrained, quadratic case. From Proposition
3.2.6, if certain conditions are satisfied, the MSE resulting from STAR-SA for all
steepness levels is less than the MSE resulting from RM and KW for the same
number of iterations
3.2.3 Numerical Experiments
We conduct two sets of numerical experiments to compare the performance of
the STAR-SA algorithm against the classical RM and KW methods under various
combinations of noise levels  f and  g [11]. We implement all three algorithms
and generate the finite-time MSE of the estimate xN for quadratic functions of
the form f(x) =  ax2, a > 0, and ⇥ = [ 50, 50]. The gain sequence is chosen as
an = ✓a(n+1) 1 and the finite di↵erence perturbation sequence is cn = ✓c(n+1) 1/4.
To illustrate the performance of the proposed algorithm, we choose parameter values
✓a 2 {1, 10, 100}, ✓c 2 {0.1, 1.0}, and N 2 {100, 1000, 10000} as well as steepness
level a 2 {10k|k =  3, 2.5, . . . , 1.5, 2} and initial value x0 2 { 50 + 5k|k =
1, . . . , 10}. In a practical setting, the noise level of the function and its gradient could
di↵er, so we consider a variety of noise level combinations,  f 2 {10k|k =  3, . . . , 1}
and  g 2 {10k|k =  3, . . . , 1}.
Although we implemented RM, KW, and STAR-SA for all of the parameter
settings mentioned above, we will only discuss two representative subsets of our
results. For both sets of numerical experiments, we choose an = 10(n + 1) 1 and
cn = 0.1(n+ 1) 1/4, and N = 1000. The MSE of the 1000th iterate is computed for
64
the algorithms based on 2000 sample paths. In this section, MSE refers to the MSE
of the 1000th iterate for STAR-SA and KW, and 2000th iteration for RM.
3.2.3.1 Experiment 1: vary initial value
For the first set of numerical experiments, we fix the steepness level and
vary the initial value x0 for di↵erent combinations of noise levels,  f and  g. We
only present the results for f(x) =  0.1x2 for  f 2 {0.001, 0.1, 1.0} and  g 2
{0.001, 0.1, 1.0}, to illustrate the potential gains of STAR-SA. The STAR-SA algo-
rithm outperforms KW and RM in six out of the nine combinations for all initial
values, which can be seen in the last two rows of Figure 3.2 (i.e., 3.2d, 3.2e, 3.2f,
3.2g, 3.2h, and 3.2i). The commonality among these six cases is the higher function
noise, i.e.,  f = 0.1, 1.0. With the exception of Figure 3.2f, RM performs signif-
icantly better than KW, but STAR-SA has the lowest MSE among the three SA
algorithms. Figure 3.2f depicts a case where RM and KW both perform similarly
with intertwining lines, but combining secant and tangents averaged gradients in
STAR-SA results in a much lower MSE.
In two of the remaining three cases, STAR-SA does not necessarily outperform
RM nor KW, but it performs as well as the better of the two algorithms, as seen in
Figures 3.2b and 3.2c. The function noise level is very low and the gradient noise
is higher, i.e.,  f = 0.001 and  g >  f , which increases the accuracy of the indirect
gradient and has the opposite e↵ect on the direct gradient. Not surprisingly in the
two cases, KW outperforms RM, but the performance of STAR-SA is on par with
65
(a)  f = 0.001, g = 0.001 (b)  f = 0.001, g = 0.1 (c)  f = 0.001, g = 1.0(d)  f = 0.1, g = 0.001 (e)  f = 0.1, g = 0.1 (f)  f = 0.1, g = 1.0(g)  f = 1.0, g = 0.001 (h)  f = 1.0, g = 0.1 (i)  f = 1.0, g = 1.0Figure 3.2: f(x) =  0.1x2,⇥ = [ 50, 50], an = 10(n + 1) 1, cn =
0.1(n+ 1) 1/4, N = 1000.
66
KW. From the eight figures discussed, it appears that integrating both indirect
and direct gradients together will either improve the SA performance, which can
be significant, or in the worst case, it will not hurt the performance. The only
case where the MSE of the STAR-SA algorithm is not approximately less than or
equal to the MSE of KW and RM is in the case when both noise levels are very
low, i.e.,  f =  g = 0.001, which is shown in Figure 3.2a. In this instance, RM
performs better than STAR-SA with the exception of when the initial value is close
to the optimum x⇤ = 0. In fact, the MSE of STAR-SA decreases as initial value x0
approaches x⇤.
3.2.3.2 Experiment 2: vary steepness level
In practice, the geometry of the function is unknown, so our second set of nu-
merical experiments investigates the performance of the STAR-SA algorithm com-
pared to RM and KW as a function of the steepness level a for di↵erent combinations
of noise. We set the initial value to be far from the optimum and consider identical
parameters as in the first experiment, i.e., x0 =  30, an = 10(n+1) 1, cn = 0.1(n+
1) 1/4, and vary the steepness level a. We implement RM, KW, and STAR-SA and
compute the MSE for N = 1000 and all combinations of  f 2 {0.001, 0.1, 1.0} and
 g 2 {0.001, 0.1, 1.0}, illustrated in Figure 3.3. For a fixed step size {an}, the MSE
is much higher for both flat and steep quadratic functions across algorithms, which
is most likely because the selected step size {an} is too small for extremely flat and
too large for steep functions, respectively. If the step size is not chosen
67
(a)  f = 0.001, g = 0.001 (b)  f = 0.001, g = 0.1 (c)  f = 0.001, g = 1.0(d)  f = 0.1, g = 0.001 (e)  f = 0.1, g = 0.1 (f)  f = 0.1, g = 1.0(g)  f = 1.0, g = 0.001 (h)  f = 1.0, g = 0.1 (i)  f = 1.0, g = 1.0Figure 3.3: f(x) =  ax2,⇥ = [ 50, 50], an = 10(n + 1) 1, cn =
0.1(n+ 1) 1/4, N = 1000.
68
appropriately, all three algorithms perform equally poorly, with the exception of the
case when both noise levels are low, i.e.,  f =  g = 0.001, where KW results in a
lower MSE. The fixed step size and stopping time only appear to be appropriate
for a = 10 1, 10 1/2. For all noise levels under the these two steepness levels, the
MSE of STAR-SA is either less than or equal to RM and KW. The only instance
when combining direct and indirect gradients is inferior is when both noise levels
are low, i.e.,  f =  g = 0.001 for a = 10 1. Otherwise, STAR-SA shows promise
since it either outperforms or at least it never performs worse than either RM or
KW. When the variance of the function is high, i.e.,  f 2 {0.1, 1.0}, KW performs
poorly across steepness levels.
3.2.3.3 Results Summary
The following conclusions are specifically for the fixed parameter setting f(x) =
 ax2, an = 10n 1, cn = 0.1n 1/4.
• If the function is neither too steep nor too flat:
• All algorithms are insensitive to the initial start value.
• If  f is large, STAR-SA has a significantly lower MSE than that of KW.
• If  f and  g are high, the MSE of STAR-SA is lower than that of RM, but the
di↵erence is not as prominent as the di↵erence compared to that of KW.
• For very flat functions:
• If  f <  g, then MSESTAR ⇡ MSEKW and MSEKW < MSERM , and this gap
decreases as the function flattens.
69
• If  g <  f or if  f and  g are both high, then MSESTAR ⇡ MSERM and
MSERM < MSEKW , and this di↵erence shrinks as the steepness level de-
creases.
• For very steep functions, the performances of the algorithms are equally poor.
Overall, as a result from both sets of numerical tests, the STAR-SA algorithm either
performs significantly better than both KW and RM or the MSE is approximately
equal to that of the algorithm with the lower MSE. Therefore, the new STAR-SA
algorithm is competitive against the classic KW and RM methods.
3.3 STAR-SPSA
Now we extend STAR-SA to the multidimensional case. The direct extension
using finite di↵erences is computationally ine cient for high-dimensional problems,
so instead, we consider the SP gradient, which only requires the two performance
estimates f˜(xn + cn n) and f˜(xn  cn n) for each indirect gradient (Secant). The
available direct gradient estimates rf˜(xn + cn n) and rf˜(xn   cn n) can be
averaged to provide another gradient estimate (Tangents AveRaged).
For multivariate stochastic optimization problems, we present the Secant-Tangents AveRaged simultaneous perturbation stochastic approximation (STAR-
SPSA) gradient estimate, which is a convex combination of an SP gradient estimate
and an average of two associated direct gradients. For each dimension i 2 {1, . . . , d},
70
the ith component of the STAR-SPSA gradient can be written as
gSTAR,i(xn) = ↵n,i f˜(xn + cn n)  f˜(xn   cn n)2cn n,i
+ (1  ↵n,i)rf˜i(xn + cn n) +rf˜i(xn   cn n)2 , (3.17)
where  n = ( n,1, . . . , n,d), ↵n,i 2 [0, 1], and cn 2 R+ for n 2 N, and cn ! 0 as
n !1.
Let f˜(xn±cn n) = f(xn±cn n)+✏±n ,rf˜i(xn±cn n) = rfi(xn±cn n)+ ±n,i,
where ✏±n and  ±n,i are Gaussian noise with zero mean for i = 1, . . . , d. For notational
simplicity, let V ar(f˜(xn ± cn n)) =  2f , V ar(rf˜i(xn ± cn n)) =  2g,i, Corr(f˜(xn ±
cn n),rf˜i(xn ± cn n)) = ⇢i, V ar(gSPSA,i(xn)) = V ar ⇣ f˜(xn+cn n) f˜(xn cn n)2cn n,i ⌘ = 2f
2c2n 2n,i , V ar(gTAR,i(xn)) = V ar ⇣rf˜i(xn+cn n)+rf˜i(xn cn n)2 ⌘ =  2g,i2 , and V ar(gSTAR,i
(xn)) =  2n,i for i = 1, . . . , d.
Analogous to Lemma 3.2.1 and Proposition 3.2.2, the next two results provide a
STAR-SPSA gradient variance minimizing deterministic weight sequence that guar-
antees the variance of the STAR-SPSA gradient is strictly less than that of RM and
SPSA.
3.3.1 Optimal Deterministic WeightsLemma 3.3.1. For gSTAR defined in (3.17), assume ✏+n ? ✏ n and  +n,i ?   n,i for all
n and i = 1, . . . , d. Then V ar(gSTAR,i(xn)) is minimized when
↵⇤n,i =  2g,ic2n 2n,i 2f +  2g,ic2n 2n,i . (3.18)Proof. For notational simplicity, let f˜±n = f˜(xn ± cn n), g˜±n = rf˜(xn ±
cn n), and g˜±n = (g˜±n,1), . . . , g˜±n,d).
71
By the orthogonality assumption, the ith component of the variance can be
written as
 
2n,i = V ar ↵n,i ˜f+n   ˜f n
2cn n,i + (1  ↵n,i) g˜+n,i + g˜ n,i2 !
=V ar
✓
↵n,i
2cn n,i ˜f+n + 1  ↵n,i2 g˜+n,i◆+ V ar✓  ↵n,i2cn n,i ˜f n + 1  ↵n,i2 g˜ n,i◆
=
✓
↵n,i
2cn n,i◆2 V ar( ˜f+n ) + ✓1  ↵n,i2 ◆2 V ar(g˜+n,i) + 2 ↵n,i2cn n,i 1  ↵n,i2 Cov( ˜f+n , g˜+n,i)
+
✓
↵n,i
2cn n,i◆2 V ar( ˜f n ) + ✓1  ↵n,i2 ◆2 V ar(g˜ n,i)  2 ↵n,i2cn n,i 1  ↵n,i2 Cov( ˜f n , g˜ n,i)
=
✓
↵n,i
2cn n,i◆2  2f + ✓1  ↵n,i2 ◆2  2g,i + 2 ↵n,i2cn n,i 1  ↵n,i2 ⇢i f g,i
+
✓
↵n,i
2cn n,i◆2  2f + ✓1  ↵n,i2 ◆2  2g,i   2 ↵n,i2cn n,i 1  ↵n,i2 ⇢i f g,i
=
↵
2n,i
2c
2n 2n,i 2f + (1  ↵n,i)22  2g,i, (3.19)
where the second equality follows from the independence of the function/gradient
at di↵erent points. Di↵erentiate the variance (3.19) with respect to the weight ↵n,i
to obtain
@ 2n,i
@↵n,i = ↵n,ic2n 2n,i 2f   (1  ↵n,i) 2g,i. (3.20)
Solve for the zero of (3.20) to obtain (3.18), which is a global minimum since (3.20)
is monotonically increasing in ↵n,i. ⇤Proposition 3.3.2. Let brf(xn) be defined as in (3.17) and the convex weight be
defined in (3.18). If the conditions of Lemma 3.3.1 hold, then V ar(gSTAR,i(xn)) 
min {V ar(gSPSA,i(xn)), V ar(gTAR,i(xn))} for i = 1, . . . , d and all n, where the in-
equality is strict if 0 <  2f and 0 <  2g,ic2n 2n,i hold.Proof. By the conditions of Lemma 3.3.1, the ith component of the variance
72
can be written as
 2n,i = ↵2n,i2c2n 2n,i 2f + (1  ↵n,i)22  2g,i. (3.21)
Substitute ↵⇤n,i from (3.18) into the variance (3.21) to obtain
 2n,i = 12c2n 2n,i   2g,ic2n 2n,i 2f +  2g,ic2n 2n,i!2  2f + 12  1   2g,ic2n 2n,i 2f +  2g,ic2n 2n,i!2  2g,i
=
 2f 2g,i
2
 
 2f +  2g,ic2n 2n,i  .
To show  2n,i is strictly less than min{V ar(gSPSA,i(xn)), V ar(gTAR,i(xn))}, we con-
sider two cases: 1)  2n,i <  2f2c2n 2n,i and 2)  2n,i <  2g,i2 . However, since cn ! 0 as
n !1, then
 2f
2c2n 2n,i !1, so eventually, we will be in the latter case.Case 1: If  2n,i <  2f2c2n 2n,i , then  2f 2g,i2( 2f+ 2g,ic2n 2n,i) <  2f2c2n 2n,i , which holds when 0 <  2f .Case 2: If  2n,i <  2g,i2 , then  2f 2g,i2( 2f+ 2g,ic2n 2n,i) <  2g,i2 , which holds when 0 <  2g,ic2n 2n,i.
If 0 <  2f and 0 <  2g,ic2n 2n,i, then the strict inequality holds; however if either
 f = 0 or  2g,ic2n 2n,i = 0, then the same argument holds for the two cases with
“ < ” replaced with “ , ” resulting in a non-strict inequality. Therefore,  2n,i 
min
n  2f
2c2n 2n,i ,  2g,i2 o = min{V ar(gSPSA,i(xn)), V ar(gTAR,i(xn))}, where the inequality
is strict for 0 <  2f and 0 <  2g,ic2n 2n,i. ⇤
As in the STAR-SA result, the weight ↵⇤n,i for STAR-SPSA is also independent
of the correlation coe cient because of the homogeneous noise
73
3.3.2 Convergence
We define the error and bias of the STAR-SPSA gradient estimate respectively
as bn(xn) = E[gSTAR(xn) rf(xn)|xn],en(xn) = gSTAR(xn)  E[gSTAR(xn)|xn],
where bn(xn) = (bn,1(xn), . . . , bn,d(xn)) and en(xn) = (en,1(xn), . . . , en,d(xn)).
The following lemma is used in the convergence proof of Theorem 3.3.4.Lemma 3.3.3. Suppose {xn} is generated under recursion (1.2) with ⇥ = Rd using
the STAR-SPSA gradient estimate defined in (3.17). If { n,i}di=1 are i.i.d., symmet-
rically distributed with mean zero, uniformly bounded with finite inverse moments,
and f(·) is three-times continuously di↵erentiable and f (3)(·) is uniformly bounded,
then bn(xn) ! 0 as n !1.Proof. We use proof techniques similar to those in [55] by applying results
from [39].
By assumption for each n 2 N, there exist positive constants K1, K2, and K3
such that | n,l|  K1 a.s., E[|  1n,l |]  K2 for l = 1, . . . , d and |f (3)i,j,k(·)|  K3 8i, j, k.
bn,i(xn) = E[↵n,i · gSPSA,i(xn) + (1  ↵n,i) · gTAR,i(xn) rfi(xn)|xn]
= ↵n,iE[gSPSA,i(xn) rfi(xn)|xn] + (1  ↵n,i)E[gTAR,i(xn) rfi(xn)|xn].
Since E[✏+n   ✏ n |xn] = 0 a.s. and f (3)(·) exists and is continuous, using Taylor’s
74
series expansions, we have
f(xn ± cn n) = f(xn) ± cnhf 0(xn), ni+ c2nhf 00(xn) n, ni
±c3nf (3)(t±n )[ n ⌦ n ⌦ n], (3.22)
where t+n and t n are on the line segment between xn and xn ± cn n, respectively,
and ⌦ denotes the Kronecker product. Using (3.22), we have
E[gSPSA,i(xn) rfi(xn)|xn]
= E
"
f˜(xn + cn n)  f˜(xn   cn n)
2cn n,i    1n,ihf 0(xn), ni   xn#
= E

f(xn + cn n)  f(xn   cn n)
2cn n,i    1n,ihf 0(xn), ni   xn 
=
c2n
12
E
⇥
  1n,i(f (3)(t+n ) + f (3)(t n ))[ n ⌦ n ⌦ n]|xn⇤ . (3.23)
The magnitude of the right hand side of (3.23) is bounded by
c2nK3
6
dXi=1 dXj=1 dXk=1 E    n,i n,j n,k n,i    

c2nK3
6
· {[d3   (d  1)3]K21 + (d  1)
3K2K31}. (3.24)
Similarly, since E[ +n,i+  n,i|xn] = 0 a.s. for i = 1, . . . , d, by Taylor’s series expansions,
rfi(xn ± cn n) = rfi(xn) ± cnhr2fi(xn), ni+ c2nhr3fi(s±n ) n, ni, (3.25)
where s+n and s n are on the line segment between xn and xn ± cn n, respectively.
75
Using (3.25),
E[gTAR,i(xn) rfi(xn)|xn]
= E
"
rf˜i(xn + cn n) +rf˜i(xn   cn n)
2
 rfi(xn)   xn#
= E

rfi(xn + cn n) +rfi(xn   cn n)
2
 rfi(xn)   xn 
=
c2n
2
E
⇥ Tn (f (3)(s+n ) + f (3)(s n )) n|xn⇤ . (3.26)
The magnitude of the right hand side of (3.26) is bounded by
c2nK3 dXi=1 E| 2n,i|  c2nK3dK21 . (3.27)
Apply bounds (3.24) and (3.27) to obtain
bn,i(xn)  (1  ↵n,i)c2nK3d2K21
+
↵n,ic2nK3
6
· {[d3   (d  1)3]K21 + (d  1)
3K2K31}. (3.28)
The RHS of (3.28) converges to zero since cn ! 0 as n !1, and this holds for all
i, which concludes the proof. ⇤Theorem 3.3.4. Let {xn} be a sequence generated using recursion (1.2) with ⇥ =
Rd using the STAR-SPSA gradient estimate defined in (3.17). Assume the condi-
tions of Lemma 3.3.3 in addition to the following hold:
1. There exist positive sequences {an} and {cn} such that cn ! 0 as n ! 1,P1n=1 an = 1, P1n=1 a2nc 2n < 1.
2. There exist positive constants C1, C2, C3 such that E[rf˜i(xn ± cn n)2]  C1,
E[f˜(xn ± cn n)4]  C22 , and E[  4n,i]  C23 for i = 1, . . . , d.
76
3. For all n, ||xn|| < 1 a.s.
4. Let x⇤ be an asymptotically stable solution to the ODE @x(t)/@t =  rf(x).
5. There exists a compact set D ✓ D(x⇤) such that xn 2 D infinitely often, where
D(x⇤) is the domain of attraction (i.e., D(x⇤) = {x0| limt!1 x(t|x0) = x⇤}).
Then xn ! x⇤ a.s. as n !1.Proof. By Lemma 2.2.1 and Theorem 2.3.1 in [39], if conditions 1 - 5 are
satisfied, then xn ! x⇤ a.s. as n !1 if
(a) ||bn(xn)|| < 1 for all n and bn(xn) ! 0 as n !1 a.s.,
(b) limn!1 P  supm n ||Pmk=n akek(xk)||   ⌘  = 0 for all ⌘ > 0.
Since (a) follows directly from Lemma 3.3.3, it remains to establish (b). Apply
Doob’s inequality to the martingale sequence {
Pmi=k aiei(xi)}m k to obtain
P
 
supm n || mXk=n akek||   ⌘!  ⌘ 2E "|| mXk=n akek||2# = ⌘ 2 mXk=n a2kE||ek||2. (3.29)
Consider i 2 {1, . . . , d}. Then the variance of the ith gradient component can be
expressed as
E[g2STAR,i(xn)] = ↵2n,iE[g2SPSA,i(xn)] + 2↵n,i(1  ↵n,i)E[gTAR,i(xn) · gSPSA,i(xn)]
+ (1  ↵n,i)2E[g2TAR,i(xn)]. (3.30)
It remains to show that each of the three terms on the right hand side of (3.30) are
bounded. We apply |a+ b|r  2r 1(|a|r + |b|r) and Holder’s inequality to obtain
E[g2TAR,i(xn)] = E 24 rf˜i(xn + cn n) +rf˜i(xn   cn n)2 !235

1
2
E
h
(rf˜i(xn + cn n)2 + (rf˜i(xn   cn n)2i  C1 (3.31)
77
and
E[g2SPSA,i(xn)] = E 24 f˜(xn + cn n)  f˜(xn   cn n)2cn n,i !235

1
2c2nE " f˜(xn + cn n)2 + f˜(xn   cn n)2 2n,i #

1
2c2n E h|(f˜(xn + cn n)|4i1/2 + E h|(f˜(xn   cn n)|4i1/2  · E     1 n,i    4 1/2

C2C3
c2n . (3.32)
Then we apply Jensen’s inequality, (3.31), and (3.32) to obtain
E[gTAR,i(xn) · gSPSA,i(xn)]
= E
" 
r ˜fi(xn + cn n) +r ˜fi(xn   cn n)
2
!
·
 
˜
f(xn + cn n)  ˜f(xn   cn n)
2cn n,i !#
 E
"   r ˜fi(xn + cn n) +r ˜fi(xn   cn n)
2
   # · E "    ˜f(xn + cn n)  ˜f(xn   cn n)
2cn n,i    #

p
C1C2C3
cn . (3.33)
Applying (3.31), (3.32), and (3.33) yields
E[||en||2]  d✓C2C34c2n + pC1C2C32cn + C14 ◆ . (3.34)
By condition 1, (3.34), and (3.29), (b) holds. ⇤
3.3.3 Numerical Experiments
3.3.3.1 9-station Closed Jackson Network
For the multidimensional case, we implement RM, SPSA, and STAR-SPSA on
a 9-station closed Jackson queueing network problem from [35], depicted in Figure
3.4 with associated transition probabilities. In a closed Jackson queueing network,
78
1" 4"
3"
2" 5"
6"
7"
8"
9"
.3" .3"
.3"
.7" .7" .7"
.4" .4" .4"
.6" .6"
.6"
.5"
.5"
.5" .5"
.5"
.5"Figure 3.4: 9-station closed Jackson queueing network.
the number of customers in the system remains fixed, hence the absence of arrows
from outside.
The objective is to maximize the total throughput of the system TP (x) (to-
tal # of customers served/total time), given restrictions on the mean service time
x(i), i = 1, . . . , d, which can be summarized as
maxx2⇥ TP (x)
s.t.
dXi=1 x(i) = M,
x(i) > 0 for i = 1, . . . , d,
where M = 10, d = 9 (number of stations). The network consists of 9 first-come,
first-served stations, where the service time follows an exponential distribution with
mean service time x(i) for station i and with 10 customers in the system.
The (equality and positivity) constraints prevent straightforward implemen-
79
tations of the gradients and projection operator. For the direct gradient, we use
the IPA gradient estimate designed specifically for the total throughput of a closed
Jackson queueing network proposed in [34], with a slight modification by rewriting
one of the estimates as a function of the other components and using the chain rule
to take into account the equality constraint (i.e., x(r) = M  Pdi=1,i 6=r x(i), where r
is an arbitrary integer between 1 and d, inclusive). For the SP gradient estimator,
the components in the random vector  n must sum to zero to ensure the perturbed
components satisfy the equality condition (i.e.,  n,r = 0,  n,i = ±1 for i 6= r, whereP9i=1,i 6=r n,i = 0). In addition, each perturbed component must be positive, i.e.,
x(i)n > cn for all i. After each iteration, a projection must be applied to maintain
feasibility; however, an orthogonal projection onto the hyperplane could violate the
x(i)n > cn condition for some i, so we make a slight adjustment by projecting the
current iterate back to the previous iterate if the condition is violated.
In the SA algorithms, we use the parameters an = ✓a(n + 1) 1, ↵n,i =
c2n/(1 + c2n), uniform start values x(i)0 = 10/9 for i = 1, . . . , 9, cn = ✓c(n + 1) 1/4,
20 macroreplications, a stopping time of N = 50 for SPSA and STAR-SPSA, and
N = 100 for RM. For simplicity, we drop the second subscript i from the convex
weight ↵n,i, since they are identical for all i. The update occurs for only eight es-
timates since the rth component is a result of the binding constraint. However, in
our preliminary experiments, the SA algorithms often performed poorly when r was
fixed for each iteration, so instead, we let r be a uniform random integer between 1
and 9, inclusive. For each simulation run, we generate the throughput and gradient
estimates using 300 customers, which is large enough for a valid non-zero direct
80
10 20 30 40 501
1.5
2
2.5
3
3.5
exp(M
SE of
 X)
θa
 
 RM2RMSPSASTAR
10 20 30 40 500
2
4
6
8
10
12
exp(M
SE of
 TP)
θa
 
 RM2RMSPSASTAR
10 20 30 40 500
1
2
3
4
5
6
7
TP
θa
 
 
RM2RMSPSASTAR(a) cn = 0.1(n + 1) 1/4
10 20 30 40 501
1.2
1.4
1.6
1.8
2
2.2
2.4
exp(M
SE of
 X)
θa
 
 RM2RMSPSASTAR
10 20 30 40 501
2
3
4
5
6
7
8
9
10
exp(M
SE of
 TP)
θa
 
 RM2RMSPSASTAR
10 20 30 40 500
1
2
3
4
5
6
7
TP
θa
 
 
RM2RMSPSASTAR(b) cn = 0.3(n + 1) 1/4Figure 3.5: 9-station closed Jackson queueing network, P9i=1 x(i) =
10, x(i) > 0, an = ✓a/(n + 1), ↵n = c2n/(1 + c2n), N = 50, customers
serviced = 300, macroreplications = 20.
gradient. If the direct IPA gradient is generated with an insu cient number of cus-
tomers, the the gradient could be exactly zero or invalid since the denominator is
zero.
We compute the total MSE of the estimates x(i)N for i = 1, . . . , 9 and MSE of
the throughput TP (xN) using x⇤ = (1.66, .68, .983, 1.66, .68, .983, 1.66, .68, .983) and
TP (x⇤) = 4.82, as well as the mean throughput with a 95% confidence band using
20 macroreplications for a range of step sizes and two di↵erent perturbation sizes,
81
i.e., ✓a 2 {5k|k = 1, . . . , 10} and ✓c 2 {0.1, 0.3}. Figures 3.5a and 3.5b illustrate the
performances as a function of ✓a for two perturbations ✓c = 0.1, 0.3, respectively. We
vary the step size and perturbation size to investigate the sensitivity of STAR-SPSA.
The leftmost graph in both figures show that STAR-SPSA significantly outperforms
SPSA for all choices of an, in terms of both MSE of the estimates as well as the
throughput. Similarly, STAR-SPSA performs better than RM in both metrics with
the exception of the case ✓a = 15, when RM results in a slightly lower MSE of xN and
a higher MSE of TP (xN), but the results are fairly close. The standard errors are
within ±0.1. In both Figures 3.5a and 3.5b, the STAR-SPSA results do not appear to
be very sensitive to the step size, especially not the throughput. In fact, combining
the direct and indirect gradients often improves the performance, and sometimes
the di↵erence can be significant, as in the case ✓a = 50, cn = 0.1(n + 1) 1/4.
Furthermore, the STAR-SPSA mean throughput is always greater than SPSA, and
either greater than RM very close to it. In addition, the confidence bands are much
tighter for STAR-SPSA for both ✓c = 0.1, 0.3. From these results, we conclude that
incorporating the SP gradient will either improve the finite time performance or in
the worst case, it will not hurt the performance significantly.Orthogonal Projection Algorithm
• Step 0. Input: Xn, M , cn
• Step 1. Set Xn+1 = Xn.
• Step 2. For i = 1, . . . , d, if Xn,i < cn, then set Xn+1,i = cn.
82
• Step 3. Let
– P0 = M/d ⇤ 1
– N = 1/pd
– T = Xn+1  P0
– t =< T,N >, where < ·, · > denotes dot product
– Xn+1 = Xn+1   tN
• Step 4. If Xn+1,i   cn for all i, return Xn+1. Otherwise, return Xn.Random Perturbation  n
• Step 0. Input: d
• Step 1. If d is odd, generate r ⇠ Unif{1, d}, and set  n,r = 0. If d is even,
let r = 0.
• Step 2. Generate ri ⇠ Unif(0, 1) for i = 1, . . . , d, i 6= r.
Let m = {i|ri = median(r1, . . . , rr 1, rr+1, . . . , rd) }.
• Step 3. For i = 1, . . . , d, i 6= r, if ri < m, set  n,i =  1 and  n,i = 1
otherwise.
• Step 4. Return  n.IPA Algorithm: Closed Jackson Queueing Network
83
• Step 0. Set A = 0, where 0 is a dxd zero matrix, n = 1, and N = number of
customers serviced.
• Step 1. Suppose the next customer to finish service is currently located at
node i. Increment Ai,i by the service time si.
• Step 2. Suppose the customer leaving node i moves to node j. If node j is
empty upon arrival, set A:,j = A:,i.
• Step 3. Set n = n+ 1. If n < N , go back to Step 1. If n = N , set T= current
time, go to Step 4.
• Step 4. Calculate minimum and maximum values of each row and set toA  = (A 1 , . . . , A d ) and A+ = (A+1 , . . . , A+d ), respectively.
• Step 5. Compute
Sk
TP
·
@TP
@Sk =   A kT   A+k + A k , where Sk is the mean service
time of server k and TP = N/(T   A+k + A k ).IPA Algorithm: Closed Jackson Queueing Network w/Equality Con-straint
• Step 1. Apply IPA gradient method for the unconstrained closed Jackson
queueing network for throughput.
• Step 2. Generate r ⇠ Unif{1, d}.
• Step 3. For k = 1, . . . , d, set
@TPc
@Xk = @TP@Xk   @TP@Xr .
84
3.4 Summary and Future Work
We have introduced the first set of stochastic approximation algorithms that
integrate both direct and indirect gradient estimates for single and multidimensional
problems. Our new hybrid gradient estimates use a symmetric finite di↵erence-type
gradient estimate for the indirect gradient and an average of the associated direct
gradients for the direct gradient. The crux of the algorithm lies in the convex
weight derived to minimize the variance of the hybrid gradient estimates. Under
mild conditions, we have analytically shown the improvement over individual direct
and indirect gradient estimates in terms of variance, which in turn can improve the
SA performance. We have demonstrated the promise of STAR-SA numerically on
a one-dimensional stylized problem as well as STAR-SPSA on a multidimensional
queueing network. Previous work had only considered SA with either direct or
indirect gradients, but we exploit the information contained in the indirect gradient
estimates, which we have shown to be theoretically and practically beneficial.
Our hybrid technique can be generalized to include other gradient estimates
within the two general gradient categories, e.g., IPA with LR/SF. In addition, we
can adaptively choose both the weight sequence and the perturbation sequence.
Currently, the variance minimizing weights (3.2) for the homogeneous case
are deterministic and depend on both the gradient and function noise, which are
unknown. As cn ! 0, the influence of the finite di↵erence gradient estimate di-
minishes and the weight on the Tangents AveRaged gradient estimate approaches
1. Ideally, independent of the noise level, the weights will put more emphasis on
85
the gradient with higher accuracy. Since we are in the setting where both direct
and indirect gradients available, we have access to various gradient approximations,
i.e., f 0(xn ± cn), f 0(xn+cn)+f 0(xn cn)2 , and f(xn+cn) f(xn cn)2cn . At each iteration, these
values can be used to determine which gradient is more accurate, gS or gTAR. If
the sample performances are noisy and the perturbation size is very small, then the
finite di↵erence gradient could be very inaccurate, so ideally, less emphasis is placed
on gS, but cn should not decrease since small perturbation sizes can lead to noisy
gradients. Instead, we propose to decrease ↵n in two ways: 1) decrease cn and 2)
increase the constant ( f/ g)2. We can explore how to make the adjustments.
86
Chapter 4
Step Size Selection in Stochastic Approximation
4.1 Sensitivity of Finite-time Performance to Step Size
The asymptotic theory of stochastic approximation algorithms guarantees al-
most sure convergence under certain conditions. The most common requirements
restrict the step size sequence {an}, but even a smaller subset of step size options
still allow for an uncountable number of choices. It is impossible to find a universally
optimal deterministic step size unless more information is known about the geomet-
ric structure of the function. Given a step size, one can always find a function where
it performs poorly. For example in the one-dimensional case, if we consider a large
step size applied to a very steep function, to ensure the iterates have an opportu-
nity to move around the feasible region, then the iterates might oscillate back and
forth, moving further and further away from the optimum for an extended period
before making any progress towards the true solution, as seen in Figure 4.1a. The
opposite could also occur, where the function is extremely flat and the step size is
relatively small in comparison, so as the number of iterations increases, the iterates
barely make any progress, and once the stopping time is reached, the algorithm
returns a poor estimate in the region where it is currently “stuck,” illustrated in
Figure 4.1b. The finite-time performance of stochastic approximation algorithms is
sensitive to the step size, and there are clear disadvantages for using an arbitrarily
87
An Overview of Stochastic Approximation
The Evolution of Stochastic Approximation
Marie Chau and Michael C. Fu
Abstract This chapter...
x1x⇤x2 x3x4 x5x6
0
0.5
1
1.5
2
·104
(a) an is “too large” relative to the gradient
x1x⇤
0
20
40
60
80
100
(b) an is “too small” relative to the gradient
Fig. 1: Sensitivity of SA to step size an.
Marie Chau
University of Maryland, College Park, College Park, MD 20742, e-mail: mchau@math.umd.edu
Michael C. Fu
University of Maryland, College Park, College Park, MD 20742 e-mail: mfu@isr.umd.edu
1
Figure 4.1: Illustration of Sensitivity of SA to {an}.
chosen deterministic step size. To circumvent this issue, adaptive step sizes have
been proposed to adjust based on the ongoing performance of the algorithm. Ideally,
an adaptive step size algorithm is able to recognize/detect the current estimates’
proximity to the true solution, path behavior/trend, and geometric structure near
the current iterate, while adjusting the step size accordingly. Before we propose our
new adaptive step sizes, we conduct preliminary tests on existing finite-time theory
on MSE bounds and two adaptive step sizes, Kesten’s rule and SSKW.
4.1.1 KW and its Variants
For our preliminary numerical experiments, we focus on the one-dimensional
KW algorithm, which generates brf(Xn) in (1.2) using finite di↵erences. Although
theoretical convergence can be guaranteed by satisfying certain requirements, practi-
cal performance depends on the choice of tuning sequences. In addition to selecting
88
a gain sequence {an} in (1.2), the KW algorithm requires an additional task of
choosing a finite di↵erence perturbation sequence {cn} for the gradient. The finite-
time performance of KW depends on both sequences {an} and {cn}. Because of the
sensitivity of the KW algorithm to the tuning sequences, it is essential to choose
an appropriate pair. In practice, KW could have the following shortcomings: long
oscillatory period if the gain sequence {an} is “too large,” degraded convergence
rate if {an} is “too small,” and poor gradient estimates if the perturbation sequence
{cn} is “too small.”
We conduct an empirical investigation of the sensitivity of KW and two of its
adaptive variants, namely Kesten’s rule and the scaled-and-shifted KW (or SSKW)
algorithm of [7]. Our goal is to identify problem characteristics that exert a strong
impact on algorithm performance, even in the presence of theoretical guarantees.
For example, in the numerical results reported in [7], SSKW outperforms the KW
algorithm in terms of both MSE and oscillatory behavior in finite time; however,
this result is obtained using what seem to be nearly worst-case parameter settings
for KW. We replicate these results, but we also find that the performance of KW can
be significantly improved over a fairly wide choice of parameter settings. Although
the worst-case performance of SSKW is much better than that of KW, it is also the
case that KW provides the best performance in a significant proportion of problem
instances. In addition, we find that Kesten’s rule performs similar to KW, and
sometimes better, when both algorithms begin with the same initial start value. We
also investigate the finite-time MSE bound in [7] and characterize instances where
this bound is tight. These results underscore the well-known di culty of tuning,
89
even for adaptive versions of KW.
4.2 Finite-time MSE Bound
Asymptotic convergence properties of the KW algorithm and its variations
have been a major research focus in SA. Convergence proofs in MSE appear in
[18], [16], [19], [58], and [49] for various assumptions and modifications of the KW
algorithm. However, in practice, where the run-time is finite, a good finite-time
bound for the MSE is useful.
By applying similar technique as in [18], [7] derived a finite-time bound for
the MSE of the KW algorithm. The MSE bound depends on certain problem-
dependent constants, which are typically di cult to calculate in practice. However,
in the special case where f is quadratic, the bound can be computed in closed form,
allowing us to observe its tightness. We briefly summarize the bound as follows.
First, we make the following assumptions on the function f(x):
1. There exist positive constants K0, K1, and C0 such that for every c 2 [0, C0],
 K1(x  x⇤)2 
f(x+ c)  f(x  c)
c
(x  x⇤)   K0(x  x⇤)2.
2. f 0(x)(x  x⇤) < 0 for all x 2 R\{x⇤}.
We also assume that the tuning sequences satisfy:
1. an/c2n  (an+1/c2n+1)(1 + Aan+1) for all n   1,
2. an ! 0 as n !1,
90
with 0 < A < 2K0. Then,
E(Xn+1   x⇤)2  Can/c2n for all n   1, (4.1)
where C is a constant explicitly defined as
C = max
⇢
 2
⇠
, max
1nn0 ⇢ c2nanBn+1   ,
and
Dn = K21Aa2n + (K21   2AK0)an   2K0   A,
n0 = 1 if Dn < 0 for all n   1 and
sup
 
n   1 : (K21   2AK0)an +K21Aa2n   2K0   A + 1 otherwise,
⇠ =   sup{A  2K0 + (K21   2AK0)an +K21Aa2n : n   n0},
Bn = X21 nYi=1 pi + n 1Xi=2 qi nYj=i+1 pj + qn,
pi = 1  2aiK0 +K21a2i , for i = 1, 2, . . . , n,
qi = a2ic2i  2, for i = 1, 2, . . . , n,
 2 = supx2⇥Var[f˜(Xn + cn)  f˜(Xn   cn)|Xn = x].
Equation (4.1) does not guarantee convergence in MSE, but rather establishes
a finite bound for each iteration. The bound is thus more useful when it is tight.
In Section 4.3.1, we investigate the tightness of (4.1) by comparing the bound with
the exact MSE of simple quadratic functions of the form f(x) = ↵x2 where ↵ < 0
and the optimal x⇤ = 0. The exact MSE can be computed as follows:
E(Xn+1   x⇤)2 = X21⇧ni=1(1 + 2↵ai)2 +  22 n 1Xk=1 a2kc2k⇧nj=k+1(1 + 2↵aj)2 + a2n 22c2n . (4.2)
91
4.3 Numerical Experiments
4.3.1 Tightness of the Finite-time MSE Bound for Quadratics
We generated the MSE bound in (4.1) and the exact MSE in (4.2) for quadratic
functions with various noise levels and initial starting values for three di↵erent cases:
1) f(x) =  0.001x2, cn = 1/n1/2, f(x) =  0.15x2, cn = 1/n1/4 and f(x) =
 0.15x2, cn = 1/n1/2. The MSE bound is a function of constants that are not
unique, satisfying S1, S3, A1, and A2. We picked the largest K0 and smallest K1
satisfying A1 and A slightly less than 2K0. Table 4.2 lists the constants used in our
calculations for the MSE bound in (4.1), and the exact MSE and MSE bound are
listed in Table 4.1. The exact MSE (4.2) is a sum of three components. The first
term on the right hand side (RHS) of (4.2) is independent of   and is dominated by
the initial starting value, X1. The second and third terms in (4.2) are dominated by
 . When   2 {0.001, 0.01, 0.1, 1.0}, both terms are small (< 1), but when   = 10,
the RHS is dominated by the second term. Therefore, the exact MSE increases with
X1 and  . Using the parameters in Table 5.1, the constant C in the MSE bound
can be expressed as
C=max
⇢
 2
⇠
,
c21
a1
✓
X1(1  2a1K0 +K21a
2
1)(1  2a2K0 +K
2
1a
2
2) +
a22
c22
 2
◆ 
. (4.3)
The first term in (4.3) dominates when   is large since ⇠ = 0.001, 0.1 for
f(x) =  0.001x2, 0.15x2, respectively. Therefore, the MSE bound and di↵erence
between the exact MSE and MSE bound increases significantly when   increases
from 1.0 to 10.0. Otherwise, the MSE bound is equal to the second term, which
92
Table 4.1: Finite-time MSE bound and exact MSE for KW with
n = 10000, an = 1/n
f(x) =  0.001x2 f(x) =  0.15x2, f(x) =  0.15x2
cn = 1/n1/2 cn = 1/n1/4 cn = 1/n1/2
  X1 Exact Bound Exact Bound Exact Bound
0.001
0 0.00 0.00 0.00 0.00 0.00 0.00
-5 24.04 24.85 0.06 8.85 0.06 0.09
-10 96.16 99.40 0.24 35.40 0.24 0.35
-20 384.64 397.61 0.95 141.61 0.95 1.42
-40 1538.56 1590.42 3.78 566.44 3.78 5.66
0.01
0 0.00 0.10 0.00 0.00 0.00 0.00
-5 24.04 24.85 0.06 8.85 0.06 0.09
-10 96.16 99.40 0.24 35.40 0.24 0.35
-20 384.64 397.61 0.95 141.61 0.95 1.42
-40 1538.56 1590.42 3.78 566.44 3.78 5.66
0.1
0 0.05 10.0 0.01 0.10 0.00 0.00
-5 24.09 24.86 0.07 8.86 0.06 0.09
-10 96.21 99.41 0.24 35.41 0.24 0.35
-20 384.69 397.61 0.95 141.62 0.95 1.42
-40 1538.61 1590.43 3.79 566.45 3.78 5.66
1.0
0 4.8 999.99 0.83 10.00 0.03 0.10
-5 28.84 999.99 0.89 10.00 0.09 0.10
-10 100.96 999.99 1.07 35.90 0.27 0.36
-20 389.44 999.99 1.78 142.11 0.98 1.42
-40 1543.36 1590.92 4.61 566.94 3.81 5.67
10.0
0 480.08 99999.20 83.23 999.79 3.20 9.99
-5 504.12 99999.20 83.29 999.79 3.26 9.99
-10 576.24 99999.20 83.47 999.79 3.43 9.99
-20 864.72 99999.20 84.18 999.79 4.14 9.99
-40 2018.64 99999.20 87.01 999.79 6.98 9.99
93
Table 4.2: Finite-time MSE Bound Parameters for KW
f(x) an cn K0 K1 A n0 ⇠
 0.001x2 1/n 1/n1/2 0.002 0002 0.003 1 0.001
 0.15x2 1/n 1/n1/2 0.3 0.3 0.5 1 0.1
 0.15x2 1/n 1/n1/4 0.3 0.3 0.5 1 0.1
increases with X1 and  . Table 4.1 contains the exact MSE and MSE bound for
three di↵erent parameter settings and for   2 {0.001, 0.01, 0.1, 1.0, 10.0}. The first
column in Table 4.1 presents results for f(x) =  0.001x2, cn = 1/n1/2. In the
presence of more noise, i.e.   = 10.0, the MSE bound is 99, 999.20, which is the first
term in (4.3) for each initial starting value. The di↵erence between this bound and
the exact MSE is significant with a di↵erence greater than 97, 500. For   = 1.0, the
MSE bound only takes the second term in (4.3), when the starting position is farther
from the optimum, i.e. X1 =  40 and is tight. However, when the initial starting
value is closer to the optimum, i.e. X1 = 0, 5, 10, 20, the MSE bound is equal
to 999.99, which is the first term in (4.3), and thus the MSE bound is significantly
greater than the exact MSE. The MSE bound is very tight for rest of the cases with
the exception of when   = 0.1 and X1 = 0. For the second column with f(x) =
 0.15x2, cn = 1/n1/2, the MSE bound is significantly greater than the exact MSE
across the board. The third column reports results for f(x) =  0.15x2, cn = 1/n1/2
the MSE bound is tight for all cases with the exception of the case with   = 10.0.
It would seem that the bound is a useful guideline for problems with low variance,
but becomes less tight as the noise level increases.
94
4.3.2 Sensitivity of KW and its Variants
We compare the MSE performance between KW and two of its variants de-
scribed in Section 2.2.3, Kesten’s rule and SSKW. All experiments were implemented
with an = ✓a/n, cn = ✓c/ns where s 2 {1/4, 1/2}, ✓a > 0, ✓c > 0, 10000 iterations,
and 1000 sample paths.
Not surprisingly, the performance of SSKW relative to KW heavily depends
on the chosen parameters such as truncated interval length, initial starting value,
and tuning sequences. Our analysis replicates the results of [7], where SSKW per-
forms significantly better than KW in terms of MSE and oscillatory period, but
we find that the chosen parameters for this experiment are among the worst pos-
sible parameters for KW as illustrated in Figure 4.2 with KW and SSKW under
✓a = ✓c = 1. By choosing a di↵erent initial starting position, the performance of
KW can be significantly improved, as demonstrated in Table 4.3 for two functions
f(x) =  0.001x2 and f(x) = 100e 0.006x2 . To o↵er a contrast with the quadratic
function, the second function considered is very steep and has flat tails.
[7] compared the SSKW performance with that of KW whose MSE is highly
reliant on the tuning sequences and initial start value. The MSE performance results
for f(x) =  0.001x2 using KW in [7] were poor because the initial position was
chosen to be far from the optimum and the gain size an was too small to make any
noticeable progress towards it after 10000 iterations, so the iterates hover around
the initial position. In our numerical experiments, we also consider an = ✓an and
cn = ✓cn1/4 for ✓a, ✓c > 0. If ✓a = ✓c = 1 as in [7], but the initial start value is 0.01
95
-40 -20 0 20 40
0.0
00
0.0
05
0.0
10
0.0
15
0.0
20
X
1
MS
E
0.0
00
0.0
05
0.0
10
0.0
15
0.0
20
MS
E
0.0
00
0.0
05
0.0
10
0.0
15
0.0
20
MS
E
0.0
00
0.0
05
0.0
10
0.0
15
0.0
20
MS
E
0.0
00
0.0
05
0.0
10
0.0
15
0.0
20
MS
E
0.0
00
0.0
05
0.0
10
0.0
15
0.0
20
MS
E
0.0
00
0.0
05
0.0
10
0.0
15
0.0
20
MS
E
0.0
00
0.0
05
0.0
10
0.0
15
0.0
20
MS
E
0.0
00
0.0
05
0.0
10
0.0
15
0.0
20
MS
E
0.0
00
0.0
05
0.0
10
0.0
15
0.0
20
MS
E
0.0
00
0.0
05
0.0
10
0.0
15
0.0
20
MS
E
0.0
00
0.0
05
0.0
10
0.0
15
0.0
20
MS
E
0.0
00
0.0
05
0.0
10
0.0
15
0.0
20
MS
E
0.0
00
0.0
05
0.0
10
0.0
15
0.0
20
MS
E
KW θ
a
= 1, θ
c
= 1
KW θ
a
= 500, θ
c
= 4
KW θ
a
= 90, θ
c
= 5
Kesten θ
a
= 1, θ
c
= 1
Kesten θ
a
= 10, θ
c
= 5
Kesten θ
a
= 100, θ
c
= 1
SSKW θ
a
= 1, θ
c
= 1Figure 4.2: MSE of the 10000th iterate of KW and Kesten for three
parameter settings and SSKW for f(x) =  0.001x2,   = 0.001, an =
✓a/n, cn = ✓c/n1/4.
instead of 30, then the MSE from KW is significantly lower compared to SSKW. The
first column in Table 4.3 compares the MSE all three algorithms with X1 = 0.01,
and clearly, KW outperforms SSKW in almost all cases. Of course, a practitioner
would have no way of knowing whether or not the starting iterate was close to the
true optimum, so these results do not indicate that KW will always perform well.
They do indicate, however, that KW exhibits substantial variation in performance.
We also conduct a sensitivity analysis for f(x) =  0.001x2 with various start-
ing positions X1 and multiplicative constants, ✓a, and ✓c and implement SSKW and
KW using Kesten’s rule. For the sensitivity analysis, we considered a wide selec-
tion of parameters: 19 initial starting values uniformly spaced within the truncated
96
f(x) =  0.001x2 [-50, 50] f(x) = 100e 0.006x2 [-50, 50]
X1 = 0.01 X1 = 30
  Alg. 100 1000 10000 100 1000 10000
.001
SSKW 5.10x10
 2
1.70x10
 2
5.00x10
 3
5.07x10
 2
1.68x10
 2
4.84x10
 3
KW 10
 4
10
 4
10
 4
763.8 653.3 431.4
Kesten 1.12x10
 4
1.08x10
 4
1.04x10
 4
10
 7
3x10
 8
10
 8
.01
SSKW 5.10x10
 2
1.70x10
 2
5.00x10
 3
5.07 1.68 4.90x10
 1
KW 10
 4
10
 4
10
 4
763.8 653.3 431.2
Kesten 2.10x10
 3
2.11x10
 3
2.05x10
 3
9.54x10
 6
2.76x10
 6
8.41x10
 7
.1
SSKW 5.10x10
 2
1.70x10
 2
5.00x10
 3
165.8 57.4 16.0
KW 10
 4
10
 4
10
 4
763.4 651.4 418.2
Kesten 2.01x10
 1
2.03x10
 1
1.97x10
 1
5.65x10
 2
2.76x10
 4
8.41x10
 5
1.0
SSKW 5.10x10
 2
1.70x10
 2
5.00x10
 3
187.2 57.8 18.7
KW 10
 4
10
 4
10
 4
722.5 562.5 415.7
Kesten 20.1 20.3 19.7 456.9 315.1 239.7Table 4.3: MSE of the 100th, 1000th, and 10000th iteration for KW
and its variates with an = 1/n, cn = 1/n1/4.
interval X1 2 { 50 + 5k | k = 1, 2, ...19}, 45 di↵erent ✓a values parametrized by
✓a 2 {10sk | k = 1, 2, . . . , 9, s = 0, 1, . . . , 4} and 10 di↵erent ✓c values parametrized
by ✓c 2 {10sk | k = 1, 2, ..., 5, s = 0, 1}. In total, there are 8550 possible combina-
tions of parameters.
The results show that KW and Kesten’s rule are sensitive to the parameter
choice, but near-optimal performance can be obtained with tuning. Figure 4.2 plots
the MSE of KW for f(x) =  0.001x2,   = 0.001 against the initial starting values
X1 for di↵erent sets of parameter choices. These cases serve as a good representation
of the majority of the MSE behaviors among the entire set of results. The case with
✓a = ✓c = 1 is among the worst for KW and Kesten’s rule. The MSE is represented
by a nearly vertical line for both algorithms. For this parameter setting, SSKW beats
97
KW and Kesten’s rule significantly for all initial values with the exception of X1 = 0.
For the case where ✓a = 90, ✓c = 5, KW outperforms SSKW in a neighborhood
around the optimum. However, there are cases such as ✓a = 500, ✓c = 4 for KW
and ✓a = 100, ✓c = 1 for Kesten’s rule that outperform SSKW for all initial start
values. Of the 8550 combinations varying all parameters and 450 combinations with
X1 = 30, KW performs better than SSKW in 4275 and 215 cases, respectively,
suggesting that KW requires some tuning to perform well, but that there is a fairly
wide range of tunable parameters that yield good performance. If KW performs
better than KW, the di↵erence is not as pronounced as when SSKW outperforms
KW, but careful tuning can partially mitigate KW’s sensitivity to parameters such
as the initial iterate.
0 2 4 6 8 10
0
40
0
80
0
σ=0.01
ln(θ
a
)
MS
E
0
40
0
80
0
MS
E
0
40
0
80
0
MS
E θc = 1θ
c
= 10
θ
c
= 40
0 2 4 6 8 10
0
40
0
80
0
σ=0.1
ln(θ
a
)
MS
E
0
40
0
80
0
MS
E
0
40
0
80
0
MS
E
0 2 4 6 8 10
0
40
0
80
0
σ=1.0
ln(θ
a
)
MS
E
0
40
0
80
0
MS
E
0
40
0
80
0
MS
E
0 2 4 6 8 10
0
40
0
80
0
σ=10.0
ln(θ
a
)
MS
E
0
40
0
80
0
MS
E
0
40
0
80
0
MS
EFigure 4.3: Sensitivity of KW to ✓a for f(x) =  0.001x2, an =
✓a/n, cn = ✓c/n1/4, n = 10000.
98
2 4 6 8 10
-5
0
5
10
σ = 0.01
φ
a
ln(
MS
E)
-5
0
5
10
ln(
MS
E)
-5
0
5
10
ln(
MS
E)
2 4 6 8 10
-5
0
5
10
σ = 0.1
φ
a
ln(
MS
E)
-5
0
5
10
ln(
MS
E)
-5
0
5
10
ln(
MS
E)
2 4 6 8 10
-5
0
5
10
σ = 1.0
φ
a
ln(
MS
E)
-5
0
5
10
ln(
MS
E)
-5
0
5
10
ln(
MS
E)
2 4 6 8 10
-5
0
5
10
σ = 10.0
φ
a
ln(
MS
E)
-5
0
5
10
ln(
MS
E)
-5
0
5
10
ln(
MS
E)Figure 4.4: Sensitivity of SSKW for f(x) =  0.001x2 as a function
of  a, n = 10000.
Figure 4.3 plots the MSE f(x) =  0.001x2, an = ✓a/n, cn = ✓c/n1/4of the
10000th iterate as a function of log ✓a given ✓c = 1, 10, 40. The case where   = 0.001
is omitted, because the results are similar to those for   = 0.01. For log ✓a < 4,
the MSE decreases for each given value of ✓c. However, for log ✓a   4, the MSE
behaves di↵erently for all noise levels. But, the overall behavior as a function of ✓c
is similar across noise levels. The MSE decreases for all ✓a as ✓c increases, so in the
case where ✓c = 40, there is a wide range of ✓a values where the MSE of KW is lower
than that of SSKW. But the MSE of KW could also be extremely high if the tuning
sequences are not chosen well. Moreover, we investigate the sensitivity of SSKW to
 a, which is the upper bound of the scale up factor for {an} as depicted in Figure
4.4. The MSE decreases until  a, the maximum scale up factor for {an}, is equal to
99
-40 -20 0 20 40
-15
-5
0
5
σ = 0.01
X
1
ln(
MS
E)
-15
-5
0
5
ln(
MS
E)
-15
-5
0
5
ln(
MS
E)
-40 -20 0 20 40
-15
-5
0
5
σ = 0.1
X
1
ln(
MS
E)
-15
-5
0
5
ln(
MS
E)
-15
-5
0
5
ln(
MS
E)
-40 -20 0 20 40
-15
-5
0
5
σ = 1.0
X
1
ln(
MS
E)
-15
-5
0
5
ln(
MS
E)
-15
-5
0
5
ln(
MS
E)
-40 -20 0 20 40
-15
-5
0
5
σ = 10.0
X
1
ln(
MS
E)
-15
-5
0
5
ln(
MS
E)
-15
-5
0
5
ln(
MS
E)
KW
Kesten
SSKWFigure 4.5: MSE Comparison of KW, Kesten, and SSKW for f(x) =
100e 0.006x2 , an = 1/n, cn = 1/n1/4, n = 10000.
4 and increases for   = 0.01, 0.1 while it levels o↵ for   2 {1.0, 10.0} thereafter. It
seems that for lower noise levels,i.e.   2 {0.01, 0.1},  a = 4 is a better choice, while
 a = 10 leads to a lower MSE for   2 {1.0, 10.0}.
In addition, we implement KW and its variants using the same parameters
(i.e., an = 1/n, cn = 1/n1/4, X1 = 30) as in [7] on f(x) = 100e 0.006x2 to test
the algorithms under the same setting for a di↵erent function. Figure 4.5 plots
the MSE of the 10000th iterate as a function of the initial start value. KW and
Kesten’s rule outperform SSKW within certain intervals around the optimum for
  2 {0.001, 0.01, 0.1, 1.0} and Kesten’s better performance intervals overlap the
intervals of KW. However, the KW using the deterministic step-size 1/n performs
100
better than using Kesten’s step-size where the intervals overlap, which can be seen
in Figure 4.5. Unfortunately, outside of those intervals, both algorithms have a
tendency to perform poorly. However, for the other four noise levels, the intervals
where KW and Kesten’s rule outperform SSKW are larger. However, there is a
tradeo↵, since if by chance the initial start value is closer to the boundary, the
di↵erence in performance can be drastic.
4.4 PROX-step
The idea behind our adaptive step size can be easily conveyed in the two
examples illustrated in Figures 4.6a and 4.6b. For a well-behaved function without
drastic changes in its gradient in neighborhoods, as depicted in Figure 4.6b, there is
a possibility of selecting a step size that leads to good finite-time performance, where
the iterates can approach the minimum at a good pace without heavy oscillations.
However, the function in Figure 4.6a has a steep valley region around the optimum,
whereas the rest of the function is extremely flat. This type of function arises in
the likelihood function typically used to find maximum likelihood estimators. If we
use a single step size sequence, the performance will often be poor since the two
extreme regions require two drastically di↵erent step sizes. For instance, a larger
step size is appropriate in the flat area to avoid getting stuck at a poor estimate;
however, once the iterate reaches the valley region, the larger step size is no longer
appropriate, and the recursion will overshoot and launch the estimate back into a
flat region. This cycle will continue until the step size is su ciently small so that
101
−5 −4 −3 −2 −1 0 1 2 3 4 5−80
−60
−40
−20
0
20
40 (a) Contrasting regions −5 −4 −3 −2 −1 0 1 2 3 4 50510152025 (b) Well-behavedFigure 4.6: Illustration of PROX-Step Motivational
the iterates make progress towards the minimum. This is only possible if the step
size was small before entering the valley region, which means many iterations were
necessary to reach that point.
We propose an adaptive step size, inspired by both Kesten’s rule and SSKW,
that adjusts based on sample performances, gradient estimates, and parameter es-
timates, taking advantage of past data. Kesten’s rule is Markovian-like, since the
next step size depends on the previous gradient (or estimate) as well as the current
gradient (or estimate). On the other hand, SSKW’s adaptive mechanisms depend
largely on the feasible search space/region and the two most recent parameter es-
timates. While the aforementioned adaptive step sizes have significantly advanced
SA algorithms, they have their pitfalls. Kesten’s rule starts with an arbitrary non-
increasing step size, which can still fall into shortcomings as in the deterministic
case if an is too large. It does however, prevent the step size from decreasing if an is
too small (assuming consecutive gradients are the same sign). An extension by [53]
102
allows the step size to increase, but it can be slow. SSKW certainly addresses the
issue of implementing SA with an inappropriately small or large an, but di↵erent
regions in the feasible search space call for di↵erent step sizes. For instance, if the
current iterate is in a very steep portion of the function, then under SSKW, it will
reduce the step size faster. However in the next iteration after the step size reduc-
tion, the estimate could be located in a flat region, which requires a larger an to
accelerate the convergence to a neighborhood of the optimum. In the worst case, the
estimate gets stuck in a region because of the newly reduced an is too small, since
aside from the initial phase, SSKW can only reduce an and cannot enlarge/increase
it.
Although the underlying notion of our adaptive step size is motivated by
Kesten and SSKW, we approach adaptivity from a di↵erent angle. Our method
adopts a memory-based approach, where the step size is adjusted based on the
perceived proximity to the optimum as in Kesten and is increased/decreased when
deemed necessary as in SSKW, but the adjustments are determined by the sample
performance(s) in addition to the gradient(s) and parameter estimate(s) used in
other methods. Each simulation run used to obtain the sample performance and
gradient estimate (if available) is expensive and the values generated contain infor-
mation that can be used in future iterations to improve the SA performance. We
e ciently exploit the data collected in previous iterations to aid in the selection of
an appropriate step size to be used in the current update.
For illustrative purposes, consider the simple one-dimensional KW algorithm.
At each iteration, two sample performances are generated to compute the finite
103
di↵erence gradient estimate for the recursive update. After collecting a su cient
number of observations, we can roughly guess how close the current estimate is from
the optimum based on the moving average of sample performances in comparison
to the current observation. If the current iterate is considered close, then a smaller
step size could prevent the next iterate from completely overshooting. In contrast,
if the estimate is believed to be further away, then a larger step size could decrease
the amount of e↵ort required to put the iterate in closer range of the optimum. For
a minimization problem, the estimate is perceived to be closer to the optimum if
the current objective value is less than the mean of the past objective values, and
further from the optimum if it is greater than the mean, i.e.,
• fn > f¯n (farther),
• fn < f¯n (closer),
where fn is the average of sample performances generated at the nth iteration and
f¯n is the running average of all sample performances (i.e., for KW with symmetric
di↵erences, f+n = f(xn+cn), f n = f(xn cn), fn = (f+n +f n )/2, and f¯n = 1n Pni=1 fi.
Our adaptive step size begins with two positive fixed sequences {a+n } and {a n },
where a n < a+n for all n. We first tailor the adaptive step size to SA algorithms
that involve a symmetric finite di↵erence gradient (e.g., KW or STAR-SA), and
then generalize it. We now present PROX-step for one-dimensional SA algorithms
involving a symmetric finite di↵erence, followed by a generalization.
104
PROX-step for KW or STAR-SA
• Step 0. Input: x1 2 ⇥, {a+n }, {a n }, {cn}, N .
• Step 1. Initialize f¯0 = 0, n = 1.
• Step 2. Generate f+n and f n to compute brf(xn).
• Step 3. Update mean objective value f¯n = n 1n f¯n 1 + 1nfn.
• Step 4. Set
an = 8>><>>: a n if fn < f¯n,a+n if fn > f¯n.
• Step 5. Update xn+1 = ⇧⇥ ⇣xn   an brf(xn)⌘.
• Step 6. If n < N , set n = n+ 1, go to Step 2.
• Step 7. Output: x⇤N = xNGeneralized PROX-step for SA
• Step 0. Input: x1 2 ⇥, {a+n }, {a n }, N .
• Step 1. Initialize f¯0 = 0, n = 1.
• Step 2. Generate nd sample performances f (i)n for i = 1, . . . , nd to computebrf(xn).
105
• Step 3. Update mean objective value f¯n = n 1n f¯n 1 + 1nfn, where fn =
1nd Pndi=1 f (i)n .
• Step 4. Set
an = 8>><>>: a n if fn < f¯n,a+n if fn > f¯n.
• Step 5. Update xn+1 = ⇧⇥ ⇣xn   an brf(xn)⌘.
• Step 6. If n < N , set n = n+ 1, go to Step 2.
• Step 7. Output: x⇤N = xN
4.5 Adaptive PROX-step
Although the PROX-step is adaptive, the two sequences used to generate it
are both deterministic, so in the end, it still faces some of the same shortcomings
of implementing a deterministic step size. Each sequence {a+n } and {a n } could be
inappropriately selected for its corresponding feasible region; therefore, we propose a
method to adjust the sequences based on past events. In particular, throughout the
recursion, in addition to the necessary information used to implement the PROX-
step, we track the step size used at each iteration to determine whether or not the
smaller sequence {a n } appropriate. If the step sizes used are alternating between
{a+n } and {a n } consistently, then it is an indication that either we are close to the
optimum or the smaller step size is too big. In either case, it would be ideal to
106
decrease the smaller step size sequence. Furthermore, in applying Kesten’s rule, we
do not have a separate step sizes for each dimension, so instead, we look at the the
total number of sign changes in the gradient and if more than half of the dimensions
keep the same sign, then we keep the same two step size sequences for the next
iteration (i.e., {a±n } = {a±n 1}). Lastly, we look at the magnitude of each violation
||xn+1   ⇧ (xn+1) ||2 and of each iteration change ||xn   ⇧ (xn+1) ||2 and reduce or
increase the step sir sequence accordingly while maintaining a n < a+n still holds.
Step size adjustments can be done in the following ways:
• Scaling
• Kesten’s rule
Scaling will be used for decreasing and increasing the step size by dividing and
multiplying by the scaling factor, respectively, and Kesten’s rule will keep the step
size constant.
Additional input parameters must be chosen:
• ↵ > 1 is the scale factor
• {dn} is the projection violation threshold
• {en} is the no-movement threshold
• nL is the number of allowable scales
The asymptotic theory still applies, since the scaling and shifting will only
occur finitely many times.
107
Adaptive PROX-step
• Step 0. Input: x1 2 ⇥, {a+n }, {a n }, {dn}, ↵, nL, N .
• Step 1. Initialize f¯0 = 0, n = 1, nL = 0.
• Step 2. Generate nd sample performances f (i)n for i = 1, . . . , nd to computebrf(xn).
• Step 3. Update mean objective value f¯n = n 1n f¯n 1 + 1nfn.
• Step 4. Set
an = 8>><>>: a n if fn < f¯n,a+n if fn > f¯n.
• Step 5. Set
a[n] =
8>><>>:  1 if fn < f¯n,1 if fn > f¯n.
• Step 6. Update xn+1 = ⇧⇥ ⇣xn   an brf(xn)⌘.
• Step 7. Modify step size sequences.
– Scaling down.
⇤ If nS < nL and ||xn+1   ⇧ (xn+1) ||2 > dn, and
· if an = a+n , then set {a+n } = {a+n /↵}, nS = nS + 1.
· if an = a n , then set {a n } = {a n /↵}, nS = nS + 1.
108
⇤ If n > 3, nS < nL, a[n]a[n   1] < 0, a[n   1]a[n   2] < 0, and
a[n  2]a[n  3] < 0, then set {a n } = {a n /↵} and nS = nS + 1.
– Scaling up.
⇤ If nS < nL and 0 < ||xn   ⇧ (xn+1) ||2 < en, and
· if an = a+n , then set {a+n } = {↵a+n }, nS = nS + 1.
· if an = a n , then set {a n } = {↵a n }, nS = nS + 1.
– Kesten.
⇤ If
Pdi=1 I{brfi(xn)·brfi(xn+1)>0} > d/2, then {a+n } = {a+n 1} and {a n } =
{a n 1}.
• Step 8. If n < N , set n = n+ 1, go to Step 2.
• Step 9. Output: x⇤N = xN
Unfortunately, the input parameters must be chosen by the users, which can
have a significant e↵ect on the performance. In our numerical experiments, we
choose the parameters conservatively so that the original two sequences do not
change dramatically from one iteration to the next.
4.6 Numerical Experiments
We investigate the performance of PROX-step and aPROX-step on two con-
trasting functions with added noise as well as the queueing network example de-
scribed in Section 3.4.
109
4.6.1 Deterministic Problem with Added Noise
For the one-dimensional case, we consider two deterministic functions: 1) ex-
ponential that is very steep near the optimum with very flat tails and 2) a simple
quadratic. We compare the standard decreasing step size, Kesten’s rule, and the
PROX-step and Adaptive PROX-step applied to RM. For the fixed decreasing step
sizes, we look at three di↵erent step size parameters (i.e., from the PROX-step, Big:
{a+n }, Small: {a n }, Average: {(a+n +a n )/2}). We apply Kesten’s rule to the average
step size sequence {(a+n + a n )/2} and use PROX and aPROX with the parameter
dn = 10 for all n. The initial value x1 = 10 and feasible region ⇥ = [ 30, 70]
were arbitrarily chosen. We consider N = 50 to be the stopping time, 20 macro
replications, and two step size settings for a±n .
To demonstrate the robustness of PROX and aPROX, we fix the smaller se-
quence {a n }, vary the larger sequence {a+n }, and plot the MSE of the estimate xN
and f(xN) for both f(x) = 100e 0.006x2 and f(x) =  x2. Figure 4.7 presents results
for the exponential case with a n = 10/n and a+n = 5(k+1)/n, where k is the x-axis,
where Figure 4.7a is the full plot and Figure 4.7b is a magnified version. Clearly,
Kesten’s rule applied to the average step size {(a+n + a n )/2} and the larger step size
{a+n } perform poorly as a+n increases, since their MSEs of the estimate xN are much
higher than the rest in the left figure, and the function value f(xN) is much lower
in the right graph (higher is better since we are maximizing). The performance of
using the fixed small step size {a n } is represented by the horizontal lines since it
is independent of a+n , performs better than Big and KestenAvg, but both PROX
110
1 2 3 4 5 6 7 8 9 100
500
1000
1500
2000
2500
3000
MSE
 of X
N
k
 
 
1 2 3 4 5 6 7 8 9 1020
30
40
50
60
70
80
90
100
100e
xp(−
0.00
6x N2 )
k
 
 AvgPROXKestenAvgSmallBigAdaptPROX
AvgPROXKestenAvgSmallBigAdaptPROX(a) MSE of xN and f(xN )
1 2 3 4 5 6 7 8 9 100
100
200
300
400
500
MSE
 of X
N
k
 
 
1 2 3 4 5 6 7 8 9 1075
80
85
90
100e
xp(−
0.00
6x N2 )
k
 
 AvgPROXKestenAvgSmallBigAdaptPROX
AvgPROXKestenAvgSmallBigAdaptPROX(b) Magnified versionFigure 4.7: f(x) = 100e 0.006x2 , ✓ = [ 30, 70], x1 = 10, a n = 10/n,
a+n = 5(k + 1)/n,  f =  g = 0.1, ↵ = 1.025, nL = 20, N = 50,
macroreplications = 20.
111
1 2 3 4 5 6 7 8 9 100
50
100
150
200
250
300
MSE
 of X
N
k
 
 
1 2 3 4 5 6 7 8 9 1050
55
60
65
70
75
80
85
90
95
100
100e
xp(−
0.00
6x N2 )
k
 
 AvgPROXKestenAvgSmallBigAdaptPROX
AvgPROXKestenAvgSmallBigAdaptPROX(a) MSE of xN and f(xN )
1 2 3 4 5 6 7 8 9 100.01
0.0105
0.011
0.0115
0.012
MSE
 of X
N
k
 
 
1 2 3 4 5 6 7 8 9 1099.9925
99.993
99.9935
99.994
99.9945
100e
xp(−
0.00
6x N2 )
k
 
 AvgPROXKestenAvgSmallBigAdaptPROX
AvgPROXKestenAvgSmallBigAdaptPROX(b) Magnified versionFigure 4.8: f(x) = 100e 0.006x2 , ✓ = [ 30, 70], x1 = 10, a n =
1/n, a+n = 2k/n,  f =  g = 0.1, ↵ = 1.025, nL = 20, N = 50,
macroreplications = 20.
112
0 5 10 15 20 25 30 35 40 45 500
200
400
600
800
1000
MSE
 of X
N
Number of Iterations
 
 
0 5 10 15 20 25 30 35 40 45 500
20
40
60
80
100
100
exp
(−0.
006
x2 )
Number of Iterations
 
 AvgPROXKestenAvgSmallBigAdaptPROX
AvgPROXAvgKestenSmallBigAdaptPROX(a) a n = 1/n, a+n = 10/n
0 5 10 15 20 25 30 35 40 45 500
500
1000
1500
2000
2500
MS
E o
f X N
Number of Iterations
 
 
0 5 10 15 20 25 30 35 40 45 500
20
40
60
80
100
100
exp
(−0.
006
x2 )
Number of Iterations
 
 AvgPROXKestenAvgSmallBigAdaptPROX
AvgPROXAvgKestenSmallBigAdaptPROX(b) a n = 10/n, a+n = 50/nFigure 4.9: MSE of xN and f(xN), f(x) = 100 exp 0.006x2 , ↵ = 1.5,
✓ = [ 30, 70], x1 = 10,  f =  g = 0.1,, ↵ = 1.5, nL = 50, N = 50,
macroreplications = 20.
and aPROX appear to outperform in almost all cases. Figure 4.7b zooms in on
the better performances for a clearer comparison. Figure 4.8 presents results for a
similar setting with the exception of step sizes, where, a n = 1/n and a+n = 2k/n.
The reduction in step sizes result in the same two step sizes (Big and KestenAvg)
performing poorly, but the average step size sequence also joins the group. From
Figure 4.8b, aPROX outperforms all algorithms for the range of step sizes {a+n },
and the PROX results are not far behind.
We also plot the MSE of xN and the function f(xN) as a function of the
113
0 5 10 15 20 25 30 35 40 45 500
200
400
600
800
1000
1200
1400
1600
1800
2000
MSE
 of X
N
Number of Iterations
 
 
0 5 10 15 20 25 30 35 40 45 500
10
20
30
40
50
60
70
80
90
100
100e
xp(−
0.00
6x2 )
Number of Iterations
 
 AvgPROXKestenAvgSmallBigAdaptPROX
AvgPROXAvgKestenSmallBigAdaptPROX(a) a n = .1/n, a+n = 50/n
0 5 10 15 20 25 30 35 40 45 500
500
1000
1500
2000
2500
MSE
 of X
N
Number of Iterations
 
 
0 5 10 15 20 25 30 35 40 45 500
10
20
30
40
50
60
70
80
90
100
100e
xp(−
0.00
6x2 )
Number of Iterations
 
 AvgPROXKestenAvgSmallBigAdaptPROX
AvgPROXAvgKestenSmallBigAdaptPROX(b) a n = 1/n, a+n = 50/n
0 5 10 15 20 25 30 35 40 45 500
500
1000
1500
2000
2500
MSE
 of X
N
Number of Iterations
 
 
0 5 10 15 20 25 30 35 40 45 500
10
20
30
40
50
60
70
80
90
100
100e
xp(−
0.00
6x2 )
Number of Iterations
 
 AvgPROXKestenAvgSmallBigAdaptPROX
AvgPROXAvgKestenSmallBigAdaptPROX(c) a n = 10/n, a+n = 50/nFigure 4.10: MSE of xN and f(xN), f(x) = 100 exp 0.006x2 , ↵ = 2
✓ = [ 30, 70], x1 = 10,  f =  g = 0.1, nL = 20, N = 50, macrorepli-
cations = 20.
114
1 2 3 4 5 6 7 8 9 100
1000
2000
3000
4000
5000
MSE
 of X
N
k
 
 
1 2 3 4 5 6 7 8 9 10−5000
−4000
−3000
−2000
−1000
0
100e
xp(−
0.00
6x N2 )
k
 
 AvgPROXKestenAvgSmallBigAdaptPROX
AvgPROXKestenAvgSmallBigAdaptPROX(a) N = 10, a n = 1/n, a+n = 2k/n
1 2 3 4 5 6 7 8 9 100
200
400
600
800
1000
MSE
 of X
N
k
 
 
1 2 3 4 5 6 7 8 9 10−1000
−800
−600
−400
−200
0
100e
xp(−
0.00
6x N2 )
k
 
 AvgPROXKestenAvgSmallBigAdaptPROX
AvgPROXKestenAvgSmallBigAdaptPROX(b) N = 50, a n = 10/n, a+n = 5(k + 1)/nFigure 4.11: MSE of xN and f(xN), f(x) =  x2, ✓ = [ 30, 70],
x1 = 10,  f = 0.1,  g = 0.1, nL = 20, macroreplications = 20.
iteration in Figure 4.9 for two di↵erent step size settings. Figure 4.9a and 4.9a
illustrate the results for the exponential function with smaller and larger step sizes,
respectively. For the smaller step sizes, (i.e., a n = 1/n and a+n = 10/n), all of
the algorithms with the exception of Big seem to be approaching the optimum,
but PROX, aPROX, and Small do so earlier on, followed by KestenAvg and then
Avg. With a larger step size, KestenAvg and Big perform very poorly, while the
others make progress as the number of iterations increase. From Figure 4.9b, PROX
appears to outperform all algorithms followed by aPROX. Moreover, Figure 4.10 also
plots the MSE of xN and f(xN) for a similar setting, but we increase the scale factor
↵ from 1.5 to 2. For all three subfigures a+n = 50/n, but the smaller step size a n
increases from 0.1/n to 1/n to 10/n. In Figures 4.10a and 4.10c, aPROX performs
115
0 5 10 15 20 25 30 35 40 45 500
1000
2000
3000
4000
5000
MSE
 of X
N
Number of Iterations
 
 
0 5 10 15 20 25 30 35 40 45 50−5000
−4000
−3000
−2000
−1000
0
100e
xp(−
0.00
6x2 )
Number of Iterations
 
 AvgPROXKestenAvgSmallBigAdaptPROX
AvgPROXAvgKestenSmallBigAdaptPROXFigure 4.12: MSE of xN and f(xN), f(x) =  x2, N = 50, x1 = 10,
 f = 0.1,  g = 1.0, nL = 50, a n = 1/n, a+n = 20/n
the best followed by PROX, and both algorithms outperform the rest. These figures
show that PROX has promise and aPROX can improve the performance of PROX
in certain cases. The performance for PROX and aPROX in Figure 4.10b are almost
identical.
We also test the adaptive step sizes on the simple quadratic function f(x) =
 x2 for the same two step size settings. The results are similar to those in the
exponential case, which can be seen in Figures 4.11a, 4.11b, and 4.12. This results
shows that PROX and aPROX can be applied to both well-behaved and ill-behaved
functions. In all of the cases, PROX and aPROX perform similarly, which is not a
surprise since we used conservative parameter settings.
4.6.2 9-station Closed Jackson Queueing Network
We empirically test the PROX-step and the aPROX-step on the close queueing
network problem from Section 3.4 with identical settings with the exception of the
step size parameter. Figures 4.13a, 4.13b, 4.13c, 4.14a, 4.14b, and 4.14c illustrate
the results of the MSE of the xN as well as the throughput TP (xN) for various step
116
sizes for six di↵erent step size parameters. In all of the cases, PROX and aPROX
are either in the top three algorithms or the top two at the stopping time. Figure
4.14c depicts an extreme case where the step sizes are either extremely small or
large, but outperform the other algorithms.
4.7 Summary and Future Work
We investigated the sensitivity of SA algorithms to step size sequences as well
as the tightness of a finite-time MSE bound. In addition, we introduced the first
adaptive step size algorithm based on two deterministic sequences from which the
current step size is selected from, which is determined by the current and past sample
performances. We also propose an adaptive method to adjust the two deterministic
sequences based on current and past events. We empirically show the promise on
stylized problems and a queueing network.
Unfortunately, the adaptive method in modifying the two initial sequences
rely on user chosen parameters, so in the future, we would like to find ways to
automatically select these parameters. In addition, we could apply this step size to
other algorithms and numerical examples.
117
0 5 10 15 20 25 30 35 40 45 500.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
MSE of Estimate xN
MSE
 of x
N
# of iterations
 
 
0 5 10 15 20 25 30 35 40 45 501.5
2
2.5
3
3.5
4
4.5
5
5.5
6 Throughput
TP(
x N)
# of iterations
 
 AvgPROXSmBigaPROX
AvgPROXSmBigaPROX(a) a n = 0.1/n, a+n = 10/n, ↵ = 1.5, dn = 5
0 5 10 15 20 25 30 35 40 45 500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
MSE of Estimate xN
MSE
 of x
N
# of iterations
 
 
0 5 10 15 20 25 30 35 40 45 501.5
2
2.5
3
3.5
4
4.5
5
5.5
6 Throughput
TP(
x N)
# of iterations
 
 AvgPROXSmBigaPROX
AvgPROXSmBigaPROX(b) a n = 0.01/n, a+n = 20/n, ↵ = 1.5, dn = 5
0 5 10 15 20 25 30 35 40 45 500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
MSE of Estimate xN
MSE
 of x
N
# of iterations
 
 
0 5 10 15 20 25 30 35 40 45 501
2
3
4
5
6
7 Throughput
TP(
x N)
# of iterations
 
 AvgPROXSmBigaPROX
AvgPROXSmBigaPROX(c) a n = 10/n, a+n = 20/n, ↵ = 1.5, dn = 5Figure 4.13
118
0 5 10 15 20 25 30 35 40 45 500
0.2
0.4
0.6
0.8
1
1.2
1.4
MSE of Estimate xN
MSE
 of x
N
# of iterations
 
 
0 5 10 15 20 25 30 35 40 45 500
1
2
3
4
5
6
7 Throughput
TP(
x N)
# of iterations
 
 AvgPROXSmBigaPROX
AvgPROXSmBigaPROX(a) a n = 10/n, a+n = 50/n, ↵ = 1.5, dn = 5
0 5 10 15 20 25 30 35 40 45 500.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
MSE of Estimate xN
MSE
 of x
N
# of iterations
 
 
0 5 10 15 20 25 30 35 40 45 501.5
2
2.5
3
3.5
4
4.5
5
5.5
6 Throughput
TP(
x N)
# of iterations
 
 AvgPROXSmBigaPROX
AvgPROXSmBigaPROX(b) a n = 0.1/n, a+n = 10/n, ↵ = 2, nL = 20
0 5 10 15 20 25 30 35 40 45 500
0.2
0.4
0.6
0.8
1
1.2
1.4
MSE of Estimate xN
MSE
 of x
N
# of iterations
 
 
0 5 10 15 20 25 30 35 40 45 500
1
2
3
4
5
6
7 Throughput
TP(
x N)
# of iterations
 
 AvgPROXSmBigaPROX
AvgPROXSmBigaPROX(c) a n = 0.001/n, a+n = 50/n, ↵ = 2, nL = 20Figure 4.14
119
Chapter 5
Greek Kernel Estimators
5.1 Introduction
Financial derivatives have become an integral part of risk management. There-
fore, accessing and quantifying the risk of these assets have sparked great interest
in the finance industry. The Greeks are derivatives of (financial) derivative prices
with respect to an underlying market parameter. Each Greek measures a di↵erent
dimension of market risk. For instance, “delta,” “vega,” and “theta” are first-order
derivatives of the expected payo↵ with respect to price, volatility, and time, respec-
tively, and “gamma” is the second-order derivative with respect to price.
One type of financial derivative used for hedging is an option, which gives the
security holder the right to exercise the option once certain conditions are satisfied,
which is dependent on the option type. For example, the exercise condition for an
Asian digital option depends on depends on the average price over a pre-specified
period, and it pays a lump sum if the option is “in the money” and zero otherwise.
Up-and-out barrier call options can be exercised once the price reaches the strike
price but does not cross the upper barrier. Both of these options have discontinuous
payo↵ functions, and the option price can be written as
E[g(S) · I{h(S) 0}], (5.1)
120
where the expectation is taken under the risk-neutral measure, S is the price modeled
by a stochastic process, I{·} is the indicator function, and g(S) is the discounted
payo↵ when the exercise condition, h(S)   0, is satisfied. The price S depends on
the market parameters ✓ 2 ⇥, where ⇥ is an open set. For simplicity, assume ✓
is one-dimensional, since we can focus on one market parameter and hold all else
constant. The derivative of the payo↵ function (5.1) with respect to ✓ is a first-order
Greek, and the derivative of the first-order Greek with respect to ✓ is a second-order
Greek, both of which we wish to estimate. Some well-known derivative estimation
methods include finite di↵erence approximation, perturbation analysis (also known
as the pathwise method), the score function or likelihood ratio (SF/LR) method, and
the weak derivative (WD) method. Each method has its advantages and drawbacks.
The most straightforward approach to estimate Greeks is the finite di↵erence
method. Although this method is easy to implement, the resulting estimators are
biased and generally have a high mean-squared error (MSE) [33]. In addition, a per-
turbation sequence { ✓} must be selected, which a↵ects the bias/variance tradeo↵,
although symmetric di↵erences can be used to reduce the order of bias [24]. In
addition, smaller perturbations could lead to more noise because of the stochastic-
ity. The pathwise method, generally known as infinitesimal perturbation analysis
(IPA) in the simulation community, is easy to implement and often performs well in
comparison with other methods, but unfortunately, it is not always applicable. For
instance, it is not suitable for discontinuous functions nor second-order Greeks [43].
Smoothed perturbation analysis (SPA) is able to overcome the discontinuity issue,
and [28] suggest using SPA to smooth out the discontinuity by conditioning on ap-
121
propriate random variables (e.g., [61]). The key, which is the most di cult part of
this method, is choosing the random variable on which to condition. On the con-
trary, the weak derivative (WD) method is applicable to discontinuous functions,
but may require a high number of simulations [24]. The likelihood ratio method
can also approximate the gradient of discontinuous functions, but usually leads to
high variance. For a comprehensive review of gradient estimation methods, refer
to [33], [23], and [4].
[43] introduces a modified pathwise method to estimate first- and second-
order Greeks for options with discontinuous payo↵s. It is a generalization of the
classical pathwise method, which extends to options with discontinuous payo↵s (5.1)
by rewriting the Greek as a sum of an expectation and a derivative(s) with respect
to an auxiliary parameter. The general pathwise method and kernel method are
applied to the expectation and derivative terms, respectively. The accuracy of the
kernel estimator depends/hinges on the chosen kernel function, which relies on a
bandwidth/smoothing parameter, described in more detail in Section 5.3.3. A pilot
simulation proposed in [43] generates an “optimal” bandwidth parameter prior to
applying the modified pathwise method. The pilot simulation involves various input
parameters, which a↵ect the bandwidth output. The modified pathwise method has
been applied to estimate delta, vega, theta and gamma of the Asian digital option
and up-and-out barrier call option.
In this chapter, we examine the accuracy of the Greeks estimated using a modi-
fied pathwise gradient estimation method for an Asian digital option and up-and-out
barrier call option by conducting two sets of numerical experiments. The first set
122
investigates the sensitivity of the Greek estimators to the bandwidth parameter(s),
and the second explores the sensitivity of a proposed method used to generate “op-
timal” bandwidths for the modified pathwise method to various input parameters.
Our numerical results show that the Greek estimators are quite sensitive to the
bandwidth, so the performance of the pilot simulation is critical.
5.2 Problem Setting
We focus on options with a discontinuous payo↵ function that can be written in
the form in (5.1), where g(·) and h(·) are di↵erentiable functions, and S is a vector of
discretized prices. Let Sti denote the price of the security at time ti   0, where ti =
iT/k for i = 0, 1, 2, . . . , k, which are evenly spaced time points between 0 and T. For
simplicity, denote Sti by Si, and S = (S0, S1, . . . , Sk)0 . Let p(✓) = E[f(S)], where
p
0
(✓) = @p(✓)/@✓ is the first-order Greek with respect to ✓ and p
00
(✓) = @2p(✓)/@✓2
is the second-order Greek with respect to ✓. If ✓ = S0, ✓ =  , and ✓ = ⌧ , then
p
0
(✓) is called delta, vega, and theta, respectively, and if ✓ = S0, then p
00
(✓) is called
gamma.
We explore Asian digital options and up-and-out barrier call options, both
of which have discontinuous payo↵s. For the Asian digital option, the discounted
payo↵ function g(S) = e rT , where r is the risk-free rate and T is the duration of
time considered, and the exercise criterion is based on an average of k discretized
prices S = 1k Pki=1 Si exceeding a pre-specified threshold K over a predetermined
period of time, i.e., whether or not h(S) , S  K   0. The up-and-out barrier call
123
Table 5.1: Parameters.
r   b µ S0 K T U k
Asian digital option 0.05 0.30 0.2 98 100 100 1 - 10, 20, 50
up-and-out barrier call option 0.05 0.20 - - 100 100 1 120 20, 50
option can be exercised if the final price Sk is greater than or equal to a threshold
K, but the maximum price Smax = max{S1, . . . , Sk} never exceeds an upper barrier
U , i.e., whether or not h(S) = min{Sk   K,U   Smax}   0. The payo↵ function
is the amount by which the final price Sk exceeds the strike price K discounted by
e rT , i.e., g(S) = e rT (Sk  K).
In the numerical experiments, the prices considered for the Asian digital option
and the up-and-out barrier call option follow an Ornstein-Uhlenbeck process and a
geometric Brownian motion generated, respectively, using
Si = Si 1e b⌧ + µ(1  e b⌧ ) +  p(1  e 2b⌧ )/(2b)Zi for i = 1, 2, ..., k,
St = S0e(r  2/2)t+ Bt for t = T/k, 2T/k, ..., T,
where b,  , µ, r, k, and ⌧ are the mean reversion rate, volatility, mean return, risk-
free interest rate, number of discretized intervals, and incremental time step (T/k),
respectively. The variables Zi and Bt are independent standard normal random
variables and standard Brownian motion, respectively. The values considered are
listed in Table 5.1.
124
5.3 Generalized Pathwise Method
The pathwise method is in general not applicable to second-order Greeks nor
discontinuous functions, but the modified kernel version is one method that circum-
vents this issue.
5.3.1 First-Order Greeks
The first-order Greek estimator based on the modified pathwise method is
based on the following theoretical result from [43].Assumption 1. For any ✓ 2 ⇥, g(S) and h(S) are di↵erentiable with respect to ✓
with probability 1 (w.p.1), and there exist random variables Kg and Kh with finite
second moments that may depend on ✓, such that |g(S(✓+ ✓)) g(S(✓))|  Kg| ✓|
and |h(S(✓ + ✓))  h(S(✓))|  Kh| ✓| when | ✓| is su ciently small.Assumption 2. For any ✓ 2 ⇥, @✓ (✓, y) exists and is continuous at (✓, y) with
 (✓, y) = E[g(S) · 1{h(S) y}].Theorem 5.3.1. Suppose g(·) and h(·) satisfy Assumptions 1 and 2, and E[|g(S)|2] <
+1 and E[|h(S)|2] < +1. Then
p
0
(✓) = E[@✓g(S) · 1{h(S) 0}]  @yE[g(S)@✓h(S) · 1{h(S) y}]|y=0 (5.2)
The two terms on the right hand side of (2) can be estimated easily, because
the first term is an ordinary expectation, where g(S) is di↵erentiable with respect
to ✓, and the second term is a derivative with respect to y, which is independent
125
of g(S) and h(S). Thus, the first term can be estimated using the law of large
numbers, and finite di↵erences can be used to estimate the second term based on
 @yE ⇥g(S)@✓h(S) · 1{h(S) y}⇤ |y=0
=   lim !0 1 {E ⇥g(S)@✓h(S) · 1{h(S)  /2}⇤  E ⇥g(S)@✓h(S) · 1{h(S)   /2}⇤}
= lim !0 1 E[g(S)@✓h(S) · 1{  /2h(S) /2}]
= lim !0 1 E g(S)@✓h(S) · Z ✓h(S)  ◆  ,
where Z(u) = 1{ 1/2u1/2}, which is known as the uniform or naive kernel because
u ⇠ U( 1/2, 1/2). It can be replaced with any other kernel, which we elaborate
on in Section 5.3.3. In practice, Assumptions 1 and 2 are typically satisfied. Refer
to [43] for the proof and a discussion of the assumptions.
5.3.2 Second-Order Greeks
The second-order Greeks are estimated using a theorem from [43] based on
the following two assumptions:Assumption 3. For any (✓1, ✓2) 2 ⇥, g(S) and h(S) are di↵erentiable with respect
to ✓1 with probability 1 (w.p.1), and @✓1g(S) and @✓1h(S) are di↵erentiable with re-
spect to ✓2 with probability 1 (w.p.1), and there exist variables Kg, Kh, Lg, and Lc
with finite fourth moments, which may depend on (✓1, ✓2), such that |g(S(✓1 +
 ✓1, ✓2))   g(S(✓1, ✓2))|  Kg| ✓1|, |g(S(✓1, ✓2 + ✓2, ))   g(S(✓1, ✓2))|  Kg| ✓2|,
|h(S(✓1 + ✓1, ✓2))  h(S(✓1, ✓2))|  Kh| ✓1|, |h(S(✓1, ✓2 + ✓2, ))  h(S(✓1, ✓2))| 
Kh| ✓2|, |@✓1g(S(✓1 +  ✓1, ✓2))   @✓1g(S(✓1, ✓2))|  Lg| ✓2|, and |g(S(✓1, ✓2 +
126
 ✓2, ))@✓1h(S(✓1, ✓2 +  ✓2, ))   g(S(✓1, ✓2))@✓1h(S(✓1, ✓2))|  Lc| ✓2|, when | ✓1|
and | ✓2| are su ciently small.Assumption 4. For any ✓ 2 ⇥, @✓1@✓2 (✓1, ✓2, y) exists and is continuous at
(✓1, ✓2, 0), where  (✓1, ✓2, y) = E[g(S) · 1{h(S) y}].Theorem 5.3.2. Suppose g(·) and h(·) satisfy Assumptions 3 and 4, and E[|g(S)|4] <
+1 and E[|h(S)|4] < +1. Then
@✓1@✓2p(✓1, ✓2) = E[@✓1@✓2g(S) · 1{h(S) 0}]
 @yE[(g(S)@✓1@✓2h(S) + @✓1g(S)@✓2h(S) + @✓2g(S)@✓1h(S)) · 1{h(S) y}]|y=0
+@2yE[g(S)@✓1h(S)@✓2h(S)1{h(S) y}]|y=0.
We focus on the second-order Greek gamma, where ✓ = S0; therefore,
@2✓p(✓) = E[@2✓g(S) · 1{h(S) 0}]
 @yE[(g(S)@2✓h(S) + 2 · @✓g(S)@✓h(S)) · 1{h(S) y}]|y=0
+@2yE[g(S)@✓h(S)@✓h(S)1{h(S) y}]|y=0.
Again, the first term on the right hand side can be estimated using using a sample
mean. The finite di↵erences derivation for the second term is similar to (5.3), and
for the finite di↵erence representation of the third term, observe that
 @yE[g(S)(@✓h(S))2 · 1{h(S) y}]|y=0
= lim !0 1 E g(S)(@✓h(S))2 · Z ✓h(S)  y  ◆     y=0,
127
and
@2yE ⇥g(S)(@✓h(S))2 · 1{h(S) y}⇤ |y=0
= lim !0 1 2E g(S)(@✓h(S))2 · Z 0✓h(S)  ◆  , (5.3)
where Z(·) is the kernel function. The approximation of the second-order auxiliary
term (5.3) along with the first-order term (5.3) can be seen in the estimator (5.5).
5.3.3 Kernel
A d-dimensional kernel, K : Rd ! R, is a bounded symmetric density with
respect to the the Lebesgue measure with limkukd!1 k u k K(u) = 0 and RRd k u k2
K(u)du < 1 where k · k is any norm on Rd [6]. A kernel K has the form
K(u) =
1
h
K
⇣u
h
⌘
, u 2 Rd,
where h is the bandwidth, also known as the smoothing parameter [6]. For this
paper, we focus on the case for d = 1. According to [6], a reasonable kernel does not
a↵ect the asymptotic behavior of the estimator; however, the bandwidth significantly
influences the accuracy of the estimator. Large bandwidths decrease the variance
but increase the bias, whereas small bandwidths lead to the exact opposite; hence,
there is a tradeo↵ between variance and bias. Therefore, careful attention should
be given to the selection process. Although there are other methods to quantify the
accuracy of the kernel estimator, [43] measure it based on the MSE. Consequently,
they attempt to select the bandwidth that maximizes the accuracy using an iterative
pilot simulation.
128
5.3.4 First- and Second-Order Greek Estimators
Based on Theorem 5.3.1 and 5.3.2, the following estimators are used to ap-
proximate the first- and second-order Greeks, respectively:
Gn = 1n nXl=1 g0l · 1[hl 0] + 1n n nXl=1 gl · h0l · Z hl n!, (5.4)
Hn = 1n nXl=1 g00l · 1[hl0] + 1n n nXl=1 {gl · h00l + 2 · g0l · h0l} · Z hl n!
+
1
n 2n nXl=1 gl · (h0l)2 · Z 0 hl n!, (5.5)
where (gl, hl, g0l, h0l, g00l , h00l ) is the lth observation of (g(S), h(S), @✓g(S), @✓h(S), @2✓g(S),
@2✓h(S), Z(·) is a kernel, and  n and  n are constant bandwidth parameters gener-
ated using a pilot simulation. The second-order estimator (5.5) is a special case
when both derivatives are taken with respect to the same parameter.
5.4 Pilot Simulation
Ideally, the optimal bandwidth parameter would minimize the error between
the Greek estimator and its true value, and the pilot simulation proposed in [43]
employs an iterative method that generates an “optimal” bandwidth that minimize
the asymptotic mean-squared error (MSE). The MSE of the first-order estimator
can be expressed as
MSE(Gn) = [E(Gn)  @✓E[g(S) · 1{h(S) 0}]]2 + Var(Gn)
=  4n  00(0)2 Z 1 1 u2Z(u)du+ ✏n 2 + 1n n ( 21 + ⇠n),
129
where  (u) = fh(u)·E[g(S)·@✓h(S)|h(S) = u],   (u) = fh(u)·E[|g(S)·@✓h(S)| |h(S) =
u], and  21 =  2(0)
R1
 1 Z
2(u)du. Since ✏n ! 0 and ⇠n ! 0 as n ! 1, then the
optimal bandwidth choice is c · n 1/5, where
c =
"
 21
( 00(0)
R1
 1 u
2Z(u)du)2
#1/5
.
This constant c is estimated by
cˆ =
"
V n
( b 00(0) R1 1 u2Z(u)du)2#1/5 ,
where V n is the sample variance of Gn and b 00(0) is the finite di↵erence approxima-
tion of  
00
(0): b 00(0) = Gn(s) +Gn( s)  2Gn(0)
s2
,
where s is a su ciently small step size and
Gn(u) = 1n n nXl=1 gl · h0l · Z ✓h(S)  u n ◆ .
Therefore, the optimal bandwidth is approximated by  ?n = cˆ · n 1/5. The pilot
simulation begins with cˆ = 1, and is updated after each iteration using the same
sample. More specifically, at each iteration, V n and b 00(0) are simulated using the
updated  ?n with n = 500. This process continues for a pre-specified number of
iterations, and the bandwidth selected is the one generated after 30 pilot iterations,
as in [43]. The pilot simulation algorithm does not specify the variance sample size
nv to generate V n, step size s to estimate  00(0), and the number of iterations for the
pilot simulation np, which are all predetermined. Refer to the e-companion of [43]
for the algorithm details and the pilot simulation for the second-order Greeks.
130
5.5 Numerical Experiments
We conduct two sets of numerical experiments to test the robustness of a
modified pathwise method and a pilot simulation used to generate an input pa-
rameter for the modified method [8]. The first is a sensitivity analysis of Greek
estimators for both the Asian digital option and up-and-out barrier call option to
bandwidth parameters, and the second explores the e↵ect of the input parameters to
the bandwidth generated from the pilot simulation. We considered identical settings
as in [43], which are shown in Table 5.1 and described in Section 5.2. However, the
sample size considered here is 100 as opposed to 1000 in [43]. From our preliminary
results, the increase in sample size only tightens the confidence band and smooths
out the curves, but the overall behavior is similar across estimators, so we opt to use
a smaller sample size. Furthermore, we also consider the standard normal density
as the kernel to increase the robustness of the resulting estimator.
5.5.1 Sensitivity to Bandwidth
In our experiment for each Greek, we generate 100 estimators for incremen-
tally increasing bandwidths and observe the e↵ect on the relative root mean-squared
error (RRMSE) and the 95% confidence band (CB). Figures 5.1, 5.2, 5.3, and 5.4
plot the 95% confidence band for the Greek estimator generated using 1000 sample
paths for Asian Greeks, 10000 sample paths for barrier Greeks, and sample size of
100 for incrementally increasing cˆ. The black curve in between is the mean, and the
horizontal dashed line represents the exact Greek value. The three vertical lines in
131
Figures 5.1, 5.2, and 5.3 represent the 95% confidence interval for the cˆ generated
using the pilot simulation with np = 30, nv = 100, which we will elaborate on in the
next section. In the same figures, the second row of graphs plots the RRMSE of the
100 generated Greeks for incrementally increasing cˆ, where the additional vertical
dashed line denotes the c⇤ that minimizes the RRMSE. The only di↵erence when
n = 1000 increases to n = 10000 for the Asian Greek estimators is the increased
smoothness of the curves and tighter confidence bands, but again, the overall be-
havior is the same. [43] reports results for n = 103, 104, 105 for the Asian Greeks,
but we only test the case for n = 103.
Not surprisingly, the bandwidth a↵ects the accuracy and precision of the Greek
estimator. Generally, as the bandwidth decreases, the variance increases, which can
be seen in the widening of the 95% confidence band as cˆ ! 0 and is observable in all
of the estimators. Similarly, the 95% confidence band decreases as the bandwidth
increases, but eventually the confidence band no longer contains the exact Greek
value. Figures 5.2, 5.3, and 5.4 for the Asian vega, Asian gamma, and barrier theta
estimators illustrate this behavior, respectively. In addition, the minimum RRMSE
of these estimates occur in one particular region for all discretization levels. Figure
5.2 shows that the Asian vega estimator is rather close to the exact value for a
range of cˆ values less than 0.1. The barrier theta estimator behaves similarly but
for a di↵erent range of cˆ values as seen in Figure 5.4. In contrast, the Asian gamma
estimator is extremely sensitive to cˆ, and the RRMSE increases significantly with
even slight deviations from the optimal bandwidth. Moreover, the 95% confidence
interval (CI) for the bandwidth parameter is very narrow and does not contain the
132
Table 5.2: Asian delta.
 n for s
n k c
⇤
RRMSE lower upper range  
⇤n 0.1 0.01 0.001
500 10 0.64 0.028 0.915 1.021 0.106 0.185 0.14 0.06 0.06
20 0.46 0.038 0.975 1.132 0.156 0.133 0.15 0.05 0.05
50 0.54 0.033 0.998 1.130 0.132 0.156 0.14 0.06 0.06
1000 10 0.74 0.020 0.931 1.005 0.074 0.186 0.14 0.05 0.04
20 0.54 0.028 0.996 1.104 0.108 0.136 0.13 0.05 0.05
50 0.60 0.024 1.019 1.117 0.099 0.151 0.13 0.05 0.05
2000 10 0.84 0.014 0.943 0.996 0.053 0.184 0.13 0.03 0.03
20 0.56 0.021 1.012 1.097 0.084 0.122 0.13 0.03 0.02
50 0.68 0.017 1.034 1.105 0.071 0.149 0.13 0.03 0.02
5000 10 1.00 0.009 0.954 0.988 0.035 0.182 0.15 0.06 0.06
20 0.66 0.014 1.027 1.083 0.055 0.120 0.16 0.07 0.07
50 0.80 0.011 1.050 1.094 0.044 0.146 0.16 0.06 0.06
10000 10 1.14 0.007 0.960 0.985 0.020 0.181 0.15 0.03 0.02
20 0.72 0.011 1.036 1.076 0.040 0.114 0.12 0.03 0.02
50 0.92 0.008 1.056 1.088 0.032 0.146 0.13 0.03 0.02
exact Asian gamma value.
The numerical results for the Asian delta are quite di↵erent as illustrated in
Figure 5.1. The mean of the delta estimator intersects with the exact/analytical
value at least twice for k = 10 and k = 50. For the case k = 10, the RRMSE
has a minimum and a nearly flat region. The minimum of the RRMSE for each
discretization case occurs at di↵erent points for a given number of sample paths
n, unlike the Asian vega, Asian gamma, and barrier theta. The exact cˆ values at
which the minimums occurs, the resulting RRMSE, and 95% confidence interval of
the Asian delta estimators generated using the cˆ are listed in Table 5.2.
Another interesting case is the gamma estimator for the up-and-out barrier
call option, which relies on two bandwidth parameters  n and  n. The Asian gamma
133
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.0
1.2
k=10
c
^
De
lta
0.8
1.0
1.2
0.8
1.0
1.2 95% CB
mean delta
exact delta
95% CI
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.0
1.2
k=20
c
^
De
lta
0.8
1.0
1.2
0.8
1.0
1.2
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.0
1.2
k=50
c
^
De
lta
0.8
1.0
1.2
0.8
1.0
1.2
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0
0.1
0
c
^
RR
MS
E
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0
0.1
0
c
^
RR
MS
E
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0
0.1
0
c
^
RR
MS
E
95% CI for c
^
optimal c
^Figure 5.1: 95% confidence band for Asian delta (solid curves) for
n = 1000 with w/95% confidence interval for bandwidth (vertical
dashed lines) generated using s = 0.1 and RRMSE for 100 sample
paths.
0.0 0.1 0.2 0.3 0.4
0.5
0.7
0.9
1.1
k=10
c
^
Ve
ga
0.5
0.7
0.9
1.1
0.5
0.7
0.9
1.1
95% CB
mean vega
exact vega
95% CI
0.0 0.1 0.2 0.3 0.4
0.5
0.7
0.9
1.1
k=20
c
^
Ve
ga
0.5
0.7
0.9
1.1
0.5
0.7
0.9
1.1
0.0 0.1 0.2 0.3 0.4
0.5
0.7
0.9
1.1
k=50
c
^
Ve
ga
0.5
0.7
0.9
1.1
0.5
0.7
0.9
1.1
0.0 0.1 0.2 0.3 0.4
0.0
0
0.1
5
0.3
0
c
^
RR
MS
E
0.0 0.1 0.2 0.3 0.4
0.0
0
0.1
5
0.3
0
c
^
RR
MS
E
0.0 0.1 0.2 0.3 0.4
0.0
0
0.1
5
0.3
0
c
^
RR
MS
E
95% CI for c
^
optimal c
^Figure 5.2: 95% confidence band for Asian vega (solid curves) for
n = 1000 w/95% confidence interval for bandwidth (vertical dashed
lines) generated using s = 0.001 and RRMSE for 100 sample paths.
134
0.0 0.1 0.2 0.3 0.4 0.5 0.6
4
6
8
10
k=10
c
^
Ga
mm
a
4
6
8
10
4
6
8
10
95% CB
mean gamma
exact gamma
95% CI
0.0 0.1 0.2 0.3 0.4 0.5 0.6
4
6
8
10
k=20
c
^
Ga
mm
a
4
6
8
10
4
6
8
10
0.0 0.1 0.2 0.3 0.4 0.5 0.6
4
6
8
10
k=50
c
^
Ga
mm
a
4
6
8
10
4
6
8
10
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.0
1.0
2.0
3.0
c
^
RR
MS
E
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.0
1.0
2.0
3.0
c
^
RR
MS
E
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.0
1.0
2.0
3.0
c
^
RR
MS
E
95% CI for c
^
optimal c
^Figure 5.3: 95% confidence band for Asian gamma (solid curves) for
n = 1000 w/95% confidence interval for bandwidth (vertical dashed
lines) generated using s = 0.01 and RRMSE for 100 sample paths.
is also a second-order Greek, but the second term in (5.4) containing  n disappears
with the given settings. Figure 5.5 plots the barrier gamma as a function of the
bandwidth parameters  n (visible horizontal axis) and  n (not visible), where the
vertical planes represent the approximately exact gamma for k = 20, 50, respectively.
It is clear that gamma is insensitive to  n, but extremely sensitive to  n, especially
when  n < .03. Unfortunately, this is the region in which the estimator attains the
value closest to the true value. Since the analytical expression is not available for
comparison, the true values are approximated using finite di↵erences with a large
sample of 109. In this case, the steepness of the gamma estimator near the “exact”
value prompts the need for a very accurate and precise  n; otherwise, the estimate
could be drastically di↵erent from its true value. As expected, the margin of error
decreases as the number of sample paths n increases. For each discretization level
k, cˆ increases with n, but the resulting  n values are similar for each k regardless of
135
0 5 10 15 20 25 30
1.0
1.4
1.8
k=20
c
^
The
ta
1.0
1.4
1.8
1.0
1.4
1.8
0 5 10 15 20 25 30
1.0
1.4
1.8
k=50
c
^
The
ta
1.0
1.4
1.8
1.0
1.4
1.8
0 5 10 15 20 25 30
0.0
0
0.1
0
c
^
RR
MS
E
0 5 10 15 20 25 30
0.0
0
0.1
0
c
^
RR
MS
EFigure 5.4: 95% confidence band for barrier theta for n = 10000 and
RRMSE for 100 sample paths.
n.
The conclusions in this section are representative of all of the estimators tested.
For instance, the Asian delta is the only estimator whose mean matches the exact
Greek value for two di↵erent bandwidth values and the barrier gamma is the only
estimator that requires the input of two bandwidth parameters. The Asian vega,
in general, has a decreasing 95% confidence band whose mean also decreases with
cˆ and matches the exact Greek value for one particular cˆ, and there is one global
minimum RRMSE, which represents the results of the remaining Greek estimators.
5.5.2 Sensitivity of Bandwidth Generator to Input Parameters
Our sensitivity analysis of the modified pathwise method applied to Greeks
concludes that it is, in general, very sensitive to the bandwidth parameter(s), so the
key to a good estimator is in the bandwidth selection process. Therefore, we inves-
136
(a) k = 20
(b) k = 50Figure 5.5: Barrier gamma estimator for n = 10000, 100 sample
paths.
137
tigate the sensitivity of the pilot simulation to the various input parameters such as
variance sample size nv, perturbation size s, number of sample paths n, and number
of pilot iterations np, as described in Section 5.4. We considered the following set-
tings: nv 2 {50, 100, 500, 1000}, s 2 {0.001, 0.01, 0.1}, n 2 {500, 1000, 2000, 5000},
and np 2 {1, . . . , 100}. For each setting, we compute the 95% confidence inter-
val for the bandwidth output and compare it against the optimal bandwidth that
minimizes the RRMSE in Table 5.2. The pilot simulation experiments consider a
variance sample size of 100 (i.e., nv = 100) and 30 pilot iterations (i.e, np = 30),
unless stated otherwise. Over an extensive set of parameter settings, the results are
quite similar, so we limit our discussion to representative cases.
First, we discuss the results of the pilot simulation for the Asian delta. For
any fixed perturbation size, the average cˆ under the Asian delta estimator is quite
similar across sample paths and discretization level k, with a maximum range of
0.04. As n increases, the average cˆ decreases ever so slightly. One would think that
variance of cˆ decreases with n, i.e., the confidence band would decrease with the n;
however, the perturbation size s appears to be the determining factor from Figure
5.6. In this case, the smaller perturbation size s = 0.1 leads to wider confidence
intervals, which is interesting, because smaller perturbations tend to generate noisier
estimates. Therefore, it could be advantageous to consider an average or a window
average, as opposed to selecting the cˆ generated after a certain number of iterations.
For instance, the average cˆ fluctuates, especially in the case of s = 0.1, so taking
a further average of multiple pilot iterations at the later stage, could generate a
parameter that is more stable.
138
0 10 20 30 40 50
0.0
0.2
0.4
n=500, s=0.1
Number of Pilot Iterations
c^
0.0
0.2
0.4
0.0
0.2
0.4
0 10 20 30 40 50
0.0
0.2
0.4
n=1000, s=0.1
Number of Pilot Iterations
c^
0.0
0.2
0.4
0.0
0.2
0.4
0 10 20 30 40 50
0.0
0.2
0.4
n=2000, s=0.1
Number of Pilot Iterations
c^
0.0
0.2
0.4
0.0
0.2
0.4
0 10 20 30 40 50
0.0
0.2
0.4
n=5000, s=0.1
Number of Pilot Iterations
c^
0.0
0.2
0.4
0.0
0.2
0.4
0 10 20 30 40 50
0.0
0.2
0.4
n=500, s=0.01
Number of Pilot Iterations
c^
0.0
0.2
0.4
0.0
0.2
0.4
0 10 20 30 40 50
0.0
0.2
0.4
n=1000, s=0.01
Number of Pilot Iterations
c^
0.0
0.2
0.4
0.0
0.2
0.4
0 10 20 30 40 50
0.0
0.2
0.4
n=2000, s=0.01
Number of Pilot Iterations
c^
0.0
0.2
0.4
0.0
0.2
0.4
0 10 20 30 40 50
0.0
0.2
0.4
n=5000, s=0.01
Number of Pilot Iterations
c^
0.0
0.2
0.4
0.0
0.2
0.4
0 10 20 30 40 50
0.0
0.2
0.4
n=500, s=0.001
Number of Pilot Iterations
c^
0.0
0.2
0.4
0.0
0.2
0.4
0 10 20 30 40 50
0.0
0.2
0.4
n=1000, s=0.001
Number of Pilot Iterations
c^
0.0
0.2
0.4
0.0
0.2
0.4
0 10 20 30 40 50
0.0
0.2
0.4
n=2000, s=0.001
Number of Pilot Iterations
c^
0.0
0.2
0.4
0.0
0.2
0.4
0 10 20 30 40 50
0.0
0.2
0.4
n=5000, s=0.001
Number of Pilot Iterations
c^
0.0
0.2
0.4
0.0
0.2
0.4Figure 5.6: Asian delta pilot 95% confidence interval for cˆ for k = 10.
Another input parameter that was not specified is the sample size nv used to
generate the sample variance V n. So we investigate the sensitivity of the bandwidth
parameter to nv for each of the perturbation sizes s. Again, we focus on the Asian
delta case for the discretization case k = 10, sample path n = 1000, vary nv 2
{50, 100, 500, 1000}, and s 2 {0.001, 0.01, 0.1}. Figure 5.7 plots the generated cˆ
against the number of pilot simulations np for a sample size of 100. The results
of the sensitivity to nv are inconclusive, since the behavior is inconsistent across
di↵erent parameter settings. For this particular example, the larger perturbation
size s = 0.1 leads to a higher bandwidth parameter, which can also be seen in Figure
5.6.
Clearly, the input parameters have a significant impact on the bandwidth
139
0 20 40 60 80 100
0.0
6
0.1
0
0.1
4
0.1
8
k = 10 s = 0.1
n
p
c^
0.0
6
0.1
0
0.1
4
0.1
8
0.0
6
0.1
0
0.1
4
0.1
8
0.0
6
0.1
0
0.1
4
0.1
8
vn=50
vn=100
vn=500
vn=1000
0 20 40 60 80 100
0.0
6
0.1
0
0.1
4
0.1
8
k = 10 s = 0.01
n
p
c^
0.0
6
0.1
0
0.1
4
0.1
8
0.0
6
0.1
0
0.1
4
0.1
8
0.0
6
0.1
0
0.1
4
0.1
8
vn=50
vn=100
vn=500
vn=1000
0 20 40 60 80 100
0.0
6
0.1
0
0.1
4
0.1
8
k = 10 s = 0.001
n
p
c^
0.0
6
0.1
0
0.1
4
0.1
8
0.0
6
0.1
0
0.1
4
0.1
8
0.0
6
0.1
0
0.1
4
0.1
8
vn=50
vn=100
vn=500
vn=1000
0 20 40 60 80 100
0.0
6
0.1
0
0.1
4
0.1
8
k = 20 s = 0.1
n
p
c^
0.0
6
0.1
0
0.1
4
0.1
8
0.0
6
0.1
0
0.1
4
0.1
8
0.0
6
0.1
0
0.1
4
0.1
8
vn=50
vn=100
vn=500
vn=1000
0 20 40 60 80 100
0.0
6
0.1
0
0.1
4
0.1
8
k = 20 s = 0.01
n
p
c^
0.0
6
0.1
0
0.1
4
0.1
8
0.0
6
0.1
0
0.1
4
0.1
8
0.0
6
0.1
0
0.1
4
0.1
8
vn=50
vn=100
vn=500
vn=1000
0 20 40 60 80 100
0.0
6
0.1
0
0.1
4
0.1
8
k = 20 s = 0.001
n
p
c^
0.0
6
0.1
0
0.1
4
0.1
8
0.0
6
0.1
0
0.1
4
0.1
8
0.0
6
0.1
0
0.1
4
0.1
8
vn=50
vn=100
vn=500
vn=1000
0 20 40 60 80 100
0.0
6
0.1
0
0.1
4
0.1
8
k = 50 s = 0.1
n
p
c^
0.0
6
0.1
0
0.1
4
0.1
8
0.0
6
0.1
0
0.1
4
0.1
8
0.0
6
0.1
0
0.1
4
0.1
8
vn=50
vn=100
vn=500
vn=1000
0 20 40 60 80 100
0.0
6
0.1
0
0.1
4
0.1
8
k = 50 s = 0.01
n
p
c^
0.0
6
0.1
0
0.1
4
0.1
8
0.0
6
0.1
0
0.1
4
0.1
8
0.0
6
0.1
0
0.1
4
0.1
8
vn=50
vn=100
vn=500
vn=1000
0 20 40 60 80 100
0.0
6
0.1
0
0.1
4
0.1
8
k = 50 s = 0.001
n
p
c^
0.0
6
0.1
0
0.1
4
0.1
8
0.0
6
0.1
0
0.1
4
0.1
8
0.0
6
0.1
0
0.1
4
0.1
8
vn=50
vn=100
vn=500
vn=1000Figure 5.7: Asian delta pilot, n = 500, sample size = 100.
parameter generated from the pilot simulation, and we investigate further to deter-
mine whether or not the confidence interval of the bandwidth parameter captures
the bandwidth that minimizes the RRMSE for parameters n = 1000, nv = 100,
np = 30, sample size 100, s = 0.1 for the delta, s = 0.01 for the gamma, and
s = 0.001 for both vega and theta Asian Greek estimators. Values for the perturba-
tion parameters were not specified in [43], but were obtained from the authors via
personal communication. The sizes were chosen relative to the parameter of interest.
Our results vary significantly, from successfully capturing the optimal bandwidth to
completely missing it. Figures 5.1, 5.2, and 5.3 contain vertical dotted lines rep-
resenting the 95% confidence interval for the optimal c⇤ with the above specified
parameters, the dotted line denotes the mean, and the dashed line is the c⇤ that
140
minimizes the RRMSE from our first set of numerical experiments. None of the 95%
confidence intervals of cˆ capture the minimum RRMSE for the Asian delta Greek
estimator. Instead, the CI contains mean cˆ⇤. The confidence interval for the cˆ gen-
erated from the pilot simulation for the Asian vega, covers the minimum RRMSE
bandwidth, which can be seen in Figure 5.2.
5.6 Summary and Future Work
In this chapter, we analyzed the sensitivity of the Greek kernel estimators
of [43] to the bandwidth parameters and the sensitivity and performance of the
pilot simulation in selecting the “optimal” bandwidth parameters to input parame-
ters. The simulated experiments show that the kernel estimators are quite sensitive
to the bandwidth choice, and the pilot simulation is sensitive to the input param-
eters. Of the input parameters, the perturbation step size s significantly impacts
the generated bandwidth. If s is too small relative to the parameter of interest,
then the pilot simulation could generate bandwidths that lead to poor gradient es-
timates. However, even if the s is chosen proportional to the parameter of interest,
the 95% confidence interval for the cˆ might not capture the c⇤ that minimizes the
RRMSE. The input parameters play a vital role in the performance of the estimator,
so the input selection process of the pilot simulation and other bandwidth selection
methods are important future research areas.
141
Bibliography
[1] S. Andrado´ttir. A new algorithm for stochastic approximation. Proceedings of
the 1990 Winter Simulation Conference, pages 364–366, 1990.
[2] S. Andrado´ttir. A stochastic approximation algorithm with varying bounds.
Operations Research, 43(6):1037–1048, 1995.
[3] S. Andrado´ttir. A scaled stochastic approximation algorithm. Management
Science, 42(4):475–498, 1996.
[4] S. Asmussen and P. Glynn. Stochastic Simulation: Algorithms and Analysis.
Springer, New York, 2007.
[5] J. Blum. Multidimensional stochastic approximation methods. The Annals of
Mathematical Statistics, 25(4):737–744–200, 1954.
[6] D. Bosq. Nonparametric Statistics for Stochastic Processes. Springer, second
edition, 1998.
[7] M. Broadie, D. Cicek, and A. Zeevi. General bounds and finite-time improve-
ment for the Kiefer-Wolfowitz stochastic approximation algorithm. Operations
Research, 59(5):1211–1224, 2011.
[8] M. Chau and M. Fu. On the sensitivity of Greek kernel estimators to band-
width parameters. Proceedings of the 2014 Winter Simulation Conference,
pages 3845–3856, 2014.
[9] M. Chau and M. Fu. An overview of stochastic approximation. In M. Fu, editor,
Handbook of Simulation Optimization, chapter 6, pages 149–178. Springer, 2014.
[10] M. Chau, H. Qu, and M. Fu. A new hybrid stochastic approximation algorithm.
In Proceedings of the 2014 Workshop on Discrete Event Systems, volume 12,
pages 241–246, 2014.
[11] M. Chau, H. Qu, and M. F. I. Ryzhov. An empirical sensitivity analysis of the
Kiefer-Wolfowitz algorithm and its variants. Proceedings of the 2013 Winter
Simulation Conference, pages 945–965, 2013.
[12] H. Chen, T. Duncan, and B. Pasik-Duncan. A Kiefer-Wolfowitz algorithm with
randomized di↵erences. IEEE Transactions on Automatic Control, 44(3):442–
453, 1999.
[13] H. Chen and Y. Zhu. Stochastic approximation procedure with randomly vary-
ing truncations. Scientia Sinica Series A, 29:914–926, 1986.
[14] K. Chung. On a stochastic approximation method. The Annals of Mathematical
Statistics, 25(3):463–483, 1954.
142
[15] B. Delyon and A. Juditsky. Accelerated stochastic approximation. SIAM Jour-
nal on Optimization, 3(4):868–881, 1993.
[16] C. Derman. An application of Chung’s lemma to the Kiefer-Wolfowitz stochas-
tic approximation procedure. The Annals of Mathematical Statistics, 27(2):532–
536, 1956.
[17] J. Dippon and J. Renz. Weighted means in stochastic approximation of minima.
SIAM Journal on Control Optimization, 35(5):1811–1827, 1997.
[18] V. Dupac. On the Kiefer-Wolfowitz approximation method. Cˇasopis pro
pe´stova´n´ı Matematiky, 082(1):47–75, 1957.
[19] V. Fabian. Stochastic approximation of minima with improved asymptotic
speed. The Annals of Mathematical Statistics, 38(1):191–200, 1967.
[20] V. Fabian. On asymptotic normally in stochastic approximation. The Annals
of Mathematical Statistics, 39(4):1107–1380, 1968.
[21] V. Fabian. Stochastic approximation. In J. Rustagi, editor, Optimizing Methods
in Statistics, pages 439–470. Academic Press, New York, 1971.
[22] E. Fogel. A fundamental approach to the convergence analysis of least squares
algorithms. IEEE Transactions on Automatic Control, 26(3):646 – 655, 1981.
[23] M. Fu. Gradient estimation. In S. Henderson and B. Nelson, editors, Hand-
books in Operations Research and Management Science: Simulation, chapter 19,
pages 575–616. Elsevier, Amsterdam, 2006.
[24] M. Fu. What you should know about simulation and derivatives. Naval Research
Logistics, 55(8):723 – 736, 2008.
[25] M. Fu. Stochastic gradient estimation. In M. Fu, editor, Handbook of Simulation
Optimization, chapter 5, pages 105–147. Springer, 2014.
[26] M. Fu and S. Hill. Optimization of discrete event systems via simultaneous
perturbation stochastic approximation. IIE Transactions, 29(3):233–243, 1997.
[27] M. Fu and Y. Ho. Using perturbation analysis for gradient estimation, averaging
and updating in a stochastic approximation algorithm. Proceedings of the 1988
Winter Simulation Conference, pages 509–517, 1988.
[28] M. Fu and J. Hu. Conditional Monte Carlo: Gradient Estimation and Opti-
mization Applications. Kluwer, Boston, MA, 1997.
[29] A. George and W. Powell. Adaptive stepsizes for recursive estimation
with applications in approximate dynamic programming. Machine Learning,
65(1):167–198, Oct. 2006.
143
[30] L. Gerencse´r. Convergence rates of moments in stochastic approximation with
simultaneous perturbation gradient approximation and resetting. IEEE Trans-
actions on Automatic Control, 44(5):894–905, 1998.
[31] S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for
strongly convex stochastic composite optimization I: A generic algorithmic
framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012.
[32] S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for
strongly convex stochastic composite optimization ii: shrinking procedures and
optimal algorithms. SIAM Journal on Optimization, 23(4):2061–2089, 2013.
[33] P. Glasserman. Monte Carlo Methods in Financial Engineering. Springer, New
York, NY, 2004.
[34] Y. Ho and X. Cao. Perturbation analysis and optimization of queueing net-
works. Journal of Optimization Theory and Applications, 40(4):550–582, 1983.
[35] Y. Ho, L. Shi, L. Dai, and W.-B. Gong. Optimizing discrete event dynamic
systems via the gradient surface method. Discrete Event Dynamic Systems:
Theory and Applications, 2(2):99–120, 1992.
[36] H. Kesten. Accelerated stochastic approximation. The Annals of Mathematical
Statistics, 29(1):41–59, 1958.
[37] K. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regres-
sion function. The Annals of Mathematical Statistics, 23(3):462–466, 1952.
[38] N. Kleinman, J. Spall, and D. Naiman. Simulation-based optimization with
stochastic approximation using common random numbers. Management Sci-
ence, 45(11):1570–1578, 1999.
[39] H. Kushner and D. Clark. Stochastic Approximation Methods for Constrained
and Unconstrained Systems. Springer, New York, NY, 1987.
[40] H. Kushner and J. Yang. Stochastic approximation with averaging of the it-
erates: Optimal asymptotic rates of convergence for general processes. SIAM
Journal on Control and Optimization, 31:1045–1062, 1993.
[41] H. Kushner and J. Yang. Stochastic approximation with averaging and feed-
back: Rapidly convergent ‘on-line’ algorithms. IEEE Transactions on Auto-
matic Control, 40:24–34, 1995.
[42] H. Kushner and G. Yin. Stochastic Approximation and Recursive Algorithms
and Applications. Springer, New York, NY, 2003.
[43] G. Liu and L. Hong. Kernel estimation of the Greeks for options with discon-
tinuous payo↵s. Operations Research, 59(1):96–108, 2011.
144
[44] J. Maryak. Some guidelines for using iterate averaging in stochastic approx-
imation. Proceedings of the 36th IEEE Conference on Decision and Control,
3:2287–2290, 1997.
[45] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approx-
imation approach to stochastic programming. SIAM Journal on Optimization,
19(4):1574–1609, 2009.
[46] A. Nemirovski and D. Yudin. Problem Complexity and Method E ciency in
Optimization. John Wiley, New York, NY, 1983.
[47] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathe-
matical Programming, 120(1):221–259, 2009.
[48] B. Polyak. New stochastic approximation type procedures. Automat. i Tele-
mekh, 51:98–105, 1990.
[49] B. Polyak and A. Juditsky. Acceleration of stochastic approximation by aver-
aging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
[50] H. Robbins and S. Monro. A stochastic approximation method. The Annals of
Mathematical Statistics, 22:400–407, 1951.
[51] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro
procedure. The Annals of Statistics, 13(1):236–245, 1985.
[52] D. Ruppert. E cient estimations from a slowly convergent Robbins-Monro pro-
cess. Technical report, Cornell University Operations Research and Industrial
Engineering, Ithaca, NY, February 1988.
[53] J. Sacks. Asymptotic distribution of stochastic approximation procedures. The
Annals of Mathematical Statistics, 29(2):351–634, 1958.
[54] P. Sadegh. Constrained optimization via stochastic approximation with a si-
multaneous perturbation gradient approximation. Automatica, 33(5):889–892,
1997.
[55] J. Spall. Multivariate stochastic approximation using a simultaneous pertur-
bation gradient approximation. IEEE Transactions on Automatic Control,
37(3):332–341, 1992.
[56] J. Spall. Adaptive stochastic approximation by the simultaneous perturbation
method. IEEE Transactions on Automatic Control, 45(10):1839–1853, 2000.
[57] J. Spall. Introduction to Stochastic Search and Optimization: Estimation, Sim-
ulation, and Control. John Wiley, Hoboken, NJ, 2003.
[58] A. Tsybakov and B. Polyak. Optimal order of accuracy of search algorithms in
stochastic optimization. Problemy Peredachi Informatsii, 26(2):45–63, 1990.
145
[59] J. Venter. An extension of the Robbins-Monro procedure. The Annals of
Mathematical Statistics, 38(1):181–190, 1967.
[60] I. Wang and E. Chong. A deterministic analysis of stochastic approxima-
tion with randomized di↵erences. IEEE Transactions on Automatic Control,
43(12):1745–1749, 1998.
[61] Y. Wang, M. Fu, and S. Marcus. Sensitivity analysis for barrier options. In
M. Rossetti, R. Hill, B. Johansson, and R. Ingalls, editors, Proceedings of the
2009 Winter Simulation Conference, pages 1272 – 1282, Piscataway, New Jer-
sey, 2009. Institute of Electrical and Electronics Engineers, Inc.
[62] M. Wasan. Stochastic Approximation. Cambridge Tracts Mathematics and
Mathematical Physics. Cambridge University Press, London, 1969.
[63] L. Xiao. Dual averaging methods for regularized stochastic learning and online
optimization. Journal of Machine Learning, 11:2543–2596, 2010.
[64] G. Yin and Y. Zhu. Almost sure convergence of stochastic approximation algo-
rithms with nonadditive noise. International Journal of Control, 49(4):1361–
1376, 1989.
[65] F. Yousefian, A. Ned´ıc, and U. Shanbhag. On stochastic gradient and subgra-
dient methods with adaptive steplength sequences. Automatica, 48(1):56–67,
2011.
146