ABSTRACT
Title of dissertation: SIMULATION OPTIMIZATION: NEW
APPROACHES AND AN APPLICATION
Huashuai Qu, Doctor of Philosophy, 2014
Dissertation directed by: Professor Michael C. Fu
Department of Decision, Operations,
and Information Technologies
Simulation models are commonly used to provide analysis and prediction of
the behavior of complex stochastic systems. Simulation optimization integrates op-
timization techniques into simulation analysis to capture response surface, to choose
optimal decision variables and to perform sensitivity analysis. Objective functions
usually cannot be computed in closed form and are computationally expensive to
evaluate. Many methods are proposed by researchers for problems with continuous
and discrete variables, respectively. The dissertation is comprised of both optimiza-
tion methods and a real-world application. In particular, our goal is to develop new
methods based on direct gradient estimates and variational Bayesian techniques.
The first part of the thesis considers the setting where additional direct gra-
dient information is available and introduces different approaches for enhancing
regression models and stochastic kriging with this additional gradient information,
respectively. For regression models, we propose Direct Gradient Augmented Regres-
sion (DiGAR) models to incorporate direct gradient estimators. We characterize
the variance of the estimated parameters in DiGAR and compare them analyti-
cally with the standard regression model for some special settings. For stochastic
kriging, we propose Gradient Extrapolated Stochastic Kriging (GESK) to incorpo-
rate direct gradient estimates by extrapolating additional responses. We show that
GESK reduces mean squared error (MSE) compared to stochastic kriging under
certain conditions on step sizes. We also propose maximizing penalized likelihood
and minimizing integrated mean squared error to determine the step sizes.
The second part of the thesis focuses on the problem of learning unknown cor-
relation structures in ranking and selection (R&S) problems. We proposes a com-
putationally tractable Bayesian statistical model for learning unknown correlation
structures in fully sequential simulation selection. We derive a Bayesian procedure
that allocates simulations based on the value of information, thus anticipating fu-
ture changes to our beliefs about the correlations. The proposed approach is able
to simultaneously learn unknown mean performance values and unknown correla-
tions, whereas existing approaches in the literature assume independence or known
correlations to learn unknown mean performance values only.
Finally we consider an application in business-to-business (B2B) pricing. We
propose an approximate Bayesian statistical model for predicting the win/loss prob-
ability for a given price and an approach for recommending target prices based on
the approximate Bayesian model.
SIMULATION OPTIMIZATION: NEW METHODS AND AN
APPLICATION
by
Huashuai Qu
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2014
Advisory Committee:
Professor Michael C. Fu, Chair/Advisor
Professor Ilya O. Ryzhov
Professor Paul Smith
Professor Leonid Koralov
Professor Steve Marcus
c© Copyright by
Huashuai Qu
2014
Dedication
I dedicate this dissertation to my loving and supportive wife, Xuan Liu.
ii
Acknowledgments
First, I would like to express my most sincere gratitude to my advisors Pro-
fessor Michael C. Fu and Professor Ilya O. Ryzhov, for their vision, advice and
patience to help me proceed through my graduate studies. The completion of this
dissertation would not have been possible without their consistent guidance and
support.
Special thanks to my other dissertation committee members, Professor Paul
Smith, Professor Steve Marcus and Professor Leonid Koralov, for taking the time
to read the thesis and attend my defense. I also would like to thank Professor
Grace Yang for teaching me the course on stochastic process and helping in my job
searching process.
I would like to thank everyone within the department of Mathematics in gen-
eral. I would particularly like to thank: Marie Chau, Zhixin Lu, Ran Ji, Changhui
Tan and Zi Ding for their friendship and feedback.
My family has provided their love and untiring support during the process.
Most importantly, doing the research and writing the thesis would be impossible
without the support and understanding form my wife, Xuan. I thank my parents,
Wenbin Qu and Ping Zhao for their encouragement and their belief in the value of
learning. Nothing in a simple paragraph can express the love I have for all of you.
iii
Table of Contents
List of Tables vii
List of Figures ix
1 Introduction 1
1.1 Simulation Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Direct Gradient Enhanced Metamodels . . . . . . . . . . . . . . . . . 3
1.3 Variational Bayesian Inference in Simulation Optimization . . . . . . 6
1.4 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Direct Gradient Augmented Regression 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Standard Linear Regression Model . . . . . . . . . . . . . . . 14
2.2.2 Direct Gradient Augmented Regression . . . . . . . . . . . . . 15
2.2.3 Choice of Weights in α-DiGAR . . . . . . . . . . . . . . . . . 18
2.2.4 Theoretical Comparisons Between Estimators for Special Cases 19
2.2.5 Multi-dimensional Linear Models . . . . . . . . . . . . . . . . 25
2.2.6 DiGAR Model for Gradient Estimates Correlated with Per-
formance Outputs . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.1 Example: M/M/1 Queue . . . . . . . . . . . . . . . . . . . . 33
2.3.2 Regression using a Quadratic Function . . . . . . . . . . . . . 40
2.3.3 Multi-dimensional Example: U/U/1 Queue . . . . . . . . . . . 45
2.3.4 Multi-dimensional Example: Sphere Function . . . . . . . . . 47
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3 Gradient Extrapolated Stochastic Kriging 56
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.1.1 Stochastic Kriging . . . . . . . . . . . . . . . . . . . . . . . . 58
3.1.2 Stochastic Kriging With Gradient Estimators . . . . . . . . . 60
3.2 Gradient Extrapolated Stochastic Kriging . . . . . . . . . . . . . . . 62
iv
3.2.1 A Two-Point Problem with Single Extrapolated Point . . . . . 68
3.2.2 A k -Point Problem . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3 Implementations of GESK . . . . . . . . . . . . . . . . . . . . . . . . 79
3.3.1 Penalized Maximum Likelihood Estimation . . . . . . . . . . . 80
3.3.2 Minimizing Integrated MSE . . . . . . . . . . . . . . . . . . . 81
3.3.3 Choosing Regularization Parameter . . . . . . . . . . . . . . . 82
3.3.4 Choosing Gradient Estimators . . . . . . . . . . . . . . . . . . 84
3.4 NUMERICAL EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . 85
3.4.1 Experiment on Step Sizes in GESK . . . . . . . . . . . . . . . 86
3.4.2 Comparisons among SK, SKG and GESK . . . . . . . . . . . 91
3.4.2.1 A stochastic simulation example . . . . . . . . . . . 91
3.4.2.2 A stylized example with added noise . . . . . . . . . 93
3.4.2.3 A multidimensional example . . . . . . . . . . . . . . 96
3.5 CONCLUSIONS AND FUTURE RESEARCH . . . . . . . . . . . . . 97
4 Simulation Selection with Unknown Correlation Structure 104
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.2 Learning Unknown Correlation Structures . . . . . . . . . . . . . . . 108
4.2.1 Learning from complete observations . . . . . . . . . . . . . . 109
4.2.2 Learning model for scalar observations . . . . . . . . . . . . . 112
4.2.3 Predictive distribution of the next observation . . . . . . . . . 122
4.2.4 Information loss due to approximate Bayesian inference . . . . 125
4.3 The Value Of Information . . . . . . . . . . . . . . . . . . . . . . . . 133
4.3.1 Computation of the Value of Information . . . . . . . . . . . . 135
4.3.2 Monotonicity of the Value of Information . . . . . . . . . . . . 136
4.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.4.1 Wind Farm Placement Example . . . . . . . . . . . . . . . . . 139
4.4.2 A Single-Server Queue Selection Problem . . . . . . . . . . . . 146
4.4.3 3-Station Jackson Network . . . . . . . . . . . . . . . . . . . . 148
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5 Bayesian Learning on Logistic Demand Curves 151
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.2 Problem Formulation and Learning Model . . . . . . . . . . . . . . . 154
5.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 154
5.2.2 Bayesian Model For Dynamic Pricing . . . . . . . . . . . . . . 155
5.2.3 Variational Bayesian Approximation . . . . . . . . . . . . . . 157
5.2.4 Minimizing the Kullback-Leibler Divergence . . . . . . . . . . 160
5.3 Dynamic Pricing Policy . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.3.1 Computation of the Bayes-Greedy Policy . . . . . . . . . . . . 162
5.3.2 Analysis of the Bayes-Greedy Policy . . . . . . . . . . . . . . 163
5.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
v
6 Conclusion 172
A.1 Analytical Results for M/M/1 and U/U/1 Queues . . . . . . . . . . . 176
A.2 Gradient Estimation for G/G/1 Queue . . . . . . . . . . . . . . . . . 177
Bibliography 180
vi
List of Tables
2.1 Parameter estimates and performance metrics for M/M/1 queue
(boxed entries indicate incorrect sign of slope) . . . . . . . . . . . . . 37
2.2 M/M/1 queue quadratic function optimal value x∗ obtained as a
function of # replications (c ≈ 27),
where boxed entries indicate a maximum rather than a minimum . . 44
2.3 Parameter estimates and performance metrics for U/U/1 queue (10
replications per design point),
based on 10 macroreplications (boxed entries indicate incorrect sign
of slope) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.4 Parameter estimates and performance metrics for U/U/1 queue y(2)
w.r.t. # replications/design point . . . . . . . . . . . . . . . . . . . . 50
2.5 Parameter estimates and performance metrics for the sphere function
for two-level centered full factorial design around (1,−0.6, 0.8,−0.5);
10 replication per design point, gridsize 0.5, 1000 macroreplications. . 54
2.6 Parameter estimates and performance metrics for the sphere function
for two-level centered full factorial design around (1,−0.6, 0.8,−0.5);
10 replication per design point, gridsize 0.05, 1000 macroreplications. 55
3.1 Averaged EIMSE from 100 macroreplications for GESK models under
six designs (# of design points, # of reps) to predict expected waiting
time in the M/M/1 queue example. Three fixed step sizes with those
determined by PMLE and IMSE are compared. Standard errors are
shown in parenthesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.2 Averaged EIMSE from 100 macroreplications for SK, SKG and GESK
with six different designs on estimating the expected steady-state
waiting time in an M/M/1 queue problem. The design (6, 50) means
6 design points with 50 replications at each design point. Standard
errors are shown in parentheses. . . . . . . . . . . . . . . . . . . . . . 92
3.3 Averaged EIMSE from 100 macroreplications for SK, SKG and GESK
with five different designs on y(x) = exp(−1.4x) cos(7pix/2)+. Stan-
dard errors are shown in parentheses. . . . . . . . . . . . . . . . . . . 94
4.1 Final opportunity cost and standard errors for the experiments . . . . 141
vii
4.2 Final opportunity cost and standard errors for the queue selection
and network selection problems . . . . . . . . . . . . . . . . . . . . . 145
viii
List of Figures
2.1 M/M/1 queue: true model, simulation data and several fitted mod-
els, where each data point is the sample mean of 10 independent
replications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.2 M/M/1 queue: using correlation matrix estimated based on 100,000
“offline” simulation replications. . . . . . . . . . . . . . . . . . . . . . 40
2.3 M/M/1 queue quadratic fit (c ≈ 27, 10 replications). . . . . . . . . . 42
2.4 U/U/1 queue box plots of four estimated slopes for y(k) based on 10
macroreplications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1 Boxplots of EIMSE from 100 macroreplications for the GESK models
under six designs (# of design points, # of reps) to predict expected
steady-state waiting time in the M/M/1 queue example. . . . . . . . 88
3.2 Boxplots for step sizes determined by PMLE and IMSE based on 100
macroreplications under six designs (# of design points, # of reps) to
predict the expected steady-state waiting time in the M/M/1 queue
example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.3 Boxplots of EIMSE from 100 macroreplications for SK, SKG and
GESK with six different designs on estimating the expected steady-
state waiting time in an M/M/1 queue problem, corresponding to
results in Table 3.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.4 Boxplots of EIMSE from 100 macroreplications for SK, SKG and
GESK with five different designs on y(x) = exp(−1.4x) cos(7pix/2) +
, corresponding to results in Table 3.3. . . . . . . . . . . . . . . . . . 101
3.5 Boxplots for step sizes determined by PMLE and IMSE based on100
macroreplication in the stylized example with added noise. . . . . . . 102
3.6 Boxplots of EIMSE from 100 macroreplications for SK and GESK
with two different Latin-hypercube designs on a four-dimensional
function with added noise. . . . . . . . . . . . . . . . . . . . . . . . . 103
4.1 Averaged opportunity cost as the number of samples increases, where
the dashed lines are the mean plus or minus two standard errors . . . 142
4.2 Value of information and maximum error as the number of samples
increases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
ix
4.3 Contour map of different policies after 200 measurements . . . . . . . 144
4.4 Comparing averaged opportunity cost in M/G/1 queue selection prob-
lem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.5 Numerical experiment on a three-station Jackson network . . . . . . 149
5.1 Probability of success and corresponding profit curve as a function of
the price under three different scenarios . . . . . . . . . . . . . . . . . 168
5.2 Plots of bidding prices, single-period profit and cumulative profits
over time under the low-truth scenario . . . . . . . . . . . . . . . . . 168
5.3 Plots of bidding prices, single-period profit and cumulative profits
over time under the medium-truth scenario . . . . . . . . . . . . . . . 169
5.4 Plots of bidding prices, single-period profit and cumulative profits
over time under the high-truth scenario . . . . . . . . . . . . . . . . . 169
x
Chapter 1: Introduction
1.1 Simulation Optimization
We consider optimization problems where the objective is to minimize an ex-
pected value that cannot be computed in closed form. Instead, the expectation must
be estimated via simulation. Therefore, deterministic optimization algorithms are
not applicable and simulation optimization algorithms are needed.
The general formulation of the simulation optimization problem is as follows:
min
θ∈Θ
J(θ) = min
θ∈Θ
E [L(θ, ω)] , (1.1)
where θ ∈ Θ is a p-dimensional vector of the decision variables and Θ is the feasible
region. It assumes that little knowledge (linearity or convexity) of the objective
function J(θ) is known, and moreover J(θ) is the expectation of another quantity
L(θ, ω), so it cannot be obtained directly. L(θ, ω) is the performance measure of
interest and ω represents a simulation replication, which comprises the uncertainty
of the system. The optimal decision variable is defined as
θ∗ = arg min
θ∈Θ
J(θ). (1.2)
Optimization problems are generally classified into continuous and discrete
problems depending on the types of values the decision variables θ can take. An
1
alternative classification involves the size of the feasible region: finite versus infinite
for discrete problems, bounded versus unbounded for continuous problems. Exten-
sive review on the topic of simulation optimization can be found in [1–5]. Three
different approaches are briefly discussed since they are closely related to the new
methods proposed in the thesis.
Stochastic approximation is a stochastic optimization technique analogous to
gradient methods in deterministic optimization problems. Stochastic approximation
uses the following recursion to update the solution:
xk+1 = xk + ak∇f˜(xk),
where ∇f˜(xk) is the gradient estimate and ak is the step size. Two classical meth-
ods, Robbins-Monro (RM) and Kiefer-Wolfowitz (KW), estimate the true gradient
∇f(xn) using direct gradient estimates and finite difference gradient estimates, re-
spectively. In the stochastic simulation context, direct gradient estimation methods
include perturbation analysis (PA) [6–8] and likelihood ratio/score function methods
(LR/SF) [9, 10]; see [11] for more details.
Metamodels, also known as Response Surface Methodology (RSM), provide
a functional relationship between the performance measurements and parameters
of interest. Metamodel-based methods decouple optimization from simulation, as
metamodels approximate stochastic responses through an algebraic function and
deterministic optimization procedures are applied to the metamodel. Polynomial
models are one of the commonly used metamodels since they usually have compact
forms and are easy to construct and evaluate. However, due to their lack of flexibility,
2
kriging, splines, neural networks and radial basis functions are more adequate to
capture global characteristics of a response surface.
Both stochastic approximation and metamodeling are generally designed for
solving stochastic optimization problems with continuous variables. Statistical rank-
ing and selection (R&S) addresses stochastic optimization problem with discrete
variables. Three R&S procedures are indifference-zone (IZ) methods, value of infor-
mation procedures and optimal computing budget allocation (OCBA). IZ methods
guarantee asymptotic lower bounds for the probability of correct selection (PCS), as
long as the true underlying performance values are sufficiently far apart. Bayesian
models for R&S consider the tradeoff between estimates of the performance values
and uncertainty about those estimates, using the concept of “value of information.”
OCBA is designed with the flexibility to adapt to both frequentist and Bayesian
models.
1.2 Direct Gradient Enhanced Metamodels
A metamodel is commonly used in simulation optimization to provide an aux-
iliary functional relationship between the input and output of a simulation model.
Conducting simulations to collect experimental data is necessary to build metamod-
els, where the simulated data collected are usually performance measurements for
parameters of interest. However, direct derivative information may also be avail-
able in stochastic simulation settings, where the output responses include not only
the performance measurement, but also values of the gradient of performance mea-
3
surement with respect to the parameters. Perturbation analysis (PA) [6–8] and
likelihood ratio/score function methods (LR/SF) [9, 10] are techniques that aim at
estimating the gradient the performance measure. Applications of direct gradient
estimates have been studied extensively, including queueing, inventory and finance
applications [12,13].
In general, there are two types of metamodeling strategies: iterated local meta-
models and global metamodels. An overview of local and global metamodel-based
optimization is given in [14] and [15].
Iterated local metamodels, also known as sequential response surface method-
ology, rely on low-order polynomial regression. A first-order polynomial is usually
used to fit local response surface in a small region to determine the search direction.
Following a line search, new regions for the parameters of interest are exploited re-
peatedly until the region of most interest is determined. At the final step, a quadratic
approximation is chosen and deterministic optimization methods are applied to lo-
cate the optimum. Regression techniques and experiment design are critical in this
procedure; see [16] for details.
In global metamodels, high-order polynomial regression or nonlinear regression
techniques based on existing knowledge about the response surface are appropriate;
see [17] for an example. To capture global characteristics of a response surface, more
flexibility in the models is required. Therefore, kriging, splines, neural networks and
radial basis functions are more appropriate for fitting global metamodels. Among
all these, kriging has received a lot of attention in the stochastic simulation com-
munity over the past decade [18–20]. Recently, [21] proposed stochastic kriging as
4
an extension of kriging, which explicitly takes the uncertainties in simulation noise
into consideration. Stochastic kriging is considered to be flexible and promising in
fitting global response surfaces, especially in stochastic simulation settings.
In the stochastic simulation setting, direct derivative information may be
available, i.e., the simulation output may include not only the performance mea-
surements, but also estimates of the gradients of performance measurement with re-
spect to the parameters. Techniques for estimating gradients, including perturbation
analysis (PA) and likelihood ratio/score function methods (LR/SF), are discussed
in [7], [10] and [13]; see also references therein.
The availability of additional gradient information suggests the potential for
improving the quality of metamodels. Combining gradient information has been
investigated for building metamodels under deterministic computer simulation set-
tings; see [22] and [23] for approaches to approximate response surface with artificial
neutral networks and kriging. In stochastic simulation settings, researchers have also
made attempts to incorporate gradient estimates into metamodeling approaches. [24]
proposed a gradient surface method (GSM) that uses the gradient estimates only
to iteratively fit lower-order polynomial models. [25] introduced stochastic kriging
with gradient estimators (SKG) to exploit gradient estimates in stochastic kriging,
showing that the new approach provides better prediction with smaller mean squared
error (MSE). This approach is similar to cokriging proposed in deterministic simula-
tions [26], and requires differentiability of the correlation functions, since derivatives
of random processes or random fields are used to model gradient estimates.
5
1.3 Variational Bayesian Inference in Simulation Optimization
Bayesian statistical models can be used to represent the beliefs of a decision-
maker about an uncertain environment. For example, in revenue management, a
seller formulates beliefs about customers’ willingness to pay; in energy, we may have
a belief about the suitability of a candidate location for a new wind farm.
In R&S, Bayesian models consider the tradeoff between estimates of the per-
formance values and uncertainty about those estimates. This is known as the “value
of information” (VIP) approach, going back to [27] and extended in later work. The-
oretical properties of the policy were studied in [28]. VIP-based policies considering
unknown measurement noise were developed in [29] and [30]. See [31] and [5] for
an extensive up-to-date survey of Bayesian learning techniques. [32] compares sev-
eral sequential procedures and concludes that Bayesian procedures are more efficient
when the number of alternatives increases.
In the context of dynamic pricing, Bayesian statistics have been used to model
environmental uncertainty [33, 34], and different pricing strategies have been pro-
posed to optimize the balance between revenue and information. For example, [35]
proposes a one-step look-ahead strategy for problems with logistic revenue curves,
while [36] presents an approach based on multi-armed bandit theory. A recent
stream of work, represented by [37], [38], [39], and [40], has focused on establish-
ing long-run convergence rates for policies that are mostly myopic, with occasional
periods of exploration spaced increasingly further apart. However, in the specific
context of B2B pricing, individual transactions typically have high volume (for ex-
6
ample, the seller may be negotiating the price of a year’s supply of raw materials)
and incur high costs (e.g. the time and money spent during negotiations), making
it important to obtain good performance quickly.
Most Bayesian procedures rely on conjugate prior distributions on the un-
known model parameters in order to maintain computational tractability. Conju-
gate priors model the evolution of these beliefs over time as new information is
collected, either from stochastic simulation or field experiments. However, there
are relatively few of these conjugate models, and they simply do not exist in many
problems of interest. Variational Bayesian inference can be used to create com-
putationally tractable, “nearly conjugate” models that optimally approximate the
actual belief distributions and enable the use of anticipatory information collection
and optimization policies.
1.4 Outline of the Thesis
The thesis centers around simulation optimization, including several new meth-
ods based on direct gradient estimates and optimal learning approaches. We now
outline the thesis and summarize the contents of each chapter.
Chapter 2 investigates potential modeling improvements that can be achieved
by exploiting additional gradient information in the regression setting. Using least
squares and maximum likelihood estimation, we propose various Direct Gradient
Augmented Regression (DiGAR) models that incorporate direct gradient estimators,
starting with a one-dimensional independent variable and then extending to multi-
7
dimensional input. For some special settings, we are able to characterize the variance
of the estimated parameters in DiGAR and compare them analytically with the
standard regression model. For a more typical stochastic simulation setting, we
investigate the potential effectiveness of the augmented model by comparing it with
standard regression in fitting a functional relationship for a simple queueing model,
including both a one-dimensional and a four-dimensional example. The preliminary
empirical results are quite encouraging, as they indicate how DiGAR can capture
trends that the standard model would miss. Even in queueing examples where there
is high correlation between the output and the gradient estimators, the basic DiGAR
model that does not explicitly account for these correlations performs significantly
better than the standard regression model.
Chapter 3 introduces an approach for enhancing stochastic kriging in the set-
ting where additional direct gradient information is available, e.g., provided by
techniques such as perturbation analysis or the likelihood ratio method. The new
approach, called Gradient Extrapolated Stochastic Kriging (GESK), incorporates
direct gradient estimates by extrapolating additional responses. For two simpli-
fied settings, we show that GESK reduces mean squared error (MSE) compared to
stochastic kriging under certain conditions on step sizes. Since extrapolation step
sizes are crucial to the performance of the GESK model, we propose two different
approaches to determine the step sizes: maximizing penalized likelihood and min-
imizing integrated mean squared error. Numerical experiments are conducted to
illustrate the performance of the GESK model and to compare it with alternative
approaches.
8
Chapter 4 proposes the first computationally tractable Bayesian statistical
model for learning unknown correlation structures in fully sequential simulation
selection. Correlations represent similarities or differences between various design
alternatives, and can be exploited to extract much more information from each
individual simulation. However, in most applications, the correlation structure is
unknown, thus creating the additional challenge of simultaneously learning unknown
mean performance values and unknown correlations. Based on our new statistical
model, we derive a Bayesian procedure that allocates simulations based on the value
of information, thus anticipating future changes to our beliefs about the correlations.
Our approach outperforms existing methods for known correlation structures in
numerical experiments, including one motivated by the problem of optimal wind
farm placement, where real data are used to calibrate the simulation model.
Chapter 5 proposes an approximate Bayesian statistical model for predicting
the win/loss probability for a given price in business-to-business (B2B) pricing. This
model allows us to learn parameters in logistic regression based on binary (win/loss)
data and can be quickly updated after each new win/loss observation. We also
consider an approach for recommending target prices based on the approximate
Bayesian model, thus integrating uncertainty into decision-making. We test the
statistical model and the target price recommendation strategy with synthetic data,
and observe encouraging empirical results.
9
Chapter 2: Direct Gradient Augmented Regression
2.1 Introduction
Because regression analysis arose from physically observed processes, it as-
sumes that the only data points generated are measurements of the value of the
dependent variable for each combination of values for the independent variable.
However, in the stochastic simulation setting that is our primary focus, research
over the previous four decades has led to the availability of direct derivative infor-
mation – meaning not from finite-difference approximations, e.g., for each design
point or value of the independent variable, the output responses generated include
not only a value of the dependent variable, but also a value of the gradient of the
dependent variable with respect to the independent variable(s). Settings where
such direct gradient estimators are available include queueing, inventory, and fi-
nance [6–8,10,12,13].
Clearly, in this enhanced data setting, the availability of gradient information
should lead to an improvement in the estimated functional relationship between the
dependent and independent variables. In this chapter, we investigate such improve-
ments in the regression setting. Specifically, we consider a simple modification of the
standard linear regression model to incorporate the additional measurements. We
10
call the new method Direct Gradient Augmented Regression (DiGAR, pronounced
“digger”). Using the least squares approach and maximum likelihood estimation, we
derive the resulting parameter estimates for the proposed DiGAR models and pro-
vide a theoretical analysis of the improvements that can be achieved over a standard
model. We also conduct some preliminary numerical experiments to empirically in-
vestigate the improvements that can be achieved. In particular, we consider a simple
queueing system for which there are analytical results available, which can be used
to judge the effectiveness of both standard regression and DiGAR using both a linear
and quadratic regression function.
The main motivation for our work is two-fold:
• characterizing a global response surface, a traditional application of regression;
• approximating a local response surface with a view towards local improvements
in carrying out simulation-based optimization.
Broadly speaking, our primary objective is to provide improved means for estimat-
ing functional relationships using direct higher-order information. As a first step, we
consider regression because of its wide application in many fields and because of the
central role it plays in the sequential response surface methodology (RSM) approach
to optimization, which “is a metamodel-based optimization method that builds lin-
ear or quadratic local approximations” to the response function [14]. Polynomial
regression is generally used to fit the response surface, as a series of experiments
is used along the steepest descent direction to obtain additional improvement [41];
see also [16] for a detailed discussion on the relevant statistical techniques. In Fig-
11
ure 18.1 of [14], simulation optimization strategies are classified according to the
nature of the controllable variables and the response function. For the case where
the response function is assumed differentiable, there is a dichotomy between direct
gradient methods and metamodel methods, of which RSM is the primary technique.
DiGAR serves as an example of merging these two categories. By integrating direct
gradient estimates into the regression model, the hope is that the fitted curve better
approximates the true model, which should also lead to faster convergence when
used in sequential procedures.
DiGAR provides a paradigm shift, though it is clearly only applicable in a
special setting in which direct gradient estimates are available, which is often the
case in stochastic simulation using techniques such as perturbation analysis (PA)
and the likelihood ratio/score function (LR/SF) method [6, 7]. These procedures
provide estimates in which no resimulation is required, i.e., whenever an estimate of
an output performance measure is obtained, an estimate of the derivative(s) with
respect to parameters of interest are also obtained at that particular setting of the
parameters. This is referred as direct gradient estimates in the thesis, to contrast
with indirect estimates obtained by actually changing the value of the parameters
and running additional simulations; see [13,42] for a recent survey and tutorial with
references.
In the setting where direct gradient estimates are available, another approach
for incorporating gradient estimates into regression models proposed by [24] for
a sequential simulation optimization procedure fits the gradient response surface
directly using the gradient estimates only; the function estimates themselves are
12
not used in the procedure. In contrast, our approach uses all sets of measurements
that are available in augmenting existing functional estimation procedures such as
regression. Thus, DiGAR augments standard regression rather than replacing it.
To illustrate the approach in a straightforward way, DiGAR uses least squares
and maximum likelihood estimation (for the normally distributed setting), but other
criteria such as minimum sum of absolute errors (MSAE) and minimization of the
maximum absolute errors (MMAE) could also have been used. Least squares is
among the most popular criterion in regression [43, 44], due to its simplicity, but it
has the drawback of being sensitive to outliers, as is also evident in our numerical
examples reported in Section 2.3. There has been much work in robust regression to
overcome the drawback of least square regression [45], but our preliminary numerical
results on a queueing example indicate that combining the gradient estimations
with the least squares approach can lead to noticeable qualitative correction that
mitigates sensitivity to outliers, i.e., when some of the observed dependent variable
output values exhibit large fluctuations from their means, DiGAR is able to correct
the shape of the fitted curve – the slope for a linear fit and the curvature for a
quadratic fit. Not surprisingly, the observed variance of the parameter estimates in
the DiGAR is also considerably lower than that of the standard regression model.
2.2 Models
In this section, we consider the simplest setting, beginning with a review of the
most basic standard linear regression model before introducing the DiGAR models
13
where direct gradient estimators are assumed available.
The following assumptions are used in the section for various theoretical results
and calculations.
Assumption 2.1. E[i] = 0 ∀i.
Assumption 2.2. E[′i] = 0 ∀i.
Assumption 2.3. (i) Cov(i, j) = 0, i 6= j, and (ii) Var(i) = σ2 ∀i with known
σ2.
Assumption 2.4. (i) Cov(′i, ′j) = 0 ∀ i 6= j, Cov(i, ′j) = 0 ∀ i, j, and (ii)
Var(′i) = σ2g ∀ i with known σ2g .
Assumption 2.5. i ∼ N(0, σ2), ′i ∼ N(0, σ2g) ∀i with known σ2 and σ2g .
2.2.1 Standard Linear Regression Model
Consider the usual regression setting with independent variable x and de-
pendent variable y, where n > 1 data points (x1, y1), ..., (xn, yn) are given. Both
independent and dependent variables take values from a continuous domain. The
standard linear regression model is the following:
yi = β0 + β1xi + i, i = 1, 2, · · · , n, (2.1)
where assumptions on the “noise” terms {i} determine properties of resulting esti-
mators.
14
The least-squares approach minimizes the sum of squared residuals given by
n∑
i=1
(yi − β0 − β1xi)2, leading to the following estimators for β0 and β1:
βˆ0 = y¯ − βˆ1x¯, (2.2)
βˆ1 =
n∑
i=1
(xi − x¯)(yi − y¯)
n∑
i=1
(xi − x¯)2
=
1
n
n∑
i=1
(xi − x¯)(yi − y¯)
1
n
n∑
i=1
(xi − x¯)2
, (2.3)
where x¯ and y¯ are the sample means of {xi} and {yi}, respectively. These estimators
are unbiased assuming the noise terms have zero mean.
In the traditional regression model, it is well known that the maximum like-
lihood estimators (MLEs) for normally distributed residuals coincide with the OLS
estimators given by (2.2) and (2.3).
2.2.2 Direct Gradient Augmented Regression
Now consider the enhanced setting where the n data points are (x1, y1, g1), ..., (xn, yn, gn),
with gi representing a direct estimate of the gradient of yi at xi. The basic Direct
Gradient Augmented Regression (DiGAR) model is the following:
yi = β0 + β1xi + i, (2.4)
gi = β1 + ′i, (2.5)
where gi, i = 1, 2, · · · , n are the gradient estimates with residuals {′i}.
Again using the OLS approach, the function to be minimized is the sum of
the squared deviations in both yi and gi,
L =
n∑
i=1
(yi − β0 − β1xi)2 +
n∑
i=1
(gi − β1)2. (2.6)
15
The resulting estimators that minimize (2.6) are
βˆ0 = y¯ − βˆ1x¯, βˆ1 =
n∑
i=1
xiyi − nx¯y¯ + ng¯
n∑
i=1
x2i − nx¯2 + n
=
1
n
n∑
i=1
(xi − x¯)(yi − y¯) + g¯
1
n
n∑
i=1
(xi − x¯)2 + 1
, (2.7)
where x¯, y¯ and g¯ are the corresponding sample means of xi, yi and gi. Note that
in the basic DiGAR model, the form of βˆ0 remains unchanged, whereas βˆ1 has the
additional terms g¯ and 1 in the numerator and denominator, respectively, reflecting
the added gradient information.
An alternative approach is to derive estimators using yi and gi separately and
combine them. However, since we would like to combine two sources of data as
opposed to two estimators, we choose to use the objective function in (2.6). Later
in the chapter, we will build the connections between these two approaches.
Under the Assumptions 2.3 - 2.5, the parameter estimators given by (2.7) are
unbiased, since
E[βˆ1] =
n∑
i=1
xiE[yi]− nx¯E[y¯] + nE[g¯]
n∑
i=1
x2i − nx¯2 + n
= β1,
E[βˆ0] = E[y¯]− E[βˆ1]x¯ = β0,
16
and the variances of βˆ1 is calculated as follows:
Var(βˆ1) =
n∑
i=1
x2i Var(yi)− n2x¯2 Var(y¯) + n2 Var(g¯)
(
n∑
i=1
x2i − nx¯2 + n
)2
=
n∑
i=1
x2iσ2 − n2x¯2σ2/n+ n2σ2g/n
(
n∑
i=1
x2i − nx¯2 + n
)2
=
(
n∑
i=1
x2i − nx¯2
)
σ2 + nσ2g
(
n∑
i=1
x2i − nx¯2 + n
)2
The DiGAR estimator βˆ1 can also be viewed as a linear combination of the
standard linear regression estimator in (2.3) and another unbiased estimator g¯ with
weights proportional n and
n∑
i=1
(xi − x¯)2, respectively. We are particularly inter-
ested in investigating the variance of the estimators, and a known results suggest
that the optimal weights for combining two unbiased estimators should be inversely
proportional to their variances.
A more general form of the least-squares function to be minimized allows
relative weighting (convex combination) of the two contributions rather than the
equal weighting used in (2.6), i.e.,
L = α
n∑
i=1
(yi − β0 − β1xi)2 + (1− α)
n∑
i=1
(gi − β1)2, (2.8)
where α ∈ [0, 1]. α = 1 corresponds to standard regression, α = 0.5 corresponds
to the OLS DiGAR model introduced earlier, and α = 0 uses only the gradient
17
information. Differentiating with respect to β0 and β1,
∂L
∂β0
= −2α
n∑
i=1
(yi − β0 − β1xi),
∂L
∂β1
= −2α
n∑
i=1
(yi − β0 − β1xi)xi − 2(1− α)
n∑
i=1
(gi − β1).
(2.9)
Setting both equal to 0 and solving yields the following estimators:
βˆ0 = y¯ − βˆ1x¯, βˆ1 =
1
n
n∑
i=1
(xi − x¯)(yi − y¯) + 1−αα g¯
1
n
n∑
i=1
(xi − x¯)2 + 1−αα
. (2.10)
If the estimators for the gradients are also unbiased, i.e., E[′i] = 0, then the
α−DiGAR estimators are also unbiased.
Remark 2.1. The weight α in (2.8) can be viewed as cost to the sum of squared
errors. We will look into the weighted objective function in (2.8) and try to build
connections between the DiGAR estimators and the estimator obtained by combining
two unbiased estimators. In the following sections, we will show that the optimal
α provides the same estimator as the optimal estimator obtained by combining two
unbiased estimators.
Proposition 2.2. Under Assumptions 2.1 & 2.2, the α−DiGAR estimators given
by (2.10) are unbiased.
2.2.3 Choice of Weights in α-DiGAR
Instead of thinking α as a fixed cost, we consider the weight α can be chosen
in practice. So how should one choose the weight parameter α? One intuitive
choice is to choose the relative weights inversely proportional to the corresponding
18
variances, but the variances are unknown and possibly nonhomogeneous across the
independent variable range.
Another option is to minimize the prediction variance. The predicted value at
xi can be written as
yˆi = βˆ0 + βˆ1xi = y¯ − βˆ1(x¯− xi),
so that minimizing the prediction variance is equivalent to minimizing the variance
of βˆ1. However, again an analytical expression for this variance is unavailable in
most settings.
In the next section we revisit the question of weight selection in some spe-
cial settings of homogeneous variances where “optimal” weights can be determined
explicitly.
2.2.4 Theoretical Comparisons Between Estimators for Special Cases
Both traditional regression and least-squares DiGAR models can be applied
in a very general setting, but it is difficult to obtain any analytical results without
further assumptions.
For the traditional regression model, if the noise terms have zero mean and in
addition are uncorrelated with common variance, the theoretical variances can also
be computed explicitly.
Proposition 2.3. Under Assumptions 2.1 and 2.3, the variances of the estimators
19
(2.2) and (2.3) are given by
Var(βˆ0) = σ2




1
n +
x¯2
n∑
i=1
(xi − x¯)2



 , (2.11)
Var(βˆ1) =
σ2
n∑
i=1
(xi − x¯)2
. (2.12)
Similarly, in the DiGAR model, if all of the residuals are uncorrelated with
common variance, it is not difficult to compute the theoretical variances of the
α-DiGAR estimators.
Proposition 2.4. Under Assumptions 2.1-2.4, the variances of the estimators (2.10)
are given by
Var(βˆ0) =
σ2
n





1 +
x¯2
n
n∑
i=1
(xi − x¯)2
(
1
n
n∑
i=1
(xi − x¯)2 + 1−αα
)2





+
σ2g
n





x¯2(1−αα )2
(
1
n
n∑
i=1
(xi − x¯)2 + 1−αα
)2





,(2.13)
Var(βˆ1) =
σ2
n





1
n
n∑
i=1
(xi − x¯)2 + (1−αα )2
σ2g
σ2
(
1
n
n∑
i=1
(xi − x¯)2 + 1−αα
)2





. (2.14)
Revisiting Weight Selection in α-DiGAR
Under the assumption of homogenous variances for both the output and its
gradient, per Assumptions 2.3(ii) and 2.4(ii), selection of weights proportional to
variance implies
α
1− α =
σ2g
σ2 ,
which leads to
α =
σ2g
σ2g + σ2
. (2.15)
20
For the “optimal” choice of weights, which in the previous subsection was said to be
equivalent to minimizing the variance of βˆ1, differentiating (2.14) with respect to α
and setting equal to 0 gives the same proportional weights as (2.15).
For the uncorrelated setting, Proposition 2.4 can be used to find necessary
and sufficient conditions for the variance of the slope estimator to be lower for the
α-DiGAR estimators than for the standard slope estimator given by (2.3).
Proposition 2.5. Under Assumptions 2.1-2.4,
σ2g
σ2 ≤
2α
1− α +
1
1
n
n∑
i=1
(xi − x¯)2
⇐⇒ Var(βˆDiGAR1 ) ≤ Var(βˆstandard1 ),
where βDiGAR1 and βstandard1 denote the α-DiGAR slope estimator given by (2.10)
and the standard slope estimator given by (2.3), respectively.
Proof. Comparing the variances given by (2.12) and (2.14), the inequality holds if
and only if
σ2
n
1
n
n∑
i=1
(xi − x¯)2 + (1−αα )2
σ2g
σ2
(
1
n
n∑
i=1
(xi − x¯)2 + 1−αα
)2 ≤
σ2
n
1
1
n
n∑
i=1
(xi − x¯)2
⇐⇒
(σg
σ
)2
≤
( α
1− α
)2





(
1
n
n∑
i=1
(xi − x¯)2 + 1−αα
)2
1
n
n∑
i=1
(xi − x¯)2
−
1
n
n∑
i=1
(xi − x¯)2





= 2
( α
1− α
)
+ 1
1
n
n∑
i=1
(xi − x¯)2
Intuitively, Proposition 2.5 indicates that if σg is large relative to σ, then it
makes sense to decrease the relative weight for the gradient; also when the simulation
21
budget is limited and thus only a relatively small number of design points are possible
and experiment conditions make them close to each other (more likely in sequential
RSM applications), the benefit of DiGAR should be more noticeable. The following
corollary of Proposition 2.5 provides more explicit bounds for the weighting α. Two
special cases include the equal weighting and proportional weighting scenarios, again
in the uncorrelated setting.
Corollary 2.6. Under Assumptions 2.1-2.4, Var(βˆ1) for the α-DiGAR OLS slope
estimator given by (2.10) is strictly lower than that for the standard slope estimator
given by (2.3) if
α ≥
σ2g
σ2g + 2σ2
,
which includes the following special cases:
(i) α = σ
2
g
σ2g+σ2
.
(ii) α = 0.5 if σ2g ≤ 2σ2.
Proof. Main result follows directly from
n∑
i=1
(xi−x¯)2 > 0, with the two cases satisfying
the requisite condition, or directly from Proposition 2.5, noting that (i) α1−α =
σ2g
σ2 ,
(ii) α1−α = 1.
Maximum Likelihood Estimation DiGAR
If we further assume that the residuals are normally distributed and uncorre-
lated (independent) with known variances, we can derive the MLEs for β0 and β1,
which will not coincide with the α-DiGAR estimators in general.
22
Note that Assumption 2.5 subsumes Assumptions 2.1 and 2.2, and includes
the known variance condition.
Proposition 2.7. Under Assumptions 2.3, 2.4, and 2.5, the MLEs for the DiGAR
model specified by (2.4) and (2.5) are given by
β˜0 = y¯ − β˜1x¯, β˜1 =
1
n
n∑
i=1
(xi − x¯)(yi − y¯) + σ
2
σ2g
g¯
1
n
n∑
i=1
(xi − x¯)2 + σ2σ2g
. (2.16)
Proof. Under Assumptions 2.3 and 2.4, yi and gi are independent due to the resid-
uals being uncorrelated. Under Assumption 2.5, both σ2 and σ2g are known. The
likelihood function is given by
L(β0, β1) = (2pi)−n(σσg)−n exp
{
−
1
2σ2
n∑
i=1
(yi − β0 − β1xi)2 −
1
2σ2g
n∑
i=1
(gi − β1)2
}
.
Differentiating the log-likelihood function with respect to the parameters,
∂ log(L)
∂β1
= 1σ2
n∑
i=1
(yi − β0 − β1xi)xi +
1
σ2g
n∑
i=1
(gi − β1),
∂ log(L)
∂β0
= 1σ2
n∑
i=1
(yi − β0 − β1xi),
which after setting equal to 0, and some algebra, leads to (2.16).
Remark 2.8. In the actual implementation, σ2 and σ2g are usually unknown. The
MLE’s of β0, β1, σ2 and σ2g are the solutions of a nonlinear system of equations
and are not available in closed form. They would need to be solved using Newton-
Raphson or some other iterative numerical methods. Approximate MLE’s for β0 and
β1 can be obtained by replacing σ2 and σ2g by the unbiased and consistent estimators
given as follows:
23
σˆ2 = 1n− 2
n∑
i=1
(yi − yˆi)2, σˆ2g =
1
n− 1
n∑
i=1
(gi − g¯)2,
where yˆi = βˆ0 + βˆ1xi with the estimators βˆ0 and βˆ1 provided in (2.7).
Proposition 2.9. Under Assumptions 2.3, 2.4, and 2.5, the variance of the MLE
for β1 given in (2.16) is
Var(β˜1) =
σ2/n
1
n
n∑
i=1
(xi − x¯)2 + σ2σ2g
. (2.17)
As expected, as σg →∞, where the additional gradient estimators provide no
additional useful information, the MLE DiGAR slope estimator variance given by
(2.17) approaches the traditional slope estimator variance given by (2.12), whereas
as σg → 0, Var(β˜1)→ 0 for MLE DiGAR.
Comparing the variances given by (2.17) and (2.12), we have the following
result.
Proposition 2.10. Under Assumptions 2.3, 2.4 and 2.5, and σ/σg > 0, Var(β˜1) for
MLE DiGAR given in (2.16) is strictly lower than that for the standard estimator
given by (2.3).
Thus, in the uncorrelated common variance setting, the MLE DiGAR slope
estimator is guaranteed to have lower variance than the standard regression model
slope estimator. However, it should also be noted that the MLEs given by (2.16)
contain the variances σ2 and σ2g , which are unknown in practice and thus need to
be estimated from the data, whereas either of the two sets of DiGAR least-squares
estimators given by (2.7) or (2.10) do not contain these terms. However, choosing
24
the “optimal” weights in the α−DiGAR estimator (2.10) as specified by (2.15) leads
to the same requirement, since α depends on both σ and σg. Furthermore, this choice
actually leads back to the MLEs, as summarized in the following.
Proposition 2.11. The α-DiGAR estimators given by (2.10) using the weight α
given by (2.15) coincide with the DiGAR MLEs given by (2.16).
Remark 2.12. It is mentioned earlier in the chapter that combing the two unbiased
estimators, namely βˆ1 in (2.3) and g¯, is an alternative approach. The optimal
weighted estimator obtained in this fashion is
1
n∑
i=1
(xi−x¯)2
σ2 + nσ2g




n∑
i=1
(xi − x¯)2
σ2
n∑
i=1
(xi − x¯)(yi − y¯)
n∑
i=1
(xi − x¯)2
+ nσ2g
g¯



 .
Simplifying this leads to the same estimator α-DiGAR estimators given by (2.10)
using the weight α given by (2.15) and the DiGAR MLEs given by (2.16). This
suggests that when we combing two sources of data using the loss function in (2.8),
we are able to obtain the same estimator as combining two unbiased estimators,
namely βˆ1 in (2.3) and g¯.
2.2.5 Multi-dimensional Linear Models
Now we consider the d-dimensional multiple regression linear model setting.
For standard regression,
yi = β0 +
d∑
j=1
βjxij + i, i = 1, 2, · · · , n. (2.18)
25
The least-squares approach minimizes the sum of squared residuals given by
n∑
i=1
(yi−
β0 −
∑d
j=1 βjxij)2, leading to the following estimators for β0 and βj:
βˆ0 = y¯ −
d∑
j=1
βˆjx¯j, (2.19)
βˆj =
n∑
i=1
(xij − x¯j)(yi − y¯)−
∑
k 6=j βˆk
n∑
i=1
(xik − x¯k)(xij − x¯j)
n∑
i=1
(xij − x¯j)2
, (2.20)
where x¯j is the sample mean of {xij}.
The analogous multi-dimensional DiGAR model is given by
yi = β0 +
d∑
j=1
βjxij + i, (2.21)
gij = βj + ′ij, (2.22)
where gij, j = 1, 2, · · · , k, i = 1, 2, · · · , n are the gradient estimates with residuals
{′ij}. The corresponding least-squares function to be minimized given by
α0
n∑
i=1
(yi − β0 −
d∑
j=1
βjxij)2 +
d∑
j=1
αj
n∑
i=1
(gij − βj)2,
where
∑d
i=0 αi = 1, αi ≥ 0. Taking the partial derivatives and setting equal to zero
yields again (2.19) and
βˆj =
n∑
i=1
(xij − x¯j)(yi − y¯)−
∑
k 6=j βk
n∑
i=1
(xik − x¯k)(xij − x¯j) + nαjα0 g¯j
n∑
i=1
(xij − x¯j)2 + nαjα0
, j > 0,
(2.23)
which reduces to the previous expression with α0 = α, α1 = 1 − α, when there is
just a single input. We will also refer to these as the α-DiGAR estimators.
26
2.2.6 DiGAR Model for Gradient Estimates Correlated with Perfor-
mance Outputs
The least-squares estimators for the basic α-DiGAR model, as given by (2.10)
and (2.23), were derived without consideration of the correlation structure, although
explicit computation of the theoretical variance as given by Proposition 2.4 required
further assumptions. Here we consider the generalized least squares (GLS) setting
where the outputs yi and gradient estimates gi are correlated as follows:
y = Xβ + , (2.24)
where for the univariate case
y =
















y1
g1
...
yn
gn
















, X =
















1 x1
0 1
... ...
1 xn
0 1
















, β =




β0
β1



 , (2.25)
and  would contain the noise terms for the corresponding output and gradient
estimates. The OLS estimator is given by
βˆ = (X′X)−1X′y, (2.26)
which matches (2.23) in the uncorrelated case with equal weights (αi = 1/n).
The weighted least-squares (WLS) solution is given by
βˆW = (X′WX)−1X′Wy, (2.27)
27
where W is a diagonal matrix such that diag(W) = [α1, α2, . . . , α2n].
Now we explicitly consider the setting where the covariance matrix of  is
given by V, which is non-diagonal due to the correlations between yi and gi. This
can also include the case of common random numbers, which would induce corre-
lations within the {yi} and {gi}. Again assume the residuals have zero mean, i.e.,
E[] = 0 (Assumptions 2.1 and 2.2), so E[y] = Xβ. In the general correlated,
non-homogeneous setting of (2.24)/(2.25), (2.26) is not the best (i.e., variance min-
imizing) linear unbiased estimator (BLUE) for β, but the Gauss-Markov Theorem
for the uncorrelated homogeneous variance setting can be used to prove that the
BLUE for the general setting is given by the following GLS estimator:
βˆG = (X′V−1X)−1X′V−1y, (2.28)
and the covariance matrix for βˆG is Cov(βˆG) = (X′V−1X)−1.
Proposition 2.13. Under Assumptions 2.1 and 2.2, the BLUE for the model given
by (2.24) and (2.25) is the GLS estimator given by (2.28). Furthermore, if the
residuals are assumed to be normally distributed, the MLE of β is the same as the
GLS estimator.
Proof. The proof uses the Gauss-Markov Theorem.
Theorem 2.14 (Gauss-Markov Theorem). For the regression model (2.24) with
E[] = 0 and Cov() = σ2I, the OLS estimators have minimum variance among all
linear unbiased estimators.
Since V is positive definite, there exists an 2n×2n nonsingular matrix P such
28
that V = PP′. Multiplying y = xβ +  by P−1, we obtain
P−1y = P−1Xβ + P−1,
where E[P−1] = P−1E[] = 0 and
Cov(P−1) = P−1 Cov((P−1)′ = P−1σ2V(P−1)′ = σ2P−1PP′(P′)−1 = σ2I.
Therefore, the assumptions in the Gauss-Markov Theorem are satisfied, and the
least-squares estimator
βˆ =
[
(P−1X)′(P−1X)
]−1 (P−1X)′P−1y,
is the best linear unbiased estimator (BLUE). The estimator βˆ can be written as
βˆ =
[
(P−1X)′(P−1X)
]−1 (P−1X)′P−1y
= [X′(P′)−1P−1X]−1X′(P′)−1P−1y
= [X′(PP′)−1X]−1X′(PP′)−1y
= (X′V−1X)−1X′V−1y.
We now analyze the variance of the slope portion of the GLS estimator by
considering a very special case.
Assumption 2.6. V is a positive definite matrix such that yi is correlated with gj
only when i = j, with correlation ρ.
Proposition 2.15. Under Assumptions 2.1, 2.2, 2.3 (ii), 2.4 (ii), and 2.6, for the
model given by (2.24)/(2.25), the variance of βˆ1 in (2.28) is given by
Var(βˆ1) =
σ2/n
1
1−ρ2
1
n
n∑
i=1
(xi − x¯)2 + σ2σ2g
, (2.29)
29
which vanishes if either σ or σg goes to zero.
Proof. The lower right element of the covariance matrix (X′V−1X)−1 gives Var(βˆ1),
so writing the covariance matrix of the residuals as
V =
















σ2 ρσσg
ρσσg σ2g
. . .
. . . . . . . . .
. . . σ2 ρσσg
ρσσg σ2g
















,
If the residuals are normally distributed, i.e.,  ∼ N (0,V), then y ∼ N (Xβ,V),
and the likelihood function is
L(β) = 1(2pi)n|V|1/2 exp
{
−(y −Xβ)′(V)−1(y −Xβ)/2
}
,
so the log-likelihood function is
lnL(β) = −n ln(2pi)− 12 ln(|V|)−
1
2(y −Xβ)
′V−1(y −Xβ).
Differentiating with respect to β,
∂ lnL
∂β = −
(
X′VXβ −X′V−1y
)
,
setting equal to zero and solving for β gives the estimator
βˆ = (X′V−1y)−1X′V−1y,
which is the same as the best linear unbiased estimator (BLUE).
30
In the special correlated setting of Proposition 2.15, it is interesting to note
that the variance of the GLS DiGAR slope estimator will go to zero if either of
the variances of the underlying noises vanish, whereas for the basic OLS DiGAR
estimator, both variances need to be zero in the uncorrelated setting (see Proposition
2.4). Also, by comparing (2.29) with the variance of the standard slope estimator
given by (2.12), the following is easily established.
Proposition 2.16. Under Assumptions 2.1, 2.2, 2.3 (ii), 2.4 (ii), and 2.6, for the
model given by (2.24)/(2.25), if 0 < σ2 <∞ and 0 < σ2g <∞, then for −1 < ρ < 1,
the DiGAR slope estimator given by (2.28) has lower variance than the standard
slope estimator.
Note that the DiGAR estimators given by (2.28) for the uncorrelated case will
differ from the DiGAR estimators derived earlier using OLS. Generally, ρ is unknown
and must be estimated based on the data. As noted in [16], estimating the corre-
lation matrix changes the linear estimator (2.28) into a nonlinear estimator. Thus,
although the theoretical analysis indicates potential for variance reduction from
estimating the correlation, the extra computational budget spent on estimating ρ
(and the resulting nonlinearities) must be traded off with any potential performance
gains, an issue that is also investigated empirically in the numerical examples in the
next section.
31
2.3 Numerical Examples
Queueing systems are one of the main application areas for stochastic simu-
lation, and the earliest application of direct gradient estimation in simulation was
queueing; thus, we chose a simple queueing model as the setting on which to empir-
ically investigate the performance of the DiGAR estimators, comparing them with
standard regression in a setting where direct gradient estimates are available but
where one or more of the assumptions of the theoretical results are generally not sat-
isfied. In particular, although the least-squares α-DiGAR models are applicable in
the general correlated setting, it is difficult to compute the theoretical variances ex-
plicitly for such queueing settings, so simulation experiments are used to empirically
compare their performance with standard regression.
Perhaps the simplest queueing system is the first-come, first-served (FCFS),
single-server queue considered here. Specifically, we consider the mean total time
in system for a customer (queue or delay time plus service time) as a function of
the parameters of the (common) service time or interarrival time distribution of
customers. Two settings are considered: a univariate setting (single input) and
a multivariate setting. In both settings, four outputs, i.e., four different y’s, are
considered. In queueing theory, the interest is frequently steady-state performance,
but since the focus here is on improving regression models rather than on queueing
theory, per se, we consider the 2nd, 3rd, 4th, and 5th customers, for which we
can easily obtain analytical results that can be used to compare the quality of
the standard linear regression and DiGAR models without having to worry about
32
whether the simulation has reached steady state. Letting Tk denote the system time
of the kth customer, the outputs of interest are
y(k) = E[Tk], k = 2, 3, 4, 5.
The gradient estimators for all of the examples are provided in the Electronic Com-
panion Section A.2. The system time performance and its gradient estimate are
clearly highly correlated, and the variance of both the performance and its gradient
are not homogeneous across the range of independent values considered (although
since the queue will be far from the heavy traffic intensity regime, the violation
may not be particularly severe). As a result, Assumptions 2.3 and 2.4 are violated
and the conclusions of Propositions 2.3 through 2.10 cannot be applied, although
Proposition 2.2 holds, as the gradient estimators are unbiased. However, it should
be noted that the α-DiGAR estimators are not derived under Assumptions 2.3 and
2.4, which are only sufficient conditions to establish the theoretical results.
2.3.1 Example: M/M/1 Queue
The univariate example takes the arrival process to be Poisson with fixed rate
and service times to be i.i.d. exponentially distributed, i.e., an M/M/1 queue,
where the independent variable x is the mean service time. It is straightforward
to compute the true theoretical dependence of the expected system time y on the
mean service time x (cf. [46]), which can then be used to judge the quality of the
fitted curves obtained by the various regression models.
For all experiments in the univariate setting of the M/M/1 queue, the arrival
33
rate is fixed at 0.2, and the number of design points is n = 10, where output y
(mean system time) is obtained at the following values of the independent variable
x (mean service time):
x1 = 3.6, x2 = 3.7, x3 = 3.8, x4 = 3.9, x5 = 4.0, x6 = 4.1, x7 = 4.2, x8 = 4.3, x9 = 4.4, x10 = 4.5.
At each of these independent variable values, 10 independent simulation replications
(runs) were generated to obtain the system times (and gradients for DiGAR) of the
customers; nine additional macroreplications were carried out only for the purposes
of estimating the slope variances as described below. The relatively small number
of replications highlights the challenges that the standard regression approach faces,
since it leads to output estimates with relatively high variances.
Two metrics were used to evaluate the quality of fit for each data set:
(i) the sample mean-squared error from the true model over the independent
variable range of interest defined by
L2 =
xmax∫
xmin
(ŷ(x)− y(x))2dx,
where ŷ(x) and y(x) denotes the fitted and true models, respectively; and
(ii) the theoretical (from the uncorrelated model) variance of the slope estimator
denoted by V̂ar(βˆ1) and estimated using (2.12) and (2.14) for the standard and
DiGAR models, respectively, with the pooled (over all input values) sample
variance estimators s2 and s2g used in place of the respective output variances.
For the M/M/1 queue example, xmin = 3.6 and xmax = 4.5. To provide
a reference for the fitted linear models, a “true” linear model was also computed
34
based on fitting a linear model that minimizes the L2 error from the true M/M/1
queue model given in Electronic Companion Section A.1 (since the true function y(x)
is not linear). The V̂ar(βˆ1) metric estimated is based on a formula that assumes
conditions not satisfied in these queueing examples. It is calculated to see if the
relative behavior resembles the actual variances of the slope estimators, estimated
using N = 10 macroreplications via
s2(βˆ1) =
1
N − 1
N∑
j=1
(βˆji −
¯ˆβi)2, where ¯ˆβ1 =
1
N
N∑
j=1
βˆj1.
The first set of experiments performed used the sample mean for the output
variable yi (and gi for DiGAR) at each of the 10 values of the independent vari-
able xi to carry out the fitting using standard regression and the various DIGAR
models: OLS DiGAR with parameters given by (2.7) – denoted simply by DiGAR
throughout; α-DiGAR models with parameters given by (2.10); MLE DiGAR with
parameters given by (2.16) – denoted by DiGARn throughout; and the correlated
DIGAR models with parameters given by (2.28) – denoted by DiGAR* and Di-
GAR** for the cases where the correlation matrix is estimated by using the given
number of replications (initially 10) and 100,000 (off-line) replications, respectively.
The results are given in Table 2.1, which provide the values of the estimated
parameters, along with the calculation of the various metrics. Overall, the DiGAR
models all outperform the standard model on all metrics, especially for the variance
of the slope estimators, with a superiority of one to two orders of magnitude. The
numbers in the last two columns are also encouraging, in that they indicate that the
estimated theoretical variances for all of the estimators are in the same ballpark as
35
the sample variances, and more importantly the ratio of improvement is reasonably
consistent, so that this metric appears to provide a pretty accurate estimate of
relative performance between the various models. The performance of the DiGAR
models is also fairly insensitive to the choice of the weight parameter, with the OLS
DiGAR model performing adequately, although the “optimal” choice of weights
(approximately 0.072, 0.087, 0.101, 0.114 for y(2), y(3), y(4), y(5), respectively) does
show improvement and is usually the best performance overall, indistinguishable
from DIGAR**, to be discussed shortly.
A clearer visual comparison is provided by the graphs in Figure 2.1, which plot
the simulation data, true model, and three fitted models, where the circles are the
data points (sample mean of 10 simulated values); the solid line is the true model;
the dashed line is the fitted model from standard regression; the line with dots is
the fitted OLS DiGAR model; and the dotted line is correlated DiGAR model. The
graphs indicate that both the standard and DiGAR models fit the data reasonably
well for y(2) and y(3), but there are dramatic differences for y(4), and y(5), in which
the DiGAR models correctly capture the orientation of the curve, whereas the slope
of the standard linear regression model has the incorrect sign. For all four cases,
the normal estimators are indistinguishable from the basic OLS DiGAR model in
the graphs, and hence were omitted.
Surprisingly, the correlated model clearly performs worse than the OLS model,
which suggests that the misspecification caused by estimating the correlation based
on 10 points outweighs the potential gains of a better model. The purpose of Di-
GAR** – which used 100,000 separate replications to estimate the covariance matrix
36
Table 2.1: Parameter estimates and performance metrics for M/M/1 queue
(boxed entries indicate incorrect sign of slope)
i y(i) model βˆ1 βˆ0 L2 V̂ar(βˆ1) s
2(βˆ1)
2
standard 1.68 -1.66 0.48 3.75 5.68
DiGAR 1.56 -1.19 0.48 0.057 0.047
DiGAR (α = 0.25) 1.58 -1.26 0.48 0.046 0.029
DiGAR (α = 0.75) 1.56 -1.17 0.48 0.173 0.217
DiGAR (α ∝ 1/σ2) 1.55 -1.14 0.45 0.045 0.030
DiGARn 1.55 -1.14 0.48 0.057 0.029
DiGAR* 1.15 -0.11 1.62 1.99 0.036
DiGAR** 1.53 -1.10 0.52 0.046 0.030
“true” linear 1.69 -1.00 4E-6
3
standard 0.68 4.04 0.27 8.82 7.17
DiGAR 2.04 -1.50 0.11 0.134 0.066
DiGAR (α = 0.25) 1.86 -0.77 0.12 0.092 0.032
DiGAR (α = 0.75) 2.12 -1.80 0.11 0.369 0.300
DiGARn 2.14 -1.91 0.11 0.123 0.029
DiGAR (α ∝ 1/σ2) 2.14 -1.81 0.051 0.090 0.029
DiGAR* 1.53 -0.55 2.00 4.35 0.098
DiGAR** 2.14 -1.85 0.088 0.091 0.028
“true” linear 2.30 -2.18 2E-5
4
standard -2.97 20.2 2.06 3.10 5.35
DiGAR 2.33 -1.26 0.019 0.237 0.074
DiGAR (α = 0.25) 1.63 1.58 0.093 0.046 0.041
DiGAR (α = 0.75) 2.61 -2.41 0.006 0.177 0.265
DiGAR (α ∝ 1/σ2) 2.71 -2.53 0.10 0.046 0.035
DiGARn 2.70 -2.77 0.004 0.081 0.038
DiGAR* 1.98 -1.56 2.52 0.122 0.110
DiGAR** 2.73 -2.78 0.027 0.046 0.036
“true” linear 2.85 -3.43 7E-5
5
standard -1.76 15.2 2.29 1.30 9.18
DiGAR 2.89 -3.63 0.71 0.191 0.114
DiGAR (α = 0.25) 2.27 -1.14 0.77 0.053 0.068
DiGAR (α = 0.75) 3.14 -4.64 0.70 0.094 0.419
DiGAR (α ∝ 1/σ2) 3.22 -4.73 0.35 0.054 0.064
DiGARn 3.19 -4.88 0.70 0.094 0.072
DiGAR* 2.45 -2.98 3.66 0.228 0.096
DiGAR** 3.24 -4.98 0.55 0.055 0.064
“true” linear 3.37 -4.70 2E-4
37
3.0 3.5 4.0 4.5 5.0x
0
2
4
6
8
10
12
y(2)
data points
standard model
DiGAR model
DiGAR* model
DiGARn model
true model
3.0 3.5 4.0 4.5 5.0x
0
2
4
6
8
10
12
y(3)
3.0 3.5 4.0 4.5 5.0x
0
2
4
6
8
10
12
y(4)
3.0 3.5 4.0 4.5 5.0x
0
2
4
6
8
10
12
y(5)
Figure 2.1: M/M/1 queue: true model, simulation data and several fitted models,
where each data point is the sample mean of 10 independent replications.
38
to serve as a proxy for the true covariance matrix – was to investigate the conjecture
that the error is in fact due to the poor estimation of the correlation matrix. Tables
2.1 indicate that DiGAR** does in fact outperform OLS DiGAR, though the differ-
ence is not substantial. Interestingly, as noted earlier, the performance of DIGAR**
is nearly identical to that of α-DiGAR with the optimal choice of α. However, it
is clear that at least for this example, it is preferable to use OLS DiGAR rather
than DiGAR* if only a small number of replications are available to estimate the
covariance matrix, which is typically the situation in most of the simulation settings.
Since the variances are much smaller in the DiGAR models, it would seem
that a reasonable fit could be obtained using DiGAR with fewer replications at
each design point. Thus, a followup experiment used only a single output point to
measure yi (and gi for DiGAR) at each design point. Since there were 10 simulation
runs, this experiment can be performed 10 independent times each. As expected,
the variance of the slope estimator is substantially lower for DiGAR, and the ratios
of the estimated variances range from 11 to 111, with an overall mean over 60, which
represents a substantial improvement.
We also have an example showing the incorrect slope sign estimated from the
standard regression model happens frequently if only a small number are conducted
at each design point, specifically considering the fitted models for just y(2) in each
of the 10 individual runs of data set 1. In half (5 out of 10) of the sample, the
standard model gives the incorrect sign for the slope. The better match in the
slope of the curves is critical in simulation-based optimization procedures such as
sequential RSM. Similar results not reported here were also observed for y(3), y(4),
39
3.0 3.5 4.0 4.5 5.0x
0
2
4
6
8
10
12
y(2)
data points
standard model
DiGAR model
DiGAR** model
DiGARn model
true model
3.0 3.5 4.0 4.5 5.0x
0
2
4
6
8
10
12
y(3)
3.0 3.5 4.0 4.5 5.0x
0
2
4
6
8
10
12
y(4)
3.0 3.5 4.0 4.5 5.0x
0
2
4
6
8
10
12
y(5)
Figure 2.2: M/M/1 queue: using correlation matrix estimated based on 100,000
“offline” simulation replications.
and y(5).
2.3.2 Regression using a Quadratic Function
Sequential RSM generally employs a quadratic function to fit the data when
the algorithm approaches the optimal value, i.e, the linear fit has slope close to
0. Using the same single-server queue as in Example 1, we consider the following
objective function (used previously in many simulation optimization settings, e.g.,
[47, 48]):
y(k)(x) = E[Tk] + c/x, k = 2, 3, 4, 5, (2.30)
40
where c is a given constant. The additional term c/x in (2.30), which can be viewed
as a cost on server speed, makes the function convex with a unique minimizer x∗
that can be found analytically for comparison purposes.
The standard regression model using a quadratic function (in x) is given by
yi = β0+β1xi+β2x2i+i, and the DiGAR model is given by adding gi = β1+2β2xi+′i.
The resulting explicit parameter estimators for the OLS DiGAR model are provided
below.
Consider the loss function
L = 12
n∑
i=1
(yi − β2x2i − β1xi − β0)2 +
1
2
n∑
i=1
(gi − 2β2xi − β1)2
Differentiating with respect to β0, β1, β2,
∂L
∂β0
= −
n∑
i=1
(yi − β2x2i − β1xi − β0)
∂L
∂β1
= −
n∑
i=1
(yi − β2x2i − β1xi − β0)xi −
n∑
i=1
(gi − 2β2xi − β1)
∂L
∂β2
= −
n∑
i=1
(yi − β2x2i − β1xi − β0)x2i −
n∑
i=1
(gi − 2β2xi − β1)2xi
(2.31)
Setting them equal to 0 and solving yields the following estimators, β = Ay, where
β ≡ [β0, β1, β2]T , y =
[
n∑
i=1
x2i yi + 2
n∑
i=1
xigi,
n∑
i=1
xiyi +
n∑
i=1
gi,
n∑
i=1
yi
]T
,
and
A = 1abc− af 2 − be2 − cd2 + 2def








bc− f 2 ef − cd df − be
ef − cd ac− e2 de− af
df − be de− af ab− d2








,
with a =
n∑
i=1
x4i+4
n∑
i=1
x2i , b =
n∑
i=1
x2i , c = n, d =
n∑
i=1
x3i+2
n∑
i=1
xi, e =
n∑
i=1
x2i , f =
n∑
i=1
xi.
41
3.0 3.5 4.0 4.5 5.0x
6
8
10
12
14
16
18
20
22
24
y(2)
data points
standard model
DiGAR model
DiGAR* model
true model
3.0 3.5 4.0 4.5 5.0x
6
8
10
12
14
16
18
20
22
24
y(3)
3.0 3.5 4.0 4.5 5.0x
6
8
10
12
14
16
18
20
22
24
y(4)
3.0 3.5 4.0 4.5 5.0x
6
8
10
12
14
16
18
20
22
24
y(5)
Figure 2.3: M/M/1 queue quadratic fit (c ≈ 27, 10 replications).
42
Again using 10 replications per design point, the true and fitted models for c ≈
27 (actual value of c was chosen to make x∗ = 4 for y(2)), are plotted in Figure 2.3,
including the correlated model indicated by DiGAR*, where the correlation matrix is
also estimated using the small number of data points. The differences between both
DiGAR models and standard regression are substantial, as the convexity/concavity
curvature is often incorrect in the standard regression fitted model, e.g., for y(4)
and y(5) in several cases. Once again the OLS DiGAR model outperforms the
correlated version, again indicating the inadequacy of 10 replications for estimating
the correlation matrix.
Two additional cases, c = 1 (where x∗ lies to the left of the interval) and
c = 100 (where x∗ lies to the right of the interval), showing the fit of each of the
models over a much wider range outside the set of design points, where the incorrect
curvature of standard regression for several cases is more evident. Visually, it is clear
that DiGAR fits the curve much better than standard regression in all three cases.
We also investigated the performance gains when more replications are car-
ried out at each design point: 100, 1000, and 10000. indicate that the differences
between both DiGAR models and standard regression can be substantial, even up
to 1000 replications at each design point, where the standard model still has the
incorrect curvature. Furthermore, even with 100 replications at each design point,
the DiGAR* models still appear to be inferior to the OLS DiGAR models. Only
when 1000 or 10000 replications are used at each design point do the DiGAR* mod-
els show better performance than the OLS DiGAR model, but the superiority is
relatively slight (and not visually obvious in the figures), again indicating that the
43
Table 2.2: M/M/1 queue quadratic function optimal value x∗ obtained as a function
of # replications (c ≈ 27),
where boxed entries indicate a maximum rather than a minimum
10 reps 100 reps
true standard DiGAR DiGAR* standard DiGAR DiGAR*
y(2) 4.0 4.8 4.0 4.2 4.0 4.1 4.1
y(3) 3.5 9.9 3.7 3.9 4.0 3.1 3.5
y(4) 3.4 4.4 3.3 3.3 4.0 1.8 3.1
y(5) 3.1 9.8 3.4 3.6 4.1 1.0 2.8
1000 reps 10000 reps
true standard DiGAR DiGAR* standard DiGAR DiGAR*
y(2) 4.0 3.9 4.0 4.0 4.3 4.0 4.0
y(3) 3.5 4.2 3.4 3.4 0.3 3.4 3.4
y(4) 3.4 4.4 2.9 3.1 -25.2 3.0 3.0
y(5) 3.1 4.4 2.5 2.8 -6.4 2.6 2.7
OLS DiGAR model may be sufficiently robust for applications such as sequential
RSM.
Finally, Table 2.2 provides the estimates of x∗ implied by the fitted functions
as a function of the number of replications, where it is evident that the quadratic
fit provided by standard regression is problematic for optimization purposes even
with 10000 replications at each point; with less than 10000 replications per point,
the curvature is always in the incorrect direction for the c ≈ 27 case.
44
2.3.3 Multi-dimensional Example: U/U/1 Queue
Now we consider a multi-dimensional case using the same single-server queue-
ing example. Specifically, we take both the interarrival times and service times to
be i.i.d. following uniform distributions, i.e., a U/U/1 queue with respective dis-
tribution parameterizations U(θ1 − δ1, θ1 + δ1) and U(θ2 − δ2, θ2 + δ2), so that the
input variable is the four-dimensional vector x = (θ1, θ2, δ1, δ2). The output func-
tions considered are the same four as in the previous example, i.e., the mean system
time of the 2nd, 3rd, 4th, and 5th customers. Again using the Lindley equation,
the true theoretical dependence of the expected system time on the input distribu-
tional parameters can be calculated analytically and are included in the Electronic
Companion Section A.1.
The standard regression model is yi = β0 + β1θ1 + β2θ2 + β3δ1 + β4δ2 + i, and
the DiGAR model adds gji = βj + ji , where gji , j = 1, 2, 3, 4 represents the deriva-
tive of yi with respect to θ1, θ2, δ1, δ2, respectively. We simulated a two-level cen-
tered full-factorial design (thus, 17 design points), with center point (θ1, θ2, δ1, δ2) =
(10, 8, 8, 7), corresponding to U(2, 18) and U(1, 15) interarrival and service time dis-
tributions, respectively, with a spacing of ±0.1 (for all four dimensions) for the
design points. The metrics used are analogous to the ones used previously: L2 error
and mean squared error (MSE), where the latter also takes into account the bias,
i.e., MSE = Var(βˆi) + (βˆi − β∗i )2, where β∗i is the true value obtained using the
analytical formula, and the sample mean and sample variance are calculated for
each βˆji , i = 1, 2, 3, 4, j = 1, 2, · · · , N, based on N macroreplications with an equal
45
number of replications at each design point:
¯ˆβi =
1
N
N∑
j=1
βˆji , s2i = Var(βˆi) =
1
N − 1
N∑
j=1
(βˆji −
¯ˆβi)2.
For simplicity, only two DiGAR models are used: the basic one where no
weights need to be determined, i.e., (2.23) with αj = α0 ∀j; and α∗-DiGAR, for
which the “optimal” (proportional to variance) weights are estimated offline using
10,000 replications. As before, the output at each design point is based on the
mean of 10 replications, and the number of macroreplications is also N = 10. The
results shown in Table 2.3 indicate that both DiGAR estimators have smaller MSE
than the slope estimators from standard linear regression models by about two
orders of magnitude, where α∗-DiGAR models reduce MSEs as well as variances
further compared to the basic DiGAR models. In terms of L2 errors, both DiGAR
models are substantially better than standard linear regression model, but here the
improvement achieved by using optimal weights is small. Note that as in the 1-D
M/M/1 queue example, there are cases where standard linear regression model gives
the incorrect sign for an estimated slope.
To visualize the results in terms of the slopes, boxplots for estimators βˆi, i =
1, 2, 3, 4, for the 2nd, 3rd, 4th and 5th customers are shown in Figure 2.4. Each box-
plot is labelled as “parameter:model”. For instance, “θ1:Linear Reg” represents
the boxplot for βˆ1 estimators using the standard linear regression model. The true
gradient values calculated from the analytical formulas are indicated in the boxplots
by stars. The boxplots further illustrate that the variances of the estimators ob-
tained from standard linear regression are significantly larger than variances of the
46
estimators using the DiGAR models, differing by about two orders of magnitude.
The differences between DiGAR and α∗-DiGAR models are not clear from the box-
plots, but the sample standard deviations suggest that α∗-DiGAR models are able
to reduce variances of the estimators by choosing optimal weights.
DiGAR and α∗-DiGAR models also provide better estimators in terms of ab-
solute errors. The median values are indicated by red segments in all boxplots.
Visually, the median values from the DiGAR models are indistinguishable from the
true gradient values indicated by red stars, while median values from standard linear
regression models are far away from the true gradient values in most cases. Com-
pared to standard linear regression model, the sample means ¯ˆβi from DiGAR and
α∗-DiGAR models are also much closer to the true gradient values.
The effect of increasing the number of replications at each design point is
shown in Table 2.4, which provides experiment results for all slope estimators for
y(2). As expected, all of the models show improvement, but the relative advantages
of the DiGAR models – well over two orders of magnitude improvement in MSE
and between one to two orders of magnitude improvement in L2 – is retained.
2.3.4 Multi-dimensional Example: Sphere Function
Lastly, we consider a more stylized example to test the robustness of DiGAR
models when the gradient estimates have much larger variances and for different
levels of correlations between the response and gradient estimates. We consider the
sphere function defined by f(x) = ∑ni=1(xi)2, with partial derivative ∂f(x)/∂xi =
47
Table 2.3: Parameter estimates and performance metrics for U/U/1 queue (10 repli-
cations per design point),
based on 10 macroreplications (boxed entries indicate incorrect sign of slope)
Linear Regression DiGAR α∗-DiGAR
y(2)
true value value MSE value MSE value MSE
βˆ1 -0.375 -2.498 26.1 -0.397 0.0024 -0.378 0.0009
βˆ2 1.375 -1.031 26.2 1.355 0.0030 1.377 0.0009
βˆ3 0.171 4.486 25.9 0.215 0.0036 0.175 0.0008
βˆ4 0.146 0.835 19.9 0.155 0.0038 0.149 0.0021
L2 8.31 0.141 0.139
y(3)
βˆ1 -0.720 -2.140 22.7 -0.742 0.0048 -0.729 0.0068
βˆ2 1.720 -1.292 22.5 1.701 0.0078 1.728 0.0069
βˆ3 0.279 3.052 18.3 0.337 0.0060 0.312 0.0032
βˆ4 0.255 -0.164 18.7 0.233 0.0036 0.237 0.0014
L2 6.95 0.088 0.086
y(4)
βˆ1 -1.065 -3.243 22.0 -1.029 0.0169 -1.009 0.0141
βˆ2 2.065 0.538 31.2 1.995 0.0203 2.008 0.0142
βˆ3 0.362 5.199 50.1 0.430 0.0108 0.386 0.0029
βˆ4 0.360 -0.319 26.7 0.305 0.0114 0.311 0.0069
L2 11.0 0.220 0.216
y(5)
βˆ1 -1.427 -1.544 13.1 -1.309 0.0367 -1.307 0.0327
βˆ2 2.427 1.111 20.4 2.296 0.0396 2.306 0.0328
βˆ3 0.431 3.542 45.0 0.483 0.0059 0.455 0.0024
βˆ4 0.467 1.469 39.4 0.413 0.0099 0.403 0.0064
L2 10.0 0.174 0.171
48
β 1: 
Line
ar R
eg
β 1: 
DiG
AR
θ 1: 
α∗ -
DiG
AR
β 2: 
Line
ar R
eg
β 2: 
DiG
AR
θ 2: 
α∗ -
DiG
AR
β 3:L
inea
r Re
g
β 3: 
DiG
AR
β 3: 
α∗ -
DiG
AR
β 4:L
inea
r Re
g
β 4: 
DiG
AR
β 4: 
α∗ -
DiG
AR
10
5
0
5
10
Es
tim
ate
d P
ara
me
ter
 Va
lue
s
β 1: 
Line
ar R
eg
β 1: 
DiG
AR
θ 1: 
α∗ -
DiG
AR
β 2: 
Line
ar R
eg
β 2: 
DiG
AR
θ 2: 
α∗ -
DiG
AR
β 3:L
inea
r Re
g
β 3: 
DiG
AR
β 3: 
α∗ -
DiG
AR
β 4:L
inea
r Re
g
β 4: 
DiG
AR
β 4: 
α∗ -
DiG
AR
10
8
6
4
2
0
2
4
6
8
Es
tim
ate
d P
ara
me
ter
 Va
lue
s
β 1: 
Line
ar R
eg
β 1: 
DiG
AR
θ 1: 
α∗ -
DiG
AR
β 2: 
Line
ar R
eg
β 2: 
DiG
AR
θ 2: 
α∗ -
DiG
AR
β 3:L
inea
r Re
g
β 3: 
DiG
AR
β 3: 
α∗ -
DiG
AR
β 4:L
inea
r Re
g
β 4: 
DiG
AR
β 4: 
α∗ -
DiG
AR
10
5
0
5
10
15
20
Es
tim
ate
d P
ara
me
ter
 Va
lue
s
β 1: 
Line
ar R
eg
β 1: 
DiG
AR
θ 1: 
α∗ -
DiG
AR
β 2: 
Line
ar R
eg
β 2: 
DiG
AR
θ 2: 
α∗ -
DiG
AR
β 3:L
inea
r Re
g
β 3: 
DiG
AR
β 3: 
α∗ -
DiG
AR
β 4:L
inea
r Re
g
β 4: 
DiG
AR
β 4: 
α∗ -
DiG
AR
10
5
0
5
10
15
20
Es
tim
ate
d P
ara
me
ter
 Va
lue
s
Figure 2.4: U/U/1 queue box plots of four estimated slopes for y(k) based on 10
macroreplications
49
Table 2.4: Parameter estimates and performance metrics for U/U/1 queue y(2) w.r.t.
# replications/design point
Linear Regression DiGAR α∗-DiGAR
# reps true value value MSE value MSE value MSE
1
βˆ1 -0.375 -3.168 155 -0.397 0.025 -0.371 0.009
βˆ2 1.375 -9.977 270 1.265 0.035 1.370 0.009
βˆ3 0.171 6.012 141 0.236 0.019 0.182 0.007
βˆ4 0.146 -6.365 67.2 0.057 0.036 0.116 0.026
L2 54.0 1.227 1.213
10
βˆ1 -0.375 -2.498 26.1 -0.397 0.0024 -0.378 0.0009
βˆ2 1.375 -1.031 26.2 1.355 0.0030 1.377 0.0009
βˆ3 0.171 4.486 25.9 0.215 0.0036 0.175 0.0008
βˆ4 0.146 0.835 19.9 0.155 0.0038 0.149 0.0021
L2 8.31 0.141 0.139
100
βˆ1 -1.065 -0.914 3.08 -1.004 0.00434 -1.005 0.00422
βˆ2 2.065 2.074 2.39 2.006 0.00482 2.005 0.00424
βˆ3 0.362 0.996 4.84 0.379 0.00084 0.373 0.00026
βˆ4 0.360 -0.484 1.61 0.309 0.00278 0.316 0.00227
L2 1.03 0.036 0.037
1000
βˆ1 -0.375 -0.338 0.230 -0.377 0.00005 -0.378 0.00003
βˆ2 1.375 1.673 0.162 1.380 0.00005 1.378 0.00003
βˆ3 0.171 0.294 0.105 0.174 0.00002 0.173 0.00001
βˆ4 0.146 0.181 0.124 0.146 0.00002 0.145 0.00001
L2 0.0525 0.0009 0.0008
50
2xi. We used n = 4 in our numerical experiments, so the standard and DiGAR
models are the same as in the previous example. To control the variances and
correlations, the noises i and (1i , 2i , . . . , 4i ) form a random vector in R5 following
a multivariate normal distribution with mean 0 and covariance matrix Σ. Four
different levels of correlations are considered: the independent case (ρ = 0), and
relatively low, medium, and high levels ρ = 0.2, 0.5, 0.8. The noise variances were set
at 10, 20, 30, 40, 50 for i, 1i , 2i , 3i , 4i , respectively. A two-level centered full factorial
design with center point (x1, x2, x3, x4) = (1.0,−0.6, 0.8,−0.5) was considered. Two
sets of experiments were performed at two different grid sizes for the two-level design,
namely 0.5 and 0.05, which correspond to cases where design points are spread out
or fairly close. The mean at each design point was based on 10 replications, and
1000 macroreplications were used to estimate the MSE, with the results summarized
in Tables 2.5 and 2.6, where DiGAR refers to the OLS DiGAR and α∗-DiGAR uses
“optimal” (proportional to variance) weights.
In Table 2.5, results across different levels of correlations are consistent. The
α∗-DiGAR model is the best among three models. Since the variances increase from
1i to 4i , this leads to increases in MSE from βˆ1 to βˆ4 in both DiGAR models, but
both DiGAR models still outperform the standard linear regression models and the
α∗-DiGAR model reduces the MSE further from the OLS DiGAR model.
The results in Table 2.6 show higher MSEs, due to the more tightly spaced
design points, which leads to larger variance in the estimators. For the standard
model, the MSEs are about 100 times higher, whereas the DiGAR models are only
two to three times higher, i.e., the relative advantage of the DiGAR models in-
51
creases significantly in this setting, consistent with Proposition 2.5 and the remarks
following it.
As expected, increasing the variances of gradient estimators leads to inflation of
MSE in the DiGAR models; however, the relative advantage of the DiGAR models
is retained. The effects of changing the level of correlation does not affect the
performances of both DiGAR models under these two experiment designs as shown
in Table 2.5 and 2.6. We also conducted experiments at three different levels of
negative correlations. In this case, the MSE increases noticeably for all models with
increasing magnitude of the level of correlation, but the relative performance of the
models is similar to those for positive correlation, with the advantage of the DiGAR
models even greater for the larger magnitudes.
2.4 Conclusion
In this chapter we proposed an augmented regression method that exploits the
availability of direct gradient estimators in certain simulation settings. In some basic
settings, we analytically characterized the improvement obtained over the standard
model by calculating the variance of the estimated parameters and showing under
which conditions guaranteed performance improvement can be expected. A simple
queue was then used to numerically investigate the improvements, and the numer-
ical results indicated great promise for the approach, with the general observation
that the DiGARs models are qualitatively able to capture trends that the standard
model might miss, e.g., there were several cases where the standard model gave the
52
incorrect sign of the slope or was oriented in the opposite direction for the quadratic
fit. Not surprisingly, the DiGAR slope estimators had much smaller variance than
the standard estimator in all of the experiments conducted, retaining the relative
advantage even in the higher-dimensional numerical examples. Although only a
small portion of results are provided here, these observations hold for many other
experiments for this simple queueing model.
Of particular note is the observation that in the queueing example where the
gradient estimators are highly correlated with the dependent output data, the α-
DiGAR models clearly outperform the standard regression models and for all prac-
tical purposes do as well as the correlated DiGAR models, and significantly better
when the number of simulation replications at each design point is relatively small.
Since the basic DiGAR estimators are quite easy to implement, this has practical
implications for immediate use in those settings in which gradient estimators are
available. Thus, we recommend using an α-DiGAR estimator in settings where the
number of replications is relatively low and the gradient estimate is reasonably ac-
curate, which is generally the case if IPA is applicable. However, more investigation
into the effects of misspecification of the correlation structure is warranted before
more conclusive statements can be made for the DiGAR* GLS estimators, and the
effects are likely to be highly dependent on the application setting. Furthermore,
using design of experiments, e.g., along the lines of [49], in choosing the design points
could possibly ameliorate the misspecification problem.
53
Table 2.5: Parameter estimates and performance metrics for the sphere function for
two-level centered full factorial design around (1,−0.6, 0.8,−0.5); 10 replication per
design point, gridsize 0.5, 1000 macroreplications.
Linear Regression DiGAR α∗-DiGAR
correlation true value value MSE value MSE value MSE
zero
ρ = 0
βˆ1 2.0 2.001 0.249 2.007 0.093 2.006 0.088
βˆ2 -1.2 -1.238 0.259 -1.206 0.124 -1.215 0.109
βˆ3 1.6 1.589 0.265 1.615 0.174 1.605 0.127
βˆ4 -1.0 -1.016 0.265 -0.976 0.202 -0.993 0.136
low
ρ =
0.2
βˆ1 2.0 2.009 0.241 1.999 0.091 2.000 0.083
βˆ2 -1.2 -1.235 0.250 -1.200 0.132 -1.209 0.106
βˆ3 1.6 1.571 0.239 1.602 0.159 1.591 0.113
βˆ4 -1.0 -1.013 0.258 -1.002 0.207 -1.007 0.140
medium
ρ =
0.5
βˆ1 2.0 2.002 0.252 2.019 0.084 2.017 0.076
βˆ2 -1.2 -1.203 0.266 -1.192 0.127 -1.195 0.108
βˆ3 1.6 1.597 0.252 1.605 0.165 1.602 0.125
βˆ4 -1.0 -1.002 0.265 -1.000 0.202 -1.001 0.137
high
ρ =
0.8
βˆ1 2.0 2.016 0.246 1.980 0.093 1.985 0.087
βˆ2 -1.2 -1.191 0.249 -1.220 0.123 -1.212 0.104
βˆ3 1.6 1.612 0.230 1.584 0.163 1.594 0.120
βˆ4 -1.0 -1.017 0.262 -1.033 0.200 -1.026 0.128
54
Table 2.6: Parameter estimates and performance metrics for the sphere function for
two-level centered full factorial design around (1,−0.6, 0.8,−0.5); 10 replication per
design point, gridsize 0.05, 1000 macroreplications.
Linear Regression DiGAR α∗-DiGAR
correlation true value value MSE value MSE value MSE
zero
ρ = 0
βˆ1 2.0 2.070 23.660 2.007 0.116 2.007 0.115
βˆ2 -1.2 -1.395 24.683 -1.194 0.169 -1.195 0.169
βˆ3 1.6 1.584 24.645 1.607 0.248 1.607 0.247
βˆ4 -1.0 -1.099 24.780 -1.002 0.301 -1.003 0.300
low
ρ =
0.2
βˆ1 2.0 2.087 24.469 1.991 0.124 1.992 0.124
βˆ2 -1.2 -1.481 26.743 -1.201 0.182 -1.202 0.182
βˆ3 1.6 1.503 25.016 1.608 0.228 1.608 0.226
βˆ4 -1.0 -0.919 26.534 -0.991 0.311 -0.990 0.310
medium
ρ =
0.5
βˆ1 2.0 2.003 25.409 1.980 0.124 1.980 0.125
βˆ2 -1.2 -1.206 25.095 -1.232 0.177 -1.232 0.177
βˆ3 1.6 1.672 26.327 1.590 0.222 1.590 0.221
βˆ4 -1.0 -1.144 24.180 -1.011 0.290 -1.012 0.288
large
ρ =
0.8
βˆ1 2.0 1.839 24.427 1.979 0.116 1.979 0.116
βˆ2 -1.2 -1.367 25.998 -1.221 0.171 -1.222 0.170
βˆ3 1.6 1.494 25.830 1.563 0.229 1.563 0.227
βˆ4 -1.0 -1.052 24.366 -1.042 0.281 -1.043 0.278
55
Chapter 3: Gradient Extrapolated Stochastic Kriging
3.1 Introduction
Consider the same stochastic simulation setting as in Chapter 2, where direct
gradient information are available. DiGAR model proposed in Chapter 2 incor-
porate gradient estimates into regression models by building a separate model for
the gradient estimates. Aside from regression, kriging has also been studied exten-
sively in the deterministic simulation community (see, for example, [23] and [50]).
Stochastic kriging was introduced by [21] as an extension of kriging in the stochas-
tic simulation setting. Stochastic kriging provides flexible metamodels of simulation
output performance measurements while taking simulation noise into consideration.
[25] introduced stochastic kriging with gradient estimators (SKG) to exploit
gradient estimates in stochastic kriging, showing that the new approach provides
better prediction with smaller mean squared error (MSE). This approach is similar
to cokriging proposed in deterministic simulations [26], and requires differentiability
of the correlation functions, since derivatives of random processes or random fields
are used to model gradient estimates.
We take a different approach to incorporate gradient estimates into stochastic
kriging and investigate the potential improvements. A new approach called Gradient
56
Extrapolated Stochastic Kriging (GESK) is proposed, which extrapolates additional
responses in the neighborhood of each design point using the original responses and
gradient estimates. These additional responses, which might be biased, lead to
better predictions than stochastic kriging if step sizes for extrapolations are chosen
carefully. The main idea is to further explore the response surface with simulation
responses and gradient estimates, so that a metamodel with better overall accuracy
can be constructed. This suggests that GESK models are superior when there are
limited number of design points or a response surface contains multiple extreme
values.
To investigate the performance of GESK, we analyze the possible reduction
in MSE of the GESK model over the standard stochastic kriging model, under two
simplified and tractable settings. Conditions that guarantee reduction in MSE are
provided as well. We also conduct numerical experiments to illustrate the effective-
ness of the GESK model. Numerical results show that GESK performs comparably
well or outperforms competing approaches such as stochastic kriging and SKG.
Moreover, in certain cases, GESK captures fluctuations of the response surface that
are usually missed by the other two approaches.
An important part of implementing the GESK model is the choice of step
size. Large step sizes usually lead to large approximation errors and deteriorate
prediction accuracy; small step sizes gain little information from extrapolations
and might lead to numerical stability issues. We formalize two different strategies,
penalized maximum likelihood estimation (PMLE) and minimizing integrated mean
squared error (IMSE), to determine optimal step sizes. A cross validation method is
57
presented to determine the regularization parameters required by each of the PMLE
and IMSE approaches. We discuss pros and cons for each approach and compare
them empirically with numerical examples.
In this section, we review stochastic kriging introduced in [21] and stochastic
kriging with gradient estimators (SKG) introduced in [25], and then present the
GESK approach.
3.1.1 Stochastic Kriging
Stochastic kriging was introduced in [21], focusing on modeling unknown
response surfaces in stochastic simulation settings. Given an experiment design
{(xi, ni)}, i = 1, 2, · · · , k, ni simulation replications are run at each design point
xi. Let Yj(xi) be the simulation output from replication j at design point xi,
j = 1, . . . , ni, and xi = (xi1, xi2, . . . , xid)ᵀ ∈ Rd. Stochastic kriging models the
output as
Yj(xi) = f(xi)ᵀβ + M(xi) + j(xi), (3.1)
where f(xi) ∈ Rp is a vector with known functions of xi, β ∈ Rp is a vector with
unknown parameters to be estimated. Components in f(xi) can be viewed as basis
functions and a polynomial basis is usually adopted in the literature. The term
f(xi)ᵀβ represents the trend of the overall response surface. It is assumed that M
is a realization of a zero-mean stationary random process (or random field) of the
second order. This assumption is inherited from the deterministic kriging literature,
where the stochastic nature of M is imposed on the problem so that statistical
58
inference can be applied. For this reason, M is sometimes referred to as extrinsic
uncertainty. This is contrasted with the term j(xi), which is the simulation noise
for replication j taken at xi. The uncertainty in j(xi) comes from the nature of
stochastic simulation, and it is sometimes referred to as intrinsic uncertainty.
Given the simulation responses {Yj(xi)}nij=1, i = 1, 2, . . . , k, the sample mean
of response output and simulation noise values at xi are denoted by
Y¯(xi) =
1
ni
ni∑
j=1
Yj(xi), ¯(xi) =
1
ni
ni∑
j=1
j(xi). (3.2)
The averaged responses Y¯(xi) at xi is modeled as
Y¯(xi) = f(xi)ᵀβ + M(xi) + ¯(xi).
Suppose that we would like to predict the response Y(x0) at any point x0. Let
Y¯ =
(
Y¯(x1), Y¯(x2), . . . , Y¯(xk)
)ᵀ. Let ΣM be the k× k covariance matrix implied by
the random field M and Σ be the k×k covariance matrix implied by the simulation
noise across all design points {x1,x2, · · · ,xk}. Let ΣM(x0, ·) be the k × 1 vector
(Cov(M(x0),M(x1)), . . . ,Cov(M(x0),M(xk)))ᵀ, which represents spatial covariances
between a prediction point x0 and all design points. Also, define the k × p design
matrix F as F = (f(x1), f(x2), . . . , f(xk))ᵀ. Suppose that ΣM, Σ and β are known.
Then the MSE-optimal predictor at x0 is of the form
Ŷ(x0) = f(x0)ᵀβ + ΣM(x0, ·)ᵀ[ΣM + Σ]−1(Y¯ − Fβ), (3.3)
with corresponding MSE
MSE
(
Ŷ(x0)
)
= ΣM(x0,x0)−ΣM(x0, ·)ᵀ [ΣM + Σ]−1 ΣM(x0, ·), (3.4)
59
where ΣM(x0,x0) is the spatial variance of the random field at x0. To build a
stochastic kriging metamodel in practice requires imposing some structure on the
spatial covariance matrix ΣM(·, ·). It is usually assumed that the spatial covariance
between M(xi) and M(xj) is
ΣM(xi,xj) = Cov [M(xi),M(xj)] = τ 2RM(xi,xj; θ), (3.5)
where τ 2 is the spatial variance of the random field and RM is a correlation function
with parameter θ. The assumption that M is second-order stationary allows us
to write RM(xi,xj; θ) = RM(|xi − xj|; θ), i.e., the correlation depends only on
the distance between xi and xj. Common candidates for the correlation function
include the triangular correlation function, the Gaussian correlation function and
the Mate´rn correlation function, etc. See [51] for a detailed discussion on effects of
using different correlation functions in stochastic kriging.
3.1.2 Stochastic Kriging With Gradient Estimators
We review the framework of stochastic kriging with gradient estimators (SKG)
introduced by [25]. SKG builds stochastic kriging models for gradient estimators
upon the stochastic kriging model for simulation responses. These two types of
models are estimated together and applied to approximate response surfaces.
Suppose that we observe not only the simulation responses Yj(xi), but also
unbiased gradient estimates Gj(xi) ∈ Rd for the jth simulation replication at design
point xi. Given an experimental design {(xi, ni)}ki=1, let the gradient estimate from
replication j at design point xi be Gj(xi) =
(
G1j (xi), . . . ,Gdj (xi)
)ᵀ. In the SKG
60
framework, each response Yj(xi) is modeled the same as in stochastic kriging and
each gradient estimator Grj (xi), r = 1, . . . , d, is modeled as
Grj (xi) =
(∂f(xi)
∂xir
)ᵀ
β + ∂M(xi)∂xir
+ δrj (xi). (3.6)
This is valid under the following conditions:
• The function f(xi) is differentiable with respect to xi.
• The second-order mixed derivative of the correlation function RM in (3.5)
exists and is continuous.
Let G¯r(xi) and δ¯r(xi), r = 1, . . . , d, be the sample average of the gradient
estimates and simulation noise, respectively, associated with xi:
G¯r(xi) =
1
ni
ni∑
j=1
Grj (xi), δ¯r(xi) =
1
ni
ni∑
j=1
δrj (xi).
The SKG framework models the averaged simulation responses and gradient esti-
mates as follows:
Y¯(xi) = f(xi)ᵀβ + M(xi) + ¯(xi),
G¯r(xi) =
(∂f(xi)
∂xir
)ᵀ
β + ∂M(xi)∂xir
+ δ¯r(xi).
To satisfy the conditions required for (3.6) to hold, a common choice for the cor-
relation function is the Gaussian correlation function. Let ΣM+ be the variance-
covariance matrix including spatial covariances induced by M, spatial covariances
induced by derivatives of M and those between M and its partial derivatives. Let
ΣM+(x0, ·) be the vector analogous to ΣM(x0, ·) in stochastic kriging. We assume
replications across design points are independent. In addition, simulation noise j
61
and δj are assumed to be independent from M. The covariance matrix Σ+ induced
by simulation noise can be estimated by the sample covariances in practice.
Let Y¯+ be the vector containing sample averages of response estimates and
gradient estimates at all design points:
Y¯+ =
(
Y¯(x1), . . . , Y¯(xk), G¯1(x1), . . . , G¯1(xk), . . . , G¯d(x1), . . . , G¯d(xk)
)ᵀ .
The design matrix F in Section 2.1 now becomes F+, which can be written as
F+ =
(
f(x1), . . . , f(xk),
(∂f(x1)
∂x11
)
, . . . ,
(∂f(xk)
∂xk1
)
, . . . ,
(∂f(x1)
∂x1k
)
, . . . ,
(∂f(xk)
∂xkk
))ᵀ
.
When β is known, the SKG predictor and the corresponding MSE can be obtained
by substituting Y¯+,F+,ΣM+ ,ΣM+(x0, ·) and Σ+ for Y¯ ,F,ΣM,ΣM(x0, ·) and Σ
in (3.3) and (3.4), respectively. Under some simplified settings, [25] shows that
SKG can reduce MSE by incorporating gradient estimates. Numerical experiments
also demonstrate the advantage of SKG in improving prediction performance over
stochastic kriging.
3.2 Gradient Extrapolated Stochastic Kriging
We propose a different approach for incorporating the gradient estimates called
Gradient Extrapolated Stochastic Kriging (GESK). Again, let Gj(xi) ∈ Rd be the
gradient estimator at xi from replication j. Instead of modeling gradient esti-
mates Gj(xi) as partial derivatives of the response surface, the gradient estimates
are simply viewed as noisy measurements of the true gradient G(xi) ∈ Rd, i.e.,
Gj(xi) = G(xi) + δj(xi), where {δj(xi)}nij=1 represent the zero-mean independent
62
identically distributed noise across different replications at the design point xi. De-
note the sample mean of gradient estimates at xi by
G¯(xi) =
1
ni
ni∑
j=1
Gj(xi).
Notice that the response estimate Yj(xi) and the gradient estimate Gj(xi) within
the same replication j are generally correlated.
To incorporate gradient estimates into stochastic kriging, we extrapolate in
the neighborhood of the original design points xi, i = 1, 2, · · · , k. Specifically, linear
extrapolation is used to obtain additional responses as follows:
x+i = xi + ∆xi, Yj(x+i ) = Yj(xi) + Gj(xi)ᵀ∆xi, (3.7)
where ∆xi = (∆xi1,∆xi2, . . . ,∆xid)ᵀ, and the step size ∆xi needs to be small rela-
tive to the spacing of xi. For simplicity, we assume that only one additional point
is added in the neighborhood of xi and that the same step size is used for all design
points, i.e., ∆xi = ∆x, i = 1, 2, . . . , k. Extensions include using more sophisti-
cated extrapolation techniques and extrapolating multiple additional responses in
the neighborhood of xi.
Let Y¯(x+i ) be the sample average of these extrapolated response outputs, which
is defined similarly as Y¯(xi) in (3.2). For ease of notation, let Y¯i = Y¯(xi) and
Y¯+i = Y¯(x+i ). Let Y¯+ be the 2k × 1 vector containing both the original responses
and the additional responses:
Y¯+ =
(
Y¯1, Y¯2, · · · , Y¯k, Y¯+1 , Y¯+2 , · · · , Y¯+k
)ᵀ .
63
Similarly, x+ is defined as
x+ = (x1,x2, · · · ,xk,x+1 ,x+2 , · · · ,x+k )ᵀ.
The sample average of the additional responses Y¯+i are modeled similarly to the
original responses Y¯i, i.e.,
Y¯+i = Y¯(x+i ) = f(x+i )ᵀβ + M(x+i ) + ¯(x+i ).
It is worth mentioning that this approach of incorporating gradient information is
not restricted to stochastic kriging, but it is a general approach that can be applied
to other metamodel approaches. The following assumptions are made:
Assumption 3.1. 1. Simulations across design points are conducted indepen-
dently, i.e., the use of common random numbers (CRN) is not considered.
2. For any design point xi, the noise j(xi) are independent across replications.
3. The random field M is independent of all noise j(xi) and j(x+i ), for each
design point xi and replication j.
4. The simulation noise ¯(xl) is independent of ¯(x+h ) for h 6= l.
[52] finds that using CRN in stochastic kriging inflates mean squared errors
generally. Assuming independence across replications and independence between
M and simulation noise is inherited from the stochastic kriging literature. The
last assumption says that the original simulation response is correlated with its
corresponding extrapolated response, but not other extrapolated responses.
64
Let Σ+M be the 2k×2k variance-covariance matrix implied by the extrinsic spa-
tial correlation model with 2k design points, including extrapolated design points:
Σ+M =






















Cov[M(x1),M(x1)] · · · Cov[M(x1),M(xk)] Cov[M(x1),M(x
+
1 )] · · · Cov[M(x1),M(x
+
k )]
...
...
...
...
...
...
Cov[M(xk),M(x1)] · · · Cov[M(xk),M(xk)] Cov[M(xk),M(x
+
1 )] · · · Cov[M(xk),M(x
+
k )]
Cov[M(x+1 ),M(x1)] · · · Cov[M(x
+
1 ),M(xk)] Cov[M(x
+
1 ),M(x
+
1 )] · · · Cov[M(x
+
1 ),M(x
+
k )]
...
...
...
...
...
...
Cov[M(x+k ),M(x1)] · · · Cov[M(x
+
k ),M(xk)] Cov[M(x
+
k ),M(x
+
1 )] · · · Cov[M(x
+
k ),M(x
+
k )]






















,
where each entry in Σ+M can be computed by (3.5) with a given correlation function
RM and spatial variance τ 2.
Let ¯+ ∈ R2k be the augmented vector of mean simulation noise:
¯+ =
(
¯(x1), . . . , ¯(xk), ¯(x+1 ), . . . ¯(x+k )
)ᵀ .
Under Assumption 3.1, let Σ+ be the 2k × 2k variance-covariance matrix induced
by ¯+, which can be expressed as
Σ+ =




















Var[¯(x1)] 0 . . . 0 Cov[¯(x1), ¯(x+1 )] 0 . . . 0
0 Var[¯(x2)] . . . 0 0
. . . . . . 0
...
...
. . .
...
...
...
. . .
...
0 0 . . . Var[¯(xk)] 0 0 . . . Cov[¯(xk), ¯(x+k )]
Cov[¯(x+1 ), ¯(x1)] 0 . . . 0 Var[¯(x
+
1 )] 0 . . . 0
0
. . . . . . 0 0 Var[¯(x+1 )] . . . 0
...
...
. . .
...
...
...
. . .
...
0 0 . . . Cov[¯(x+k ), ¯(xk)] 0 0 . . . Var[¯(x
+
k )]




















.
Let x0 be a prediction point and Σ+M(x0, ·) be a 2k × 1 vector
Σ+M(x0, ·) =
(
Cov[M(x0),M(x1)], . . . ,Cov[M(x0),M(x+k )]
)ᵀ ,
65
which represents spatial covariances between x0 and design points, including those
extrapolated design points. The augmented design matrix F+ can be expressed as
F+ =
(
f(x1), . . . , f(xk), f(x+1 ), . . . , f(x+k )
)ᵀ .
When β, Σ+M and Σ+ are known, the MSE-optimal predictor from the GESK model
and its corresponding MSE can be constructed by substituting Y¯+, F+, Σ+M(x0, ·),
Σ+M and Σ+ for Y¯ , F, ΣM(x0, ·), ΣM and Σ in (3.3) and (3.4), respectively.
In practice, β, Σ+M and Σ+ are unknown and need to be estimated. The
augmented matrix Σ+M is characterized by the spatial variance τ 2 and correla-
tion function with parameters θ. We assume that the simulation noise vectors
+j =
(
j(x1), . . . , j(xk), j(x+1 ), . . . j(x+k )
)ᵀ are multivariate normally distributed
with mean zero and covariance matrix Σ+ . Given the assumption, we first esti-
mate Σ+ . Our approach to estimate Var[¯(xi)], Var[¯(x+i )] and Cov[¯(xi), ¯(x+i )],
i = 1, 2, . . . , k will be described in the following. Estimation of Var[¯(xi)] is
V̂ar[¯(xi)] =
1
ni
[
1
ni − 1
ni∑
j=1
(
Yj(xi)− Y¯(xi)
)2
]
.
Estimation for Var[¯(x+i )] can be done in a similar fashion by replacing Yj(xi) by
Yj(x+i ). The covariance Cov[¯(xi), ¯(x+i )] is estimated by the sample covariance as
Ĉov[¯(xi), ¯(x+i )] =
1
ni
[
1
ni − 1
ni∑
j=1
(
Yj(xi)− Y¯(xi)
) (
Yj(x+i )− Y¯(x+i )
)
]
.
This provides us an estimate Σ̂+ for Σ+ . Combining this with normality assump-
tions, we can estimate the set of parameters (β, τ 2,θ) together using maximum
likelihood estimators (MLEs) as described in [21].
66
Key to implementing the GESK model is the choice of step sizes for the ex-
trapolated points, which depends on analyzing the potential improvements in per-
formance from the GESK model as well as the approximation errors introduced
by extrapolation. A good GESK model should take this bias-variance type tradeoff
into consideration. We consider two tractable models: a two-point problem and a k-
point problem with known model parameters. Under these two settings, we analyze
the potential improvement in MSE by the GESK model over the stochastic kriging
model, and provide conditions under which such improvement can be guaranteed.
In addition, we also analyze the effects of step sizes on MSE following the
discussions of the two-point problem and the k-point problem in Sections 3.2.1 and
3.2.2. Understanding the effects of step sizes will provide insights for determining
step sizes, which will be discussed later in detail in Section 3.3. In the following
discussion, we continue to assume that the same step size is used for extrapola-
tion at each design point and only one additional response is extrapolated in the
neighborhood of each original design point.
The step size ∆x determines the MSE of the GESK predictor through several
factors: the biases ζi in the extrapolated responses; the correlation ρi between the
simulation noise of original responses and extrapolated responses; and the variances
σ2i+ of the simulation noise in extrapolated responses. Since linear extrapolation
is employed in the GESK model, the bias ζi in the extrapolated response Y¯+i is
bounded by K||∆x||2 for some K > 0. This suggests that the upper bound of biases
can be controlled by choosing different step sizes. The correlation ρi depends on
both the step size and the covariance between the simulation noise of the responses
67
and those of the gradient estimators. A larger step size or smaller covariance leads to
smaller correlation factor ρi, whereas σ2i+ changes as the step size changes, but also
depends on the sign of the correlation ρi. Effects of these factors will be discussed
in detail in the following using the two-point problem and k-point problem as in
Sections 3.2.1 and 3.2.2.
3.2.1 A Two-Point Problem with Single Extrapolated Point
Consider a one-dimensional problem (d = 1) of two design points x1 and x2
with numbers of replications n1 and n2, respectively. Without loss of generality, let
x1 < x2 and the prediction point be x0 ∈ [x1, x2]. The simulation outputs include
responses {Yj(xi)}nij=1 for i = 1, 2 at both design points and gradient estimators
{Gj(x1)}n1j=1 at x1 only. A constant trend is used to represent the overall surface
mean, i.e., f(xi)ᵀβ = β0. All parameters (β0, τ 2, θ) are assumed to be known.
Let the spatial variance τ 2 > 0 and ril be the correlation between M(xi) and
M(xl), i, l = 0, 1, . . . , k. The correlation ril can be calculated from the correla-
tion function RM(xi, xl; θ), but no specific correlation function is specified in this
discussion. Let the variance of the simulation noise at xi from replication j be
Var[j(xi)] = σ2i .
Let Y¯ = (Y¯1, Y¯2)ᵀ be the vector containing the sample means of responses at
x1 and x2. The stochastic kriging predictor at x0 is given as
Yˆ(x0) = β0 +τ 2
(r1(τ 2 + σ
2
2
n2
)− r2τ 2r12)(Y¯1 − β0) + (r2(τ 2 + σ
2
1
n1
)− r1τ 2r12)(Y¯2 − β0)
(τ 2 + σ
2
1
n1
)(τ 2 + σ
2
2
n2
)− τ 4r212
,
(3.8)
68
with corresponding MSE
MSE
(
Yˆ(x0)
)
= τ 2
(
1− τ 2
(r201 + r202)τ 2 +
r201σ
2
2
n2
+ r
2
02σ
2
1
n1
− 2r01r02r12τ 2
(τ 2 + σ
2
1
n1
)(τ 2 + σ
2
2
n2
)− τ 4r212
)
. (3.9)
With a pre-specified step size ∆x, a new design point x+1 = x1 + ∆x in the interval
[x1, x2] is added and GESK extrapolates its response as Yj(x+1 ) = Yj(x1)+∆xGj(x1).
This additional response output is modeled as Yj(x+1 ) = β0 + M(x+1 ) + j(x+1 ). To
address approximation error introduced by extrapolation, we assume that j(x+1 ) is
normally distributed with mean ζi = ζ(xi) and variance σ21+ ; thus, the extrapolated
responses Yj(x+1 ) at x+1 are biased unless ζi = 0.
Let Y¯+1 be the sample mean of responses at x+1 and the vector Y¯+ = (Y¯1, Y¯2, Y¯+1 )ᵀ.
Let ρ1 be the correlation between ¯(x1) and ¯(x+1 ) and ri1+ be the correlation between
M(xi) and M(x+1 ) for i = 0, 1, 2. The variance-covariance matrix Σ+ = Σ+M + Σ+
takes the form
Σ+ = τ 2








1 r12 r11+
r12 1 r21+
r11+ r21+ 1








+








σ21
n1
0 ρ1
σ1σ1+
n1
0 σ
2
2
n2
0
ρ1
σ1σ1+
n1
0 σ
2
1+
n1








=




Σ b
bᵀ c



 ,
where Σ is the 2 × 2 covariance matrix of the vector (Y¯1, Y¯2)ᵀ, b is a 2 × 1 vec-
tor and c = τ 2 + σ21+/n1. The vector containing covariances between M(x0) and
(
M(x1),M(x2),M(x+1 )
)ᵀ is
Σ+M(x0, ·) = τ 2








r01
r02
r01+








=




ΣM(x0, ·)
τ 2r01+



 .
69
The new predictor at x0 from the GESK model is
Ŷ+(x0) = Ŷ(x0) +
1
v
[
bᵀΣ−1(Y¯ − β012)− (Y¯+1 − β0)
] [
ΣM(x0, ·)ᵀΣ−1b− τ 2r01+
]
,
(3.10)
where Ŷ(x0) is defined in (3.8) and v = c− bᵀΣ−1b.
The following theorem provides an expression for MSE(Ŷ+(x0)) and conditions
under which the GESK predictor in (3.10) has smaller MSE than that in (3.8).
Theorem 3.1. The MSE of the predictor in (3.10) can be expressed as
MSE
(
Ŷ+(x0)
)
= MSE
(
Ŷ(x0)
)
+
(ζ21
v2 −
1
v
)
[
ΣM(x0, ·)ᵀΣ−1b− τ 2r01+
]2 , (3.11)
and the GESK predictor has a smaller MSE if ζ21 < v.
Proof. The MSE of GESK predictor MSE
(
Yˆ+(x0)
)
follows from straightforward
calculation and details are provided in the Appendix. Both Σ+ and Σ are variance-
covariance matrices, and
det
(
Σ+
)
= det (Σ) det
(
c− bᵀΣ−1b
)
= v det (Σ) ,
and it follows that v > 0 since both det(Σ+) and det(Σ) are positive. Since v > 0,
the condition ζ21 < v is well defined. Under this condition, the GESK predictor has
a smaller MSE than the stochastic kriging predictor.
Theorem 3.1 provided the change in MSE at a prediction point x0:
∆MSE =
(ζ21
v2 −
1
v
)
[
ΣM(x0, ·)ᵀΣ−1b− τ 2r01+
]2 .
We summarize our findings in this setting as follows:
70
1. The bias ζ1 must satisfy ζ21 < v as shown in Theorem 3.1 to guarantee reduction
in MSE. Since ζ1 is proportional to (∆x)2, intuitively the step size should be
relatively small.
2. The greater the correlation ρ1, the greater the reduction in MSE. The param-
eter ρ1 also depends on the correlation between j(x1) and δj(x1), namely, the
simulation noise of output responses and gradient estimators. The parameter
ρ1 increases as the correlation between j(x1) and δj(x1) increases.
3. The parameter σ21+ represents the noise in an extrapolated response Yj(x
+
1 ).
The reduction in MSE is greater if σ21+ is smaller.
All conditions seem to prefer using smaller step sizes. However, other diffi-
culties arise if the step sizes are too small: first, because the quantity v becomes
smaller as ∆x becomes smaller, the condition ζ21 < v may not hold; second, as ∆x
approaches zero, the correlation ρ1 approaches 1 and this may make the matrix Σ+
ill-conditioned, which leads to numerical issues.
3.2.2 A k -Point Problem
In this section, we consider a tractable problem with k design points, where
xi ∈ Rd, under the following assumptions:
1. Along with the response outputs Yj(xi), gradient estimators Gj(xi) are also
collected from simulations at design points {xi}ki=1.
2. Only one additional response is extrapolated in the neighborhood of each
71
design point.
3. The trend f(xi)ᵀβ = β0 and all parameters (β0, τ 2,θ) are known.
Within the region of interest, an additional response Y+j (xi) is extrapolated using
each pair of observations (Yj(xi),Gj(xi)). All extrapolated design points should be
in the interior of the design region; therefore, extrapolations from design points at
the boundary should be done cautiously. Let Y¯+ be the 2k × 1 vector that consists
of sample means of all response outputs
Y¯+ = (Y¯1, Y¯2, . . . , Y¯k, Y¯+1 , Y¯+2 , . . . , Y¯+k )ᵀ,
where Y¯i = Y¯(xi) and Y¯+i = Y¯(x+i ). The sample mean of original responses at
xi are modeled as in Section 3.2. The sample mean of extrapolated responses are
modeled similarly, i.e., Y¯(x+i ) = β0 + M(x+i ) + ¯(x+i ).
Let ρi denote the correlation between ¯(xi) and ¯(x+i ). The spatial correlations
between original design points and extrapolated design points are denoted as ril =
Corr[M(xi),M(xl)], ril+ = Corr[M(xi),M(x+l )] and ri+l+ = Corr[M(x+i ),M(x+l )] for
i, l = 1, 2, . . . , k. The 2k × 2k variance-covariance matrix Σ+ = Σ+M + Σ+ can be
expressed in a block form
Σ+ =




Σ B
Bᵀ C



 , (3.12)
72
where
Σ =












τ 2 + σ21/n1 r12 · · · r1k
r12 τ 2 + σ22/n2 · · · r2k
... ... . . . ...
r1k r2k · · · τ 2 + σ2k/nk












,
B =












τ 2r11+ + ρ1
σ1σ1+
n1
τ 2r12+ · · · τ 2r1k+
τ 2r12+ τ 2r22+ + ρ2
σ2σ2+
n2
· · · τ 2r2k+
... ... . . . ...
τ 2r1k+ τ 2r2k+ · · · τ 2rkk+ + ρk
σkσk+
nk












,
C =












τ 2 + σ21+/n1 τ 2r1+2+ · · · τ 2r1+k+
τ 2r1+2+ τ 2 + σ22+/n2 · · · τ 2r2+k+
... ... . . . ...
τ 2r1+k+ τ 2r2+k+ · · · τ 2 + σ2k+/nk












.
Given a prediction point x0, let Σ+M(x0, ·) be a 2k × 1 vector that consists of the
spatial covariances between x0 and all design points,
Σ+M(x0, ·) =
(
ΣM(x0,x1), . . . ,ΣM(x0,xk),ΣM(x0,x+1 ), . . . ,ΣM(x0,x+k )
)ᵀ
=
(
ΣᵀM(x0, ·) Σ
ᵀ
M+(x0, ·)
)ᵀ ,
where the both ΣM+(x0, ·) and ΣM+(x0, ·) are k × 1 vectors.
As in the analysis of the two-point problem, an important issue to address
is the approximation error introduced by extrapolation. Let the noise terms (x+i )
73
at x+i follow normal distributions with means ζi = ζ(xi), which implies that the
additional response outputs Yj(x+i ) are biased if ζi 6= 0. We will analyze the effects
of incorporating them in the following. Let the vector ζ ∈ R2k be
ζ = (0, 0, . . . , 0, ζ1, ζ2, . . . , ζk)ᵀ = (0ᵀk ζ
ᵀ
k)
ᵀ ,
which represents the expectation of the 2k × 1 noise vector ¯+.
Let Ŷ+(x0) be the GESK predictor at x0. The MSE of the GESK predictor
for this k-point problem is
MSE
(
Ŷ+(x0)
)
= Σ+M(x0,x0)−Σ+M(x0, ·)ᵀ
[
Σ+M + Σ+
]−1
Σ+M(x0, ·)
+
(
Σ+M(x0, ·)ᵀ
[
Σ+M + Σ+
]−1
ζ
)2
= MSE
(
Ŷ2k(x0)
)
+
(
Σ+M(x0, ·)ᵀ
[
Σ+M + Σ+
]−1
ζ
)2
.
(3.13)
The first term MSE
(
Ŷ2k(x0)
)
is the MSE of prediction that one would obtain if
unbiased responses are collected at 2k design points, namely, running simulations
at x+i to collect response estimates rather than extrapolating additional response
estimates. The second term is the inflation of MSE caused by approximation errors
ζ in the additional extrapolated responses.
Let Ŷ(x0) be the stochastic kriging predictor with k design points. Our interest
is to compare the MSE of the GESK predictor Ŷ+(x0) with that of Ŷ(x0). To achieve
this, we begin by looking into the MSE of Ŷ2k(x0).
Using the Woodbury matrix identity and block inverse formula in linear alge-
bra, the MSE of Ŷ2k(x0) can be expressed as
MSE
(
Ŷ2k(x0)
)
= Σ+M(x0,x0)−Σ+M(x0, ·)ᵀ
(
Σ+
)−1
Σ+M(x0, ·)
= MSE
(
Ŷ(x0)
)
− ωᵀVω,
(3.14)
74
where ω = BᵀΣ−1ΣM(x0, ·)−ΣM+(x0, ·) and V = (C−BᵀΣ−1B)−1.
Lemma 3.2. The matrix V = (C−BᵀΣ−1B)−1 is positive definite.
Proof. Consider the 2k × 2k covariance matrix Σ+,
Σ+ =




Σ B
Bᵀ C



 .
First it is easy to see that Σ+ is positive definite, so
[
u v
]




Σ B
Bᵀ C








u
v



 > 0,
for any k × 1 vector u,v ∈ Rk. This leads to
uᵀΣu + 2vᵀBᵀu + vᵀCv > 0.
For a fixed vector v, consider f(u) = uᵀΣu + 2vᵀBᵀu + vᵀCv as a function of u.
The first-order condition shows that the minimum of f(u) is
min
u
f(u) = vᵀ(C−BᵀΣ−1B)v,
which has to be positive for any v ∈ Rk. Therefore the matrix C − BᵀΣ−1B is
positive definite and its inverse V = (C−BᵀΣ−1B)−1 is also positive definite.
Since the matrix V is positive definite, it follows immediately that
MSE
(
Ŷ2k(x0)
)
= MSE
(
Ŷ(x0)
)
− ωᵀVω ≤ MSE
(
Ŷ(x0)
)
,
where equality only holds if and only if ω = 0. Thus, not surprisingly, the MSE is
reduced if the k additional response outputs are unbiased.
75
Next we investigate the effect of the extrapolated bias on the overall MSE.
Combining (3.13) and (3.14) gives
MSE
(
Ŷ+(x0)
)
= MSE
(
Ŷ2k(x0)
)
+
(
Σ+M(x0, ·)ᵀ
[
Σ+M + Σ+
]−1
ζ
)2
= MSE
(
Ŷ(x0)
)
− ωᵀVω +
(
Σ+M(x0, ·)ᵀ
[
Σ+M + Σ+
]−1
ζ
)2
= MSE
(
Ŷ(x0)
)
− ωᵀVω +
((
ΣᵀM+(x0, ·)V −Σ
ᵀ
M(x0, ·)Σ−1BV
)
ζk
)2
= MSE
(
Ŷ(x0)
)
− ωᵀVω + (ωᵀVζk)2
= MSE
(
Ŷ(x0)
)
+ ωᵀVζkωᵀVζk − ωᵀVω
= MSE
(
Ŷ(x0)
)
+ ωᵀVζkζᵀkVω − ωᵀVω
= MSE
(
Ŷ(x0)
)
+ ωᵀ (VζkζᵀkV −V)ω.
(3.15)
The next theorem provides a sufficient condition under which MSE
(
Ŷ+(x0)
)
is
smaller than MSE
(
Ŷ(x0)
)
.
Theorem 3.3. Let λi(A) denote the ith largest eigenvalue of matrix A, i = 1, 2, . . . , k.
The symmetric matrix W = VζkζᵀkV −V is negative definite if
ζᵀkζk ≤
λk(V)
[λ1(V)]2
. (3.16)
Proof. Using Weyl’s inequality in matrix theory and Corollary 11 in [53], the largest
eigenvalue λ1(W) of W satisfies
λ1(W) = λ1 (VζkζᵀkV −V)
≤ λ1 (VζkζᵀkV) + λ1(−V)
= λ1 (VζkζᵀkV)− λk(V)
≤ [λ1(V)]2 λ1(ζkζᵀk)− λk(V).
76
The k × k matrix ζkζᵀk is known as a dyad, which has one positive eigenvalue ζ
ᵀ
kζk
and k−1 zero eigenvalues provided that ζk 6= 0. It follows that the largest eigenvalue
of ζkζᵀk is λ1(ζkζ
ᵀ
k) = ζ
ᵀ
kζk. Applying condition (3.16), we have λ1(W) < 0, namely,
the largest eigenvalue of W is negative, so all eigenvalues of W are negative, and
therefore the matrix W is negative definite.
When the matrix W is negative definite, the quantity ωᵀWω is always neg-
ative unless ω = 0, so the GESK model reduces MSE for the k-point problem if
(3.16) holds.
Remark 3.4. The condition in (3.16) is well defined, as the matrix V is shown
to be positive definite in Lemma 3.2. Theorem 3.3 shows that when the biases are
relatively small, the reduction in MSE from including the additional extrapolated
points still exceeds the inflation in MSE introduced from the bias of the extrapolated
points.
In addition to assumptions in Section 3.2.2, we also assume that the k design
points are widely spread such that the spatial correlation between design points
is approximately 0. A similar assumption is used in [25] to isolate the impacts of
incorporating gradient estimators from spatial covariances. This implies that the
matrix Σ in (3.12) is a diagonal matrix. As the step size ∆x is usually small, we
assume the same property holds for B and C in (3.12) also. The change in MSE of
the GESK predictor is
∆MSE = ωᵀ (VζkζᵀkV −V)ω,
77
where ω = BᵀΣ−1ΣM(x0, ·) − ΣM+(x0, ·) and V = (C−BᵀΣ−1B)−1. The effects
of ∆x, ρi and σ2i+ are summarized as follows:
1. Theorem 3.3 suggests that the quantity ζᵀkζk needs to be small enough to
guarantee that GESK can reduce MSE. This condition requires the step size
∆xj in each dimension to be small.
2. Regarding the correlation ρi, the preferable sign of the correlation actually
depends on the location of the prediction point x0. A condition between ∆x
and xi−x0 determines the preferable sign of ρi. A specific analytical form for
the condition depends on the type of correlation function, larger |ρi| is better
in each favorable case.
3. If the correlation ρi ≈ 0, smaller σ2i+ is preferable, since it suggests that there
is less noise in the extrapolated responses. The same conclusion holds when
the correlation ρi is positive. However, if the correlation ρi < 0, smaller σ2i+
is not necessarily better, as there exists an optimal σ2i+ that reduces MSE the
most.
Analyzing the effects of step size on MSE in a general setting is more difficult,
especially in multidimensional problems. For example, step size used in a multidi-
mensional problem may be different along different directions. Choosing good step
sizes is crucial for building the GESK models. In the next section, we propose two
different approaches to determine the optimal step size.
78
3.3 Implementations of GESK
In this section, we focus on two important questions in the implementation of
the GESK model: choosing step sizes and choosing gradient estimators. We pro-
vide two different techniques for determining steps sizes and discuss their pros and
cons. We also make recommendations between the infinitesimal perturbation anal-
ysis (IPA) and the likelihood ratio/score function (LR/SF) techniques for gradient
estimation.
As discussed in the previous section, central to building a GESK model is
the determination of appropriate step sizes. A good choice of step size is crucial
to the performance of the GESK model. Different step sizes, even with the same
data set, may lead to dramatical performance differences of GESK models. The
linear extrapolation used in GESK is only appropriate in a small neighborhood of
the design points, so the step size cannot be too large. If the step size is too small,
the additional points obtained from linear extrapolations provide little information
and it may cause numerical stability issues.
Two natural choices for determining step sizes are maximum likelihood esti-
mation (MLE) and minimizing integrated mean squared error (IMSE). However,
the MLE approach is not suitable for determining step sizes in this context. The
MLE approach leads to step sizes as small as possible, which results in numerical
stability issues when building the GESK model. One unique characteristic of the
GESK model is that biases are introduced during extrapolations. Although the
biases are unknown, they should be taken into consideration during parameter esti-
79
mations. To accomplish this, penalty terms are introduced in MLE and IMSE. We
formalize two approaches for determining step sizes: penalized maximum likelihood
estimation and minimizing integrated mean squared error.
3.3.1 Penalized Maximum Likelihood Estimation
One natural choice for determining the step size ∆x is to treat it as a new
parameter in addition to the other parameters (β, τ 2,θ). Under Assumption 1 in
[21], we can write down the likelihood function. However, as mentioned earlier, naive
MLE is not suitable for choosing step sizes in this case. Assuming the correlation
function in (3.5) is used, we propose a penalized maximum likelihood method where
the penalized likelihood function takes the following general form:
Q(β, τ 2,θ,∆x) = − ln
[
(2pi)k
]
−
1
2 ln
[∣
∣Σ+M + Σ+
∣
∣
]
−
1
2(Y¯
+ − F+β)ᵀ
[
Σ+M + Σ+
]−1 (Y¯+ − F+β)− pλ(∆x),
where pλ(·) is a given nonnegative penalty function with a regularization param-
eter λ. Common choices of penalty functions include L1 penalty, L2 penalty and
smoothly clipped absolute deviation (SCAD).
The proposed penalty function is
pλ(∆x) = λ||∆x||−2,
where ∆x = (∆x1,∆x2, . . . ,∆xd) and ||∆x||−2 :=
d∑
i=1
(∆xi)−2. Therefore the pro-
80
posed penalized likelihood function is
Q(β, τ 2,θ,∆x) = − ln
[
(2pi)k
]
−
1
2 ln
[∣
∣Σ+M + Σ+
∣
∣
]
−
1
2(Y¯
+ − F+β)ᵀ
[
Σ+M + Σ+
]−1 (Y¯+ − F+β)− λ||∆x||−2. (3.17)
Penalized maximum likelihood estimation (PMLE) has been used to do vari-
able selection [54] and overcome flat likelihood function issues [55] for kriging. One
key difference is that previous PMLE approaches try to improve the quality of esti-
mates for (β, τ 2,θ), while we propose to use PMLE for choosing step sizes.
3.3.2 Minimizing Integrated MSE
Another view is to connect the problem of finding step sizes with design of
experiments (DOE). Choosing step sizes is similar to adding new design points in
DOE. In deterministic and stochastic kriging literature, many criteria have been
proposed to find the “best” experiment design, most of which are based on MSE.
Using integrated mean squared error (IMSE) as the objective function, the problem
can be formulated as
Minimize
∆x
IMSE =
∫
x0∈Ω
MSE+
(
Ŷ(x0; ∆x)
)
dx0, (3.18)
where Ω is the region of interest.
Lower IMSE suggests smaller deviation associated with the approximation
over the region of interest. In practice, a penalty term involving step sizes is added
81
to MSE+,
MSE+(Ŷ(x0; ∆x)) = Σ+M(x0,x0)−Σ+M(x0, ·)ᵀ[Σ+M + Σ+ ]−1Σ+M(x0, ·) + λ||∆x||2,
(3.19)
where the Euclidean norm of ∆x is used. Adding a penalty term that is proportional
to ||∆x||2 in MSE+ follows the discussion earlier.
The PMLE approach estimates all the parameters (β, τ 2,θ) with ∆x simulta-
neously. However, the IMSE approach requires τ 2 and θ to be known in advance.
In practice, a two-stage strategy is proposed to address this issue:
1. In Stage 1, use the original dataset
{
xi, Y¯(xi)
}k
i=1 to obtain MLEs for (βˆ, τˆ
2, θˆ).
2. Calculate MSE+
(
Ŷ(x0; ∆x)
)
in (3.18) with the estimated (βˆ, τˆ 2, θˆ) and a
predetermined penalty constant λ.
3. In Stage 2, minimize the IMSE in (3.19) to find the optimal step size.
In our implementation, we use the optimization routine fmincon in Matlab.
3.3.3 Choosing Regularization Parameter
The question of selecting the regularization parameter λ in both approaches
remains to be answered. We propose to use cross validation (CV), which is widely
used in statistics and machine learning community, to choose the regularization
parameters. Cross validation allows us to assess the performance with different
regularization parameters without running additional simulations. When a J-fold
cross validation is applied for a given regularization parameter λ, a corresponding
82
score CV(λ) can be calculated. Given the design points {xi}ki=1 with the averaged
simulation output {Y¯(xi), G¯(xi)}ki=1, the CV score is calculated as follows:
1. Split the dataset D = {xi, Y¯(xi), G¯(xi)}ki=1 into J subsets D1,D2, . . . ,DJ . All
the design points located on the boundary are handled separately so that
extrapolated design points are still in the space of interest.
2. For j = 1, 2, . . . , J , choose all the design points in Dj as prediction points and
build a GESK model using D\Dj. Predict the response Ŷ(x) for every x ∈ Dj.
3. Compute the CV score for a given parameter λ as a sum of squared errors
between the prediction Ŷ(xi) and the averaged output Y¯(xi) on Dj,
CV(λ) =
J∑
j=1
∑
(xi,Y¯i)∈Dj
(
Y¯i − Ŷ(xi)
)2
.
To start the cross validation, we need to choose a set of regularization parameters
Λ = {λ1, λ2, . . . , λL}. We compute the CV score for each λl ∈ Λ and choose the
best regularization parameter λ∗ as
λ∗ = arg min
λl∈Λ
CV(λl).
Regarding the set of parameters Λ, if Λ contains a sufficiently large number
of points and computational time is not an issue, cross validation will choose the
best regularization parameter for building a GESK model. In practice, one way
to choose the candidate parameter linearly on a logarithmic scale, for example.
10−1, 100, . . . , 103. This is essentially an optimization problem. When the size of
the set Λ is small, an exhaustive search algorithm can find λ∗ easily. If the size of
83
Λ get large, randomized search algorithm can be applied as well. In our numerical
experiments, CV score is computed for each parameter in Λ to find λ∗.
The performance of the two proposed approaches, PMLE and IMSE, will be
investigated via numerical examples in Section 3.4 using different test problems. We
summarize their main features and differences as follows:
• Both methods need a regularization parameter to be calibrated, as unknown
biases are taken into consideration in both methods. We propose using cross
validation methods to determine regularization parameters.
• PMLE uses a penalty function to overcome small step size issues in using
naive MLE. The PMLE approach takes biases into consideration, but does not
guarantee good performance in MSE or IMSE. However, in high-dimensional
problems, maximizing a penalized likelihood is usually computationally faster
than integrating MSE.
• IMSE minimizes the IMSE over the design region, but, in high-dimensional
problems, numerical integrations become computationally expensive, requiring
Monte Carlo methods with long computation time.
3.3.4 Choosing Gradient Estimators
In this chapter, we only consider direct gradient estimators. Specifically, we
focus on the infinitesimal perturbation analysis (IPA) and likelihood ratio/score
function (LR/SF) methods. Under mild conditions, both techniques are able to
provide unbiased gradient estimators, but we would like to know which technique is
84
preferable in building a GESK model, provided that both methods are applicable.
Observations made in [56] and [25] suggest the following: (i) at a given point,
correlations between the responses and the corresponding gradient estimates are
higher when IPA is applied, (ii) IPA gradient estimators usually have smaller vari-
ances than LR/SF estimators, (iii) IPA gradient estimators have better performance
when applied in stochastic kriging with gradient estimators (SKG).
Discussions in Sections 3.2.1 and 3.2.2 suggests that the GESK model prefers
gradient estimators that are highly correlated with response estimates and have
smaller variances. Therefore, under most settings, IPA gradient estimators are
preferable to build a GESK model. IPA gradient estimators are employed in the
M/M/1 example conducted in Section 3.4.
3.4 NUMERICAL EXAMPLES
In this section, several numerical experiments are conducted to illustrate the
proposed GESK model. Our goal in this section is three-fold: To demonstrate the
effects of different step sizes on the performance of the GESK model; to empiri-
cally compare the effectiveness of the PMLE and IMSE approaches in determining
step sizes; to examine the performance of the GESK model in different settings and
compare it with stochastic kriging [21] and stochastic kriging with gradient estima-
tors (SKG) [25]. Implementation of SKG and GESK are built upon software for
stochastic kriging downloaded from http://www.stochastickriging.net.
Across all experiments, we assume little information is known about the re-
85
sponse surface and choose constant trends for all models, i.e., f(x)ᵀβ = β0. A
Gaussian correlation function RM(x,x′) = exp{−θ‖x − x′‖2} is used for all the
experiments, since it satisfies the conditions required by SKG.
We implemented both PMLE and IMSE approaches discussed in Section ??
to determine step sizes, together with the cross validation method to choose regu-
larization parameters. The corresponding GESK models are named GESK-PMLE and
GESK-IMSE. The measure of performance we chose is the Empirical IMSE (EIMSE),
as used in [57] and other kriging literature:
EIMSE = 1N
N∑
i=1
(
Ŷ(xi)− Y(xi)
)2
, (3.20)
where N is the number of predictions, Ŷ(xi) is the predicted response at xi and
Y(xi) is the true value at xi.
3.4.1 Experiment on Step Sizes in GESK
We investigate the effects of using different step sizes in the GESK models
using an M/M/1 queue example [58]. The M/M/1 queue has arrival rate 1 and
service rate x ∈ [1.1, 2]. We are interested in the steady-state expected waiting time
y(x), which has an analytical solution y(x) = 1/(x(x− 1)). In our simulation, each
sample path was initialized in steady state and simulated for 5000 customers. The
outputs collected were the average waiting time and its derivative with respect to
the service rate x.
Six different experiment designs, (6, 50), (6, 200), (6, 1000), (8, 200), (10, 200),
(20, 200), were used in the experiment, where the first element in each pair gives the
86
Design GESK-1 GESK-2 GESK-3 GESK-PMLE GESK-IMSE
(6, 50) 0.085 (0.0036) 0.167 (0.0035) 0.618 (0.0149) 0.042 (0.0020) 0.027 (0.0017)
(6, 200) 0.094 (0.0023) 0.181 (0.0019) 0.731 (0.0064) 0.034 (0.0011) 0.024 (0.0010)
(6, 1000) 0.092 (0.0011) 0.180 (0.0010) 0.747 (0.0037) 0.038 (0.0006) 0.021 (0.0004)
(8, 200) 0.006 (0.0004) 0.016 (0.0006) 0.194 (0.0023) 0.005 (0.0003) 0.002 (0.0002)
(10, 200) 0.006 (0.0005) 0.007 (0.0011) 0.017 (0.0012) 0.007 (0.0007) 0.004 (0.0003)
(20, 200) 0.002 (0.0008) 0.006 (0.0003) 0.048 (0.0020) 0.003 (0.0003) 0.001 (0.0001)
Table 3.1: Averaged EIMSE from 100 macroreplications for GESK models under
six designs (# of design points, # of reps) to predict expected waiting time in the
M/M/1 queue example. Three fixed step sizes with those determined by PMLE and
IMSE are compared. Standard errors are shown in parenthesis.
number of design points and the second element represents the number of replica-
tions at each design point.
With equally spaced design points, three predetermined step sizes were chosen
for each design, which correspond to 1/10, 1/5 and 1/2 of the length of the subin-
terval. GESK models built with these step sizes are labelled as GESK-1, GESK-2 and
GESK-3, respectively. We ran the experiments for 100 macroreplications. Within
each macroreplication, we chose N = 1000 to estimate the EIMSE in (3.20). Table
3.1 shows the sample mean and standard errors of EIMSE, and Figure 3.1 contains
boxplots for the EIMSE.
Our findings are summarized as follows:
• Predetermined step sizes vs. Optimal step sizes. Performances of
87
GESK-1GESK-2GESK-3 PMLE IMSE GESK-1GESK-2GESK-3 PMLE IMSE GESK-1GESK-2GESK-3 PMLE IMSE
0.0
0.2
0.4
0.6
0.8
1.0
1.2
E
I
M
S
E
Design(6, 50) Design(6, 200) Design(6, 1000)
(a) Design (6, 50), (6, 200), (6, 1000)
GESK-1GESK-2GESK-3 PMLE IMSE GESK-1GESK-2GESK-3 PMLE IMSE GESK-1GESK-2GESK-3 PMLE IMSE
0.00
0.05
0.10
0.15
0.20
0.25
E
I
M
S
E
Design(8, 200) Design(10, 200) Design(20, 200)
(b) Design (8, 200), (10, 200), (20, 200)
Figure 3.1: Boxplots of EIMSE from 100 macroreplications for the GESK models
under six designs (# of design points, # of reps) to predict expected steady-state
waiting time in the M/M/1 queue example.
88
the two optimal step sizes are better than those of predetermined step sizes,
especially when the number of design points is small. This is expected, as
the choice of step sizes should adapt to the experiment design and simulation
output.
• PMLE vs. IMSE. The performance of IMSE is better than that of PMLE
under most experiment designs, in terms of having smaller averaged EIMSE,
smaller variances of EIMSE and smaller number of outliers. Figure 3.2 shows
boxplots for step sizes determined by PMLE and IMSE under all six designs.
• Effect of number of design points. When the number of design points is
small, for example k = 6, improvements in EIMSE are more significant. How-
ever, when there are already enough design points, improvements are hardly
noticeable. In addition, for both PMLE and IMSE, the relative step size (ra-
tio to the size of the subinterval) generally increases as the number of design
points increases.
• Effect of number of replications. As the number of replications increases,
the variances of EIMSE become smaller as shown in Table 3.1 and Figure 3.1.
However, changes in the averaged IMSE are not significant. Variances of the
chosen step sizes seem to decrease as well, as shown in Figure 3.2.
89
0.004 0.006 0.008 0.010 0.012 0.014 0.016
Step sizes
Design (6, 50)
Design (6, 200) 
Design (6, 1000)
Design (8, 200) 
Design (10, 200)
Design (20, 200)
(a) Step sizes determined by PMLE
0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008
Step sizes
Design (6, 50)
Design (6, 200) 
Design (6, 1000)
Design (8, 200) 
Design (10, 200)
Design (20, 200)
(b) Step sizes determined by IMSE
Figure 3.2: Boxplots for step sizes determined by PMLE and IMSE based on 100
macroreplications under six designs (# of design points, # of reps) to predict the
expected steady-state waiting time in the M/M/1 queue example.
90
3.4.2 Comparisons among SK, SKG and GESK
In this section, we will compare the performances of three different meta-
models: stochastic kriging, stochastic kriging with gradient estimators (SKG) and
gradient extrapolated stochastic kriging (GESK) in three different experiments. The
theoretical analysis of GESK in Section ?? assumes that all parameters are known,
whereas the comparison here is empirical when all parameters must be estimated.
3.4.2.1 A stochastic simulation example
We used the same M/M/1 queue example as in Section 3.4.1. Six different ex-
periment designs were adopted as well. We ran the experiments for 100 macrorepli-
cations. Within each macroreplication, we chose N = 1000 to estimate the EIMSE
in (3.20). Results are shown in Table 3.2 and Figure 3.3. These 100 macroreplica-
tions used the same random numbers as those used in Section 3.4.1, so numbers for
the two GESK models in Table 3.3 and corresponding boxplots in Figure 3.4 are
the same as those in Section 3.4.1.
Our findings are summarized as follows:
• SK vs. SKG vs. GESK. Not surprisingly, SKG and GESK perform
better than SK, as incorporating gradient estimators provides more informa-
tion about the response surface. GESK-IMSE and SKG perform better than
GESK-PMLE in most cases, since objective adopted by PMLE is not directly
related to MSE. Both GESK models perform comparably well or better than
the SKG model. It is not easy to distinguish GESK-IMSE and SKG, since
91
Design SK SKG GESK-PMLE GESK-IMSE
(6, 50) 0.313 (0.0134) 0.031 (0.0036) 0.042 (0.0020) 0.027 (0.0017)
(6, 200) 0.324 (0.0062) 0.016 (0.0007) 0.034 (0.0011) 0.024 (0.0010)
(6, 1000) 0.328 (0.0027) 0.016 (0.0003) 0.038 (0.0006) 0.021 (0.0004)
(8, 200) 0.054 (0.0019) 0.002 (0.0003) 0.005 (0.0003) 0.002 (0.0002)
(10, 200) 0.009 (0.0004) 0.004 (0.0014) 0.007 (0.0007) 0.004 (0.0003)
(20, 200) 0.004 (0.0002) 0.004 (0.0004) 0.003 (0.0003) 0.001 (0.0001)
Table 3.2: Averaged EIMSE from 100 macroreplications for SK, SKG and GESK
with six different designs on estimating the expected steady-state waiting time in
an M/M/1 queue problem. The design (6, 50) means 6 design points with 50 repli-
cations at each design point. Standard errors are shown in parentheses.
92
their performances are really close.
• Number of design points. Incorporating gradient estimators improves per-
formance considerably when the design points are sparse. For example, both
SKG and GESK have more significant improvement over stochastic kriging
when k = 6. As the number of design points increases, performance of most
models improves.
• Number of replications. As the number of replications increases with a
fixed number of design points, the variance of EIMSE decreases for all three
methods, as shown in Figures 3.3(a). However, the averaged EIMSE does not
improve significantly.
3.4.2.2 A stylized example with added noise
We consider a one-dimensional example from [23], where the true response
surface is Y(x) = exp(−1.4x) cos(7pix/2) with x ∈ [−2, 0]. The presence of mul-
tiple local extreme values on the response surface makes building a good meta-
model difficult. The simulation response output at x from replication j is Yj(x) =
exp(−1.4x) cos(7pix/2) + j(x), with j(x) ∼ N (0, 1). Direct gradient estimates are
assumed of the form Gj(x) = Y′(x) + δj(x) as the gradient estimate at x from sim-
ulation replication j, with δj(x) ∼ N (0, 25). We let δj(x) have a larger variance in
order to empirically investigate the performances of SKG and GESK when gradient
estimates are noisier.
We ran the experiments for 100 macroreplication. Within each macroreplica-
93
Design SK SKG GESK-PMLE GESK-IMSE
(6, 50) 39.616 (0.0374) 2.044 (0.0106) 1.909 (0.0181) 1.828 (0.0111)
(6, 200) 39.586 (0.0192) 2.023 (0.0049) 1.830 (0.0091) 1.757 (0.0050)
(6, 1000) 39.581 (0.0084) 7.652 (0.8823) 1.829 (0.0033) 1.758 (0.0023)
(8, 200) 2.793 (0.0039) 0.069 (0.0009) 0.063 (0.0026) 0.068 (0.0025)
(10, 200) 0.949 (0.0026) 1.243 (0.4204) 0.178 (0.0008) 0.012 (0.0007)
(20, 200) 0.008 (0.0002) 0.001 (0.0001) 0.046 (0.0004) 0.004 (0.0004)
Table 3.3: Averaged EIMSE from 100 macroreplications for SK, SKG and GESK
with five different designs on y(x) = exp(−1.4x) cos(7pix/2)+. Standard errors are
shown in parentheses.
tion, we chose N = 1000 to estimate the EIMSE in (3.20). Six different experiment
designs, (6, 50), (6, 200), (6, 1000), (8, 200), (10, 200) and (20, 200) were adopted,
with results shown in Table 3.3 and Figure 3.4. Notice that ln(EIMSE) values are
shown in Figure 3.4, as EIMSE results from the three models differ substantially.
• SK vs. SKG vs. GESK. As shown in Figure 3.4, both the SKG and the
GESK models are better than SK when there is a limited number of design
points. The GESK models perform better than SKG when k = 6. The expla-
nation is that the response surface has several fluctuation and extrapolation
allows GESK models to explore and approximate the response surface better
than the others. SKG performs better when there are enough design point, for
example, k = 20. SKG experiences numerical issues under designs (6, 1000)
and (10, 200).
94
• Number of replications. When the number of replications increases, the
variance of EIMSE decreases in general, as shown in Table 3.3 and Fig-
ure 3.4(a), except for SKG in design (6, 1000). However, the averaged EIMSE
does not change much as the number of replications increases, similar to the
M/M/1 queue example.
• Number of design points. We fixed the number of replications at 200
and increases the number of design points up to 20. Boxplots are shown in
Figures 3.4(b). EIMSE results for all models improve as the number of design
points increases, with the exception of SKG and GESK-PMLE with design
(10, 200).
• Step sizes. Step sizes determined by the PMLE and IMSE approaches are
shown in Figure 3.5. The plots suggest relationships between experiment de-
signs and step sizes: (i) relative step size (ratio to the size of the subinterval)
increases generally when the number of design points increases, (ii) the vari-
ability of step sizes decreases as the number of replications increases.
• Remark. Performance of SKG with design (10, 200) shown in Table 3.3 and
Figure 3.4(b) doesn’t seem to match each other. The reason is that several
outliers outside of the range shown in Figure 3.4(b) are omitted.
95
3.4.2.3 A multidimensional example
Lastly, we consider a stylized multidimensional example to test the perfor-
mance of GESK models, especially when gradient estimates have much larger vari-
ances. We consider the sphere function defined by Y(x) =
2∑
i=1
x2i +
4∑
i=3
10x2i . We
chose the experimental design space as [−1, 1]4. The simulated response at x =
(x1, x2, x3, x4) from replication j is Yj(x) = Y(x) + j(x) with the noise j(x) ∼
N (0, 1). The gradient estimate with respect to xr at x from replication j is given by
Grj (x) = ∂Y(x)∂xr + δ
r
j (x) with δrj (x) ∼ N (0, 25). The added noise terms are mutually
independent.
We chose two different experiment designs: (20, 500) and (40, 500), which cor-
respond to 20-point and 40-point Latin-hypercube designs with 500 independent
replications at each design point, respectively. We collected simulation responses
Yj(x) and gradient estimates Grj (x) for r = 1, 2, 3, 4, j = 1, 2, . . . , 500 to build
metamodels.
We ran the experiments for 100 macroreplications. Within each replication, we
chose N = 1000 to estimate the EIMSE in (3.20). Figure 3.6 contains boxplots for
the EIMSE calculated from the 100 macroreplications. Our findings are summarized
as follows:
• SKG and both GESK models perform better than SK. As the number of
design points increases, the performances of all models improve. Under design
(20500), GESK-IMSE seems to be the best; under design (40, 500), SKG is
preferred due to its low average and low variance in EIMSE.
96
• Between the two GESK models, PMLE scales better than IMSE for high-
dimensional problems. The IMSE approach requires multidimensional integra-
tions to determine step sizes, which is expensive and depends on the accuracy
of integration approximation in high-dimensional problems.
• Step sizes determined by PMLE are generally much larger than those deter-
mined by IMSE. Along each dimension, step sizes determined by PMLE and
IMSE have similar behavior. If the step size chosen by PMLE is relatively
smaller on one dimension, so is the step size chosen by IMSE. Step sizes cho-
sen for a dimension with higher gradient values are not necessarily smaller
than others.
3.5 CONCLUSIONS AND FUTURE RESEARCH
In this paper we investigated gradient extrapolated stochastic kriging (GESK),
which exploits the availability of direct gradient estimates in stochastic simulation
settings. The performance of the GESK models was analyzed theoretically and nu-
merically, with a focus on analyzing the approximation errors introduced by extrap-
olation. Since step sizes are crucial to GESK models, two methods for determining
step sizes were proposed and tested in numerical examples, which indicated substan-
tial gains in performance over SK in all of the experiments. Between the proposed
PMLE and IMSE approaches for determining step sizes, IMSE demonstrated better
performance in numerical experiments, but it becomes computationally expensive
for high-dimensional problems. The numerical experiments showed comparable per-
97
formance for GESK and SKG, except when the number of design points is very
small, where GESK shows some advantage.
From our analysis and numerical experiments, we offer the following overall
conclusions:
• GESK can be especially effective when the number of design points is relatively
small, e.g., in the setting where simulation is expensive.
• GESK offers additional flexibility in choosing the correlation function; in par-
ticular, differentiability is not a constraint.
• For high-dimensional problems, GESK using the PMLE is recommended, since
its computation does not increase with dimension, whereas the computational
burden increases exponentially for IMSE minimization and at least quadrati-
cally for SKG.
Our work points to several other directions for future research. The first direction
is to focus on the extrapolation strategy in GESK. For this paper, we use linear
extrapolation with the same step size and assume that only one additional point
is extrapolated from each design point. More sophisticated techniques could use
the local response surface information and adaptively determine the extrapolation
strategy. This is especially important in higher-dimensional problems with multiple
extreme values.
Another direction is to further investigate a comparison of the SKG and GESK
models. Improvements from incorporating gradient estimates can be expected from
both models. However, it would be valuable to be able to characterize when one
98
model is likely to be more effective. A theoretical analysis of various properties
comparing the two models can lead to useful guidelines for practitioners.
99
SK SKG PMLE IMSE SK SKG PMLE IMSE SK SKG PMLE IMSE
0.0
0.2
0.4
0.6
0.8
1.0
E
I
M
S
E
Design(6, 50) Design(6, 200) Design(6, 1000)
(a) Design (6, 50), (6, 200), (6, 1000)
SK SKG PMLE IMSE SK SKG PMLE IMSE SK SKG PMLE IMSE
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
E
I
M
S
E
Design(8, 200) Design(10, 200) Design(20, 200)
(b) Design (8, 200), (10, 200), (20, 200)
Figure 3.3: Boxplots of EIMSE from 100 macroreplications for SK, SKG and GESK
with six different designs on estimating the expected steady-state waiting time in
an M/M/1 queue problem, corresponding to results in Table 3.2.
100
SK SKG PMLE IMSE SK SKG PMLE IMSE SK SKG PMLE IMSE
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
l
o
g
(
E
I
M
S
E
)
Design(6, 50) Design(6, 200) Design(6, 1000)
(a) Design (6, 50), (6, 200), (6, 1000)
SK SKG PMLE IMSE SK SKG PMLE IMSE SK SKG PMLE IMSE
−8
−6
−4
−2
0
2
l
o
g
(
E
I
M
S
E
)
Design(8, 200) Design(10, 200) Design(20, 200)
(b) Design (8, 200), (10, 200), (20, 200)
Figure 3.4: Boxplots of EIMSE from 100 macroreplications for SK, SKG and GESK
with five different designs on y(x) = exp(−1.4x) cos(7pix/2) + , corresponding to
results in Table 3.3.
101
0.000 0.005 0.010 0.015 0.020
Step sizes
Design (6, 50)
Design (6, 200) 
Design (6, 1000)
Design (8, 200) 
Design (10, 200)
Design (20, 200)
(a) Step sizes determined by PMLE
0.000 0.002 0.004 0.006 0.008 0.010
Step sizes
Design (6, 50)
Design (6, 200) 
Design (6, 1000)
Design (8, 200) 
Design (10, 200)
Design (20, 200)
(b) Step sizes determined by IMSE
Figure 3.5: Boxplots for step sizes determined by PMLE and IMSE based on100
macroreplication in the stylized example with added noise.
102
SK SKG PMLE IMSE SK SKG PMLE IMSE
−4
−2
0
2
4
6
8
l
o
g
(
E
I
M
S
E
)
Design(20, 500) Design(40, 500)
Figure 3.6: Boxplots of EIMSE from 100 macroreplications for SK and GESK with
two different Latin-hypercube designs on a four-dimensional function with added
noise.
103
Chapter 4: Simulation Selection with Unknown Correlation Struc-
ture
4.1 Introduction
Consider a decision-maker who must identify the best among a finite set of
design alternatives with unknown performance values. Stochastic simulation is
used to estimate the performance of an alternative. More simulation experiments
will produce better estimates; however, these experiments are expensive and time-
consuming, limiting the simulation budget. We must use this budget efficiently to
maximize the quality of the final selection decision.
In many applications, the simulation budget is comparable to, or smaller than,
the number of design alternatives. However, there may be correlations between the
underlying mean performance values. Correlations can potentially allow us to handle
much larger problems: a single piece of information about one alternative can now
be used to learn about other alternatives with “similar” values. However, these
similarities can be difficult to quantify or guess heuristically. Consider the following
examples:
1. Wind farm placement. Given a set of candidate locations for a new wind farm
104
installation, we wish to select the one with the highest average power output.
However, power output depends on volatile wind speeds and other physical
factors like pressure gradient, frictional forces, wind currents, and topographi-
cal features. These factors are difficult to quantify, but simulation can be used
to estimate the net result [59]. Physical and topographical similarities induce
complex correlations between locations.
2. Logistics management. In a vehicle routing problem with service choice [60],
customers can request privilege for early delivery through bidding. The service
provider will accept a set of requests if the total bid price exceeds the additional
cost incurred by deviating from the optimal route. The number of acceptable
sets of requests grows combinatorially. To solve this problem, we have to use
the routing cost computed for one set of requests to infer the costs of other
sets that contain one or more of the same customers.
3. Call center control. A call center administrator assigns agents in shifts to
minimize average call waiting time. The administrator is uncertain about
employee efficiency [61], making it difficult to determine the best assignment.
Simulation can be used to test performances from different assignments. The
performance of two different assignments will be correlated if the assignments
involve the same agents.
Simulation selection procedures consist of a statistical model of the decision-
maker’s estimates of the performance values, and an optimization algorithm for
choosing an alternative to simulate based on the current statistical estimates. In
105
the classical literature on ranking and selection (R&S), estimates are constructed
using frequentist statistics, and decisions are made using the indifference-zone (IZ)
approach pioneered by [62]; see also [63] for an overview of classical results. IZ
methods guarantee asymptotic lower bounds for the probability of correct selec-
tion (PCS), as long as the true underlying performance values are sufficiently far
apart. The best-performing IZ methods include those by [64,65] and [66]. Numerous
reviews and surveys are available, including [67], [68], [69] and [70].
Bayesian models for R&S consider the tradeoff between our estimates of the
performance values and our uncertainty about those estimates. This is known as
the “value of information” approach, going back to [27] and extended in later work.
See [31] or [5] for an overview of value of information procedures (VIP). The optimal
computing budget allocation (OCBA) methodology (see e.g. [71–73]), designed to
maximize a Bayesian version of PCS, can also be included in the Bayesian category.
Both frequentist [66] and Bayesian [29,30] methods are able to handle problems
where the simulation output has unknown variance. However, most work on R&S
typically makes independence assumptions on the estimates of performance values:
under this assumption, a single experiment only provides information about a single
alternative, making it difficult to handle large problems with a small simulation
budget. Correlations have largely been studied in the context of common random
numbers inside simulators; see [74] and [75] for IZ methods in this setting. [76]
considers this problem from the perspective of OCBA.
The present chapter, however, uses the term “correlation” in a broader sense.
In the Bayesian setting, correlations can be used inside a distribution of belief as
106
a measure of the inherent similarities or differences between alternatives (e.g., the
geographical similarities between two wind farm locations). As we show in this
chapter, correlated beliefs can significantly improve the performances of simulation
selection, even when the simulation output is completely independent. Recently,
[77] studied Bayesian R&S with correlated beliefs ( [78] extends this analysis to
include correlated simulation output), but under the restrictive assumption that the
correlation structure was correctly specified by the decision-maker. By contrast,
we develop a model where the correlation structure is unknown, and has to be
learned together with the performance values. Our model has the ability to correct
inaccurate prior beliefs as new information arrives. If the simulation output is
correlated, we can also learn that correlation structures provided that some prior
information about it is available. To our knowledge, [79] is the only work on Bayesian
R&S to consider unknown correlation structures.
Bayesian R&S procedures rely on conjugate prior distributions on the unknown
model parameters in order to maintain computational tractability. The Wishart
distribution is a well-known conjugate prior for an unknown covariance matrix, as-
suming that we can simultaneously observe the performance of every alternative.
See e.g. [80] or [81] for applications of the Wishart distribution in simulation meta-
modeling and input uncertainty. However, in fully sequential R&S, we only sam-
ple from one alternative at a time. There is no standard conjugate prior for this
problem, although the statistics community has made several attempts to create
one; see [82], [83] or [84] for examples. Unfortunately, these models either present
computational difficulties in an R&S setting or cannot extract information about
107
multiple alternatives from a single scalar observation. [79] resolves this problem by
imposing restrictions on the sampling procedure: the simulation budget is allocated
equally among a certain subset of alternatives.
We propose a different approach, where we have the flexibility to simulate
any one alternative at any time. Although our prior is not exactly conjugate, we
create an optimal approximation of conjugacy by minimizing the Kullback-Leibler
divergence between the true posterior and the normal-Wishart distribution, lead-
ing to a computationally efficient learning model. The approximate model enables
us to derive a new VIP that generalizes previous procedures on R&S with known
correlations. We establish intuitive analogies between the new model and classical
statistical results on unknown sampling variance. We also show that, all else be-
ing equal, information has greater value when the correlation structure is unknown,
making it important to consider this uncertainty when allocating the next simulation
experiment.
4.2 Learning Unknown Correlation Structures
Let {1, 2, · · · , K} be a set of alternatives. Let Ŷ be a multivariate normal
random vector in RK with mean µ = (µ1, · · · , µK) and covariance matrix Σ. Our
goal is to discover the alternative x with the largest underlying mean µx. Assuming
that Σ is invertible, we define R = Σ−1 to be the precision matrix of Ŷ. For ease
of computation and presentation, we will work with the precision matrix instead of
the covariance matrix throughout this chapter.
108
The vector Ŷ describes the behavior of K alternatives, all observed concur-
rently. We suppose that both µ and R are unknown. Let yˆx represents the simula-
tion output of the behavior of the xth alternative. The sampling distribution of yˆx
given µ and R is univariate normal with the following probability density function
p(yˆx|µ,R) ∝
1
|e′xR−1ex|
1
2
exp
{
−
(yˆx − µx)2
2e′xR−1ex
}
, (4.1)
where ex = (0, . . . , 1, . . . , 0) is a K × 1 vector, with 1 at the xth component, and 0
at others. The prime denotes transpose.
We allow the precision matrix R to be non-diagonal, implying correlations
between components of Ŷ. As the rest of this section will show, when µ and
R are both unknown, a set of beliefs about R will induce correlations between
our beliefs about different components of µ, implying similarities and differences
between alternatives. We can expect that a single observation yˆx should also provide
some information about other alternatives that are correlated with x. However, the
nature of this information is not clear as the correlation structure is unknown.
4.2.1 Learning from complete observations
Taking the Bayesian viewpoint, we view the unknown mean vector µ and
the precision matrix R as a random vector and a random matrix, respectively. In
accordance with the Bayesian approach, we assume that our prior knowledge about
these unknown quantities is reflected by a prior distribution, which we write as
µ|R ∼ NK(θ0, q0R), R ∼ WK(b0,B0).
109
The precision matrix R is assumed to follow a Wishart distribution parametrized
by a scalar b0 and a K × K matrix B0. The conditional distribution of µ given
R is multivariate normal with mean vector θ0 and precision matrix q0R, where θ0
is a K vector and q0 is a scalar. The probability density function of the Wishart
distribution, see e.g. [85], is given as
p(R) = 1Z(b0,B0) |R|
b0−K−1
2 exp
{
−
1
2 tr(B
0R)
}
,
with a normalizing constant
Z(b0,B0) = piK(K−1)4
∣
∣
∣
∣
B0
2
∣
∣
∣
∣
− b
0
2
K∏
i=1
Γ
(b0 + 1− i
2
)
.
Therefore the joint prior distribution of µ and R is
p0(µ,R) = 1Z(b0,B0) |R|
b0−K−1
2 exp
{
−
1
2 tr(B
0R)
}( q0
2pi
)K
2
|R|
1
2 exp
{
−
q0
2 (µ− θ
0)′R(µ− θ0)
}
.
The Wishart distribution has the property that E(R) = b0(B0)−1, whence
E(Σ) = B0b0−K+1 . The matrix B0 can be viewed as a generalized “sum of squares”.
If the prior parameters are constructed from historical data (known as a “first-stage
sample” in [79]), the diagonal entries of B0 will be the sums of squared deviations
of the first-stage observations from their means. The scalar b0 is analogous to the
size of the first-stage sample, so that B0b0−K+1 is precisely the empirical covariance
matrix constructed from the first-stage data. The parameter q0 is also analogous to
a sample size; if first-stage sampling is used, R−1q0 will be the covariance matrix of
the sample mean µ.
If our distribution of belief at stage n is normal-Wishart, and our next ob-
servation is the entire vector Ŷn+1 ∼ NK(µ,R), standard results from Bayesian
110
analysis [86] tell us that the posterior density
pn+1(µ,R|Ŷn+1) = p(Ŷ
n+1|µ,R)pn(µ,R)
∫∫
p(Ŷn+1|µ,R)pn(µ,R)dµdR
is another normal-Wishart distribution with parameters
qn+1 = qn + 1, (4.2)
bn+1 = bn + 1, (4.3)
θn+1 = q
nθn + 1
qn + 1 , (4.4)
Bn+1 = Bn + q
n
qn + 1(θ
n − Ŷn+1)(θn − Ŷn+1)′. (4.5)
This is known as the conjugacy property of the normal-Wishart distribution. Con-
jugacy allows us to represent a distribution of belief with a finite, small number of
parameters, which can be easily updated after each new observation. It is conve-
nient to denote these parameters by the notational shorthand Sn = (qn, bn,θn,Bn),
representing the state of our beliefs at time n.
If we are able to observe the entire vector Ŷn+1 (also known as a “complete
observation”), the decision-maker’s objective can be easily formulated as follows. Let
(Ω,F ,P) be an appropriate probability space, and define a filtration Fn, where Fn is
the σ-algebra generated by the first n observations Y1, ...,Yn. Then, θn = E(µ|Fn)
intuitively represents our time-n beliefs about µ. We wish to find
E(max
x
E(µx|FN)), (4.6)
where N is the simulation budget. The maximum inside the outer expectation
in (4.6) represents the decision-maker’s implementation decision: at time N , we
111
will select the alternative that appears to be the best. If complete observations
are available, (4.6) is simply a statistical estimation problem with no elements of
simulation optimization. However, we will now move to a setting where only a single
component of Ŷn+1 can be observed at a time, giving rise to the problem of choosing
the right component.
4.2.2 Learning model for scalar observations
As is common in fully sequential ranking and selection, we now suppose
that at stage n we sample from alternative x only, with an observation yˆn+1x ∼
N (µx, (e′xR−1ex)
−1). The distribution of yˆn+1x is simply the marginal distribution
of the xth component of Ŷn+1. Using Bayes’ rule, the joint posterior distribution of
µ and R, given the observation yˆn+1x , can be written as
pn+1(µ,R|yˆn+1x ) ∝
1
Z(bn,Bn) |R|
bn−K−1
2 exp
{
−
1
2 tr(B
nR)
}( qn
2pi
)K
2
|R|
1
2
· exp
{
−
qn
2 (µ− θ
n)′R(µ− θn)
} 1
(2piR−1)
1
2
xx
exp
{
−
(yˆn+1x − µx)2
2(R−1)xx
}
.
(4.7)
After decomposing the posterior distribution in (4.7) into the conditional posterior
distribution of µ given R and the marginal posterior distribution of R, we ob-
serve that the conditional distribution of µ given R is multivariate normal, but the
marginal distribution of R is no longer a Wishart distribution.
Computational difficulties arise from here. Equation (4.7) suggests that the
conjugacy property of the normal-Wishart distribution is lost if we can no longer
observe the entire vector Ŷn+1. Conjugacy of the prior distribution is necessary
112
in order to build a computationally tractable learning model and develop efficient
sequential decision procedures that make sampling choices based on a small set of
belief parameters. In what follows, we force conjugacy using the density projection
technique. To be precise, by minimizing the Kullback-Leibler (KL) divergence, we
find the best approximation for the posterior distribution in (4.7) from the normal-
Wishart family. The posterior distribution is then replaced by its normal-Wishart
approximation, and the decision-maker’s beliefs are assumed to be normal-Wishart
after each successive observation.
Let ξ(µ,R) be a distribution from the normal-Wishart family with parameters
(q, b,θ,B) such that
µ|R ∼ NK(θ, qR), R ∼ WK(b,B).
Define DnKL(ξ‖pn+1) to be the Kullback-Leibler (KL) divergence between ξ(µ,R)
and pn+1(µ,R|yˆn+1x ), which is given by
DnKL(ξ‖pn+1) = Eξ
(
log ξ(µ,R)pn+1(µ,R|yˆn+1x )
)
, (4.8)
where Eξ[·] is the expectation with respect to ξ(µ,R). This quantity, bounded
below by zero, is used to measure the “distance” between the distributions ξ and
pn+1. Lower KL divergence suggests that there is more similarity between the two
distributions. For simplicity of notation, we write DnKL(ξ‖p) instead of DnKL(ξ‖pn+1).
We wish to find
(qn+1, bn+1,θn+1,Bn+1) = arg min(q,b,θ,B)DnKL(ξ‖p), (4.9)
the set of parameters that projects (according to KL divergence) the normal-Wishart
113
distribution onto the true posterior in (4.7).
We first give a closed-form expression for (4.8), and then solve (4.9). We briefly
sketch the proof of the solution, but readers are referred to the Appendix for the
complete details.
Proposition 4.1. The KL divergence DnKL(ξ‖p) defined in (4.8) can be expressed
as
DnKL(ξ‖p) =
bn+1 − bn
2
(
− log
∣
∣
∣
∣
Bn+1
2
∣
∣
∣
∣+
K∑
i=1
ψ
(bn+1 − i+ 1
2
))
−
bn+1K
2
+ b
n+1
2 tr
(
Bn(Bn+1)−1
)
+ log Z(b
n,Bn)
Z(bn+1,Bn+1) +
1
2 logB
n+1
xx
+ 12
[
K log q
n+1
qn +K
qn
qn+1 −K + q
n(θn − θn+1)′bn+1(Bn+1)−1(θn − θn+1)
]
−
1
2ψ
(bn+1 −K + 1
2
)
+ 12qn+1 +
1
2(yˆ
n+1
x − θn+1x )2
bn+1 −K + 1
Bn+1xx
+ C,
where ψ(x) = d ln Γ(x)/dx is the digamma function and C is some constant that
does not depend on the parameters of ξ.
Proof. Proof: First notice that the posterior distribution in (4.7) can be written as
pn+1(µ,R|yˆn+1x ) =
pn+1(µ|R)pn+1(R)p(yˆn+1x |µ,R)
p(yˆn+1x )
.
114
Then the KL divergence is given as
DnKL(ξ|p) = Eξ
(
log ξ(µ,R)pn+1(µ,R|yˆn+1x )
)
=
∫∫
ξ(µ,R) log ξ(R)ξ(µ|R)p(yˆ
n+1
x )
pn+1(µ|R)pn+1(R)p(yˆn+1x |µ,R)
dµdR
=
∫∫
ξ(µ,R) log ξ(R)pn+1(R)dµdR (4.10)
+
∫∫
ξ(µ,R) log ξ(µ|R)pn+1(µ|R)dµdR (4.11)
−
∫∫
ξ(µ,R) log p(yˆn+1x |µ,R)dµdR (4.12)
+
∫∫
ξ(µ,R) log p(yˆn+1x )dµdR, (4.13)
where (4.10) can be computed as
∫∫
ξ(µ,R) log ξ(R)pn+1(R)dµdR
=
∫
ξ(R) log ξ(R)pn+1(R)dR
=
∫
ξ(R) log
{ Z(bn,Bn)
Z(bn+1,Bn+1) |R|
bn+1−bn
2 exp
{
−
1
2 tr(B
n+1R)− 12 tr(BR)
}}
dR
= b
n+1 − bn
2 Eξ (log |R|) + log
Z(bn,Bn)
Z(bn+1,Bn+1) −
1
2Eξ
[
tr((Bn+1 + Bn)R)
]
= b
n+1 − bn
2
(
− log
∣
∣
∣
∣
Bn+1
2
∣
∣
∣
∣+
K∑
i=1
ψ
(bn+1 − i+ 1
2
))
+ log Z(b
n,Bn)
Z(bn+1,Bn+1)
+ b
n+1
2 tr
(
Bn(Bn+1)−1
)
−
bn+1
2 K, (4.14)
115
the term in (4.11) can be computed as
∫∫
ξ(µ,R) log ξ(µ|R)pn+1(µ|R)dµdR
=
∫∫
ξ(µ,R) log
{
|qn+1R| 12 exp
{
−12(µ− θn+1)′qn+1R(µ− θn+1)
}
|qnR| 12 exp
{
−12(µ− θn)′qnR(µ− θn)
}
}
=
∫∫
ξ(µ,R)
[K
2 log
qn+1
qn −
1
2(µ− θ
n+1)qn+1R(µ− θn+1) + 12(µ− θ)
′qnR(µ− θ)
]
dµdR
= K2 log
qn+1
qn +
K
2
qn
qn+1 −
K
2 +
qn
2 (θ
n+1 − θn)′bn+1(Bn+1)−1(θn+1 − θn), (4.15)
and the term in (4.12) can be computed as
∫∫
ξ(µ,R) log p(yˆn+1|µ,R)dµdR
=
∫∫
ξ(µ,R) log
[
(2piR−1)−
1
2
xx exp
[
−
1
2(R
−1)−1xx (yˆn+1x − µx)2
]]
dµdR
=− 12Eξ
(
log(R−1)xx
)
−
1
2 log(2pi)−
1
2(yˆ
n+1
x − θn+1x )2E
(
(R−1)−1xx
)
−
1
2qn+1
=− 12
(
logBn+1xx − log 2− ψ
(bn+1 −K + 1
2
))
−
1
2 log(2pi)−
1
2qn+1 −
1
2(yˆ
n+1
x − θn+1x )2
bn+1 −K + 1
Bn+1xx
.
(4.16)
Notice that p(yˆn+1x ) is the marginal distribution of yˆn+1x , which is not a function of
µ and R, so (4.13) doesn’t depend on the parameters of ξ. The proof then follows
from equations (4.14), (4.15) and (4.16).
Theorem 4.2. There exists a finite value ∆bn s.t. the solution to (4.9) can be
116
expressed as
qn+1 = qn + 1K , (4.17)
bn+1 = bn + ∆bn, (4.18)
θn+1 =
(
qnbn+1(Bn+1)−1 + b
n+1 −K + 1
e′xBn+1ex
exe′x
)−1
·
(
qnbn+1(Bn+1)−1θn + b
n+1 −K + 1
e′xBn+1ex
yˆn+1x ex
)
, (4.19)
Bn+1 = b
n+1
bn B
n + b
n+1
bn + 1
(
qn(yˆn+1x − θnx)2
qnbn+1
bn+1−K+1 + 1
−
Bnxx
bn
)
Bnexe′xB
n
(Bnxx)2
. (4.20)
Proof. Proof: Taking derivatives of (4.8) with respect to the parameters and apply-
ing matrix calculus, we obtain
∂DnKL(ξ‖p)
∂qn+1 =
1
2
{ K
qn+1 −
qnK
(qn+1)2 −
1
(qn+1)2
}
, (4.21)
∂DnKL(ξ‖p)
∂bn+1 =
1
2
{(yˆn+1x − θnx)2
Bn+1xx
(qn)2bn+1(K − 1)
(qnbn+1 + bn+1 −K + 1)2 +
bnK + 1
bn+1 −K
+b
n+1 − bn
2
K∑
i=1
ψ′
(bn+1 − i+ 1
2
)
−
1
2ψ
′
(bn+1 −K + 1
2
)}
,
(4.22)
∂DnKL(ξ‖p)
∂θn+1 = q
nbn+1(Bn+1)−1(θn − θn+1) + b
n+1 −K + 1
Bn+1xx
(exe′x) (yˆn+1x ex − θn+1),
(4.23)
∂DnKL(ξ‖p)
∂Bn+1 = −
1
2q
nbn+1
(
Bn+1
)−1 (θn − θn+1)(θn − θn+1)′(Bn+1)−1 + b
n
2 (B
n+1)−1
−
bn+1
2 (B
n+1)−1Bn(Bn+1)−1 +
( 1
2Bn+1xx
−
bn+1 −K + 1
2
(yˆn+1x − θn+1x )2
(Bn+1xx )2
)
exe′x.
(4.24)
Setting (4.21) - (4.24) to zero and solving, we notice that (4.17) follows immediately
from (4.21). However, (4.22)-(4.24) are more difficult to solve because each equation
depends on multiple parameters. We denote the change in degrees of freedom by
117
∆bn ≡ bn+1− bn. Then we can derive (4.19) and (4.20) as functions of ∆bn. Finally,
we compute ∆bn itself (see Remark 4.4 for a discussion).
Setting (4.23) to zero, we obtain
(
qnbn+1(Bn+1)−1 + b
n+1 −K + 1
Bn+1xx
)
θn+1 = qnbn+1(Bn+1)−1θn+b
n+1 −K + 1
Bn+1xx
exe′xyˆn+1x ex,
solving for θn+1 and it yields (4.19).
Setting (4.24) to zero and multiplying Bn+1 from left and right, we obtain,
Bn+1 = b
n+1
bn B
n + q
nbn+1
bn (θ
n+1 − θn)(θn+1 − θn)′
−
1
bn
( 1
Bn+1xx
−
(bn+1 −K + 1)(yˆn+1x − θn+1x )2
(Bn+1xx )2
)
Bn+1exe′xB
n+1.
Since
(θn+1 − θn)(θn+1 − θn)′ = (yˆ
n+1
x − θnx)2
(γn)2
Bn+1exe′xB
n+1
(Bn+1xx )2
,
and
(yˆn+1x − θn+1x )2 =
(yˆn+1x − θnx)2(γn − 1)2
(γn)2 ,
where
γn = 1 + q
nbn+1
bn+1 −K + 1 ,
it follows that
Bn+1 = b
n+1
bn B
n +
(qnbn+1(yˆn+1x − θnx)2
bnγn −
Bn+1xx
bn
)
Bn+1exe′xB
n+1
(Bn+1xx )2
= b
n+1
bn B
n + b
n+1
bn
(
qn(yˆn+1x − θnx)2
qnbn+1
bn+1−K+1 + 1
−
Bn+1xx
bn+1
)
Bn+1exe′xB
n+1
(Bn+1xx )2
. (4.25)
The matrix Bn+1 shows up in both sides of (4.25). We will show how to derive
updating equations for all entries in the matrix Bn+1.
118
Consider Bn+1xx , it follows from (4.25) that
bnBn+1xx = bn+1Bnxx +
qnbn+1(yˆn+1x − θnx)2
γn −B
n+1
xx ,
then
Bn+1xx =
bn+1
bn + 1B
n
xx +
1
bn + 1
qnbn+1(yˆn+1x − θnx)2
γn .
It follows from symmetry of the matrix Bn that Bn+1ix = Bn+1xi .
Consider Bn+1xi and Bn+1ix for i 6= x,
bnBn+1xi = bn+1Bnxi +
(qnbn+1(yˆn+1x − θnx)2
γn −B
n+1
xx
) Bn+1xx Bn+1xi
(Bn+1xx )2
,
then it follows that
Bn+1xi =
bn+1
bn + 1B
n
xi +
1
bn + 1
qnbn+1(yˆn+1x − θnx)2
γn
Bnxi
Bnxx
. (4.26)
The following result from (4.26)is worth mentioning,
Bn+1xi
Bn+1xx
= B
n
xi
Bnxx
.
Consider Bn+1ij for i, j 6= x,
bnBn+1ij = bn+1Bnij +
(qnbn+1(yˆn+1x − θnx)2
γn −B
n+1
xx
) Bn+1xi Bn+1xj
(Bn+1xx )2,
and it follows that
Bn+1ij =
bn+1
bn B
n
ij +
1
bn + 1
(qnbn+1(yˆn+1x − θnx)2
γn −
bn+1
bn B
n
xx
) BnxiBnxj
(Bnxx)2
.
Using the Sherman-Morrison-Woodbury formula [?, see e.g.]]GoLo96, we can
rewrite (4.19) without using inverse matrices as
θn+1 = θn + yˆ
n+1
x − θnx
qnbn+1
bn+1−K+1Bnxx +Bnxx
Bnex. (4.27)
119
The most crucial aspect of (4.27) is that a single scalar observation is now
used to update the entire posterior mean vector through the matrix Bn. Similar
behavior occurs in the Kalman filter-like update used by [77] in the case of known
correlation structures. In that setting, the updating equation incorporates both the
variance of the current belief and the known variance of the observations. However,
when the correlation structure is unknown, the matrix Bn is used to estimate both
types of variances.
Equations (4.17)-(4.20) allow us to conveniently represent and update a joint
distribution of belief about µ and R using a finite number of parameters, which can
be compactly encoded in the belief state Sn. We can now connect the mechanism
of approximate Bayesian inference back to a formal objective function. Recall from
Section 4.2.1 that the sampling model is defined on a probability space (Ω,F ,P),
where P is the law of the process Sn. Essentially, our approximate Bayesian learn-
ing model replaces P by an alternate probability measure P¯ under which (µ,R) is
normal-Wishart, given Fn with parameters obtained through KL minimization.
We use the notation EP¯(·) for expectations under the probability measure P¯.
Given a measurement budget of N , the experimenter chooses a measurement pol-
icy, which is a function xpi mapping a belief state Sn to an alternative xpi(Sn) ∈
{1, . . . , K}. Under the probability space (Ω,F , P¯), the policy makes measurement
decisions sequentially. As before, Fn is the σ-algebra generated by all the decisions
made in the first n stages, as well as the observations collected. Let pi be a mea-
surement policy. The notation Epi indicates that the expectation is taken when the
measurement policy pi is applied. We also define Π as the set of measurement poli-
120
cies. The challenge is to choose a measurement policy pi for allocating the simulation
budget one measurement at a time, and our objective can be written as
sup
pi∈Π
EpiP¯
(
max
x
EpiP¯(µx|FN)
)
. (4.28)
As in (4.6), the maximum in (4.28) represents the decision-maker’s imple-
mentation decision to select the alternative that seems to be the best at time N .
However, unlike (4.6), equation (4.28) now contains the optimization problem of
choosing a policy pi, i.e., a sequence of measurement decisions.
We close our discussion of the learning model with several remarks on the in-
terpretation of the model parameters. The approximate updating equations (4.17)-
(4.20) are intuitive generalizations of the conjugate update in (4.2)-(4.5). For ex-
ample, in (4.5), the squared error matrix (θn− Yˆn+1)(θn− Yˆn+1)′ is used to update
Bn. In (4.20), the full matrix is not available, so the update uses the scalar squared
error (θnx − yˆn+1x )2 to update all covariances between x and other alternatives.
Remark 4.3. The parameter qn in the prior distribution is intended to be a reflec-
tion of prior precision relative to the sample size that is tunable by the researcher or
practitioner to reflect their prior confidence. Recall from (4.2) that, when we have
complete observations, this parameter is always increased by 1. By analogy, if we
only collect information about one out of K alternatives, qn is increased by 1/K.
Remark 4.4. Although one might expect bn to behave in the same way as qn, this
is not exactly the case. The parameter bn increases by 1 when we have complete ob-
servations. However, when we sample from only one alternative, the increment ∆bn
actually depends on (qn, bn, yˆn+1x , θnx , Bnxx). The quantity ∆bn does not have a closed-
121
form expression, but can easily be obtained numerically via a bisection procedure
or Newton’s method on the interval [0, 1]. We have also observed in our numeri-
cal experiments that the optimal values of ∆bn appear to be smaller than 1/K and
approach 1/K asymptotically over time.
Remark 4.5. Note that the computational complexity of the updates (4.17)-(4.20) is
O(K2), identical to that of the conjugate updates for R&S with known correlations;
see equations (2.22) and (2.23) in [5]. The number of iterations of the bisection
method needed to compute ∆bn within a fixed, pre-specified tolerance level does not
depend on K. However, the effort needed for a single iteration of the bisection
method is O(K), since the terms in (4.22) have to be recomputed when different
values of ∆bn are considered.
4.2.3 Predictive distribution of the next observation
Given the prior distribution on the unknown parameters, the distribution of
yˆn+1x represents the decision-maker’s beliefs about the next observation (assuming
that alternative x will be measured). For this reason, it is known as the predictive
distribution. In Section 4.3, we introduce a policy that uses the predictive distri-
bution to look ahead to the outcome of a simulation decision. In preparation for
this discussion, we now present the predictive distribution for the normal-Wishart
model under the approximate Bayesian learning scheme of Section 4.2.1. That is,
we assume that the decision-maker’s beliefs at time n are represented by a normal-
Wishart distribution whose parameters are contained in the state Sn, and use this
122
assumption to characterize yˆn+1x .
For completeness, we provide the definition of the multivariate t-distribution
[87].
Definition 4.1. A p-dimensional random vector X = (X1, . . . , Xp) is said to have
the p-variate t-distribution with ν degrees of freedom, mean vector µ, and correlation
matrix V if its joint pdf is given by
f(x) = Γ(
ν+p
2 )
(piν)p/2|V|1/2Γ(ν2 )
[
1 + 1ν (x− µ)
′V−1(x− µ)
](ν+p)/2
.
The predictive distribution of yˆn+1x follows from the following two lemmas.
Lemma 4.6. Suppose that (µ,R) follows a normal-Wishart distribution with pa-
rameters (qn, bn,θn,Bn). Then the predictive distribution of a complete observation
Ŷn+1 is a multivariate t-distribution with bn−K+1 degrees of freedom, mean vector
θn and correlation matrix (qn + 1)Bn/qn(bn −K + 1).
Proof. Proof: The predictive distribution of the vector Ŷn+1,
p(Ŷn+1) =
∫∫
p(Ŷn+1,µ,R)dµdR,
and
p(Ŷn+1,µ,R) ∝ |R| b
n−K+1
2 exp
{
−
1
2(Ŷ
n+1 − µ)′R(Ŷn+1 − µ)
}
· exp
{
−
qn
2 (µ− θ
n)′R(µ− θn)
}
exp
{
−
1
2 tr(B
nR)
}
.
It can be verified that
(Ŷn+1 − µ)′R(Ŷn+1 − µ) + qn(µ− θn)′R(µ− θn)
= (qn + 1)(µ− θ¯n)′R(µ− θ¯n) + q
n
qn + 1(θ − Ŷ
n+1)′R(θ − Ŷn+1),
123
with
θ¯n = q
nθn + Ŷn+1
qn + 1 .
It follows that
p(Ŷn+1,µ,R) ∝ |R| 12 exp
{
−
qn + 1
2 (µ− θ¯
n)′R(µ− θ¯n)
}
|R|
bn−K
2 exp
{
−
1
2 tr(B¯
nR)
}
,
where
B¯n = Bn + q
n
qn + 1(θ
n − Ŷn+1)(θn − Ŷn+1)′.
Integration with respect to µ and R yields
p(Ŷn+1) ∝
(
1 + q
n
qn + 1(Ŷ
n+1 − θn)′(Bn)−1(Ŷn+1 − θn)
)− b
n+1
2
.
This shows that Ŷn+1 has a multivariate t-distribution with degrees of freedom
bn −K + 1, mean vector θn and correlation matrix qn+1qn(bn−K+1)Bn.
Lemma 4.7. The predictive distribution of yˆn+1x is
yˆn+1x ∼ t
(
bn −K + 1, θnx ,
qn(bn −K + 1)
(qn + 1)Bnxx
)
, (4.29)
which is a univariate Student’s t-distribution with bn − K + 1 degrees of freedom,
mean θnx and variance q
n+1
qn(bn−K−1)Bnxx.
Proof. Proof: This result follows by combining Lemma 4.6 together with results in
Section 1.10 from [87].
Using the predictive distribution found in (4.29), we can derive another ex-
124
pression for the updating equation of θn in (4.27). Define
T n = yˆ
n+1
x − θnx
(
qn+1
qn(bn−K+1)Bnxx
)1/2 ,
s˜n(qn, bn,Bn, x) =
(
qn+1
qn(bn−K+1)
)1/2
(
qnbn+1
bn+1−K+1 + 1
)
(Bnxx)1/2
Bnex.
Then, (4.27) can be rewritten as
θn+1 = θn + s˜n(qn, bn,Bn, x)T n, (4.30)
where T n has a Student’s t-distribution with bn −K + 1 degrees of freedom, mean
0 and scale parameter 1.
At time n, the vector θn+1 of future beliefs is unknown. However, we see
from (4.30) that our uncertainty about this vector originates from a single scalar
random variable. This is in line with previous work on ranking and selection with
known correlation structures [77], where the scalar random variable is normally
distributed. When the correlations are unknown, we use Student’s t-distribution,
forming a precise analogy to classical frequentist statistics.
4.2.4 Information loss due to approximate Bayesian inference
The KL divergence DnKL(ξ ‖ p) can be thought of as the incremental infor-
mation loss incurred by forcing conjugacy after the (n + 1)st observation, under
the assumption that our beliefs at time n are accurately represented by a normal-
Wishart distribution. This section shows that, under the probability measure P¯,
the incremental information loss converges to zero in probability as n → ∞. That
125
is, if conjugacy is maintained up to time n, the error due to a single application
of approximate Bayesian inference at time n + 1 will become vanishingly small for
large enough n.
This result is intended to provide the intuition that, over time, the learning
model with scalar observations bears greater resemblance to a conjugate learning
model. As in Section 4.2.3, we assume that the decision-maker’s beliefs at time n are
represented by the normal-Wishart distribution. We begin by showing in Proposi-
tion 4.8 that the degrees of freedom parameter bn goes to infinity, eventually leading
to the result that the incremental loss from one additional observation vanishes to
zero.
Proposition 4.8. If b0 is sufficiently large, then ∆bn ∈ (0, 1) P¯-a.s. and bn → ∞
as n→∞.
Proof. Proof: Let ∆bn = 1 in (4.22) and from which we have
∂Dn(ξ|p)
∂bn+1
∣
∣
∣
bn+1=bn+1
≥
1
2
(bnK + 1
bn + 1 −K
)
+ 14
K∑
i=1
ψ′
(bn − i+ 2
2
)
−
1
4ψ
′
(bn −K + 2
2
)
= − K − 12(bn + 1) +
1
4
K−1∑
i=1
ψ′
(bn − i+ 2
2
)
≥ −
K − 1
2(bn + 1) +
1
4
K−1∑
i=1
2
bn − i+ 2
> − K − 12(bn + 1) +
K − 1
2(bn + 1) = 0.
Let ∆bn = 0 in (4.22), we can show that for any  > 0, there exists sufficiently large
bn such that the first term is less than . We also observe that
1
bn −
1
2ψ
′
(bn −K + 1
2
)
< 1bn −
1
bn −K + 1 < 0.
126
Since (4.22) is a continuous function of ∆bn on [0, 1], we know that ∆bn ∈ (0, 1),
whence bn has a limit by the monotone convergence theorem. In the following, we
will prove that bn goes to infinity by contradiction. Assume that there exists M <∞
such that bn converges to M . This suggests that ∆bn converges to zero. Taking the
limit of (4.22) as n goes to infinity yields,
lim
n→∞
(yˆn+1x − θnx)2
Bn+1xx
(qn)2bn+1(K − 1)
(qnbn+1 + bn+1 −K + 1)2 = ψ
′
(M −K + 1
2
)
−
1
M (4.31)
From Lemma 4.7, the predictive distribution of yˆn+1x is a Student’s t-distribution
with bn −K + 1 degrees of freedom, mean θnx and variance q
n+1
qn(bn−K+1Bnxx. It follows
that
(yˆn+1x − θnx)
(Bnxx)1/2
= T n
( qn + 1
qn(bn −K + 1)
)1/2
. (4.32)
As qn →∞,
lim
n→∞
(qn)2bn+1(K − 1)
(qnbn+1 + bn+1 −K + 1)2 =
1
M ,
whence (4.31) can be rewritten as
lim
n→∞
l(T n) =
[
ψ′
(M −K + 1
2
)
−
1
M
]
M2, (4.33)
where l(T n) is a function of the random variable T n. Since bn → M , the random
variable T n converges weakly to a Student’s t-distribution with M−K+1 degrees of
freedom. That means that (4.33) cannot hold almost surely. Therefore, we conclude
that the degrees of freedom bn goes to infinity as n→∞.
The fact that the degrees of freedom parameter bn goes to infinity is a key to
the other results in this section. We next show several preliminary results concerning
the updating equation for Bn.
127
Proposition 4.9. Let
Mnx =
( bn+1
bn + 1
)( qn(bn+1 −K + 1)
qnbn+1 + bn+1 −K + 1
(yˆn+1x − θnx)2
(Bnxx)2
−
1
bnBnxx
)
.
Then, MnxBnxx converges to zero in P¯-probability.
Proof. Proof: First note that
MnxBnxx =
( bn+1
bn + 1
)( qn(bn+1 −K + 1)
qnbn+1 + bn+1 −K + 1
(yˆn+1x − θnx)2
Bnxx
−
1
bn
)
,
therefore showing that MnxBnxx converges to zero in probability is equivalent to show-
ing that (yˆn+1x − θnx)2/Bnxx converges to zero in probability. For any  > 0, using
(4.32) and Chebyshev’s inequality, we know that
P¯
(∣
∣
∣
∣
(yˆn+1x − θnx)
(Bnxx)1/2
∣
∣
∣
∣ > 
)
≤
qn + 1
qn(bn −K + 1)
1
2 .
Then we have
lim
n→∞
P¯
(∣
∣
∣
∣
(yˆn+1x − θnx)
(Bnxx)1/2
∣
∣
∣
∣ > 
)
= 0,
as required.
If we view the updating equation (4.13) from the viewpoint of stochastic ap-
proximation, then the quantity MnxBnxx can be considered as the stepsize. Since the
stepsize converges to zero, this guarantees that the change in the matrix Bn will not
be too large.
We will provide two propositions related to the determinant and the trace of
the matrix Bn. Instead of checking the matrix Bn componentwise, these two results
provide the changes in the determinant and the trace analytically, which are useful
for studying the asymptotic behavior of the matrix Bn.
128
Proposition 4.10. The determinant of Bn is updated recursively through
det(Bn+1) = det(Bn)
(bn+1
bn
)K (
1 + b
n
bn+1M
n
xBnxx
)
. (4.34)
Proof. Proof: First rewrite (4.20) as
Bn+1 = Bn + Bn
(∆bn
bn IK +M
n
x exe
′
xB
n
)
It follows from the matrix determinant lemma [88] that
det(Bn+1) = det(Bn) det
(
IK +
∆bn
bn IK +M
n
x exe
′
xB
n
)
= det(Bn) det
(bn+1
bn IK +M
n
x exe
′
xB
n
)
= det(Bn) det
(bn+1
bn IK
)(
1 + b
n
bn+1M
n
x e
′
xB
nex
)
= det(Bn)
(bn+1
bn
)K (
1 + b
n
bn+1M
n
xBnxx
)
,
as desired.
Proposition 4.11.
tr
(
Bn(Bn+1)−1
)
= b
nK + 1
bn+1 −
qn
2 (θ
n+1−θn)′(Bn+1)−1(θn+1−θn)−12(yˆ
n+1
x −θn+1x )2
bn+1 −K + 1
Bn+1xx
.
Proof. Proof: Multiplying (4.24) by Bn+1 from the left yields
bn
2 IK −
bn+1
2 B
n(Bn+1)−1 − q
nbn+1
2 (θ
n+1 − θ)(θn+1 − θn)′(Bn+1)−1
+
( 1
2Bn+1xx
−
bn+1 −K + 1
2(Bn+1xx )2
(yˆn+1x − θn+1x )2
)
Bn+1exe′x = 0
Taking trace on both sides, it gives
bnK
2 −
bn+1
2 tr
(
Bn(Bn+1)−1
)
−
qnbn+1
2 (θ
n+1 − θn)′(Bn+1)−1(θn+1 − θn)
( 1
2Bn+1xx
−
bn+1 −K + 1
2(Bn+1xx )2
(yˆn+1x − θn+1x )2
)
Bn+1xx = 0 (4.35)
Solving for tr (Bn(Bn+1)−1) from (4.35) completes the proof.
129
The next lemma finds the limit of a sequence of expressions involving the
gamma and digamma functions. The limit will appear repeatedly in the proof of
our main results later.
Lemma 4.12. For any α, β, γ ∈ R,
lim
x→∞
log Γ(x+ α)Γ(x+ β) − (α− β)ψ(x+ γ) = 0. (4.36)
Proof. Proof: Notice that
log Γ(x+ α)Γ(x+ β) − (α− β)ψ(x+ γ) = log
(Γ(x+ α)
Γ(x+ β)e
−(α−β)ψ(x+γ)
)
,
then to prove (4.36) is equivalent to prove
lim
x→∞
Γ(x+ α)
Γ(x+ β)e
−(α−β)ψ(x+γ) = 1.
From [89], we have the asymptotic expansion
Γ(x+ α)
Γ(x+ β) = x
α−β
[
1 + (α− β)(α + β − 1)2x +O(x
−2)
]
. (4.37)
For x > 0, we have [90]
log x− 1x < ψ(x) < log x−
1
2x.
Without loss of generality, we assume that α > β. This gives
(x+ γ)−(α−β)e
α−β
2(x+γ) < e−(α−β)ψ(x+γ) < (x+ γ)−(α−β)e
α−β
x+γ .
Therefore, by (4.37), we find that
lim
x→∞
Γ(x+ α)
Γ(x+ β)e
−(α−β)ψ(x+γ) = 1.
130
We now state the key theorem. As the number of measurements goes to
infinity, the KL divergence converges to zero in probability. This suggests that the
incremental information loss incurred by forcing conjugacy vanishes.
Theorem 4.13. As n→∞, the KL divergence DnKL(ξn+1 ‖ pn+1) converges to zero
in P¯-probability.
Proof. Proof: The constant C omitted in Proposition (4.1) can be given explicitly
as
C = 12 log
qn
qn + 1 −
1
2 logB
n
xx + log
(
Γ
(bn −K + 2
2
))
− log
(
Γ
(bn −K + 1
2
))
−
bn −K + 2
2 log
(
1 + q
n
(qn + 1)Bnxx
(yˆn+1x − θx)2
)
,
131
whence the KL divergence DnKL(ξn+1 ‖ pn+1) can be expressed as
DnKL(ξn+1 ‖ pn+1) =
bn
2 log
det(Bn+1)
det(Bn) +
bn+1
2 tr
(
Bn(Bn+1)−1
)
−
bn+1
2 K (4.38)
+ q
nbn+1
2 (θ
n+1 − θn)′(Bn+1)−1(θn+1 − θn) + 12(yˆ
n+1
x − θn+1x )2
bn+1 −K + 1
Bn+1xx
(4.39)
−
bn −K + 2
2 log
(
1 + q
n
(qn + 1)Bnxx
(yˆn+1x − θnx)2
)
(4.40)
+ b
n+1 − bn
2
K∑
i=1
ψ
(bn+1 − i+ 1
2
)
+
K∑
i=1
log Γ
(bn − i+ 1
2
)
−
K∑
i=1
log Γ
(bn+1 − i+ 1
2
)
(4.41)
−
1
2ψ
(bn+1 −K + 1
2
)
+ log Γ
(bn −K + 2
2
)
− log Γ
(bn −K + 1
2
)
(4.42)
+ 12K log
(qn+1
qn
)
+ 12K
qn
qn+1 −
1
2K +
1
2qn+1 +
1
2 log
qn
qn + 1
(4.43)
+ 12 logB
n+1
xx −
1
2 logB
n
xx. (4.44)
Following Propositions 4.10 and 4.11, the terms in (4.38) and (4.39) can be simplified
as
bnK
2 log
bn+1
bn +
bn
2 log
(
1 + b
n
bn+1M
n
xBnxx
)
+ b
nK + 1
2 −
bn+1K
2 .
As bn →∞, it is easy to show that
lim
n→∞
bnK
2 log
bn+1
bn +
bnK
2 −
bn+1K
2 = 0.
Also, we can show that
lim
n→∞
bn
2 log
(
1 + b
n
bn+1M
n
xBnxx
)
−
bn −K + 2
2 log
(
1 + q
n
(qn + 1)Bnxx
(yˆn+1x − θnx)2
)
= −12 .
132
This suggests that the sum of (4.38), (4.39) and (4.40) approaches zero as n→∞.
Using Lemma 4.12, we can show that both (4.41) and (4.42) go to zero as n→∞.
It is easy to check that (4.43) approaches zero as n → ∞. It follows from (4.20)
that
Bn+1xx =
bn+1
bn + 1B
n
xx +
1
bn + 1
qnbn+1(yˆn+1x − θnx)2
1 + qnbn+1bn+1−K+1
.
Therefore,
log B
n+1
xx
Bnxx
= log
(
bn+1
bn + 1 +
1
bn + 1
qnbn+1
1 + qnbn+1bn+1−K+1
(yˆn+1x − θnx)2
Bnxx
)
,
which is easily shown to converge to zero. This completes the proof.
4.3 The Value Of Information
Value of information procedures allocate the simulation budget by evaluating
the potential of new observations to improve the current estimate of the best value
(see [31] for a survey). The information potential is defined in terms of the expected
difference in the estimated objective value before and after the next observation
occurs. We do not know exactly how an observation of alternative x will change
our beliefs about the best alternative, but we can compute an expectation over the
predictive distribution in (4.30). In this way, we can “look ahead” to the random
outcome of the next observation, attempting to anticipate the results before we see
them. If we sample from alternative x at time n and collect observation yˆn+1x , the
value of information is defined as
Vx(Sn) = En
[
max
i
θn+1i | xn = x
]
−max
i
θni ,
133
where En is the conditional expectation taken with respect to the decision-maker’s
distribution of belief at time n, and xn denotes the alternative measured at time n.
Note that the predictive distribution of θn+1 depends on qn, bn and Bn only
through the vector s˜n from (4.30). As a result, the expected value of information
can be rewritten as
V(θn, s˜(Sn, x),m) = En
[
max
i
θni + s˜(Sn, xn)Tm | xn = x
]
−max
i
θni , (4.45)
where V is defined by V(a,b,m) = E[max
i
ai + biTm] −max
i
ai, a and b are K × 1
vectors. The random variable Tm follows a univariate Student’s t-distribution with
degrees of freedom m, mean 0 and variance 1.
Once again, (4.45) assumes that the decision-maker’s beliefs are represented
by a normal-Wishart distribution at each time step. In practice, the normal-Wishart
distribution is an approximation of the true posterior beliefs, updated using (4.17)-
(4.20). By using this approximation to represent our uncertainty, we can solve (4.45)
in closed form, leading to a computationally efficient procedure.
We introduce a fully sequential policy called Projected Learning of Unknown
Correlations with Knowledge Gradients (PLUCK). The PLUCK policy chooses an
alternative by computing
xPLUCK(Sn) = arg max
x
V(θn, s˜(Sn, x),m). (4.46)
We now show how the value of information can be computed exactly under P¯.
134
4.3.1 Computation of the Value of Information
To compute the expected value of information, we start by defining a function
h : R 7→ {1, 2, . . . , K} as
h(t) := max(argmax
i
ai + bit).
The function h tells us which alternative is the best among {1, 2, · · · , K} in the
sense of having largest value of ai + bit given Tm = t. The largest index is chosen if
multiple alternatives tie. Instead of calculating V(a,b,m) directly, we notice that
max
i
ai + biTm = ah(Tm) + bh(Tm)Tm,
and rewrite ah(Tm) + bh(Tm)Tm as a telescoping sum,
ah(0)+bh(0)Tm+


h(Tm)−1∑
i=h(0)
(ai+1 − ai) + (bi+1 − bi)Tm

+


h(0)−1∑
i=h(Tm)
(ai − ai+1) + (bi − bi+1)Tm

 .
Using standard techniques (see Section 5.3 of [5]), we can find a non-decreasing
sequence {ci}Ki=0 such that h(z) = i if and only if z ∈ [ci−1, ci). It follows that
V(a,b,m) can be written as
V(a,b,m) =
K−1∑
i=1
(bi+1 − bi)E[(Tm − |ci|)+].
To continue the computational procedure, we need an analytical form for the tail
expectation of a univariate Student’s t-distribution. Denote the pdf and cdf of a
standard Student’s t-distribution with m degrees of freedom as gm(·) and Gm(·),
respectively. We can easily rewrite E [(Tm − |ci|)+] as
E
[
(Tm − |ci|)+
]
= E(Tm · 1{Tm>|ci|})− |ci|(1−Gm(|ci|)).
135
It also can be shown [30] that
E(Tm · 1{Tm>|ci|}) =
m+ c2i
m− 1 gm(|ci|).
With the analytical form for the tail expectation, the value of information can be
expressed as:
V(a,b,m) =
K−1∑
i=1
(bi+1 − bi)
(m+ c2i
m− 1 gm(|ci|)− |ci|(1−Gm(|ci|))
)
. (4.47)
We note that (4.46) has the same computational complexity as the analo-
gous VIP for R&S with known correlation structures [5, 77]. The breakpoints ci
can be computed in O(K logK) time. Repeating this for every alternative yields
O(K2 logK). Just as in the learning model of Section 4.2.1, we can account for
unknown correlations for the same computational cost.
4.3.2 Monotonicity of the Value of Information
The value of information calculated in (4.47) depends on the degrees of free-
dom m of the Student’s t-distribution. Lemma 4.7 shows that in the nth stage,
the predictive distribution of the new observation yˆx follows a univariate Student’s
t-distribution with degrees of freedom m = bn−K+1. The parameter bn is updated
through (4.18) and increases as the PLUCK policy collects information. The rela-
tionship between the value of information and the degrees of freedom is summarized
in the next theorem.
Theorem 4.14. For fixed K×1 vectors a and b, the value of information V(a,b,m)
is a decreasing function in the degrees of freedom parameter m.
136
Proof. Proof: Let V(a,b,m) and V(a,b, n) be the values of information for two
different values of the degrees of freedom parameter, with m ≥ n. Since a and b
are fixed, it is sufficient to consider E(Tm − |ci|)+ and E(Tn − |ci|)+.
Let gm(t) and gn(t) be the probability density function of Student’s t distribu-
tions with m and n degrees of freedom, respectively. There exists c∗ > 0 such that
gm(c∗) = gn(c∗) with gm(t) ≤ gn(t) on [c∗,∞) and gm(t) > gn(t) on [0, c∗). We will
consider two cases:
(i) If |ci| ≥ c∗, then gm(t) ≤ gn(t) for t ∈ [|ci|,∞).
E(Tm − |ci|)+ =
∞∫
|ci|
(t− |ci|)φm(t)dt ≤
∞∫
|ci|
(t− |ci|)φn(t)dt = E(Tn − |ci|)+.
(ii) If |ci| < c∗, then gm(t) > gn(t) for t ∈ [0, |ci|).
E(Tm − |ci|)+ − E(Tn − |ci|)+
=
c∗∫
|ci|
(t− |ci|)gm(t)dt+
∞∫
c∗
(t− |ci|)gm(t)dt−
c∗∫
|ci|
(t− |ci|)gn(t)dt−
∞∫
c∗
(t− |ci|)gn(t)dt
=
c∗∫
|ci|
(t− |ci|)gm(t)dt−
c∗∫
|ci|
(t− |ci|)gn(t)dt+
∞∫
c∗
(t− |ci|)gm(t)dt−
∞∫
c∗
(t− |ci|)gn(t)dt
=
c∗∫
|ci|
(t− |ci|)gm(t)dt−
c∗∫
|ci|
(t− |ci|)gn(t)dt+
c∗∫
0
(t− |ci|)gm(t)dt−
c∗∫
0
(t− |ci|)gn(t)dt
=
|ci|∫
0
(t− |ci|)gn(t)dt−
|ci|∫
0
(t− |ci|)gm(t)dt < 0
It follows from (i) and (ii) that for any |ci| > 0,
V(a,b,m) =
K−1∑
i=1
(bi+1− bi)E(Tm− |ci|)+ ≤
K−1∑
i=1
(bi+1− bi)E(Tn− |ci|)+ = V(a,b, n).
Therefore the value of information decreases in the number of degrees of freedom.
137
The theorem suggests that the value of information decreases as the degrees
of freedom increase, with all else being equal. In other words, the same information,
under the same estimated means and covariances, is less valuable when we have
already accumulated many other observations.
Theorem 4.14 leads to an interesting comparison with earlier work on R&S
with known correlations. It is a well-known result that Tm converges weakly to
a standard normal random variable as the degrees of freedom m goes to infinity.
Recall that, when the correlation structure is known, we calculate a version of (4.45)
using a standard normal random variable; see (5.16) in [5]. Theorem 4.14 suggests
that, given the same estimated means and conditional covariances, the value of
information is inherently higher when the correlations are unknown. That is, a
single measurement provides more information when we are learning both means
and covariances.
4.4 Numerical Experiments
We present experimental results demonstrating the value added by learning
unknown correlations using PLUCK. Throughout this section, we considered six
policies: the PLUCK policy, the correlated KG (CKG) policy in [77], a sequential
modified version of proportional to variance (PTV) policy, a sequential OCBA policy
designed for opportunity cost in [91], a Greedy policy and the LL policy with linear
loss in [79]. We briefly explain the distinctions between the remaining policies below.
Both PLUCK and CKG are designed to sample sequentially, with CKG assum-
138
ing a known covariance structure and using a conjugate Bayesian learning model.
This comparison allows us to see the value added by incorporating unknown corre-
lations into our decision-making. The PTV and greedy policy are also sequential,
and make simulation decisions at time n in the following ways: PTV policy chooses
the alternative with the highest variance; the greedy policy chooses the alternative
arg maxx θnx . For both of these methods, we use our approximate Bayesian learning
model in (4.17)-(4.20) to update our beliefs about the alternatives. This compari-
son allows us to see the value added by using the PLUCK policy to make decisions,
in addition to the value of learning unknown correlations. Lastly, the LL policy
of [79] first screens out a subset of alternatives, then allocates the simulation budget
equally among the rest. This structure allows a conjugate normal-Wishart prior to
be used, with the drawback that the policy often samples alternatives that do not
provide a lot of useful information. The LL policy can also be extended to allow
multiple screening stages; however, this approach works best with large sampling
budgets. In our experiments, we consider problems where the simulation budget is
comparable to the number of alternatives, making it difficult to run LL for more
than two stages.
4.4.1 Wind Farm Placement Example
Our study is based on the wind farm placement problem mentioned in the
introduction. For the purpose of these experiments, we use the wind speed at a
location as a stand-in for power output. In practice, wind speed data are collected
139
at each location simultaneously, rather than sequentially. However, this also allows
us to use the data to demonstrate the effectiveness of our policy by comparing its
performance to how well we could have done. Practical applications of sequential
simulation in wind farm placement use complex physics-based models incorporating
factors other than wind speed, thus necessitating the use of sequential simulation.
For our purposes, the public availability of wind-speed data allows us to create a
realistic test setting for the PLUCK algorithm.
We used hourly wind speed data [92] across the United States with latitude
and longitude resolution of 0.125 degrees. The data consist of two components: the
zonal component u (in the west-east direction) and the meridional component v (in
the north-south direction). Assuming for the purpose of this example that all the
wind turbines can be placed in the right direction, we focus on the magnitude of
the wind speed, which is defined as
√
u2 + v2. The objective is to select the location
with the highest wind speed over a set of 64 locations.
We considered data from four regions across the United States: Kansas, Wash-
ington, Iowa and Oklahoma. All regions have had a high percentage of wind power
generation, or a large amount of wind capacity installed, in recent years. For each of
the four regions, we selected 64 different locations sitting on an 8×8 grid (the areas
of these grids range from 3500 to 4500 square miles) within the region as alternatives
for wind farm placement.
We used 1800 days of data to estimate the mean and covariance matrix of
a multivariate normal distribution. These fitted parameters were taken to repre-
sent the “true” underlying sampling distribution. In our experiments, individual
140
Policies
Experiment Performance Measure PLUCK CKG Greedy OCBA PTV LL
Kansas
Opportunity Cost 0.0314 0.1270 0.3073 0.2760 0.0537 0.2848
Standard Errors 0.0026 0.0068 0.0045 0.0040 0.0036 0.0047
Washington
Opportunity Cost 0.0640 0.0917 0.2651 0.2021 0.1125 0.1647
Standard Errors 0.0051 0.0050 0.0018 0.0053 0.0060 0.0058
Iowa
Opportunity Cost 0.0512 0.0892 0.1917 0.1526 0.1229 0.1335
Standard Errors 0.0037 0.0040 0.0046 0.0036 0.0034 0.0040
Oklahoma
Opportunity Cost 0.0509 0.0911 0.2413 0.1401 0.1816 0.2031
Standard Errors 0.0040 0.0050 0.0013 0.0017 0.0039 0.0042
Table 4.1: Final opportunity cost and standard errors for the experiments
observations were generated by simulating from normal distributions with the true
parameter values. However, the policies were not allowed to see these true values
when making decisions.
It is critical to collect information efficiently when the decision-maker’s prior
beliefs are inaccurate or misleading. To show that the PLUCK policy is particularly
effective in such a situation, we used a small number of data points to create a prior
for which the location that appeared to be the best was quite different from the true
best location. We discuss this issue further below; for now, we note that each policy
was given a budget of N = 200 measurements to correct this initial error. Table 4.1
gives the final opportunity cost, which is defined as
Cpi = max
x
µx − µ(arg maxx θNx )
after N measurements for each policy pi, averaged over 500 sample paths. Lower
opportunity cost suggests that a policy selects an alternative closer to the best. The
141
0 20 40 60 80 100 120 140 160 180 2000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Number of Samples
Oppo
rtuni
ty Co
st
 
 CKG PolicyPLUCK PolicyGreedy PolicySOCBA PolicyPTV Policy
(a) 64 locations from Kansas
0 20 40 60 80 100 120 140 160 180 2000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Number of Samples
Oppo
rtuni
ty Co
st
 
 CKG PolicyPLUCK PolicyGreedy PolicySOCBA PolicyPTV Policy
(b) 64 locations from Washington
0 20 40 60 80 100 120 140 160 180 2000
0.05
0.1
0.15
0.2
0.25
Number of Samples
Oppo
rtuni
ty Co
st
 
 CKG PolicyPLUCK PolicyGreedy PolicySOCBA PolicyPTV Policy
(c) 64 locations from Iowa
0 20 40 60 80 100 120 140 160 180 2000
0.05
0.1
0.15
0.2
0.25
0.3
Number of Samples
Oppo
rtuni
ty Co
st
 
 CKG PolicyPLUCK PolicyGreedy PolicySOCBA PolicyPTV Policy
(d) 64 locations from Oklahoma
Figure 4.1: Averaged opportunity cost as the number of samples increases, where
the dashed lines are the mean plus or minus two standard errors
PLUCK policy outperforms all other competing policies by a statistically significant
amount based on Table 4.1, while the CKG policy has smaller final opportunity cost
than the other policies in three experiments. The PTV policy performs better than
the sequential OCBA policy and the greedy policy. The LL policy performs poorly
in all four experiments, possibly because the simulation budget is quite small relative
to the number of alternatives.
Figure 4.1 shows how this performance measure changes as the number of sam-
142
(a) Expected Improvement
0 20 40 60 80 100 120 140 160 180 2000.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
Number of Samples
Max
imum
 Erro
r
Test4:  64 locations from Washington
 
 CKG PolicyPLUCK PolicyGreedy PolicyPTV Policy
(b) Maximum Error
Figure 4.2: Value of information and maximum error as the number of samples
increases
ples N varies from 1 to 100. The bands indicate the mean performance measures
plus or minus two standard errors. The LL policy is omitted from these figures
because it allocates simulations in batch rather than sequentially, by dividing them
uniformly across any alternatives that were not screened out. The opportunity cost
for PLUCK tends to decrease over time. For CKG, sometimes there is a degra-
dation in performance at the beginning. We conjecture that this behavior arises
because CKG assumes a known covariance structure. If the prior beliefs about the
correlations are inaccurate, this misdirects the way in which CKG incorporates new
information into the posterior. A small amount of information can thus make CKG
produce even worse performance than what could be obtained with just the prior.
The sequential OCBA policy and the greedy policy tend to work poorly on all cases,
and performance of the PTV policy differs dramatically among cases.
We also considered a different set of experiments in which results were averaged
143
across multiple priors constructed from a small sample of wind speed data. Overall,
we found that PLUCK still outperformed the competition, with the caveat that all
policies were more heavily affected by the initial degradation in performance (the
early iterations needed to get a handle on the true correlation structure).
We make two interesting observations from the experimental results. Fig-
ure 4.2(a) shows the logarithm of the value of information as computed by both
PLUCK and CKG (for a particular experiment), while Figure 4.2(b) shows the max-
imum absolute difference between the posterior and true means for various policies.
Figure 4.2(a) shows that the value of information is much higher when we consider
unknown correlations, as suggested by Theorem 4.14. Figure 4.2(b) shows that the
True Mean
2 4 6 8
2
4
6
8
Prior Mean
2 4 6 8
2
4
6
8
•
 39
•
 107 • 52 • 2
Post. Mean Using PLUCK
2 4 6 8
2
4
6
8 •
 43
•
 10 • 45 • 4 • 40
•
 14
•
 44
Post. Mean Using CKG
2 4 6 8
2
4
6
8
•
 49
•
 151
Post. Mean Using PTV
2 4 6 8
2
4
6
8
•
 7
•
 7
•
 7
•
 7
•
 7
•
 7
•
 7
•
 7
•
 7
•
 7
•
 7
•
 7
•
 7
•
 7
•
 7
•
 7
•
 7
•
 7
•
 7
•
 7
•
 7
•
 7
•
 7
•
 7
•
 7
•
 7
•
 7
Post. Mean Using LL
2 4 6 8
2
4
6
8
Figure 4.3: Contour map of different policies after 200 measurements
144
Policies
Experiment Performance Measure PLUCK CKG Greedy OCBA PTV LL
Queue (correlated)
Opportunity Cost 0.3181 0.4677 0.6979 1.0521 0.4862 2.1105
Standard Errors 0.0380 0.0423 0.0631 0.0645 0.0451 0.034
Network
Opportunity Cost 0.1938 0.2902 0.3339 0.3439 0.3420 0.2374
Standard Errors 0.0024 0.0063 0.0047 0.0055 0.0066 0.0045
Table 4.2: Final opportunity cost and standard errors for the queue selection and
network selection problems
PLUCK policy does a better overall job of estimating the true values.
Figure 4.3 gives contour maps of the true means, prior means and posterior
means after 100 measurements with four different policies. The sequential OCBA
policy and the greedy policy are omitted due to its poor performance. The num-
ber of times that each alternative is measured is also shown on the contour maps
(zeros are omitted). Red colors indicate higher values. We can see that the true
best alternative is in the upper-left corner, whereas the prior misdirects us toward
bottom-left. After 200 measurements, the PLUCK policy captures the general trend
of the true values, whereas CKG and PTV are still stuck on beliefs that resemble
the prior. Observe that both PLUCK and CKG measure the true best alternative
in the upper-left corner almost equally often. However, the statistical model used
by PLUCK provides more accurate posterior beliefs, leading to a better implemen-
tation decision. The LL policy performs poorly and the identified best is far away
from the true best. Also, its batch structure allocates many samples to alternatives
that do not provide a lot of useful information.
145
4.4.2 A Single-Server Queue Selection Problem
In simulation, correlations may arise due to common random numbers. How-
ever, it is important to keep in mind that correlated beliefs reflect inherent similar-
ities or differences between alternatives, even when the actual simulation output is
completely independent. The following example demonstrates that correlated beliefs
can enhance performance even when no correlations are present in the simulation
output.
Consider 20 first-come, first-served M/G/1 queues. The interarrival times
follow an exponential distribution with λ = 0.05 and the service times follow Pareto
distributions with mean service rates 23(0.1 + 0.05(i − 1)), i = 1, 2, ..., 20. Suppose
that the administrator of these queues wishes to reduce costs by closing the worst
server, i.e., the one with the largest expected waiting time. System 1 is the worst,
having the smallest service rate. However, this is unknown to the administrator.
Observe that, due to the structure of the problem, the performance of queues
i and j will exhibit greater similarity if |i − j| is smaller. Thus, even though these
queues function independently, our beliefs about their performance can be corre-
lated. Of course, we do not know the problem structure, but we can use an empirical
covariance matrix computed from a small sample of observations to initialize our
prior distribution, and use PLUCK to improve on this prior. Table 4.2 gives the
final opportunity cost. Figure 4.4(a) shows the performance of PLUCK over time
when the prior matrix parameter B0 is diagonal, while 4.4(b) shows performance
with B0 computed using small-sample empirical covariances. We see that, although
146
0 10 20 30 40 50 60 70 80 90 1000
0.5
1
1.5
2
2.5
3
3.5
Number of Samples
Oppo
rtuni
ty Co
st
 
 CKG PolicyPLUCK PolicyGreedy PolicyPTV PolicySOCBA Policy
(a) With independent belief
0 10 20 30 40 50 60 70 80 90 1000
0.5
1
1.5
2
2.5
3
3.5
Number of Samples
Oppo
rtuni
ty Co
st
 
 CKG PolicyPLUCK PolicyGreedy PolicyPTV PolicySOCBA Policy
(b) With correlated belief
Figure 4.4: Comparing averaged opportunity cost in M/G/1 queue selection problem
the queues function independently, PLUCK can leverage correlated beliefs to learn
much more quickly than the other policies.
As before, a small sample of 10 replications was used to create a prior for
the covariance matrix.We then compared PLUCK, CKG, PTV, LL, the sequen-
tial OCBA policy and the greedy policy by running 1000 macroreplications. The
Pollaczek-Khinchin formula can be applied to compute the true expected waiting
time. Figure 4.4(b) shows that PLUCK outperforms the competing policies, espe-
cially in early stages. The PTV policy and CKG policy are indistinguishable most
of the time. The greedy policy and the sequential OCBA policy work poorly in
this experiment. The performance of the CKG policy and PTV policy are behind
the PLUCK policy initially, but they eventually catch up and lag behind PLUCK
slightly. In summary, this experiment suggests that we are learning the similarities-
between alternatives, enabling us to discover the optimal solution more quickly even
when the actual simulation outputs are independent.
147
4.4.3 3-Station Jackson Network
Consider a classical 3-station open Jackson network shown in Figure 4.5(a),
where the interarrival times and service times follow exponential distributions. Let
λ be the total external arrival rate to the system, and let µj represent the service
rate at station j. Upon completing service at station i, a job leaves the network
with probability pi0 or is routed to station j with probability pij.
The goal of the administrator is to minimize the average time spent by cus-
tomers in the system subject to a constraint on the overall service rate. Suppose
that all the available agents can achieve an overall service rate of 3 for stations 2 and
3. Consider 10 different assignments where the service rate at station 2 is 1 + 0.1i,
i = 1, 2, · · · , 10. The performance of different assignments will exhibit correlation
due to similarities in the service rates. We chose λ = 0.5 and the routing probability
matrix P = [pij] as
P =








pi1 pi2 pi3 pi0
0 0.7 0.3 0
0.3 0 0 0.7
0.2 0 0 0.8








Again, a small sample of 10 replications was used to created priors for the
mean and covariance matrix. We compared PLUCK, CKG, PTV, LL, the sequential
OCBA policy and the greedy policy, where each policy is given a sampling budget
of 50. The true expected times in the system for different assignments are computed
analytically. The final opportunity costs averaged over 500 sample paths are shown
148
in Figure 4.5(b), and Table 4.2 gives the final opportunity cost. The PLUCK policy
again outperforms all the other policies.
(a) A three-station Jackson network
0 5 10 15 20 25 30 35 40 45 500.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Number of Samples
Oppo
rtuni
ty Co
st
 
 CKG PolicyPLUCK PolicyGreedy PolicyPTV PolicyOCBA Policy
(b) Averaged opportunity cost
Figure 4.5: Numerical experiment on a three-station Jackson network
4.5 Conclusion
We have presented the first computationally tractable statistical learning model
for fully sequential ranking and selection with unknown correlation structures. The
model uses approximate Bayesian inference to represent and update our beliefs about
unknown performance means and unknown covariances using the normal-Wishart
distribution. We have also derived a value of information procedure that antici-
pates new information about both the true values and the true correlations when
allocating simulations. Previous work in this area has required known correlation
structures, an assumption that is likely to be violated in many applications. We re-
lax this assumption, but retain the ability to learn about multiple alternatives from
a single observation, for the same computational cost as the known-covariance case.
149
We believe that our work offers a useful way to tackle large learning problems with
difficult correlation structures, and opens up new applications for Bayesian optimal
learning.
150
Chapter 5: Bayesian Learning on Logistic Demand Curves
5.1 Introduction
The problem of business-to-business (B2B) pricing arises in high-volume com-
mercial transactions between businesses. For example, consider the problem faced
by a supplier of raw materials negotiating a long-term contract with a large manu-
facturing concern. After a period of negotiation, the seller quotes a price, which can
be accepted or rejected. If the pricing offer is rejected, the seller loses a substantial
amount of revenue, but it may not be clear exactly how much lower the offer should
have been. If the offer is accepted, the seller makes a profit, but is left wondering
whether a somewhat higher offer would still have been accepted. The seller’s goal is
to maximize total revenue from a sequence of contracts, in the face of uncertainty
about buyer behavior.
Dynamic pricing in general is subject to uncertainty. Classic models in revenue
management often assume stochastic demand for a product [93, 94], or uncertain
customer valuations of it [95]. Recent work, however, has considered the additional
dimension that the uncertainty may be environmental, that is, the seller does not
even know the distribution from which customer valuations are drawn. In practice,
this distribution must be estimated, and the estimate must be adjusted over time
151
as new transactions are observed. This gives rise to the problem of “learning and
earning,” in which the seller does not always prefer the decision that appears to be
optimal based on the current demand model (referred to as the “myopic” decision),
but rather may engage in more exploratory or experimental behavior. For example,
an online retailer may increase or decrease some prices for a period of time, simply
to observe the effect on sales. Although this behavior may result in lost revenue, it
provides new information that produces a more accurate demand model, enabling
better pricing decisions in the future.
The literature has used Bayesian statistics to model environmental uncer-
tainty [33, 34], and different pricing strategies have been proposed to optimize the
balance between revenue and information. For example, [35] proposes a one-step
look-ahead strategy for problems with logistic revenue curves, while [36] presents an
approach based on multi-armed bandit theory. A recent stream of work, represented
by [37], [38], [39], and [40], has focused on establishing long-run convergence rates
for policies that are mostly myopic, with occasional periods of exploration spaced in-
creasingly further apart. However, in the specific context of B2B pricing, individual
transactions typically have high volume (for example, the seller may be negotiating
the price of a year’s supply of raw materials) and incur high costs (e.g. the time and
money spent during negotiations), making it important to obtain good performance
quickly.
We consider an application where information arrives in the form of binary
win/loss observations, representing customers’ yes/no responses to the seller’s pric-
ing offers (or “bids”). A common demand model in this setting (used e.g. by [35])
152
assumes that these binary outcomes follow a logistic distribution, which also allows
us to relate the win probability to a set of regression features representing additional
information about the product or customer type. Although this is a fairly natural
choice of demand model (essentially just an instance of logistic regression), it is
quite challenging to connect to the Bayesian way of representing new information
and using it to update the seller’s beliefs. While, for linear regression, the multi-
variate normal distribution offers an intuitive and easy-to-use conjugate prior [96],
no such model is available for logistic regression, making it difficult to represent a
belief over a continuous space of logistic curves.
We approach this problem with approximate Bayesian inference, using the
technique of density projection to create a multivariate normal posterior distribution
that is “approximately conjugate,” in the sense of minimizing the Kullback-Leibler
divergence from the actual posterior. See e.g. [97] for an application of this technique
to the problem of learning unknown correlation structures in ranking and selection.
In the context of logistic regression, our approach is similar to the variational ap-
proximation by [98], but involves an additional optimization step using infinitesimal
perturbation analysis (see e.g. [11] or [13]) to further improve the quality of the
approximation. Using this statistical technique to efficiently update a multivariate
normal prior on the parameters of the logistic demand curve, we then apply a policy
that optimizes a myopic estimate of the expected revenue curve (see e.g. Ch. 11
of [5]). Our numerical experiments provide evidence in favor of both the approxi-
mate Bayesian learning model and the Bayes-greedy pricing policy. Although our
Bayesian model has numerous applications outside pricing, in this particular context
153
it enables the seller to compactly model a set of beliefs about win probabilities for a
wide range of customer and product segments, and then quickly update this belief
in real time.
5.2 Problem Formulation and Learning Model
Section 5.2.1 introduces the demand and revenue curves optimized by a seller
in the B2B pricing problem. In Section 5.2.2, we discuss the challenge of developing
a Bayesian model for learning the parameters of the demand curve. Then, Sections
5.2.3 and 5.2.4 outline our proposed approach for overcoming this challenge.
5.2.1 Problem Formulation
Consider a seller who must quote prices for a sequence of corporate clients.
The (n+ 1)st client will accept a price offer pn ≥ 0 with probability ρ, which may
also depend on additional properties of the client or product. The function ρ is called
the demand curve, and is not known exactly to the seller. However, the seller does
observe the client’s response, modeled as a binary variable Y n+1, where Y n+1 = 1
with probability ρ, representing a sale (or “win”), and Y n+1 = 0 represents a “loss.”
The seller’s expected revenue from the client is
R(pn) = pnρ, pn ≥ 0, (5.1)
where the demand curve ρ usually depends on the price pn. In most applications,
we need to consider the marginal cost c for the product, and work with the expected
154
profit
Π(pn) = (pn − c)ρ, pn ≥ c. (5.2)
We assume that Y n follows a logistic distribution, allowing us to write the demand
curve as
ρ(xn) = P(Y n+1 = 1) = 11 + e−µᵀxn , (5.3)
where xn is a vector of features, observed by the seller, providing relevant informa-
tion for the (n+ 1)st pricing decision. In the simplest possible model, the customers
are assumed to be homogeneous, xn = [1, pn]ᵀ, and the parameter vector µ consists
only of an intercept and a slope term. We use this simple model in our examples
throughout this chapter. However, our analysis is readily applicable to the general
case, where xn may also contain information about the product (type or volume)
and the client (region, industry, history with the seller).
In all of these cases, the parameter vector µ is unknown to the seller and must
be inferred using a combination of prior knowledge and incoming win/loss results.
The shape of the demand curve is extremely sensitive to the parameter values,
making it important to obtain accurate estimates of the parameters as quickly as
possible. We now propose a Bayesian framework for learning the demand curve.
5.2.2 Bayesian Model For Dynamic Pricing
We adopt the Bayesian view, and represent our uncertainty about the vector
µ using a multivariate normal prior distribution, that is,
µ ∼ N (θ,Σ). (5.4)
155
The multivariate normal distribution offers a compact and powerful way to model
correlations between our beliefs about different components of µ. Because the ob-
servation Y n+1 provides information about an entire vector xn, our beliefs about
different components of this vector should become correlated due to their depen-
dence on the same observations. A second convenience of the multivariate normal
distribution (important for computational purposes) is that the linear combination
µᵀxn follows a univariate normal distribution.
In linear regression, where a continuous response variable is related to a linear
combination of features, the multivariate normal prior possesses the property of
conjugacy. That is, if the residual errors are i.i.d. normal, the posterior distribution
of the regression parameters, conditional on a sequence of observations, will remain
normal [96]. This model makes the learning process highly efficient computationally,
as one only needs to recursively update the mean vector and covariance matrix of
the belief distribution after each observation. Unfortunately, in logistic regression,
there is no known prior distribution that is conjugate with logistic observations. To
see this, we first assume that µ ∼ N (θn,Σn), and write the likelihood function of
Y n+1 as
P (Y n+1) = g(Hn+1(µ)), (5.5)
where g(z) = (1 + e−z)−1 and Hn+1(µ) = (2Y n+1− 1)(µᵀxn). Equation (5.5) allows
us to represent the win/loss probability in a concise form. Applying Bayes’ rule, the
posterior distribution, given the bidding price pn and the observation Y n+1, can be
156
written as
p(µ|pn, Y n+1) ∝ g(Hn+1(µ))|Σn|−1/2 exp
{
−
1
2(µ− θ
n)ᵀ(Σn)−1(µ− θn)
}
, (5.6)
which is clearly non-normal.
We would like to retain the multivariate normal prior due to its power in
modeling correlated beliefs. However, we are now required to use the techniques of
approximate Bayesian inference to develop a multivariate normal posterior that is
“approximately conjugate.” Several such approaches have been proposed, includ-
ing approximation methods based on Laplace approximation [99] and variational
bounds [98]. We take a variational Bayesian approach to approximate the poste-
rior distribution by minimizing the Kullback-Leibler divergence between the true
posterior distribution and a multivariate normal distribution.
5.2.3 Variational Bayesian Approximation
Suppose that, after observing n responses, our beliefs about µ are multivari-
ate normal with parameters (θn,Σn). Let P (µ|pn, Y n,θn,Σn) be the likelihood
function of this distribution. The variational Bayesian approach approximates the
posterior distribution of µ, given Y n+1, with a normal distribution Q(µ|θn+1,Σn+1)
by minimizing the Kullback-Leibler (KL) divergence. The KL divergence between
P (µ|pn, Y n+1,θn,Σn) and Q(µ|θn+1,Σn+1) is defined as
D(Q q P ) := EQ
(
log Q(µ|θ
n+1,Σn+1)
P (µ|pn, Y n+1,θn,Σn)
)
, (5.7)
where the expectation is taken with respect to Q. This definition can be partially
simplified, as stated in the following result.
157
Proposition 5.1. The KL divergence can be written as
D(Q q P ) = EQ
[
log
(
1 + e−Hn+1(µ)
)]
+ h(θn,Σn,θn+1,Σn+1), (5.8)
with the second component specified as
h(θn,Σn,θn+1,Σn+1) = 12
[
tr
(
(Σn)−1Σn+1
)
+ (θn − θn+1)ᵀ(Σn)−1(θn − θn+1)− k − ln |Σ
n+1|
|Σn|
+ C
]
,
where C is a constant that does not depend on θn+1 and Σn+1.
To minimize the KL divergence, the first step is to take the gradient of D(Q q
P ) with respect to its parameter θn+1 and Σn+1. Unfortunately, a closed-form
expression for the gradient is not available, because the expectation in equation (5.8)
is intractable. However, if our goal is to minimize an expected value, a connection
to gradient-based stochastic search [100] comes naturally to mind. The work by
[101] uses such an approach, where a likelihood ratio estimate [102] of the gradient
is constructed. However, this approach leads to a noisy simulation optimization
problem, whose dimensionality is quadratic in the number of features, presenting
substantial computational difficulties.
Instead of optimizing with respect to (θn+1,Σn+1), we utilize a dimension
reduction technique and propose the following form for θn+1 and Σn+1:
θn+1 = Σn+1
(
(Σn)−1 θn +
(
Y n+1 − 12
)
xn
)
(5.9)
Σn+1 =
(
(Σn)−1 + λxn (xn)ᵀ
)−1 (5.10)
158
Applying the Sherman-Morrison formula to (5.9) and (5.10), we obtain
θn+1 = θn +
Y n+1−1/2
λ − (xn)ᵀθn
1
λ + (xn)ᵀΣnxn
Σnxn, (5.11)
Σn+1 = Σn − Σ
nxn(xn)ᵀΣn
1
λ + (xn)ᵀΣnxn
. (5.12)
In this form, there is only one parameter λ to be determined. We minimize the
KL divergence with respect to λ to find the optimal multivariate normal posterior
distribution from the parametrized family in (5.11)-(5.12). Aside from the compu-
tational convenience of reducing the size of the problem, we choose precisely this
form because it resembles the Kalman-filter-like equations used for Bayesian linear
regression; the parameter λ is analogous to the precision of the residuals, while
Y n+1−1/2
λ stands in for the continuous observation. In this way, our learning model
for logistic regression makes an intuitive connection to the well-understood linear
setting. Moreover, previous work on logistic regression, including [98] and [99], has
derived updating rules with very similar form, based on different approximation
techniques for the posterior likelihood function.
For additional convenience, we apply the transformation v = 1λ and find
v∗ = argmin
v
D(Q q P ). (5.13)
The parameter v is analogous to the variance of the residuals in a linear regression
model. Since no such explicit parameter is given in logistic regression, we simply
find the value that produces the most accurate approximation.
159
5.2.4 Minimizing the Kullback-Leibler Divergence
We now propose a stochastic approximation method to solve the minimization
problem in (5.13), which requires estimations of the gradient of D(Q q P ) with
respect to the single parameter v. This results in
∇vD(Q q P ) = ∇vEQ
[
log
(
1 + e−Hn+1(µ)
)]
+∇vh(θn,Σn,θn+1,Σn+1). (5.14)
Since we do not have a close-form expression for ∇vEQ
[
log
(
1 + e−Hn+1(µ)
)]
, we
propose to use infinitesimal perturbation analysis (IPA) to obtain noisy samples
of the gradient (see e.g. [100] or [13] for an introduction). First, we transform
the expectation in (5.14) into an integration with respect to a standard univariate
normal distribution, E[f¯(Z)], where Z ∼ N (0, 1) and
f¯ (z) = log
(
1 + exp
{
−(2Y n+1 − 1)
[(
xn(pn(µ))ᵀΣn+1xn(pn)
)1/2 z + (θn+1)ᵀxn(pn)
]})
.
The next result shows that the conditions for IPA [103] hold.
Proposition 5.2. ∇vE
[
f¯ (Z)
]
= E
[
∇vf¯ (Z)
]
.
The IPA estimator itself is given as
E
[
∇vf¯(Z)
]
≈
1
N
N∑
i=1
∇vf¯(Z(i)),
where Z(i) are independent samples from a standard normal distribution. We denote
the gradient estimator by ∇̂vEQ
[
log
(
1 + e−Hn+1(µ)
)]
and plug it into (5.14) for
∇vEQ
[
log
(
1 + e−Hn+1(µ)
)]
. This produces an estimator of ∇̂vD(Q q P ), and we
can apply the Robbins-Monro stochastic approximation algorithm
vn+1 = vn − an∇̂vD(Q q P ),
160
for some suitably chosen stepsize an, to find the optimal v∗ and thus the optimal
λ∗. Then we can apply the updating rules in (5.11) and (5.12) to determine the
approximate posterior distribution after collecting each observation Y n.
5.3 Dynamic Pricing Policy
We have shown a way in which the seller’s beliefs can be updated after observ-
ing customer response to a price. It remains to address how that price can be chosen
in the first place. In this section, we expand upon the notion of a “Bayes-greedy”
pricing policy introduced in Ch. 11 of [5]. Greedy and semi-greedy policies have been
widely studied in the literature on dynamic pricing under environmental uncertainty
(see e.g. [39]), and our policy may also be viewed as part of that realm. However, in
the setting of Bayesian logistic regression, the concept of “greedy” admits important
nuances.
Ideally, the seller would like to choose the price that maximizes the true revenue
curve,
p∗ = arg max
p
p
1 + e−(µᵀx(p)) , (5.15)
where we emphasize that x depends on p since p is typically one component of the
vector of features. A simple “greedy” policy will simply replace µ in (5.15) by the
current posterior mean vector θn. This is typically the approach used in frequentist
models (e.g. in [40]) where an MLE estimator is used in place of θn.
In the Bayesian setting, however, this approach will under-perform, because it
does not use all of the available information. In particular, it does not account for
161
the uncertainty in our beliefs, expressed by Σn. The covariance matrix is important
because it specifies a whole family of possible revenue curves, parametrized by µ ∼
N (θn,Σn). Thus, a Bayes-greedy policy will still myopically optimize the expected
single-period revenue, but the expectation will be over the entire space of revenue
curves. That is,
pn = argmax
p
E [R(p)] = argmax
p
E
[ p
1 + e−(µᵀxn(p))
]
, (5.16)
where the expectation is taken with respect to the (approximate) posterior joint
distribution of the parameters.
5.3.1 Computation of the Bayes-Greedy Policy
In order to use the Bayes-greedy policy, we require the ability to compute the
expectation in (5.16). The approximate Bayesian model suggests that the posterior
distribution is multivariate normal, which leads to another convenient dimension
reduction. If µ ∼ N (θn,Σn) after collecting n observations, then
µᵀx ∼ N ((θn)ᵀx,xᵀΣnx)
for arbitrary x. Therefore, let W = µᵀx, noticing that W is actually a function of
p, and rewrite (5.16) as
pn = argmax
p
E
[ p
1 + e−W
]
, (5.17)
where the expectation is now taken with respect to a univariate normal distribution
with appropriately chosen mean and variance.
162
The expectation in (5.17) is known as the logistic-normal integral [104], which
plays an important role in statistics. However, this integral is impossible to compute
analytically. It may be computed using Monte Carlo simulation, in particular using
IPA (it can be shown that the relevant conditions hold). However, [105] offers a
tractable approximation
E
[ 1
1 + e−W
]
≈
1
1 + e−
E(W )
γ
,
where
γ =
√
1 + pi8 Var(W ).
This leads to an approximate Bayes-greedy policy that can be written as
pn = argmax
p
p
1 + e−
(θn)ᵀxn(p)
γn(p)
, (5.18)
where
γn(p) =
√
1 + pi8 x
n (p)ᵀ Σnxn (p).
This approximation gives us a closed-form expression for the expected revenue func-
tion, so that making a pricing decision using (5.18) is computationally easier.
5.3.2 Analysis of the Bayes-Greedy Policy
It can be easily shown that the objective function optimized by the point-
estimate policy,
RPE(p) =
p
1 + e−(θn)ᵀxn(p) ,
is log-concave (but not concave). As a consequence, this function has a single
globally optimal price. We show that the Bayes-greedy objective function in (5.16)
163
possesses the same property, whence it follows that the idea of a “Bayes-greedy
price” is well-defined.
Theorem 5.3. The Bayes-greedy objective function
RBG(p) = E[R(p)] = E
[ p
1 + e−(µᵀxn(p))
]
is quasi-concave in p when p > 0.
An important consequence of Theorem 5.3 is that, if we apply IPA to optimize
RBG, we are guaranteed to converge to the optimal price. In general, IPA is only
guaranteed to find a local optimum. However, in this case, we can apply stochastic
approximation to solve the problem directly instead of using the approximation in
(5.18). However, we are still interested in understanding the approximate problem,
since it is easier to solve. One can observe that (5.18) resembles the point-estimate
objective, but with an additional factor γn (p) incorporating our uncertainty about
the regression coefficients. The following proposition summarizes structural proper-
ties of this factor.
Proposition 5.4. The factor γn(p) ≥ 1 is convex in p.
The variance factor can be viewed as the risk we have to take when choose
a price. In most problem instances that we have observed, the factor γn(p) is not
only a convex function, but also an increasing function within the domain of bidding
prices. This suggests that the risk is higher when we take a higher price, but the
possible reward is also higher. Moreover, the probability of success decreases when
we choose a higher price. However, since the factor γn(p) is greater than 1, the
164
Bayes-greedy policy tends to explore higher prices than the point-estimate policy,
leading to possible higher profit.
5.4 Numerical Experiments
In this section, we present numerical experiments using the approximate Bayesian
learning model with stochastic approximation proposed in Section 5.2 and the Bayes-
greedy policy proposed in Section 5.3. We compare this with several alternative
approaches, described as follows:
1. The standard frequentist logistic regression approach with the point-estimate
policy. In this approach, logistic regression is reapplied after collecting each
observation to estimate parameters in the demand function. Then, a pricing
decision is made using the point-estimate policy mentioned in Section 5.3.
2. The variational lower bound approach in [98], using the point-estimate policy
to make pricing decisions.
3. The variational lower bound approach in [98], with the proposed Bayes-greedy
policy. This is used to show the advantage of choosing the parameter λ opti-
mally using IPA.
Suppose that we are running a computer company, and one of our standard
desktop computer models has production cost c. When we set a price for selling, we
restrict the prices to be within the range [pl, pu], where the lower bound pl ≥ c. This
means that we never set a price that is lower than the cost, and it is very unlikely
165
that we will make a sale if the selling price is above pu. Our objective is to maximize
the profit function as in (5.2). In our experiment, we choose c = 300, pl = 300 and
pu = 500. We consider a finite number of possible bidding prices from 300 to 500 in
increments of 10.
For the purposes of this example, we use a two-parameter model, that is,
µ = (µ1, µ2) and x (p) = [1, p]ᵀ. We begin with a prior mean θ0 = [500,−1]ᵀ. The
corresponding demand and profit curves, based on this prior, are shown as the solid
green lines in Figure 5.1. The figure also shows three different realizations of µ
in which the maximum possible profits are “low,” “medium,” and “high.” Minor
changes in the regression parameters can significantly alter the shape of the profit
curve. From Figure 5.1(a), we see that the prior is essentially telling us that any
customer is highly likely to purchase the computer for prices between $300 and $500.
Considering the upper and lower bounds we have chosen for our price, this type of
prior can be understood as uninformative (since our belief suggests that a customer
will buy the product for almost any price in the range).
The prior covariance matrix we choose for one specific setting is given by
Σ0 =




100 0
0 0.01



 .
Instead of interpreting this as our uncertainty about the parameters in the prior
belief, it may be more meaningful to consider Σ0 as the uncertainty about possible
prices that buyers will pay. Notice that the magnitudes of the two variances are
quite different, again due to the extreme sensitivity of the profit curve to small
changes in the regression parameters, particularly the price sensitivity µ2. If the
166
variance of µ2 is too large, this essentially means that µ2 has a high probability
of being positive, which is quite unlikely to occur in practice. Furthermore, in a
practical application of the Bayes-greedy policy, we may have x (p) = [x˜, p]ᵀ, where
x˜ contains product and customer attributes unrelated to the price. For the purpose
of myopically optimizing the price, this model is equivalent to a two-parameter
model where multiple features are embedded into µ1, in which case the variance of
µ1 actually represents the variance of a sum of random variables, and should be
much larger than the variance of µ2.
We compare the performances, including pricing decisions, single-period profit
and cumulative profit, of the approach proposed in this chapter (referred to as “IPA-
Bayes”) and three alternative approaches. The results are reported for the first 20
iterations and averaged over 1000 sample paths, with the true value of µ fixed
according to the three scenarios shown in Figure 5.1. In all numerical experiments,
the Bayes-greedy policy refers to the approximate policy in equation (5.18). We
briefly discuss each of the three scenarios below.
Low-truth scenario. The parameters of the low truth setting are µ1 = 40 and
µ2 = −0.115. With pre-specified values for µ1 and µ2, the optimal bidding price
is 340. As shown in Figure 5.2, all four methods start with the same bidding
price initially and converge after 6 iterations, but the values they converge to are
different. All methods converge to prices below the optimal selling price, but the
prices from IPA-Bayes and the logistic regression method are closer to optimal, and
produce similar profits. IPA-Bayes adjusts more quickly to new information than the
167
Low truth
Med. truth
High truth
Prior
300 350 400 450 500
0.0
0.2
0.4
0.6
0.8
1.0
Prices
Pro
ba
bil
ity
of
Su
cce
ss
(a) Probability of success
Low truth
Med. truth
High truth
Prior
300 350 400 450 500
0
50
100
150
Prices
Pro
fit
(b) Profit curves
Figure 5.1: Probability of success and corresponding profit curve as a function of
the price under three different scenarios
(a) Bidding prices (b) Single-period-profit (c) Cumulative profit
Figure 5.2: Plots of bidding prices, single-period profit and cumulative profits over
time under the low-truth scenario
other three methods, without the volatile behavior observed for frequentist logistic
regression. Additionally, IPA-Bayes shows advantages in single-period profit during
the first 5 iterations, resulting in higher cumulative profit.
Medium-truth scenario. The true parameters are µ1 = 32.5 and µ2 = −0.08, and
the optimal bidding price is 380. Figure 5.3 shows that all four methods converge
to a price close to optimal. Both the single-period profit and the cumulative profit
from the IPA-Bayes method dominate those from the other three methods. Note
that, while the frequentist method explores higher prices than IPA-Bayes for some
168
(a) Bidding prices (b) Single-period-profit (c) Cumulative profit
Figure 5.3: Plots of bidding prices, single-period profit and cumulative profits over
time under the medium-truth scenario
(a) Bidding prices (b) Single-period-profit (c) Cumulative profit
Figure 5.4: Plots of bidding prices, single-period profit and cumulative profits over
time under the high-truth scenario
time, this behavior is actually too aggressive, and produces lower profits.
High-truth scenario. The parameters of the high-truth setting are µ1 = 55 and
µ2 = −0.125, with the optimal price being $420. As shown in Figure 5.4, both
IPA-Bayes and Bayes-greedy start from a lower bidding price than the other meth-
ods, due to the effect of the uncertainty factor γn (p). However, IPA-Bayes quickly
adjusts and increases the bidding price, eventually getting close to optimal, and
dominating the other methods in terms of single-period profit. After 10 iterations,
frequentist logistic regression catches up and produces similar single-period profit,
at the expense of volatile behavior and smaller profits in the early iterations.
169
Discussion. Frequentist logistic regression generally performs well after a few iter-
ations. However, in the sequential setting, we have to refit a new logistic regression
model after every observation, which becomes more time-consuming as the number
of observations increases. The Bayesian learning model, while less accurate (due to
the approximation of conjugacy), provides a quick and efficient way to update pa-
rameters, and generally produces a “smoother” sequence of prices; essentially, the
uncertainty encoded in the covariance matrix smooths the pricing decision, com-
pared to the volatile prices chosen by the frequentist method in the early iterations.
Among the methods using approximate Bayesian inference, IPA-Bayes is consis-
tently the best.
In these examples, some competing policies tend to perform very similarly to
IPA-Bayes after about 10 iterations. However, in the specific context of B2B pricing,
the early iterations are especially important because each individual contract tends
to have much higher value, and the opportunity cost of pricing suboptimally is more
severe. To give some perspective, if the value of every contract is on the order of
hundreds of thousands, or millions, of dollars, the overall planning horizon will be
shorter, and the first 5− 10 iterations will become very significant.
5.5 Conclusion
We have presented an approximate Bayesian approach for learning the pa-
rameters in a logistic regression model, with specific application to learning revenue
curves in B2B pricing problems. We use infinitesimal perturbation analysis and
170
stochastic search to improve the quality of the approximation. We also consider
a pricing policy that incorporates uncertainty about the parameters into the esti-
mated expected revenue curve, and chooses a price that optimizes this aggregated
function. The proposed model and pricing policy show encouraging results in our
empirical experiments. Future work will test the proposed approach on real-world
pricing data, where the underlying statistical model can be high-dimensional.
171
Chapter 6: Conclusion
In this thesis we have proposed two different metamodeling approaches that
incorporates direct gradient estimates for solving simulation optimization problems
with continuous variables and a knowledge-gradient method that employs variational
Bayesian technique for ranking and selection problems. We have also proposed
an approximate Bayesian statistical model and price recommendation strategy in
business-to-business (B2B) pricing context.
In Chapter 2, we analyzed regression models that explicitly incorporate di-
rect gradient estimators, and derived the corresponding parameter estimators. We
provided preliminary evidence for the potential gains from the DiGAR approach by
comparing with standard regression both theoretically via analytical calculations un-
der settings with more restrictive assumptions, and empirically via simple queueing
examples where the assumptions under which the theoretical results are established
do not hold. More generally, we investigated the idea of augmenting statistical mod-
els when direct gradient estimators are available, motivated by stochastic simulation
settings. We provided an alternative model for the local improvement step in the
sequential RSM approach used in experimental design for optimization.
In Chapter 3, we investigated the idea of incorporating gradient estimates into
172
stochastic kriging by extrapolating additional responses using the original responses
and gradient estimates. This approach is not restricted to stochastic kriging, but can
be applied to other metamodeling approaches as well. We analyzed the proposed
GESK model theoretically under simplified settings and showed that it provides
predictions with smaller MSE than stochastic kriging. We also conducted numerical
experiments and illustrated the performance of the GESK model. We presented two
different strategies, namely PMLE and IMSE, to determine extrapolation step sizes
used in GESK. Effectiveness of these two strategies were compared using numerical
examples.
In Chapter 4, we created a new Bayesian model for simultaneously learning
unknown means and unknown correlations in fully sequential R&S. We derived a
new VIP for ranking and selection with unknown correlation structures. The new
procedure intuitively generalizes VIP for R&S with known correlations, with the
additional ability to incorporate the decision-makers uncertainty about the corre-
lation structure into decision-making. We proved that the value of information is
greater when the correlation structure becomes unknown. We also argued that the
incremental information loss from a single application of approximate Bayesian in-
ference eventually vanishes. We provided numerical results to show the value added
by learning unknown correlations. In particular, we studied a version of the wind
farm placement problem using real data.
In Chapter 5, we considered the problem of optimally choosing prices to max-
imize revenue or profit from transactions with heterogeneous customers. We de-
veloped a learning model that maintains a set of beliefs about the effects of the
173
significant characteristics, but is able to adjust and improve those beliefs as new
data come in. We also developed an optimization algorithm that recommends a
price to maximize average revenue, based on the estimates provided by the predic-
tive and learning models. The uncertainty measured by the learning model is used
as a factor in the price calculation. For example, if the estimate from the data sug-
gests that we should quote a high price, but there is a large amount of uncertainty
suggesting that the estimate is unreliable, the final recommendation made by the
procedure will tend to be more conservative. We conducted simulations to check
the performance of the optimal prices. The results of the simulations suggest that
the optimal bidding algorithm has the potential to substantially increase cumulative
revenue over time.
Our work has initiated some new ideas and points to several other more general
directions for future research. The proposed DiGAR model introduced in Chapter 2
can be used in the application to simulation-based optimization, , for example,
sequential RSM, which was one of the motivations for choosing regression models for
incorporating direct stochastic gradient estimators. How much gains will be realized
in the optimization efficiency from the improved linear regression model? Although
it is reasonable to expect improvements, since the new method does obtain better
fitted values than simple linear regression, both theoretical work and numerical
experimentation are needed to characterize and quantify the improvements.
The GESK method proposed in Chapter 3 use linear extrapolation with the
same step size and assume that only one additional point is extrapolated from each
design point. More sophisticated techniques could use the local response surface
174
information and adaptively determine the extrapolation strategy. This is especially
important in higher-dimensional problems with multiple extreme values. Another
line of research is on the comparison between GESK and SKG. Improvements from
incorporating gradient estimates can be expected from both models. However, the
question is when does one model performs better than the other? To compare these
two models theoretically will provide more insights about this question and lead
to guideline for practitioners as to when to choose each of these two models for
practitioners.
175
A.1 Analytical Results for M/M/1 and U/U/1 Queues
For the M/M/1 queue, the true models for an interarrival mean of 0.2 are
given by
y(2)(x) = x+ x
2
5 + x,
y(3)(x) = x+ 5x
2
(5 + x)2 +
x3(15 + 2x)
(5 + x)3 ,
y(4)(x) = x+ 25x
2
(5 + x)3 +
25x3
(5 + x)4 +
5x3(15 + 2x)
(5 + x)4 +
x4(225 + 50x+ 3x2)
(5 + x)5 ,
y(5)(x) = x+ 125x
2
(5 + x)4 +
250x3
(5 + x)5 +
25x3(15 + 2x)
(5 + x)5 +
5x4(225 + 50x+ 3x2)
(5 + x)6
+25x
4(15 + 2x)
(5 + x)6 +
250x4
(5 + x)6 +
x5(10 + x)(350 + 65x+ 4x2)
(5 + x)7 .
176
For the U/U/1 queue, the true models are given by
y(2) = δ14 −
θ1
2 +
3θ2
2 +
1
δ1
( δ22
12 +
θ21
4 −
θ1θ2
2 +
θ22
4
)
y(3) = 5δ112 − θ1 + 2θ2 +
1
12δ1
(
2δ22 + 9θ21 − 18θ1θ2 + 9θ22
)
−
1
12δ21
(θ1 − θ2)(δ22 + 2θ21 − 4θ1θ2 + 2θ22)
y(4) = 107δ1192 −
25θ1
16 +
41θ2
16 +
1
2880δ1
(750δ22 + 4590θ21 − 9180θ1θ2 + 4590θ22)
−
1
48δ21
(θ1 − θ2)(13δ22 + 35θ21 − 70θ1θ2 + 35θ22)
+ 12880δ31
(13δ42 + 270δ22θ21 − 540δ22θ1θ2 + 270δ22θ22
+ 405θ41 − 1620θ31θ2 + 2430θ21θ22 − 1620θ1θ32 + 405θ42)
y(5) = 221δ1320 −
107θ1
48 +
155θ2
48 +
1
2880δ1
(1070δ22 + 8430θ21 − 16860θ1θ2 + 8430θ22)
−
1
48δ21
(θ1 − θ2)(29δ22 + 99θ21 − 198θ1θ2 + 99θ22)
+ 12880δ31
(49δ42 + 1230δ22θ21 − 2460δ22θ1θ2 + 1230δ22θ22
+ 2325θ41 − 9300θ31θ2 + 13950θ21θ22 − 9300θ1θ32 + 2325θ42)
−
1
720δ41
(θ1 − θ2)(9δ42 + 80δ22θ21 − 160δ22θ1θ2 + 80δ22θ22 + 96θ41
− 384θ31θ2 + 576θ21θ22 − 384θ1θ32 + 96θ42)
A.2 Gradient Estimation for G/G/1 Queue
Let Ak be the interarrival time between the (k − 1)st and kth customer (by
convention, taking A1 to be the time of the 1st arrival), and let Xk be the service
time of the kth customer. The system time of the kth customer, denoted by Tk,
177
satisfies the well-known Lindley equation:
Tk+1 = Xk+1 + (Tk − Ak+1)+, (1)
where a+ = max(a, 0). The infinitesimal perturbation analysis (IPA) estimator is
then obtained by simple differentiation, which for a general parameter θ is given by
( [46]):
dTk+1
dθ =
dXk+1
dθ +
(dTk
dθ −
dAk+1
dθ
)
1{Tk ≥ Ak+1}, k > 1, with
dT1
dθ =
dX1
dθ . (2)
For x a parameter of the (common) customer service time distribution, the
unbiased IPA estimator is
dTk+1
dx =
dXk+1
dx +
dTk
dx 1{Tk ≥ Ak+1},
where dX/dx can be calculated based on the distribution for the random variable
X. For example, if X is exponentially distributed (with mean x), then dX/dx is
simply given by X/x, and (A.2) becomes
dTk+1
dx =
Xk+1
x +
dTk
dx 1{Tk ≥ Ak+1}, k > 1, with
dT1
dx =
X1
x ,
the latter assuming that the system starts empty. This is what is used for the
M/M/1 queue example.
Similarly, for the U/U/1 example, where the interarrival time and service time
distributions are U(θ1−δ1, θ1 +δ1) U(θ2−δ2, θ2 +δ2), respectively, the four unbiased
178
IPA estimators are
∂Tk+1
∂θ1
=
(∂Tk
∂θ1
− 1
)
1{Tk ≥ Ak+1}, k > 1, with
∂T1
∂θ1
= 0,
∂Tk+1
∂θ2
= 1 + ∂Tk∂θ2
1{Tk ≥ Ak+1}, k > 1, with
∂T1
∂θ2
= 1,
∂Tk+1
∂δ1
=
(∂Tk
∂δ1
−
Ak+1 − θ1
δ1
)
1{Tk ≥ Ak+1}, k > 1, with
∂T1
∂δ1
= 0,
∂Tk+1
∂δ2
= Xk+1 − θ2δ2
+ ∂Tk∂δ2
1{Tk ≥ Ak+1}, k > 1, with
∂T1
∂δ2
= X1 − θ2δ2
,
179
Bibliography
[1] M. C. Fu. Optimization via simulation: A review. Annals of Operations
Research, 53:199–248, 1994.
[2] M. C. Fu. Optimization for simulation: Theory vs. practice (Feature Article).
INFORMS Journal on Computing, 14(3):192–215, 2002.
[3] M. C. Fu. Simulation optimization: Evolution or revolution? INFORMS
Journal on Computing, 14(3):226–227, 2002.
[4] Handbook of Simulation Optimization. Springer, 2014.
[5] W. B. Powell and I. O. Ryzhov. Optimal Learning. John Wiley and Sons,
2012.
[6] Y. C. Ho and X. R. Cao. Perturbation Analysis and Discrete Event Dynamic
Systems. Kluwer Academic, 1991.
[7] P. Glasserman. Gradient Estimation via Perturbation Analysis. Kluwer Aca-
demic Publishers, Boston, Massachusetts, 1991.
[8] P. Glasserman. Monte Carlo Methods in Financial Engineering. Springer,
New York, 2004.
[9] R. Y. Rubinstein. Monte Carlo Optimization, Simulation and Sensitivity of
Queueing Networks. John Wiley & Sons, 1986.
[10] R. Y. Rubinstein and A. Shapiro. Discrete Event Systems: Sensitivity Analysis
and Stochastic Optimization by the Score Function Method. John Wiley &
Sons, 1993.
[11] M. C. Fu. Stochastic gradient estimation. In S. G. Henderson and B. L.
Nelson, editors, Handbooks of Operations Research and Management Science,
vol. 13: Simulation, pages 575–616. North-Holland Publishing, Amsterdam,
2006.
180
[12] S. Asmussen and P. Glynn. Stochastic Simulation: Algorithms and Analysis.
Springer, New York, 2007.
[13] M. C. Fu. What you should know about simulation and derivatives. Naval
Research Logistics, 55(8):723–736, 2008.
[14] R. R. Barton and M. Meckesheimer. Metamodel-based simulation optimiza-
tion. In S. G. Henderson and B. L. Nelson, editors, Handbooks in Operations
Research and Management Science: Simulation, chapter 18, pages 535–574.
Elsevier, 2006.
[15] R. R. Barton. Simulation optimization using metamodels. In M. D. Rossetti,
R. R. Hill, B. Johansson, A. Dunkin, and R. G. Ingalls, editors, Proceedings
of the 2009 Winter Simulation Conference, pages 230–238, Piscataway, New
Jersey, December 2009. Institute of Electrical and Electronics Engineers, Inc.
[16] J.P.C. Kleijnen. Design and Analysis of Simulation Experiments. New York:
Springer, 2008.
[17] Feng Yang, Bruce Ankenman, and Barry L Nelson. Efficient generation of
cycle time-throughput curves through simulation and metamodeling. Naval
Research Logistics, 54(1):78–93, 2007.
[18] N.A.C. Cressie. Statistics for spatial data. Wiley series in probability and
mathematical statistics: Applied probability and statistics. John Wiley and
Sons, New York, 2, revised edition, 1993.
[19] Michael L. Stein. Interpolation of spatial data: some theory for kriging.
Springer series in statistics. Springer-Verlag, New York, 1999.
[20] Jack P. C. Kleijnen, Wim C. M. van Beers, and Inneke van Nieuwenhuyse.
Constrained optimization in expensive simulation: Novel approach. European
Journal of Operational Research, 202(1):164–174, April 2010.
[21] B. E. Ankenman, B. L. Nelson, and J. Staum. Stochastic Kriging for Simula-
tion Metamodeling. Operations research, 58(2):371–382, March 2010.
[22] W. Liu. Development of gradient-enhanced kriging approximations for multi-
disciplinary design optimization. PhD thesis, University of Notre Dame, 2003.
[23] T. J. Santner, B. Williams, and W. Notz. The Design and Analysis of Com-
puter Experiments. Springer-Verlag, 2003.
[24] Y. C. Ho, L. Shi, L. Dai, and W. Gong. Optimizing discrete event dynamic
systems via the gradient surface method. Discrete Event Dynamic Systems:
Theory and Applications, 2:99–120, Jan 1992.
[25] X. Chen, B. E. Ankenman, and B. L. Nelson. Enhancing stochastic kriging
metamodels with gradient estimators. Operations Research, 61(2):512–528,
2013.
181
[26] J. J. Alonso and H. S. Chung. Using gradients to construct cokriging approx-
imation models for high-dimensional design optimization problems. In 40th
AIAA Aerospace Sciences Meeting and Exhibit, AIAA, pages 2002–0317, 2002.
[27] S. S. Gupta and K. J. Miescke. Bayesian look ahead one-stage sampling al-
locations for selection of the best population. Journal of Statistical Planning
and Inference, 54(2):229–244, 1996.
[28] P. I. Frazier, W. B. Powell, and S. Dayanik. A knowledge gradient policy for
sequential information collection. SIAM Journal on Control and Optimization,
47(5):2410–2439, 2008.
[29] S. E. Chick and K. Inoue. New procedures to select the best simulated system
using common random numbers. Management Science, 47:1133–1149, 2001.
[30] S. E. Chick, J. Branke, and C. Schmidt. Sequential sampling to myopically
maximize the expected value of information. INFORMS Journal on Comput-
ing, 22(1):71–80, 2010.
[31] H. S. Chang. Converging marriage in honey-bees optimization and applica-
tion to stochastic dynamic programming. Journal of Global Optimization,
35(3):423–441, 2006.
[32] J. Branke, S. E. Chick, and C. Schmidt. Selecting a selection procedure.
Management Science, 53(12):1916–1932, 2007.
[33] E. Cope. Bayesian strategies for dynamic pricing in e-commerce. Naval Re-
search Logistics, 54(3):265–281, 2007.
[34] V. F. Farias and Ben. Van Roy. Dynamic pricing with a prior on market
response. Operations Research, 58(1):16–29, 2010.
[35] A. X. Carvalho and M. L. Puterman. Learning and pricing in an internet
environment with binomial demands. Journal of Revenue and Pricing Man-
agement, 3(4):320–336, 2005.
[36] M. Chhabra and S. Das. Learning the Demand Curve in Posted-Price Digi-
tal Goods Auctions. In Proceedings of the 10th International Conference on
Autonomous Agents and Multi-Agent Systems, pages 63–70, 2011.
[37] Omar Besbes and Assaf Zeevi. Dynamic pricing without knowing the demand
function: Risk bounds and near-optimal algorithms. Operations Research,
57(6):1407–1420, 2009.
[38] A. V. den Boer and B. Zwart. Dynamic pricing and learning with finite
inventories. Submitted for publication, 2011.
[39] J. M. Harrison, N. B. Keskin, and A. Zeevi. Bayesian dynamic pricing policies:
Learning and earning under a binary prior distribution. Management Science,
58(3):570–586, 2012.
182
[40] J. Broder and P. Rusmevichientong. Dynamic pricing under a general para-
metric choice model. Operations Research, 60(4):965–980, 2012.
[41] George E. P. Box and Norman R. Draper. Response surfaces, mixtures, and
ridge analyses. Wiley & Sons, New York, Jan 2007.
[42] M. C. Fu. Gradient estimation. In S. G. Henderson and B. L. Nelson, edi-
tors, Handbooks in Operations Research and Management Science: Simulation,
chapter 19, pages 575–616. Elsevier, 2006.
[43] S. Weisberg. Applied Linear Regression, 3rd Edition. Wiley & Sons, New
York, 2005.
[44] Alvin C. Rencher and G. Bruce. Schaalje. Linear Models in Statistics. Wiley
& Sons, New York, 2007.
[45] R. J. Beckman and R. D. Cook. Outliers. Technometrics, 25(2):119–163, 1983.
[46] R. Suri and M. A. Zazanis. Perturbation analysis gives strongly consistent
sensitivity estimates for the M/G/1 queue. Management Science, 34:39–64,
1988.
[47] M. C. Fu. Convergence of a stochastic approximation algorithm for the
GI/G/1 queue using infinitesimal perturbation analysis. Journal of Opti-
mization Theory and Applications, 65(1):149–160, 1990.
[48] P. L’Ecuyer, N. Giroux, and P. Glynn. Stochastic optimization by simulation:
Numerical experiments with the M/M/1 queue in steady-state. Management
Science, 40(10):1245–1261, 1994.
[49] Anatoly Zhigljavsky, Holger Dette, and Andrey Pepelyshev. A new approach
to optimal design for linear models with correlated observations. Journal of
the American Statistical Association, 105(491):1093–1103, 2010.
[50] J. P. C. Kleijnen and W. C. M. van Beers. Robustness of kriging when interpo-
lating in random simulation with heterogeneous variances: some experiments.
European Journal of Operational Research, 165(3):826 – 834, 2005.
[51] W. Xie, B. L. Nelson, and J. Staum. The influence of correlation functions
on stochastic kriging metamodels. In Proceedings of the 2010 Winter Simu-
lation Conference, pages 1067–1078, Piscataway, New Jersey, December 2010.
Institute of Electrical and Electronics Engineers, Inc, Winter Simulation Con-
ference.
[52] X. Chen, B. E. Ankenman, and B. L. Nelson. The effects of common random
numbers on stochastic kriging metamodels. ACM Transactions on Modeling
and Computer Simulation, 22(2):7:1–7:20, March 2012.
183
[53] F. Zhang and Q. Zhang. Eigenvalue inequalities for matrix product. IEEE
Transactions on Automatic Control, 51(9):1506 –1509, Sept. 2006.
[54] T. Chu, J. Zhu, and H. Wang. Penalized maximum likelihood estimation and
variable selection in geostatistics. The Annals of Statistics, 39(5):2607–2625,
2011.
[55] R. Li and A. Sudjianto. Analysis of computer experiments using penalized
likelihood in gaussian kriging models. Technometrics, 47(2):111–121, May
2005.
[56] P. L’Ecuyer. A unified view of the IPA, SF, and LR gradient estimation
techniques. Management Science, 36:1364–1383, 1990.
[57] W. C. M. van Beers and J. P. C. Kleijnen. Customized sequential designs for
random simulation experiments: Kriging metamodeling and bootstrapping.
European Journal of Operational Research, 186(3):1099 – 1113, 2008.
[58] J. Staum. Better simulation metamodeling: The why, what, and how of
stochastic kriging. In M. D. Rossetti, R. R. Hill, B. Johansson, A. Dunkin,
and R. G. Ingalls, editors, Proceedings of the 2009 Winter Simulation Con-
ference, pages 119–133, Piscataway, New Jersey, December 2009. Institute of
Electrical and Electronics Engineers, Inc, Winter Simulation Conference.
[59] G. Marmidis, S. Lazarou, and E. Pyrgioti. Optimal placement of wind turbines
in a wind park using Monte Carlo simulation. Renewable Energy, 33(7):1455
– 1460, 2008.
[60] P. Francis, K. Smilowitz, and M. Tzur. The period vehicle routing problem
with service choice. Transportation Science, 40(4):439–454, 2006.
[61] A. Arlotto, N. Gans, and S. Chick. Optimal employee retention when infer-
ring unknown learning curves. In B. Johansson, S. Jain, J. Montoya-Torres,
J. Hugan, and E. Yu¨cesan, editors, Proceedings of the 2010 Winter Simulation
Conference, pages 1178–1188, 2010.
[62] R. E. Bechhofer. A single-sample multiple decision procedure for ranking
means of normal populations with known variances. The Annals of Mathe-
matical Statistics, 25(1):pp. 16–39, 1954.
[63] R. E. Bechhofer, T. J. Santner, and D. M. Goldsman. Design and Analysis of
Experiments for Statistical Selection, Screening, and Multiple Comparisons.
Wiley, 1995.
[64] S.-H. Kim and B. L. Nelson. A fully sequential procedure for indifference-
zone selection in simulation. ACM Transactions on Modeling and Computer
Simulation, 11:251–273, 2001.
184
[65] S.-H. Kim and B. L. Nelson. On the asymptotic validity of fully sequen-
tial selection procedures for steady-state simulation. Operations Research,
54(3):475–488, 2006.
[66] T. Homem-de-Mello. A study on the cross-entropy method for rare-event
probability estimation. INFORMS Journal on Computing, 2006. accepted for
publication.
[67] S.-H. Kim and B. L. Nelson. Selecting the best system. In S.G. Henderson
and B.L. Nelson, editors, Handbooks of Operations Research and Management
Science, vol. 13: Simulation, pages 501–534. North-Holland Publishing, Am-
sterdam, 2006.
[68] S.-H. Kim and B. L. Nelson. Recent advances in ranking and selection. In
S. G. Henderson, B. Biller, M.-H. Hsieh, J. Shortle, J. D. Tew, and R. R.
Barton, editors, Proceedings of the 2007 Winter Simulation Conference, pages
162–172, 2007.
[69] L. J. Hong and B. L. Nelson. A brief introduction to optimization via simula-
tion. In M.D. Rosetti, R.R. Hill, B. Johansson, A. Dunkin, and R.G. Ingalls,
editors, Proceedings of the 2009 Winter Simulation Conference, pages 75–85,
2009.
[70] S.-H Kim. Statistical ranking and selection. In S. I. Gass and M. C. Fu, edi-
tors, Encyclopedia of Operations Research and Management Science. Springer,
2013. To appear.
[71] C. H. Chen, D. He, M. C. Fu, and L. H. Lee. Efficient simulation budget
allocation for selecting an optimal subset. INFORMS Journal on Computing,
2008. accepted for publication.
[72] D. He, L. H. Lee, C. H. Chen, M. C. Fu, and S. Wasserkrug. Simulation
optimization using the cross-entropy method with optimal computing bud-
get allocation. ACM Transactions on Modeling and Computer Simulation,
20(1):4:1–4:22, February 2010.
[73] C.-H. Chen and L. H. Lee. Stochastic Simulation Optimization: An Optimal
Computing Budget Allocation. System Engineering and Operations Research.
World Scientific, 2011.
[74] W. N. Yang and B. L. Nelson. Using common random numbers and control
variates in multiple-comparison procedures. Operations Research, 39:583–591,
1991.
[75] B. L. Nelson and F. J. Matejcik. Using common random numbers for
indifference-zone selection and multiple comparisons in simulation. Manage-
ment Science, 41:1935–1945, 1995.
185
[76] M. C. Fu, J. Q. Hu, C. H. Chen, and X. Xiong. Simulation allocation for
determining the best design in the presence of correlated sampling. INFORMS
Journal on Computing, 19(1):101–111, 2007.
[77] P. I. Frazier, W. B. Powell, and S. Dayanik. The knowledge-gradient policy for
correlated normal rewards. INFORMS Journal on Computing, 21(4):599–613,
2009.
[78] P I. Frazier, J. Xie, and S. E. Chick. Value of information methods for pair-
wise sampling with correlations. In S. Jain, R. R. Creasey, J. Himmelspach,
K. P. White, and M. Fu, editors, Proceedings of the 2011 Winter Simulation
Conference, pages 3974–3986, 2011.
[79] S. E. Chick and K. Inoue. New two-stage and sequential procedures for se-
lecting the best simulated system. Operations Research, 49:732–743, 2001.
[80] R. C. H. Cheng and C. S. M. Currie. Optimization by simulation metamod-
elling methods. In R.G. Ingalls, M.D. Rossetti, J. S. Smith, and B. A. Peters,
editors, Proceedings of the 2004 Winter Simulation Conference, pages 485–490,
2004.
[81] B. Biller and C. G. Corlu. Accounting for parameter uncertainty in large-scale
stochastic simulations with correlated inputs. Operations Research, 59(3):661–
673, 2011.
[82] J. Kadane and R. Trader. A bayesian treatment of multivariate normal data
with observations missing at random. Statistical Decision Theory and Related
Topics, 4(1):225–234, 1988.
[83] F. Dominici, G. Parmigiani, and M. Clyde. Conjugate analysis of multivariate
normal data with incomplete observations. Canadian Journal of Statistics,
28(3):533–550, 2000.
[84] K. Triantafyllopoulos. Missing observation analysis for matrix-variate time
series data. Statistics and Probability Letters, 78(16):2647–2653, 2008.
[85] A. K. Gupta and D. K. Nagar. Matrix Variate Distributions. Chapman &
Hall, 2000.
[86] M. H. DeGroot. Optimal Statistical Decisions. Wiley-Interscience, New York,
wcl edition, 2004.
[87] S. Kotz and S. Nadarajah. Multivariate t Distributions and Their Applications.
Cambridge University Press, New York, 2004.
[88] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press,
1990.
186
[89] A. Erde´lyi and F. G. Tricomi. The asymptotic expansion of a ratio of gamma
functions. Pacific Journal of Mathematics, 1(1):133–142, 1951.
[90] G. D. Anderson and S.-L. Qiu. A monotoneity property of the gamma function.
Proceedings of The American Mathematical Society, 125(11), 1997.
[91] D. He, S. E. Chick, and C. H. Chen. The opportunity cost and OCBA selection
procedures in ordinal optimization. IEEE Transactions on Systems, Man, and
Cybernetics–Part C, 37:951–961, 2007.
[92] B. A. Cosgrove, D. Lohmann, K. E. Mitchell, P. R. Houser, E. F. Wood,
J. C. Schaake, A. Robock, C. Marshall, J. Sheffield, Q. Duan, L. Luo, R. W.
Higgins, R. T. Pinker, J. D. Tarpley, and J. Meng. Real-time and retrospective
forcing in the north american land data assimilation system (NLDAS) project.
Journal of Geophysical Research, 108(D22):8842–8853, 2003.
[93] G. Gallego and G. Van Ryzin. Optimal dynamic pricing of inventories with
stochastic demand over finite horizons. Management Science, 40(8):999–1020,
1994.
[94] Y. Feng and G. Gallego. Optimal starting times for end-of-season sales and op-
timal stopping times for promotional fares. Management Science, 41(8):1371–
1391, 1995.
[95] E. P. Lazear. Retail pricing and clearance sales. The American Economic
Review, 76(1):14–32, 1986.
[96] T. P. Minka. Bayesian linear regression. Technical report, Microsoft Research,
2000.
[97] H. Qu, I. O. Ryzhov, and M. C. Fu. Ranking and selection with unknown cor-
relation structures. In Proceedings of the 2012 Winter Simulation Conference,
2012. to appear.
[98] T. S. Jaakkola and M. I. Jordan. Bayesian parameter estimation via variational
methods. Statistics and Computing, 10(1):25–37, 2000.
[99] David J. Spiegelhalter and Steffen L. Lauritzen. Sequential updating of condi-
tional probabilities on directed graphical structures. Networks, 20(5):579–605,
1990.
[100] S. Kim. Gradient-based simulation optimization. In L. F. Perrone, F. P.
Wieland, J. Liu, B. G. Lawson, D. M. Nicol, and R. M. Fujimoto, editors,
Proceedings of the 2006 Winter Simulation Conference, pages 159–167, 2006.
[101] D. M. Blei, M. I. Jordan, and J. W. Paisley. Variational bayesian inference
with stochastic search. In Proceedings of the 29th International Conference on
Machine Learning, pages 1367–1374, 2012.
187
[102] J. C. Spall. Introduction to stochastic search and optimization: estimation,
simulation, and control. Wiley-Interscience, 2005.
[103] P. L’Ecuyer. On the interchange of derivative and expectation for likelihood
ratio derivative estimators. Management Science, 41(4):738–747, 1995.
[104] L. Devroye. Random variate generation. In S. G. Henderson and B. L. Nelson,
editors, Handbooks in Operations Research and Management Science: Simu-
lation, chapter 4. Elsevier, 2005.
[105] G. E. Crooks. Logistic approximation to the logistic-normal integral. Technical
report, Lawrence Berkeley National Laboratory, 2009.
188