ABSTRACT Title of dissertation: SIMULATION OPTIMIZATION: NEW APPROACHES AND AN APPLICATION Huashuai Qu, Doctor of Philosophy, 2014 Dissertation directed by: Professor Michael C. Fu Department of Decision, Operations, and Information Technologies Simulation models are commonly used to provide analysis and prediction of the behavior of complex stochastic systems. Simulation optimization integrates op- timization techniques into simulation analysis to capture response surface, to choose optimal decision variables and to perform sensitivity analysis. Objective functions usually cannot be computed in closed form and are computationally expensive to evaluate. Many methods are proposed by researchers for problems with continuous and discrete variables, respectively. The dissertation is comprised of both optimiza- tion methods and a real-world application. In particular, our goal is to develop new methods based on direct gradient estimates and variational Bayesian techniques. The first part of the thesis considers the setting where additional direct gra- dient information is available and introduces different approaches for enhancing regression models and stochastic kriging with this additional gradient information, respectively. For regression models, we propose Direct Gradient Augmented Regres- sion (DiGAR) models to incorporate direct gradient estimators. We characterize the variance of the estimated parameters in DiGAR and compare them analyti- cally with the standard regression model for some special settings. For stochastic kriging, we propose Gradient Extrapolated Stochastic Kriging (GESK) to incorpo- rate direct gradient estimates by extrapolating additional responses. We show that GESK reduces mean squared error (MSE) compared to stochastic kriging under certain conditions on step sizes. We also propose maximizing penalized likelihood and minimizing integrated mean squared error to determine the step sizes. The second part of the thesis focuses on the problem of learning unknown cor- relation structures in ranking and selection (R&S) problems. We proposes a com- putationally tractable Bayesian statistical model for learning unknown correlation structures in fully sequential simulation selection. We derive a Bayesian procedure that allocates simulations based on the value of information, thus anticipating fu- ture changes to our beliefs about the correlations. The proposed approach is able to simultaneously learn unknown mean performance values and unknown correla- tions, whereas existing approaches in the literature assume independence or known correlations to learn unknown mean performance values only. Finally we consider an application in business-to-business (B2B) pricing. We propose an approximate Bayesian statistical model for predicting the win/loss prob- ability for a given price and an approach for recommending target prices based on the approximate Bayesian model. SIMULATION OPTIMIZATION: NEW METHODS AND AN APPLICATION by Huashuai Qu Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2014 Advisory Committee: Professor Michael C. Fu, Chair/Advisor Professor Ilya O. Ryzhov Professor Paul Smith Professor Leonid Koralov Professor Steve Marcus c© Copyright by Huashuai Qu 2014 Dedication I dedicate this dissertation to my loving and supportive wife, Xuan Liu. ii Acknowledgments First, I would like to express my most sincere gratitude to my advisors Pro- fessor Michael C. Fu and Professor Ilya O. Ryzhov, for their vision, advice and patience to help me proceed through my graduate studies. The completion of this dissertation would not have been possible without their consistent guidance and support. Special thanks to my other dissertation committee members, Professor Paul Smith, Professor Steve Marcus and Professor Leonid Koralov, for taking the time to read the thesis and attend my defense. I also would like to thank Professor Grace Yang for teaching me the course on stochastic process and helping in my job searching process. I would like to thank everyone within the department of Mathematics in gen- eral. I would particularly like to thank: Marie Chau, Zhixin Lu, Ran Ji, Changhui Tan and Zi Ding for their friendship and feedback. My family has provided their love and untiring support during the process. Most importantly, doing the research and writing the thesis would be impossible without the support and understanding form my wife, Xuan. I thank my parents, Wenbin Qu and Ping Zhao for their encouragement and their belief in the value of learning. Nothing in a simple paragraph can express the love I have for all of you. iii Table of Contents List of Tables vii List of Figures ix 1 Introduction 1 1.1 Simulation Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Direct Gradient Enhanced Metamodels . . . . . . . . . . . . . . . . . 3 1.3 Variational Bayesian Inference in Simulation Optimization . . . . . . 6 1.4 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Direct Gradient Augmented Regression 10 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Standard Linear Regression Model . . . . . . . . . . . . . . . 14 2.2.2 Direct Gradient Augmented Regression . . . . . . . . . . . . . 15 2.2.3 Choice of Weights in α-DiGAR . . . . . . . . . . . . . . . . . 18 2.2.4 Theoretical Comparisons Between Estimators for Special Cases 19 2.2.5 Multi-dimensional Linear Models . . . . . . . . . . . . . . . . 25 2.2.6 DiGAR Model for Gradient Estimates Correlated with Per- formance Outputs . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3.1 Example: M/M/1 Queue . . . . . . . . . . . . . . . . . . . . 33 2.3.2 Regression using a Quadratic Function . . . . . . . . . . . . . 40 2.3.3 Multi-dimensional Example: U/U/1 Queue . . . . . . . . . . . 45 2.3.4 Multi-dimensional Example: Sphere Function . . . . . . . . . 47 2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3 Gradient Extrapolated Stochastic Kriging 56 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.1.1 Stochastic Kriging . . . . . . . . . . . . . . . . . . . . . . . . 58 3.1.2 Stochastic Kriging With Gradient Estimators . . . . . . . . . 60 3.2 Gradient Extrapolated Stochastic Kriging . . . . . . . . . . . . . . . 62 iv 3.2.1 A Two-Point Problem with Single Extrapolated Point . . . . . 68 3.2.2 A k -Point Problem . . . . . . . . . . . . . . . . . . . . . . . . 71 3.3 Implementations of GESK . . . . . . . . . . . . . . . . . . . . . . . . 79 3.3.1 Penalized Maximum Likelihood Estimation . . . . . . . . . . . 80 3.3.2 Minimizing Integrated MSE . . . . . . . . . . . . . . . . . . . 81 3.3.3 Choosing Regularization Parameter . . . . . . . . . . . . . . . 82 3.3.4 Choosing Gradient Estimators . . . . . . . . . . . . . . . . . . 84 3.4 NUMERICAL EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . 85 3.4.1 Experiment on Step Sizes in GESK . . . . . . . . . . . . . . . 86 3.4.2 Comparisons among SK, SKG and GESK . . . . . . . . . . . 91 3.4.2.1 A stochastic simulation example . . . . . . . . . . . 91 3.4.2.2 A stylized example with added noise . . . . . . . . . 93 3.4.2.3 A multidimensional example . . . . . . . . . . . . . . 96 3.5 CONCLUSIONS AND FUTURE RESEARCH . . . . . . . . . . . . . 97 4 Simulation Selection with Unknown Correlation Structure 104 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.2 Learning Unknown Correlation Structures . . . . . . . . . . . . . . . 108 4.2.1 Learning from complete observations . . . . . . . . . . . . . . 109 4.2.2 Learning model for scalar observations . . . . . . . . . . . . . 112 4.2.3 Predictive distribution of the next observation . . . . . . . . . 122 4.2.4 Information loss due to approximate Bayesian inference . . . . 125 4.3 The Value Of Information . . . . . . . . . . . . . . . . . . . . . . . . 133 4.3.1 Computation of the Value of Information . . . . . . . . . . . . 135 4.3.2 Monotonicity of the Value of Information . . . . . . . . . . . . 136 4.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 138 4.4.1 Wind Farm Placement Example . . . . . . . . . . . . . . . . . 139 4.4.2 A Single-Server Queue Selection Problem . . . . . . . . . . . . 146 4.4.3 3-Station Jackson Network . . . . . . . . . . . . . . . . . . . . 148 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5 Bayesian Learning on Logistic Demand Curves 151 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.2 Problem Formulation and Learning Model . . . . . . . . . . . . . . . 154 5.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 154 5.2.2 Bayesian Model For Dynamic Pricing . . . . . . . . . . . . . . 155 5.2.3 Variational Bayesian Approximation . . . . . . . . . . . . . . 157 5.2.4 Minimizing the Kullback-Leibler Divergence . . . . . . . . . . 160 5.3 Dynamic Pricing Policy . . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.3.1 Computation of the Bayes-Greedy Policy . . . . . . . . . . . . 162 5.3.2 Analysis of the Bayes-Greedy Policy . . . . . . . . . . . . . . 163 5.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 165 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 v 6 Conclusion 172 A.1 Analytical Results for M/M/1 and U/U/1 Queues . . . . . . . . . . . 176 A.2 Gradient Estimation for G/G/1 Queue . . . . . . . . . . . . . . . . . 177 Bibliography 180 vi List of Tables 2.1 Parameter estimates and performance metrics for M/M/1 queue (boxed entries indicate incorrect sign of slope) . . . . . . . . . . . . . 37 2.2 M/M/1 queue quadratic function optimal value x∗ obtained as a function of # replications (c ≈ 27), where boxed entries indicate a maximum rather than a minimum . . 44 2.3 Parameter estimates and performance metrics for U/U/1 queue (10 replications per design point), based on 10 macroreplications (boxed entries indicate incorrect sign of slope) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.4 Parameter estimates and performance metrics for U/U/1 queue y(2) w.r.t. # replications/design point . . . . . . . . . . . . . . . . . . . . 50 2.5 Parameter estimates and performance metrics for the sphere function for two-level centered full factorial design around (1,−0.6, 0.8,−0.5); 10 replication per design point, gridsize 0.5, 1000 macroreplications. . 54 2.6 Parameter estimates and performance metrics for the sphere function for two-level centered full factorial design around (1,−0.6, 0.8,−0.5); 10 replication per design point, gridsize 0.05, 1000 macroreplications. 55 3.1 Averaged EIMSE from 100 macroreplications for GESK models under six designs (# of design points, # of reps) to predict expected waiting time in the M/M/1 queue example. Three fixed step sizes with those determined by PMLE and IMSE are compared. Standard errors are shown in parenthesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.2 Averaged EIMSE from 100 macroreplications for SK, SKG and GESK with six different designs on estimating the expected steady-state waiting time in an M/M/1 queue problem. The design (6, 50) means 6 design points with 50 replications at each design point. Standard errors are shown in parentheses. . . . . . . . . . . . . . . . . . . . . . 92 3.3 Averaged EIMSE from 100 macroreplications for SK, SKG and GESK with five different designs on y(x) = exp(−1.4x) cos(7pix/2)+. Stan- dard errors are shown in parentheses. . . . . . . . . . . . . . . . . . . 94 4.1 Final opportunity cost and standard errors for the experiments . . . . 141 vii 4.2 Final opportunity cost and standard errors for the queue selection and network selection problems . . . . . . . . . . . . . . . . . . . . . 145 viii List of Figures 2.1 M/M/1 queue: true model, simulation data and several fitted mod- els, where each data point is the sample mean of 10 independent replications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.2 M/M/1 queue: using correlation matrix estimated based on 100,000 “offline” simulation replications. . . . . . . . . . . . . . . . . . . . . . 40 2.3 M/M/1 queue quadratic fit (c ≈ 27, 10 replications). . . . . . . . . . 42 2.4 U/U/1 queue box plots of four estimated slopes for y(k) based on 10 macroreplications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.1 Boxplots of EIMSE from 100 macroreplications for the GESK models under six designs (# of design points, # of reps) to predict expected steady-state waiting time in the M/M/1 queue example. . . . . . . . 88 3.2 Boxplots for step sizes determined by PMLE and IMSE based on 100 macroreplications under six designs (# of design points, # of reps) to predict the expected steady-state waiting time in the M/M/1 queue example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.3 Boxplots of EIMSE from 100 macroreplications for SK, SKG and GESK with six different designs on estimating the expected steady- state waiting time in an M/M/1 queue problem, corresponding to results in Table 3.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 3.4 Boxplots of EIMSE from 100 macroreplications for SK, SKG and GESK with five different designs on y(x) = exp(−1.4x) cos(7pix/2) + , corresponding to results in Table 3.3. . . . . . . . . . . . . . . . . . 101 3.5 Boxplots for step sizes determined by PMLE and IMSE based on100 macroreplication in the stylized example with added noise. . . . . . . 102 3.6 Boxplots of EIMSE from 100 macroreplications for SK and GESK with two different Latin-hypercube designs on a four-dimensional function with added noise. . . . . . . . . . . . . . . . . . . . . . . . . 103 4.1 Averaged opportunity cost as the number of samples increases, where the dashed lines are the mean plus or minus two standard errors . . . 142 4.2 Value of information and maximum error as the number of samples increases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 ix 4.3 Contour map of different policies after 200 measurements . . . . . . . 144 4.4 Comparing averaged opportunity cost in M/G/1 queue selection prob- lem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 4.5 Numerical experiment on a three-station Jackson network . . . . . . 149 5.1 Probability of success and corresponding profit curve as a function of the price under three different scenarios . . . . . . . . . . . . . . . . . 168 5.2 Plots of bidding prices, single-period profit and cumulative profits over time under the low-truth scenario . . . . . . . . . . . . . . . . . 168 5.3 Plots of bidding prices, single-period profit and cumulative profits over time under the medium-truth scenario . . . . . . . . . . . . . . . 169 5.4 Plots of bidding prices, single-period profit and cumulative profits over time under the high-truth scenario . . . . . . . . . . . . . . . . . 169 x Chapter 1: Introduction 1.1 Simulation Optimization We consider optimization problems where the objective is to minimize an ex- pected value that cannot be computed in closed form. Instead, the expectation must be estimated via simulation. Therefore, deterministic optimization algorithms are not applicable and simulation optimization algorithms are needed. The general formulation of the simulation optimization problem is as follows: min θ∈Θ J(θ) = min θ∈Θ E [L(θ, ω)] , (1.1) where θ ∈ Θ is a p-dimensional vector of the decision variables and Θ is the feasible region. It assumes that little knowledge (linearity or convexity) of the objective function J(θ) is known, and moreover J(θ) is the expectation of another quantity L(θ, ω), so it cannot be obtained directly. L(θ, ω) is the performance measure of interest and ω represents a simulation replication, which comprises the uncertainty of the system. The optimal decision variable is defined as θ∗ = arg min θ∈Θ J(θ). (1.2) Optimization problems are generally classified into continuous and discrete problems depending on the types of values the decision variables θ can take. An 1 alternative classification involves the size of the feasible region: finite versus infinite for discrete problems, bounded versus unbounded for continuous problems. Exten- sive review on the topic of simulation optimization can be found in [1–5]. Three different approaches are briefly discussed since they are closely related to the new methods proposed in the thesis. Stochastic approximation is a stochastic optimization technique analogous to gradient methods in deterministic optimization problems. Stochastic approximation uses the following recursion to update the solution: xk+1 = xk + ak∇f˜(xk), where ∇f˜(xk) is the gradient estimate and ak is the step size. Two classical meth- ods, Robbins-Monro (RM) and Kiefer-Wolfowitz (KW), estimate the true gradient ∇f(xn) using direct gradient estimates and finite difference gradient estimates, re- spectively. In the stochastic simulation context, direct gradient estimation methods include perturbation analysis (PA) [6–8] and likelihood ratio/score function methods (LR/SF) [9, 10]; see [11] for more details. Metamodels, also known as Response Surface Methodology (RSM), provide a functional relationship between the performance measurements and parameters of interest. Metamodel-based methods decouple optimization from simulation, as metamodels approximate stochastic responses through an algebraic function and deterministic optimization procedures are applied to the metamodel. Polynomial models are one of the commonly used metamodels since they usually have compact forms and are easy to construct and evaluate. However, due to their lack of flexibility, 2 kriging, splines, neural networks and radial basis functions are more adequate to capture global characteristics of a response surface. Both stochastic approximation and metamodeling are generally designed for solving stochastic optimization problems with continuous variables. Statistical rank- ing and selection (R&S) addresses stochastic optimization problem with discrete variables. Three R&S procedures are indifference-zone (IZ) methods, value of infor- mation procedures and optimal computing budget allocation (OCBA). IZ methods guarantee asymptotic lower bounds for the probability of correct selection (PCS), as long as the true underlying performance values are sufficiently far apart. Bayesian models for R&S consider the tradeoff between estimates of the performance values and uncertainty about those estimates, using the concept of “value of information.” OCBA is designed with the flexibility to adapt to both frequentist and Bayesian models. 1.2 Direct Gradient Enhanced Metamodels A metamodel is commonly used in simulation optimization to provide an aux- iliary functional relationship between the input and output of a simulation model. Conducting simulations to collect experimental data is necessary to build metamod- els, where the simulated data collected are usually performance measurements for parameters of interest. However, direct derivative information may also be avail- able in stochastic simulation settings, where the output responses include not only the performance measurement, but also values of the gradient of performance mea- 3 surement with respect to the parameters. Perturbation analysis (PA) [6–8] and likelihood ratio/score function methods (LR/SF) [9, 10] are techniques that aim at estimating the gradient the performance measure. Applications of direct gradient estimates have been studied extensively, including queueing, inventory and finance applications [12,13]. In general, there are two types of metamodeling strategies: iterated local meta- models and global metamodels. An overview of local and global metamodel-based optimization is given in [14] and [15]. Iterated local metamodels, also known as sequential response surface method- ology, rely on low-order polynomial regression. A first-order polynomial is usually used to fit local response surface in a small region to determine the search direction. Following a line search, new regions for the parameters of interest are exploited re- peatedly until the region of most interest is determined. At the final step, a quadratic approximation is chosen and deterministic optimization methods are applied to lo- cate the optimum. Regression techniques and experiment design are critical in this procedure; see [16] for details. In global metamodels, high-order polynomial regression or nonlinear regression techniques based on existing knowledge about the response surface are appropriate; see [17] for an example. To capture global characteristics of a response surface, more flexibility in the models is required. Therefore, kriging, splines, neural networks and radial basis functions are more appropriate for fitting global metamodels. Among all these, kriging has received a lot of attention in the stochastic simulation com- munity over the past decade [18–20]. Recently, [21] proposed stochastic kriging as 4 an extension of kriging, which explicitly takes the uncertainties in simulation noise into consideration. Stochastic kriging is considered to be flexible and promising in fitting global response surfaces, especially in stochastic simulation settings. In the stochastic simulation setting, direct derivative information may be available, i.e., the simulation output may include not only the performance mea- surements, but also estimates of the gradients of performance measurement with re- spect to the parameters. Techniques for estimating gradients, including perturbation analysis (PA) and likelihood ratio/score function methods (LR/SF), are discussed in [7], [10] and [13]; see also references therein. The availability of additional gradient information suggests the potential for improving the quality of metamodels. Combining gradient information has been investigated for building metamodels under deterministic computer simulation set- tings; see [22] and [23] for approaches to approximate response surface with artificial neutral networks and kriging. In stochastic simulation settings, researchers have also made attempts to incorporate gradient estimates into metamodeling approaches. [24] proposed a gradient surface method (GSM) that uses the gradient estimates only to iteratively fit lower-order polynomial models. [25] introduced stochastic kriging with gradient estimators (SKG) to exploit gradient estimates in stochastic kriging, showing that the new approach provides better prediction with smaller mean squared error (MSE). This approach is similar to cokriging proposed in deterministic simula- tions [26], and requires differentiability of the correlation functions, since derivatives of random processes or random fields are used to model gradient estimates. 5 1.3 Variational Bayesian Inference in Simulation Optimization Bayesian statistical models can be used to represent the beliefs of a decision- maker about an uncertain environment. For example, in revenue management, a seller formulates beliefs about customers’ willingness to pay; in energy, we may have a belief about the suitability of a candidate location for a new wind farm. In R&S, Bayesian models consider the tradeoff between estimates of the per- formance values and uncertainty about those estimates. This is known as the “value of information” (VIP) approach, going back to [27] and extended in later work. The- oretical properties of the policy were studied in [28]. VIP-based policies considering unknown measurement noise were developed in [29] and [30]. See [31] and [5] for an extensive up-to-date survey of Bayesian learning techniques. [32] compares sev- eral sequential procedures and concludes that Bayesian procedures are more efficient when the number of alternatives increases. In the context of dynamic pricing, Bayesian statistics have been used to model environmental uncertainty [33, 34], and different pricing strategies have been pro- posed to optimize the balance between revenue and information. For example, [35] proposes a one-step look-ahead strategy for problems with logistic revenue curves, while [36] presents an approach based on multi-armed bandit theory. A recent stream of work, represented by [37], [38], [39], and [40], has focused on establish- ing long-run convergence rates for policies that are mostly myopic, with occasional periods of exploration spaced increasingly further apart. However, in the specific context of B2B pricing, individual transactions typically have high volume (for ex- 6 ample, the seller may be negotiating the price of a year’s supply of raw materials) and incur high costs (e.g. the time and money spent during negotiations), making it important to obtain good performance quickly. Most Bayesian procedures rely on conjugate prior distributions on the un- known model parameters in order to maintain computational tractability. Conju- gate priors model the evolution of these beliefs over time as new information is collected, either from stochastic simulation or field experiments. However, there are relatively few of these conjugate models, and they simply do not exist in many problems of interest. Variational Bayesian inference can be used to create com- putationally tractable, “nearly conjugate” models that optimally approximate the actual belief distributions and enable the use of anticipatory information collection and optimization policies. 1.4 Outline of the Thesis The thesis centers around simulation optimization, including several new meth- ods based on direct gradient estimates and optimal learning approaches. We now outline the thesis and summarize the contents of each chapter. Chapter 2 investigates potential modeling improvements that can be achieved by exploiting additional gradient information in the regression setting. Using least squares and maximum likelihood estimation, we propose various Direct Gradient Augmented Regression (DiGAR) models that incorporate direct gradient estimators, starting with a one-dimensional independent variable and then extending to multi- 7 dimensional input. For some special settings, we are able to characterize the variance of the estimated parameters in DiGAR and compare them analytically with the standard regression model. For a more typical stochastic simulation setting, we investigate the potential effectiveness of the augmented model by comparing it with standard regression in fitting a functional relationship for a simple queueing model, including both a one-dimensional and a four-dimensional example. The preliminary empirical results are quite encouraging, as they indicate how DiGAR can capture trends that the standard model would miss. Even in queueing examples where there is high correlation between the output and the gradient estimators, the basic DiGAR model that does not explicitly account for these correlations performs significantly better than the standard regression model. Chapter 3 introduces an approach for enhancing stochastic kriging in the set- ting where additional direct gradient information is available, e.g., provided by techniques such as perturbation analysis or the likelihood ratio method. The new approach, called Gradient Extrapolated Stochastic Kriging (GESK), incorporates direct gradient estimates by extrapolating additional responses. For two simpli- fied settings, we show that GESK reduces mean squared error (MSE) compared to stochastic kriging under certain conditions on step sizes. Since extrapolation step sizes are crucial to the performance of the GESK model, we propose two different approaches to determine the step sizes: maximizing penalized likelihood and min- imizing integrated mean squared error. Numerical experiments are conducted to illustrate the performance of the GESK model and to compare it with alternative approaches. 8 Chapter 4 proposes the first computationally tractable Bayesian statistical model for learning unknown correlation structures in fully sequential simulation selection. Correlations represent similarities or differences between various design alternatives, and can be exploited to extract much more information from each individual simulation. However, in most applications, the correlation structure is unknown, thus creating the additional challenge of simultaneously learning unknown mean performance values and unknown correlations. Based on our new statistical model, we derive a Bayesian procedure that allocates simulations based on the value of information, thus anticipating future changes to our beliefs about the correlations. Our approach outperforms existing methods for known correlation structures in numerical experiments, including one motivated by the problem of optimal wind farm placement, where real data are used to calibrate the simulation model. Chapter 5 proposes an approximate Bayesian statistical model for predicting the win/loss probability for a given price in business-to-business (B2B) pricing. This model allows us to learn parameters in logistic regression based on binary (win/loss) data and can be quickly updated after each new win/loss observation. We also consider an approach for recommending target prices based on the approximate Bayesian model, thus integrating uncertainty into decision-making. We test the statistical model and the target price recommendation strategy with synthetic data, and observe encouraging empirical results. 9 Chapter 2: Direct Gradient Augmented Regression 2.1 Introduction Because regression analysis arose from physically observed processes, it as- sumes that the only data points generated are measurements of the value of the dependent variable for each combination of values for the independent variable. However, in the stochastic simulation setting that is our primary focus, research over the previous four decades has led to the availability of direct derivative infor- mation – meaning not from finite-difference approximations, e.g., for each design point or value of the independent variable, the output responses generated include not only a value of the dependent variable, but also a value of the gradient of the dependent variable with respect to the independent variable(s). Settings where such direct gradient estimators are available include queueing, inventory, and fi- nance [6–8,10,12,13]. Clearly, in this enhanced data setting, the availability of gradient information should lead to an improvement in the estimated functional relationship between the dependent and independent variables. In this chapter, we investigate such improve- ments in the regression setting. Specifically, we consider a simple modification of the standard linear regression model to incorporate the additional measurements. We 10 call the new method Direct Gradient Augmented Regression (DiGAR, pronounced “digger”). Using the least squares approach and maximum likelihood estimation, we derive the resulting parameter estimates for the proposed DiGAR models and pro- vide a theoretical analysis of the improvements that can be achieved over a standard model. We also conduct some preliminary numerical experiments to empirically in- vestigate the improvements that can be achieved. In particular, we consider a simple queueing system for which there are analytical results available, which can be used to judge the effectiveness of both standard regression and DiGAR using both a linear and quadratic regression function. The main motivation for our work is two-fold: • characterizing a global response surface, a traditional application of regression; • approximating a local response surface with a view towards local improvements in carrying out simulation-based optimization. Broadly speaking, our primary objective is to provide improved means for estimat- ing functional relationships using direct higher-order information. As a first step, we consider regression because of its wide application in many fields and because of the central role it plays in the sequential response surface methodology (RSM) approach to optimization, which “is a metamodel-based optimization method that builds lin- ear or quadratic local approximations” to the response function [14]. Polynomial regression is generally used to fit the response surface, as a series of experiments is used along the steepest descent direction to obtain additional improvement [41]; see also [16] for a detailed discussion on the relevant statistical techniques. In Fig- 11 ure 18.1 of [14], simulation optimization strategies are classified according to the nature of the controllable variables and the response function. For the case where the response function is assumed differentiable, there is a dichotomy between direct gradient methods and metamodel methods, of which RSM is the primary technique. DiGAR serves as an example of merging these two categories. By integrating direct gradient estimates into the regression model, the hope is that the fitted curve better approximates the true model, which should also lead to faster convergence when used in sequential procedures. DiGAR provides a paradigm shift, though it is clearly only applicable in a special setting in which direct gradient estimates are available, which is often the case in stochastic simulation using techniques such as perturbation analysis (PA) and the likelihood ratio/score function (LR/SF) method [6, 7]. These procedures provide estimates in which no resimulation is required, i.e., whenever an estimate of an output performance measure is obtained, an estimate of the derivative(s) with respect to parameters of interest are also obtained at that particular setting of the parameters. This is referred as direct gradient estimates in the thesis, to contrast with indirect estimates obtained by actually changing the value of the parameters and running additional simulations; see [13,42] for a recent survey and tutorial with references. In the setting where direct gradient estimates are available, another approach for incorporating gradient estimates into regression models proposed by [24] for a sequential simulation optimization procedure fits the gradient response surface directly using the gradient estimates only; the function estimates themselves are 12 not used in the procedure. In contrast, our approach uses all sets of measurements that are available in augmenting existing functional estimation procedures such as regression. Thus, DiGAR augments standard regression rather than replacing it. To illustrate the approach in a straightforward way, DiGAR uses least squares and maximum likelihood estimation (for the normally distributed setting), but other criteria such as minimum sum of absolute errors (MSAE) and minimization of the maximum absolute errors (MMAE) could also have been used. Least squares is among the most popular criterion in regression [43, 44], due to its simplicity, but it has the drawback of being sensitive to outliers, as is also evident in our numerical examples reported in Section 2.3. There has been much work in robust regression to overcome the drawback of least square regression [45], but our preliminary numerical results on a queueing example indicate that combining the gradient estimations with the least squares approach can lead to noticeable qualitative correction that mitigates sensitivity to outliers, i.e., when some of the observed dependent variable output values exhibit large fluctuations from their means, DiGAR is able to correct the shape of the fitted curve – the slope for a linear fit and the curvature for a quadratic fit. Not surprisingly, the observed variance of the parameter estimates in the DiGAR is also considerably lower than that of the standard regression model. 2.2 Models In this section, we consider the simplest setting, beginning with a review of the most basic standard linear regression model before introducing the DiGAR models 13 where direct gradient estimators are assumed available. The following assumptions are used in the section for various theoretical results and calculations. Assumption 2.1. E[i] = 0 ∀i. Assumption 2.2. E[′i] = 0 ∀i. Assumption 2.3. (i) Cov(i, j) = 0, i 6= j, and (ii) Var(i) = σ2 ∀i with known σ2. Assumption 2.4. (i) Cov(′i, ′j) = 0 ∀ i 6= j, Cov(i, ′j) = 0 ∀ i, j, and (ii) Var(′i) = σ2g ∀ i with known σ2g . Assumption 2.5. i ∼ N(0, σ2), ′i ∼ N(0, σ2g) ∀i with known σ2 and σ2g . 2.2.1 Standard Linear Regression Model Consider the usual regression setting with independent variable x and de- pendent variable y, where n > 1 data points (x1, y1), ..., (xn, yn) are given. Both independent and dependent variables take values from a continuous domain. The standard linear regression model is the following: yi = β0 + β1xi + i, i = 1, 2, · · · , n, (2.1) where assumptions on the “noise” terms {i} determine properties of resulting esti- mators. 14 The least-squares approach minimizes the sum of squared residuals given by n∑ i=1 (yi − β0 − β1xi)2, leading to the following estimators for β0 and β1: βˆ0 = y¯ − βˆ1x¯, (2.2) βˆ1 = n∑ i=1 (xi − x¯)(yi − y¯) n∑ i=1 (xi − x¯)2 = 1 n n∑ i=1 (xi − x¯)(yi − y¯) 1 n n∑ i=1 (xi − x¯)2 , (2.3) where x¯ and y¯ are the sample means of {xi} and {yi}, respectively. These estimators are unbiased assuming the noise terms have zero mean. In the traditional regression model, it is well known that the maximum like- lihood estimators (MLEs) for normally distributed residuals coincide with the OLS estimators given by (2.2) and (2.3). 2.2.2 Direct Gradient Augmented Regression Now consider the enhanced setting where the n data points are (x1, y1, g1), ..., (xn, yn, gn), with gi representing a direct estimate of the gradient of yi at xi. The basic Direct Gradient Augmented Regression (DiGAR) model is the following: yi = β0 + β1xi + i, (2.4) gi = β1 + ′i, (2.5) where gi, i = 1, 2, · · · , n are the gradient estimates with residuals {′i}. Again using the OLS approach, the function to be minimized is the sum of the squared deviations in both yi and gi, L = n∑ i=1 (yi − β0 − β1xi)2 + n∑ i=1 (gi − β1)2. (2.6) 15 The resulting estimators that minimize (2.6) are βˆ0 = y¯ − βˆ1x¯, βˆ1 = n∑ i=1 xiyi − nx¯y¯ + ng¯ n∑ i=1 x2i − nx¯2 + n = 1 n n∑ i=1 (xi − x¯)(yi − y¯) + g¯ 1 n n∑ i=1 (xi − x¯)2 + 1 , (2.7) where x¯, y¯ and g¯ are the corresponding sample means of xi, yi and gi. Note that in the basic DiGAR model, the form of βˆ0 remains unchanged, whereas βˆ1 has the additional terms g¯ and 1 in the numerator and denominator, respectively, reflecting the added gradient information. An alternative approach is to derive estimators using yi and gi separately and combine them. However, since we would like to combine two sources of data as opposed to two estimators, we choose to use the objective function in (2.6). Later in the chapter, we will build the connections between these two approaches. Under the Assumptions 2.3 - 2.5, the parameter estimators given by (2.7) are unbiased, since E[βˆ1] = n∑ i=1 xiE[yi]− nx¯E[y¯] + nE[g¯] n∑ i=1 x2i − nx¯2 + n = β1, E[βˆ0] = E[y¯]− E[βˆ1]x¯ = β0, 16 and the variances of βˆ1 is calculated as follows: Var(βˆ1) = n∑ i=1 x2i Var(yi)− n2x¯2 Var(y¯) + n2 Var(g¯) ( n∑ i=1 x2i − nx¯2 + n )2 = n∑ i=1 x2iσ2 − n2x¯2σ2/n+ n2σ2g/n ( n∑ i=1 x2i − nx¯2 + n )2 = ( n∑ i=1 x2i − nx¯2 ) σ2 + nσ2g ( n∑ i=1 x2i − nx¯2 + n )2 The DiGAR estimator βˆ1 can also be viewed as a linear combination of the standard linear regression estimator in (2.3) and another unbiased estimator g¯ with weights proportional n and n∑ i=1 (xi − x¯)2, respectively. We are particularly inter- ested in investigating the variance of the estimators, and a known results suggest that the optimal weights for combining two unbiased estimators should be inversely proportional to their variances. A more general form of the least-squares function to be minimized allows relative weighting (convex combination) of the two contributions rather than the equal weighting used in (2.6), i.e., L = α n∑ i=1 (yi − β0 − β1xi)2 + (1− α) n∑ i=1 (gi − β1)2, (2.8) where α ∈ [0, 1]. α = 1 corresponds to standard regression, α = 0.5 corresponds to the OLS DiGAR model introduced earlier, and α = 0 uses only the gradient 17 information. Differentiating with respect to β0 and β1, ∂L ∂β0 = −2α n∑ i=1 (yi − β0 − β1xi), ∂L ∂β1 = −2α n∑ i=1 (yi − β0 − β1xi)xi − 2(1− α) n∑ i=1 (gi − β1). (2.9) Setting both equal to 0 and solving yields the following estimators: βˆ0 = y¯ − βˆ1x¯, βˆ1 = 1 n n∑ i=1 (xi − x¯)(yi − y¯) + 1−αα g¯ 1 n n∑ i=1 (xi − x¯)2 + 1−αα . (2.10) If the estimators for the gradients are also unbiased, i.e., E[′i] = 0, then the α−DiGAR estimators are also unbiased. Remark 2.1. The weight α in (2.8) can be viewed as cost to the sum of squared errors. We will look into the weighted objective function in (2.8) and try to build connections between the DiGAR estimators and the estimator obtained by combining two unbiased estimators. In the following sections, we will show that the optimal α provides the same estimator as the optimal estimator obtained by combining two unbiased estimators. Proposition 2.2. Under Assumptions 2.1 & 2.2, the α−DiGAR estimators given by (2.10) are unbiased. 2.2.3 Choice of Weights in α-DiGAR Instead of thinking α as a fixed cost, we consider the weight α can be chosen in practice. So how should one choose the weight parameter α? One intuitive choice is to choose the relative weights inversely proportional to the corresponding 18 variances, but the variances are unknown and possibly nonhomogeneous across the independent variable range. Another option is to minimize the prediction variance. The predicted value at xi can be written as yˆi = βˆ0 + βˆ1xi = y¯ − βˆ1(x¯− xi), so that minimizing the prediction variance is equivalent to minimizing the variance of βˆ1. However, again an analytical expression for this variance is unavailable in most settings. In the next section we revisit the question of weight selection in some spe- cial settings of homogeneous variances where “optimal” weights can be determined explicitly. 2.2.4 Theoretical Comparisons Between Estimators for Special Cases Both traditional regression and least-squares DiGAR models can be applied in a very general setting, but it is difficult to obtain any analytical results without further assumptions. For the traditional regression model, if the noise terms have zero mean and in addition are uncorrelated with common variance, the theoretical variances can also be computed explicitly. Proposition 2.3. Under Assumptions 2.1 and 2.3, the variances of the estimators 19 (2.2) and (2.3) are given by Var(βˆ0) = σ2     1 n + x¯2 n∑ i=1 (xi − x¯)2     , (2.11) Var(βˆ1) = σ2 n∑ i=1 (xi − x¯)2 . (2.12) Similarly, in the DiGAR model, if all of the residuals are uncorrelated with common variance, it is not difficult to compute the theoretical variances of the α-DiGAR estimators. Proposition 2.4. Under Assumptions 2.1-2.4, the variances of the estimators (2.10) are given by Var(βˆ0) = σ2 n      1 + x¯2 n n∑ i=1 (xi − x¯)2 ( 1 n n∑ i=1 (xi − x¯)2 + 1−αα )2      + σ2g n      x¯2(1−αα )2 ( 1 n n∑ i=1 (xi − x¯)2 + 1−αα )2      ,(2.13) Var(βˆ1) = σ2 n      1 n n∑ i=1 (xi − x¯)2 + (1−αα )2 σ2g σ2 ( 1 n n∑ i=1 (xi − x¯)2 + 1−αα )2      . (2.14) Revisiting Weight Selection in α-DiGAR Under the assumption of homogenous variances for both the output and its gradient, per Assumptions 2.3(ii) and 2.4(ii), selection of weights proportional to variance implies α 1− α = σ2g σ2 , which leads to α = σ2g σ2g + σ2 . (2.15) 20 For the “optimal” choice of weights, which in the previous subsection was said to be equivalent to minimizing the variance of βˆ1, differentiating (2.14) with respect to α and setting equal to 0 gives the same proportional weights as (2.15). For the uncorrelated setting, Proposition 2.4 can be used to find necessary and sufficient conditions for the variance of the slope estimator to be lower for the α-DiGAR estimators than for the standard slope estimator given by (2.3). Proposition 2.5. Under Assumptions 2.1-2.4, σ2g σ2 ≤ 2α 1− α + 1 1 n n∑ i=1 (xi − x¯)2 ⇐⇒ Var(βˆDiGAR1 ) ≤ Var(βˆstandard1 ), where βDiGAR1 and βstandard1 denote the α-DiGAR slope estimator given by (2.10) and the standard slope estimator given by (2.3), respectively. Proof. Comparing the variances given by (2.12) and (2.14), the inequality holds if and only if σ2 n 1 n n∑ i=1 (xi − x¯)2 + (1−αα )2 σ2g σ2 ( 1 n n∑ i=1 (xi − x¯)2 + 1−αα )2 ≤ σ2 n 1 1 n n∑ i=1 (xi − x¯)2 ⇐⇒ (σg σ )2 ≤ ( α 1− α )2      ( 1 n n∑ i=1 (xi − x¯)2 + 1−αα )2 1 n n∑ i=1 (xi − x¯)2 − 1 n n∑ i=1 (xi − x¯)2      = 2 ( α 1− α ) + 1 1 n n∑ i=1 (xi − x¯)2 Intuitively, Proposition 2.5 indicates that if σg is large relative to σ, then it makes sense to decrease the relative weight for the gradient; also when the simulation 21 budget is limited and thus only a relatively small number of design points are possible and experiment conditions make them close to each other (more likely in sequential RSM applications), the benefit of DiGAR should be more noticeable. The following corollary of Proposition 2.5 provides more explicit bounds for the weighting α. Two special cases include the equal weighting and proportional weighting scenarios, again in the uncorrelated setting. Corollary 2.6. Under Assumptions 2.1-2.4, Var(βˆ1) for the α-DiGAR OLS slope estimator given by (2.10) is strictly lower than that for the standard slope estimator given by (2.3) if α ≥ σ2g σ2g + 2σ2 , which includes the following special cases: (i) α = σ 2 g σ2g+σ2 . (ii) α = 0.5 if σ2g ≤ 2σ2. Proof. Main result follows directly from n∑ i=1 (xi−x¯)2 > 0, with the two cases satisfying the requisite condition, or directly from Proposition 2.5, noting that (i) α1−α = σ2g σ2 , (ii) α1−α = 1. Maximum Likelihood Estimation DiGAR If we further assume that the residuals are normally distributed and uncorre- lated (independent) with known variances, we can derive the MLEs for β0 and β1, which will not coincide with the α-DiGAR estimators in general. 22 Note that Assumption 2.5 subsumes Assumptions 2.1 and 2.2, and includes the known variance condition. Proposition 2.7. Under Assumptions 2.3, 2.4, and 2.5, the MLEs for the DiGAR model specified by (2.4) and (2.5) are given by β˜0 = y¯ − β˜1x¯, β˜1 = 1 n n∑ i=1 (xi − x¯)(yi − y¯) + σ 2 σ2g g¯ 1 n n∑ i=1 (xi − x¯)2 + σ2σ2g . (2.16) Proof. Under Assumptions 2.3 and 2.4, yi and gi are independent due to the resid- uals being uncorrelated. Under Assumption 2.5, both σ2 and σ2g are known. The likelihood function is given by L(β0, β1) = (2pi)−n(σσg)−n exp { − 1 2σ2 n∑ i=1 (yi − β0 − β1xi)2 − 1 2σ2g n∑ i=1 (gi − β1)2 } . Differentiating the log-likelihood function with respect to the parameters, ∂ log(L) ∂β1 = 1σ2 n∑ i=1 (yi − β0 − β1xi)xi + 1 σ2g n∑ i=1 (gi − β1), ∂ log(L) ∂β0 = 1σ2 n∑ i=1 (yi − β0 − β1xi), which after setting equal to 0, and some algebra, leads to (2.16). Remark 2.8. In the actual implementation, σ2 and σ2g are usually unknown. The MLE’s of β0, β1, σ2 and σ2g are the solutions of a nonlinear system of equations and are not available in closed form. They would need to be solved using Newton- Raphson or some other iterative numerical methods. Approximate MLE’s for β0 and β1 can be obtained by replacing σ2 and σ2g by the unbiased and consistent estimators given as follows: 23 σˆ2 = 1n− 2 n∑ i=1 (yi − yˆi)2, σˆ2g = 1 n− 1 n∑ i=1 (gi − g¯)2, where yˆi = βˆ0 + βˆ1xi with the estimators βˆ0 and βˆ1 provided in (2.7). Proposition 2.9. Under Assumptions 2.3, 2.4, and 2.5, the variance of the MLE for β1 given in (2.16) is Var(β˜1) = σ2/n 1 n n∑ i=1 (xi − x¯)2 + σ2σ2g . (2.17) As expected, as σg →∞, where the additional gradient estimators provide no additional useful information, the MLE DiGAR slope estimator variance given by (2.17) approaches the traditional slope estimator variance given by (2.12), whereas as σg → 0, Var(β˜1)→ 0 for MLE DiGAR. Comparing the variances given by (2.17) and (2.12), we have the following result. Proposition 2.10. Under Assumptions 2.3, 2.4 and 2.5, and σ/σg > 0, Var(β˜1) for MLE DiGAR given in (2.16) is strictly lower than that for the standard estimator given by (2.3). Thus, in the uncorrelated common variance setting, the MLE DiGAR slope estimator is guaranteed to have lower variance than the standard regression model slope estimator. However, it should also be noted that the MLEs given by (2.16) contain the variances σ2 and σ2g , which are unknown in practice and thus need to be estimated from the data, whereas either of the two sets of DiGAR least-squares estimators given by (2.7) or (2.10) do not contain these terms. However, choosing 24 the “optimal” weights in the α−DiGAR estimator (2.10) as specified by (2.15) leads to the same requirement, since α depends on both σ and σg. Furthermore, this choice actually leads back to the MLEs, as summarized in the following. Proposition 2.11. The α-DiGAR estimators given by (2.10) using the weight α given by (2.15) coincide with the DiGAR MLEs given by (2.16). Remark 2.12. It is mentioned earlier in the chapter that combing the two unbiased estimators, namely βˆ1 in (2.3) and g¯, is an alternative approach. The optimal weighted estimator obtained in this fashion is 1 n∑ i=1 (xi−x¯)2 σ2 + nσ2g     n∑ i=1 (xi − x¯)2 σ2 n∑ i=1 (xi − x¯)(yi − y¯) n∑ i=1 (xi − x¯)2 + nσ2g g¯     . Simplifying this leads to the same estimator α-DiGAR estimators given by (2.10) using the weight α given by (2.15) and the DiGAR MLEs given by (2.16). This suggests that when we combing two sources of data using the loss function in (2.8), we are able to obtain the same estimator as combining two unbiased estimators, namely βˆ1 in (2.3) and g¯. 2.2.5 Multi-dimensional Linear Models Now we consider the d-dimensional multiple regression linear model setting. For standard regression, yi = β0 + d∑ j=1 βjxij + i, i = 1, 2, · · · , n. (2.18) 25 The least-squares approach minimizes the sum of squared residuals given by n∑ i=1 (yi− β0 − ∑d j=1 βjxij)2, leading to the following estimators for β0 and βj: βˆ0 = y¯ − d∑ j=1 βˆjx¯j, (2.19) βˆj = n∑ i=1 (xij − x¯j)(yi − y¯)− ∑ k 6=j βˆk n∑ i=1 (xik − x¯k)(xij − x¯j) n∑ i=1 (xij − x¯j)2 , (2.20) where x¯j is the sample mean of {xij}. The analogous multi-dimensional DiGAR model is given by yi = β0 + d∑ j=1 βjxij + i, (2.21) gij = βj + ′ij, (2.22) where gij, j = 1, 2, · · · , k, i = 1, 2, · · · , n are the gradient estimates with residuals {′ij}. The corresponding least-squares function to be minimized given by α0 n∑ i=1 (yi − β0 − d∑ j=1 βjxij)2 + d∑ j=1 αj n∑ i=1 (gij − βj)2, where ∑d i=0 αi = 1, αi ≥ 0. Taking the partial derivatives and setting equal to zero yields again (2.19) and βˆj = n∑ i=1 (xij − x¯j)(yi − y¯)− ∑ k 6=j βk n∑ i=1 (xik − x¯k)(xij − x¯j) + nαjα0 g¯j n∑ i=1 (xij − x¯j)2 + nαjα0 , j > 0, (2.23) which reduces to the previous expression with α0 = α, α1 = 1 − α, when there is just a single input. We will also refer to these as the α-DiGAR estimators. 26 2.2.6 DiGAR Model for Gradient Estimates Correlated with Perfor- mance Outputs The least-squares estimators for the basic α-DiGAR model, as given by (2.10) and (2.23), were derived without consideration of the correlation structure, although explicit computation of the theoretical variance as given by Proposition 2.4 required further assumptions. Here we consider the generalized least squares (GLS) setting where the outputs yi and gradient estimates gi are correlated as follows: y = Xβ + , (2.24) where for the univariate case y =                 y1 g1 ... yn gn                 , X =                 1 x1 0 1 ... ... 1 xn 0 1                 , β =     β0 β1     , (2.25) and  would contain the noise terms for the corresponding output and gradient estimates. The OLS estimator is given by βˆ = (X′X)−1X′y, (2.26) which matches (2.23) in the uncorrelated case with equal weights (αi = 1/n). The weighted least-squares (WLS) solution is given by βˆW = (X′WX)−1X′Wy, (2.27) 27 where W is a diagonal matrix such that diag(W) = [α1, α2, . . . , α2n]. Now we explicitly consider the setting where the covariance matrix of  is given by V, which is non-diagonal due to the correlations between yi and gi. This can also include the case of common random numbers, which would induce corre- lations within the {yi} and {gi}. Again assume the residuals have zero mean, i.e., E[] = 0 (Assumptions 2.1 and 2.2), so E[y] = Xβ. In the general correlated, non-homogeneous setting of (2.24)/(2.25), (2.26) is not the best (i.e., variance min- imizing) linear unbiased estimator (BLUE) for β, but the Gauss-Markov Theorem for the uncorrelated homogeneous variance setting can be used to prove that the BLUE for the general setting is given by the following GLS estimator: βˆG = (X′V−1X)−1X′V−1y, (2.28) and the covariance matrix for βˆG is Cov(βˆG) = (X′V−1X)−1. Proposition 2.13. Under Assumptions 2.1 and 2.2, the BLUE for the model given by (2.24) and (2.25) is the GLS estimator given by (2.28). Furthermore, if the residuals are assumed to be normally distributed, the MLE of β is the same as the GLS estimator. Proof. The proof uses the Gauss-Markov Theorem. Theorem 2.14 (Gauss-Markov Theorem). For the regression model (2.24) with E[] = 0 and Cov() = σ2I, the OLS estimators have minimum variance among all linear unbiased estimators. Since V is positive definite, there exists an 2n×2n nonsingular matrix P such 28 that V = PP′. Multiplying y = xβ +  by P−1, we obtain P−1y = P−1Xβ + P−1, where E[P−1] = P−1E[] = 0 and Cov(P−1) = P−1 Cov((P−1)′ = P−1σ2V(P−1)′ = σ2P−1PP′(P′)−1 = σ2I. Therefore, the assumptions in the Gauss-Markov Theorem are satisfied, and the least-squares estimator βˆ = [ (P−1X)′(P−1X) ]−1 (P−1X)′P−1y, is the best linear unbiased estimator (BLUE). The estimator βˆ can be written as βˆ = [ (P−1X)′(P−1X) ]−1 (P−1X)′P−1y = [X′(P′)−1P−1X]−1X′(P′)−1P−1y = [X′(PP′)−1X]−1X′(PP′)−1y = (X′V−1X)−1X′V−1y. We now analyze the variance of the slope portion of the GLS estimator by considering a very special case. Assumption 2.6. V is a positive definite matrix such that yi is correlated with gj only when i = j, with correlation ρ. Proposition 2.15. Under Assumptions 2.1, 2.2, 2.3 (ii), 2.4 (ii), and 2.6, for the model given by (2.24)/(2.25), the variance of βˆ1 in (2.28) is given by Var(βˆ1) = σ2/n 1 1−ρ2 1 n n∑ i=1 (xi − x¯)2 + σ2σ2g , (2.29) 29 which vanishes if either σ or σg goes to zero. Proof. The lower right element of the covariance matrix (X′V−1X)−1 gives Var(βˆ1), so writing the covariance matrix of the residuals as V =                 σ2 ρσσg ρσσg σ2g . . . . . . . . . . . . . . . σ2 ρσσg ρσσg σ2g                 , If the residuals are normally distributed, i.e.,  ∼ N (0,V), then y ∼ N (Xβ,V), and the likelihood function is L(β) = 1(2pi)n|V|1/2 exp { −(y −Xβ)′(V)−1(y −Xβ)/2 } , so the log-likelihood function is lnL(β) = −n ln(2pi)− 12 ln(|V|)− 1 2(y −Xβ) ′V−1(y −Xβ). Differentiating with respect to β, ∂ lnL ∂β = − ( X′VXβ −X′V−1y ) , setting equal to zero and solving for β gives the estimator βˆ = (X′V−1y)−1X′V−1y, which is the same as the best linear unbiased estimator (BLUE). 30 In the special correlated setting of Proposition 2.15, it is interesting to note that the variance of the GLS DiGAR slope estimator will go to zero if either of the variances of the underlying noises vanish, whereas for the basic OLS DiGAR estimator, both variances need to be zero in the uncorrelated setting (see Proposition 2.4). Also, by comparing (2.29) with the variance of the standard slope estimator given by (2.12), the following is easily established. Proposition 2.16. Under Assumptions 2.1, 2.2, 2.3 (ii), 2.4 (ii), and 2.6, for the model given by (2.24)/(2.25), if 0 < σ2 <∞ and 0 < σ2g <∞, then for −1 < ρ < 1, the DiGAR slope estimator given by (2.28) has lower variance than the standard slope estimator. Note that the DiGAR estimators given by (2.28) for the uncorrelated case will differ from the DiGAR estimators derived earlier using OLS. Generally, ρ is unknown and must be estimated based on the data. As noted in [16], estimating the corre- lation matrix changes the linear estimator (2.28) into a nonlinear estimator. Thus, although the theoretical analysis indicates potential for variance reduction from estimating the correlation, the extra computational budget spent on estimating ρ (and the resulting nonlinearities) must be traded off with any potential performance gains, an issue that is also investigated empirically in the numerical examples in the next section. 31 2.3 Numerical Examples Queueing systems are one of the main application areas for stochastic simu- lation, and the earliest application of direct gradient estimation in simulation was queueing; thus, we chose a simple queueing model as the setting on which to empir- ically investigate the performance of the DiGAR estimators, comparing them with standard regression in a setting where direct gradient estimates are available but where one or more of the assumptions of the theoretical results are generally not sat- isfied. In particular, although the least-squares α-DiGAR models are applicable in the general correlated setting, it is difficult to compute the theoretical variances ex- plicitly for such queueing settings, so simulation experiments are used to empirically compare their performance with standard regression. Perhaps the simplest queueing system is the first-come, first-served (FCFS), single-server queue considered here. Specifically, we consider the mean total time in system for a customer (queue or delay time plus service time) as a function of the parameters of the (common) service time or interarrival time distribution of customers. Two settings are considered: a univariate setting (single input) and a multivariate setting. In both settings, four outputs, i.e., four different y’s, are considered. In queueing theory, the interest is frequently steady-state performance, but since the focus here is on improving regression models rather than on queueing theory, per se, we consider the 2nd, 3rd, 4th, and 5th customers, for which we can easily obtain analytical results that can be used to compare the quality of the standard linear regression and DiGAR models without having to worry about 32 whether the simulation has reached steady state. Letting Tk denote the system time of the kth customer, the outputs of interest are y(k) = E[Tk], k = 2, 3, 4, 5. The gradient estimators for all of the examples are provided in the Electronic Com- panion Section A.2. The system time performance and its gradient estimate are clearly highly correlated, and the variance of both the performance and its gradient are not homogeneous across the range of independent values considered (although since the queue will be far from the heavy traffic intensity regime, the violation may not be particularly severe). As a result, Assumptions 2.3 and 2.4 are violated and the conclusions of Propositions 2.3 through 2.10 cannot be applied, although Proposition 2.2 holds, as the gradient estimators are unbiased. However, it should be noted that the α-DiGAR estimators are not derived under Assumptions 2.3 and 2.4, which are only sufficient conditions to establish the theoretical results. 2.3.1 Example: M/M/1 Queue The univariate example takes the arrival process to be Poisson with fixed rate and service times to be i.i.d. exponentially distributed, i.e., an M/M/1 queue, where the independent variable x is the mean service time. It is straightforward to compute the true theoretical dependence of the expected system time y on the mean service time x (cf. [46]), which can then be used to judge the quality of the fitted curves obtained by the various regression models. For all experiments in the univariate setting of the M/M/1 queue, the arrival 33 rate is fixed at 0.2, and the number of design points is n = 10, where output y (mean system time) is obtained at the following values of the independent variable x (mean service time): x1 = 3.6, x2 = 3.7, x3 = 3.8, x4 = 3.9, x5 = 4.0, x6 = 4.1, x7 = 4.2, x8 = 4.3, x9 = 4.4, x10 = 4.5. At each of these independent variable values, 10 independent simulation replications (runs) were generated to obtain the system times (and gradients for DiGAR) of the customers; nine additional macroreplications were carried out only for the purposes of estimating the slope variances as described below. The relatively small number of replications highlights the challenges that the standard regression approach faces, since it leads to output estimates with relatively high variances. Two metrics were used to evaluate the quality of fit for each data set: (i) the sample mean-squared error from the true model over the independent variable range of interest defined by L2 = xmax∫ xmin (ŷ(x)− y(x))2dx, where ŷ(x) and y(x) denotes the fitted and true models, respectively; and (ii) the theoretical (from the uncorrelated model) variance of the slope estimator denoted by V̂ar(βˆ1) and estimated using (2.12) and (2.14) for the standard and DiGAR models, respectively, with the pooled (over all input values) sample variance estimators s2 and s2g used in place of the respective output variances. For the M/M/1 queue example, xmin = 3.6 and xmax = 4.5. To provide a reference for the fitted linear models, a “true” linear model was also computed 34 based on fitting a linear model that minimizes the L2 error from the true M/M/1 queue model given in Electronic Companion Section A.1 (since the true function y(x) is not linear). The V̂ar(βˆ1) metric estimated is based on a formula that assumes conditions not satisfied in these queueing examples. It is calculated to see if the relative behavior resembles the actual variances of the slope estimators, estimated using N = 10 macroreplications via s2(βˆ1) = 1 N − 1 N∑ j=1 (βˆji − ¯ˆβi)2, where ¯ˆβ1 = 1 N N∑ j=1 βˆj1. The first set of experiments performed used the sample mean for the output variable yi (and gi for DiGAR) at each of the 10 values of the independent vari- able xi to carry out the fitting using standard regression and the various DIGAR models: OLS DiGAR with parameters given by (2.7) – denoted simply by DiGAR throughout; α-DiGAR models with parameters given by (2.10); MLE DiGAR with parameters given by (2.16) – denoted by DiGARn throughout; and the correlated DIGAR models with parameters given by (2.28) – denoted by DiGAR* and Di- GAR** for the cases where the correlation matrix is estimated by using the given number of replications (initially 10) and 100,000 (off-line) replications, respectively. The results are given in Table 2.1, which provide the values of the estimated parameters, along with the calculation of the various metrics. Overall, the DiGAR models all outperform the standard model on all metrics, especially for the variance of the slope estimators, with a superiority of one to two orders of magnitude. The numbers in the last two columns are also encouraging, in that they indicate that the estimated theoretical variances for all of the estimators are in the same ballpark as 35 the sample variances, and more importantly the ratio of improvement is reasonably consistent, so that this metric appears to provide a pretty accurate estimate of relative performance between the various models. The performance of the DiGAR models is also fairly insensitive to the choice of the weight parameter, with the OLS DiGAR model performing adequately, although the “optimal” choice of weights (approximately 0.072, 0.087, 0.101, 0.114 for y(2), y(3), y(4), y(5), respectively) does show improvement and is usually the best performance overall, indistinguishable from DIGAR**, to be discussed shortly. A clearer visual comparison is provided by the graphs in Figure 2.1, which plot the simulation data, true model, and three fitted models, where the circles are the data points (sample mean of 10 simulated values); the solid line is the true model; the dashed line is the fitted model from standard regression; the line with dots is the fitted OLS DiGAR model; and the dotted line is correlated DiGAR model. The graphs indicate that both the standard and DiGAR models fit the data reasonably well for y(2) and y(3), but there are dramatic differences for y(4), and y(5), in which the DiGAR models correctly capture the orientation of the curve, whereas the slope of the standard linear regression model has the incorrect sign. For all four cases, the normal estimators are indistinguishable from the basic OLS DiGAR model in the graphs, and hence were omitted. Surprisingly, the correlated model clearly performs worse than the OLS model, which suggests that the misspecification caused by estimating the correlation based on 10 points outweighs the potential gains of a better model. The purpose of Di- GAR** – which used 100,000 separate replications to estimate the covariance matrix 36 Table 2.1: Parameter estimates and performance metrics for M/M/1 queue (boxed entries indicate incorrect sign of slope) i y(i) model βˆ1 βˆ0 L2 V̂ar(βˆ1) s 2(βˆ1) 2 standard 1.68 -1.66 0.48 3.75 5.68 DiGAR 1.56 -1.19 0.48 0.057 0.047 DiGAR (α = 0.25) 1.58 -1.26 0.48 0.046 0.029 DiGAR (α = 0.75) 1.56 -1.17 0.48 0.173 0.217 DiGAR (α ∝ 1/σ2) 1.55 -1.14 0.45 0.045 0.030 DiGARn 1.55 -1.14 0.48 0.057 0.029 DiGAR* 1.15 -0.11 1.62 1.99 0.036 DiGAR** 1.53 -1.10 0.52 0.046 0.030 “true” linear 1.69 -1.00 4E-6 3 standard 0.68 4.04 0.27 8.82 7.17 DiGAR 2.04 -1.50 0.11 0.134 0.066 DiGAR (α = 0.25) 1.86 -0.77 0.12 0.092 0.032 DiGAR (α = 0.75) 2.12 -1.80 0.11 0.369 0.300 DiGARn 2.14 -1.91 0.11 0.123 0.029 DiGAR (α ∝ 1/σ2) 2.14 -1.81 0.051 0.090 0.029 DiGAR* 1.53 -0.55 2.00 4.35 0.098 DiGAR** 2.14 -1.85 0.088 0.091 0.028 “true” linear 2.30 -2.18 2E-5 4 standard -2.97 20.2 2.06 3.10 5.35 DiGAR 2.33 -1.26 0.019 0.237 0.074 DiGAR (α = 0.25) 1.63 1.58 0.093 0.046 0.041 DiGAR (α = 0.75) 2.61 -2.41 0.006 0.177 0.265 DiGAR (α ∝ 1/σ2) 2.71 -2.53 0.10 0.046 0.035 DiGARn 2.70 -2.77 0.004 0.081 0.038 DiGAR* 1.98 -1.56 2.52 0.122 0.110 DiGAR** 2.73 -2.78 0.027 0.046 0.036 “true” linear 2.85 -3.43 7E-5 5 standard -1.76 15.2 2.29 1.30 9.18 DiGAR 2.89 -3.63 0.71 0.191 0.114 DiGAR (α = 0.25) 2.27 -1.14 0.77 0.053 0.068 DiGAR (α = 0.75) 3.14 -4.64 0.70 0.094 0.419 DiGAR (α ∝ 1/σ2) 3.22 -4.73 0.35 0.054 0.064 DiGARn 3.19 -4.88 0.70 0.094 0.072 DiGAR* 2.45 -2.98 3.66 0.228 0.096 DiGAR** 3.24 -4.98 0.55 0.055 0.064 “true” linear 3.37 -4.70 2E-4 37 3.0 3.5 4.0 4.5 5.0x 0 2 4 6 8 10 12 y(2) data points standard model DiGAR model DiGAR* model DiGARn model true model 3.0 3.5 4.0 4.5 5.0x 0 2 4 6 8 10 12 y(3) 3.0 3.5 4.0 4.5 5.0x 0 2 4 6 8 10 12 y(4) 3.0 3.5 4.0 4.5 5.0x 0 2 4 6 8 10 12 y(5) Figure 2.1: M/M/1 queue: true model, simulation data and several fitted models, where each data point is the sample mean of 10 independent replications. 38 to serve as a proxy for the true covariance matrix – was to investigate the conjecture that the error is in fact due to the poor estimation of the correlation matrix. Tables 2.1 indicate that DiGAR** does in fact outperform OLS DiGAR, though the differ- ence is not substantial. Interestingly, as noted earlier, the performance of DIGAR** is nearly identical to that of α-DiGAR with the optimal choice of α. However, it is clear that at least for this example, it is preferable to use OLS DiGAR rather than DiGAR* if only a small number of replications are available to estimate the covariance matrix, which is typically the situation in most of the simulation settings. Since the variances are much smaller in the DiGAR models, it would seem that a reasonable fit could be obtained using DiGAR with fewer replications at each design point. Thus, a followup experiment used only a single output point to measure yi (and gi for DiGAR) at each design point. Since there were 10 simulation runs, this experiment can be performed 10 independent times each. As expected, the variance of the slope estimator is substantially lower for DiGAR, and the ratios of the estimated variances range from 11 to 111, with an overall mean over 60, which represents a substantial improvement. We also have an example showing the incorrect slope sign estimated from the standard regression model happens frequently if only a small number are conducted at each design point, specifically considering the fitted models for just y(2) in each of the 10 individual runs of data set 1. In half (5 out of 10) of the sample, the standard model gives the incorrect sign for the slope. The better match in the slope of the curves is critical in simulation-based optimization procedures such as sequential RSM. Similar results not reported here were also observed for y(3), y(4), 39 3.0 3.5 4.0 4.5 5.0x 0 2 4 6 8 10 12 y(2) data points standard model DiGAR model DiGAR** model DiGARn model true model 3.0 3.5 4.0 4.5 5.0x 0 2 4 6 8 10 12 y(3) 3.0 3.5 4.0 4.5 5.0x 0 2 4 6 8 10 12 y(4) 3.0 3.5 4.0 4.5 5.0x 0 2 4 6 8 10 12 y(5) Figure 2.2: M/M/1 queue: using correlation matrix estimated based on 100,000 “offline” simulation replications. and y(5). 2.3.2 Regression using a Quadratic Function Sequential RSM generally employs a quadratic function to fit the data when the algorithm approaches the optimal value, i.e, the linear fit has slope close to 0. Using the same single-server queue as in Example 1, we consider the following objective function (used previously in many simulation optimization settings, e.g., [47, 48]): y(k)(x) = E[Tk] + c/x, k = 2, 3, 4, 5, (2.30) 40 where c is a given constant. The additional term c/x in (2.30), which can be viewed as a cost on server speed, makes the function convex with a unique minimizer x∗ that can be found analytically for comparison purposes. The standard regression model using a quadratic function (in x) is given by yi = β0+β1xi+β2x2i+i, and the DiGAR model is given by adding gi = β1+2β2xi+′i. The resulting explicit parameter estimators for the OLS DiGAR model are provided below. Consider the loss function L = 12 n∑ i=1 (yi − β2x2i − β1xi − β0)2 + 1 2 n∑ i=1 (gi − 2β2xi − β1)2 Differentiating with respect to β0, β1, β2, ∂L ∂β0 = − n∑ i=1 (yi − β2x2i − β1xi − β0) ∂L ∂β1 = − n∑ i=1 (yi − β2x2i − β1xi − β0)xi − n∑ i=1 (gi − 2β2xi − β1) ∂L ∂β2 = − n∑ i=1 (yi − β2x2i − β1xi − β0)x2i − n∑ i=1 (gi − 2β2xi − β1)2xi (2.31) Setting them equal to 0 and solving yields the following estimators, β = Ay, where β ≡ [β0, β1, β2]T , y = [ n∑ i=1 x2i yi + 2 n∑ i=1 xigi, n∑ i=1 xiyi + n∑ i=1 gi, n∑ i=1 yi ]T , and A = 1abc− af 2 − be2 − cd2 + 2def         bc− f 2 ef − cd df − be ef − cd ac− e2 de− af df − be de− af ab− d2         , with a = n∑ i=1 x4i+4 n∑ i=1 x2i , b = n∑ i=1 x2i , c = n, d = n∑ i=1 x3i+2 n∑ i=1 xi, e = n∑ i=1 x2i , f = n∑ i=1 xi. 41 3.0 3.5 4.0 4.5 5.0x 6 8 10 12 14 16 18 20 22 24 y(2) data points standard model DiGAR model DiGAR* model true model 3.0 3.5 4.0 4.5 5.0x 6 8 10 12 14 16 18 20 22 24 y(3) 3.0 3.5 4.0 4.5 5.0x 6 8 10 12 14 16 18 20 22 24 y(4) 3.0 3.5 4.0 4.5 5.0x 6 8 10 12 14 16 18 20 22 24 y(5) Figure 2.3: M/M/1 queue quadratic fit (c ≈ 27, 10 replications). 42 Again using 10 replications per design point, the true and fitted models for c ≈ 27 (actual value of c was chosen to make x∗ = 4 for y(2)), are plotted in Figure 2.3, including the correlated model indicated by DiGAR*, where the correlation matrix is also estimated using the small number of data points. The differences between both DiGAR models and standard regression are substantial, as the convexity/concavity curvature is often incorrect in the standard regression fitted model, e.g., for y(4) and y(5) in several cases. Once again the OLS DiGAR model outperforms the correlated version, again indicating the inadequacy of 10 replications for estimating the correlation matrix. Two additional cases, c = 1 (where x∗ lies to the left of the interval) and c = 100 (where x∗ lies to the right of the interval), showing the fit of each of the models over a much wider range outside the set of design points, where the incorrect curvature of standard regression for several cases is more evident. Visually, it is clear that DiGAR fits the curve much better than standard regression in all three cases. We also investigated the performance gains when more replications are car- ried out at each design point: 100, 1000, and 10000. indicate that the differences between both DiGAR models and standard regression can be substantial, even up to 1000 replications at each design point, where the standard model still has the incorrect curvature. Furthermore, even with 100 replications at each design point, the DiGAR* models still appear to be inferior to the OLS DiGAR models. Only when 1000 or 10000 replications are used at each design point do the DiGAR* mod- els show better performance than the OLS DiGAR model, but the superiority is relatively slight (and not visually obvious in the figures), again indicating that the 43 Table 2.2: M/M/1 queue quadratic function optimal value x∗ obtained as a function of # replications (c ≈ 27), where boxed entries indicate a maximum rather than a minimum 10 reps 100 reps true standard DiGAR DiGAR* standard DiGAR DiGAR* y(2) 4.0 4.8 4.0 4.2 4.0 4.1 4.1 y(3) 3.5 9.9 3.7 3.9 4.0 3.1 3.5 y(4) 3.4 4.4 3.3 3.3 4.0 1.8 3.1 y(5) 3.1 9.8 3.4 3.6 4.1 1.0 2.8 1000 reps 10000 reps true standard DiGAR DiGAR* standard DiGAR DiGAR* y(2) 4.0 3.9 4.0 4.0 4.3 4.0 4.0 y(3) 3.5 4.2 3.4 3.4 0.3 3.4 3.4 y(4) 3.4 4.4 2.9 3.1 -25.2 3.0 3.0 y(5) 3.1 4.4 2.5 2.8 -6.4 2.6 2.7 OLS DiGAR model may be sufficiently robust for applications such as sequential RSM. Finally, Table 2.2 provides the estimates of x∗ implied by the fitted functions as a function of the number of replications, where it is evident that the quadratic fit provided by standard regression is problematic for optimization purposes even with 10000 replications at each point; with less than 10000 replications per point, the curvature is always in the incorrect direction for the c ≈ 27 case. 44 2.3.3 Multi-dimensional Example: U/U/1 Queue Now we consider a multi-dimensional case using the same single-server queue- ing example. Specifically, we take both the interarrival times and service times to be i.i.d. following uniform distributions, i.e., a U/U/1 queue with respective dis- tribution parameterizations U(θ1 − δ1, θ1 + δ1) and U(θ2 − δ2, θ2 + δ2), so that the input variable is the four-dimensional vector x = (θ1, θ2, δ1, δ2). The output func- tions considered are the same four as in the previous example, i.e., the mean system time of the 2nd, 3rd, 4th, and 5th customers. Again using the Lindley equation, the true theoretical dependence of the expected system time on the input distribu- tional parameters can be calculated analytically and are included in the Electronic Companion Section A.1. The standard regression model is yi = β0 + β1θ1 + β2θ2 + β3δ1 + β4δ2 + i, and the DiGAR model adds gji = βj + ji , where gji , j = 1, 2, 3, 4 represents the deriva- tive of yi with respect to θ1, θ2, δ1, δ2, respectively. We simulated a two-level cen- tered full-factorial design (thus, 17 design points), with center point (θ1, θ2, δ1, δ2) = (10, 8, 8, 7), corresponding to U(2, 18) and U(1, 15) interarrival and service time dis- tributions, respectively, with a spacing of ±0.1 (for all four dimensions) for the design points. The metrics used are analogous to the ones used previously: L2 error and mean squared error (MSE), where the latter also takes into account the bias, i.e., MSE = Var(βˆi) + (βˆi − β∗i )2, where β∗i is the true value obtained using the analytical formula, and the sample mean and sample variance are calculated for each βˆji , i = 1, 2, 3, 4, j = 1, 2, · · · , N, based on N macroreplications with an equal 45 number of replications at each design point: ¯ˆβi = 1 N N∑ j=1 βˆji , s2i = Var(βˆi) = 1 N − 1 N∑ j=1 (βˆji − ¯ˆβi)2. For simplicity, only two DiGAR models are used: the basic one where no weights need to be determined, i.e., (2.23) with αj = α0 ∀j; and α∗-DiGAR, for which the “optimal” (proportional to variance) weights are estimated offline using 10,000 replications. As before, the output at each design point is based on the mean of 10 replications, and the number of macroreplications is also N = 10. The results shown in Table 2.3 indicate that both DiGAR estimators have smaller MSE than the slope estimators from standard linear regression models by about two orders of magnitude, where α∗-DiGAR models reduce MSEs as well as variances further compared to the basic DiGAR models. In terms of L2 errors, both DiGAR models are substantially better than standard linear regression model, but here the improvement achieved by using optimal weights is small. Note that as in the 1-D M/M/1 queue example, there are cases where standard linear regression model gives the incorrect sign for an estimated slope. To visualize the results in terms of the slopes, boxplots for estimators βˆi, i = 1, 2, 3, 4, for the 2nd, 3rd, 4th and 5th customers are shown in Figure 2.4. Each box- plot is labelled as “parameter:model”. For instance, “θ1:Linear Reg” represents the boxplot for βˆ1 estimators using the standard linear regression model. The true gradient values calculated from the analytical formulas are indicated in the boxplots by stars. The boxplots further illustrate that the variances of the estimators ob- tained from standard linear regression are significantly larger than variances of the 46 estimators using the DiGAR models, differing by about two orders of magnitude. The differences between DiGAR and α∗-DiGAR models are not clear from the box- plots, but the sample standard deviations suggest that α∗-DiGAR models are able to reduce variances of the estimators by choosing optimal weights. DiGAR and α∗-DiGAR models also provide better estimators in terms of ab- solute errors. The median values are indicated by red segments in all boxplots. Visually, the median values from the DiGAR models are indistinguishable from the true gradient values indicated by red stars, while median values from standard linear regression models are far away from the true gradient values in most cases. Com- pared to standard linear regression model, the sample means ¯ˆβi from DiGAR and α∗-DiGAR models are also much closer to the true gradient values. The effect of increasing the number of replications at each design point is shown in Table 2.4, which provides experiment results for all slope estimators for y(2). As expected, all of the models show improvement, but the relative advantages of the DiGAR models – well over two orders of magnitude improvement in MSE and between one to two orders of magnitude improvement in L2 – is retained. 2.3.4 Multi-dimensional Example: Sphere Function Lastly, we consider a more stylized example to test the robustness of DiGAR models when the gradient estimates have much larger variances and for different levels of correlations between the response and gradient estimates. We consider the sphere function defined by f(x) = ∑ni=1(xi)2, with partial derivative ∂f(x)/∂xi = 47 Table 2.3: Parameter estimates and performance metrics for U/U/1 queue (10 repli- cations per design point), based on 10 macroreplications (boxed entries indicate incorrect sign of slope) Linear Regression DiGAR α∗-DiGAR y(2) true value value MSE value MSE value MSE βˆ1 -0.375 -2.498 26.1 -0.397 0.0024 -0.378 0.0009 βˆ2 1.375 -1.031 26.2 1.355 0.0030 1.377 0.0009 βˆ3 0.171 4.486 25.9 0.215 0.0036 0.175 0.0008 βˆ4 0.146 0.835 19.9 0.155 0.0038 0.149 0.0021 L2 8.31 0.141 0.139 y(3) βˆ1 -0.720 -2.140 22.7 -0.742 0.0048 -0.729 0.0068 βˆ2 1.720 -1.292 22.5 1.701 0.0078 1.728 0.0069 βˆ3 0.279 3.052 18.3 0.337 0.0060 0.312 0.0032 βˆ4 0.255 -0.164 18.7 0.233 0.0036 0.237 0.0014 L2 6.95 0.088 0.086 y(4) βˆ1 -1.065 -3.243 22.0 -1.029 0.0169 -1.009 0.0141 βˆ2 2.065 0.538 31.2 1.995 0.0203 2.008 0.0142 βˆ3 0.362 5.199 50.1 0.430 0.0108 0.386 0.0029 βˆ4 0.360 -0.319 26.7 0.305 0.0114 0.311 0.0069 L2 11.0 0.220 0.216 y(5) βˆ1 -1.427 -1.544 13.1 -1.309 0.0367 -1.307 0.0327 βˆ2 2.427 1.111 20.4 2.296 0.0396 2.306 0.0328 βˆ3 0.431 3.542 45.0 0.483 0.0059 0.455 0.0024 βˆ4 0.467 1.469 39.4 0.413 0.0099 0.403 0.0064 L2 10.0 0.174 0.171 48 β 1: Line ar R eg β 1: DiG AR θ 1: α∗ - DiG AR β 2: Line ar R eg β 2: DiG AR θ 2: α∗ - DiG AR β 3:L inea r Re g β 3: DiG AR β 3: α∗ - DiG AR β 4:L inea r Re g β 4: DiG AR β 4: α∗ - DiG AR 10 5 0 5 10 Es tim ate d P ara me ter Va lue s β 1: Line ar R eg β 1: DiG AR θ 1: α∗ - DiG AR β 2: Line ar R eg β 2: DiG AR θ 2: α∗ - DiG AR β 3:L inea r Re g β 3: DiG AR β 3: α∗ - DiG AR β 4:L inea r Re g β 4: DiG AR β 4: α∗ - DiG AR 10 8 6 4 2 0 2 4 6 8 Es tim ate d P ara me ter Va lue s β 1: Line ar R eg β 1: DiG AR θ 1: α∗ - DiG AR β 2: Line ar R eg β 2: DiG AR θ 2: α∗ - DiG AR β 3:L inea r Re g β 3: DiG AR β 3: α∗ - DiG AR β 4:L inea r Re g β 4: DiG AR β 4: α∗ - DiG AR 10 5 0 5 10 15 20 Es tim ate d P ara me ter Va lue s β 1: Line ar R eg β 1: DiG AR θ 1: α∗ - DiG AR β 2: Line ar R eg β 2: DiG AR θ 2: α∗ - DiG AR β 3:L inea r Re g β 3: DiG AR β 3: α∗ - DiG AR β 4:L inea r Re g β 4: DiG AR β 4: α∗ - DiG AR 10 5 0 5 10 15 20 Es tim ate d P ara me ter Va lue s Figure 2.4: U/U/1 queue box plots of four estimated slopes for y(k) based on 10 macroreplications 49 Table 2.4: Parameter estimates and performance metrics for U/U/1 queue y(2) w.r.t. # replications/design point Linear Regression DiGAR α∗-DiGAR # reps true value value MSE value MSE value MSE 1 βˆ1 -0.375 -3.168 155 -0.397 0.025 -0.371 0.009 βˆ2 1.375 -9.977 270 1.265 0.035 1.370 0.009 βˆ3 0.171 6.012 141 0.236 0.019 0.182 0.007 βˆ4 0.146 -6.365 67.2 0.057 0.036 0.116 0.026 L2 54.0 1.227 1.213 10 βˆ1 -0.375 -2.498 26.1 -0.397 0.0024 -0.378 0.0009 βˆ2 1.375 -1.031 26.2 1.355 0.0030 1.377 0.0009 βˆ3 0.171 4.486 25.9 0.215 0.0036 0.175 0.0008 βˆ4 0.146 0.835 19.9 0.155 0.0038 0.149 0.0021 L2 8.31 0.141 0.139 100 βˆ1 -1.065 -0.914 3.08 -1.004 0.00434 -1.005 0.00422 βˆ2 2.065 2.074 2.39 2.006 0.00482 2.005 0.00424 βˆ3 0.362 0.996 4.84 0.379 0.00084 0.373 0.00026 βˆ4 0.360 -0.484 1.61 0.309 0.00278 0.316 0.00227 L2 1.03 0.036 0.037 1000 βˆ1 -0.375 -0.338 0.230 -0.377 0.00005 -0.378 0.00003 βˆ2 1.375 1.673 0.162 1.380 0.00005 1.378 0.00003 βˆ3 0.171 0.294 0.105 0.174 0.00002 0.173 0.00001 βˆ4 0.146 0.181 0.124 0.146 0.00002 0.145 0.00001 L2 0.0525 0.0009 0.0008 50 2xi. We used n = 4 in our numerical experiments, so the standard and DiGAR models are the same as in the previous example. To control the variances and correlations, the noises i and (1i , 2i , . . . , 4i ) form a random vector in R5 following a multivariate normal distribution with mean 0 and covariance matrix Σ. Four different levels of correlations are considered: the independent case (ρ = 0), and relatively low, medium, and high levels ρ = 0.2, 0.5, 0.8. The noise variances were set at 10, 20, 30, 40, 50 for i, 1i , 2i , 3i , 4i , respectively. A two-level centered full factorial design with center point (x1, x2, x3, x4) = (1.0,−0.6, 0.8,−0.5) was considered. Two sets of experiments were performed at two different grid sizes for the two-level design, namely 0.5 and 0.05, which correspond to cases where design points are spread out or fairly close. The mean at each design point was based on 10 replications, and 1000 macroreplications were used to estimate the MSE, with the results summarized in Tables 2.5 and 2.6, where DiGAR refers to the OLS DiGAR and α∗-DiGAR uses “optimal” (proportional to variance) weights. In Table 2.5, results across different levels of correlations are consistent. The α∗-DiGAR model is the best among three models. Since the variances increase from 1i to 4i , this leads to increases in MSE from βˆ1 to βˆ4 in both DiGAR models, but both DiGAR models still outperform the standard linear regression models and the α∗-DiGAR model reduces the MSE further from the OLS DiGAR model. The results in Table 2.6 show higher MSEs, due to the more tightly spaced design points, which leads to larger variance in the estimators. For the standard model, the MSEs are about 100 times higher, whereas the DiGAR models are only two to three times higher, i.e., the relative advantage of the DiGAR models in- 51 creases significantly in this setting, consistent with Proposition 2.5 and the remarks following it. As expected, increasing the variances of gradient estimators leads to inflation of MSE in the DiGAR models; however, the relative advantage of the DiGAR models is retained. The effects of changing the level of correlation does not affect the performances of both DiGAR models under these two experiment designs as shown in Table 2.5 and 2.6. We also conducted experiments at three different levels of negative correlations. In this case, the MSE increases noticeably for all models with increasing magnitude of the level of correlation, but the relative performance of the models is similar to those for positive correlation, with the advantage of the DiGAR models even greater for the larger magnitudes. 2.4 Conclusion In this chapter we proposed an augmented regression method that exploits the availability of direct gradient estimators in certain simulation settings. In some basic settings, we analytically characterized the improvement obtained over the standard model by calculating the variance of the estimated parameters and showing under which conditions guaranteed performance improvement can be expected. A simple queue was then used to numerically investigate the improvements, and the numer- ical results indicated great promise for the approach, with the general observation that the DiGARs models are qualitatively able to capture trends that the standard model might miss, e.g., there were several cases where the standard model gave the 52 incorrect sign of the slope or was oriented in the opposite direction for the quadratic fit. Not surprisingly, the DiGAR slope estimators had much smaller variance than the standard estimator in all of the experiments conducted, retaining the relative advantage even in the higher-dimensional numerical examples. Although only a small portion of results are provided here, these observations hold for many other experiments for this simple queueing model. Of particular note is the observation that in the queueing example where the gradient estimators are highly correlated with the dependent output data, the α- DiGAR models clearly outperform the standard regression models and for all prac- tical purposes do as well as the correlated DiGAR models, and significantly better when the number of simulation replications at each design point is relatively small. Since the basic DiGAR estimators are quite easy to implement, this has practical implications for immediate use in those settings in which gradient estimators are available. Thus, we recommend using an α-DiGAR estimator in settings where the number of replications is relatively low and the gradient estimate is reasonably ac- curate, which is generally the case if IPA is applicable. However, more investigation into the effects of misspecification of the correlation structure is warranted before more conclusive statements can be made for the DiGAR* GLS estimators, and the effects are likely to be highly dependent on the application setting. Furthermore, using design of experiments, e.g., along the lines of [49], in choosing the design points could possibly ameliorate the misspecification problem. 53 Table 2.5: Parameter estimates and performance metrics for the sphere function for two-level centered full factorial design around (1,−0.6, 0.8,−0.5); 10 replication per design point, gridsize 0.5, 1000 macroreplications. Linear Regression DiGAR α∗-DiGAR correlation true value value MSE value MSE value MSE zero ρ = 0 βˆ1 2.0 2.001 0.249 2.007 0.093 2.006 0.088 βˆ2 -1.2 -1.238 0.259 -1.206 0.124 -1.215 0.109 βˆ3 1.6 1.589 0.265 1.615 0.174 1.605 0.127 βˆ4 -1.0 -1.016 0.265 -0.976 0.202 -0.993 0.136 low ρ = 0.2 βˆ1 2.0 2.009 0.241 1.999 0.091 2.000 0.083 βˆ2 -1.2 -1.235 0.250 -1.200 0.132 -1.209 0.106 βˆ3 1.6 1.571 0.239 1.602 0.159 1.591 0.113 βˆ4 -1.0 -1.013 0.258 -1.002 0.207 -1.007 0.140 medium ρ = 0.5 βˆ1 2.0 2.002 0.252 2.019 0.084 2.017 0.076 βˆ2 -1.2 -1.203 0.266 -1.192 0.127 -1.195 0.108 βˆ3 1.6 1.597 0.252 1.605 0.165 1.602 0.125 βˆ4 -1.0 -1.002 0.265 -1.000 0.202 -1.001 0.137 high ρ = 0.8 βˆ1 2.0 2.016 0.246 1.980 0.093 1.985 0.087 βˆ2 -1.2 -1.191 0.249 -1.220 0.123 -1.212 0.104 βˆ3 1.6 1.612 0.230 1.584 0.163 1.594 0.120 βˆ4 -1.0 -1.017 0.262 -1.033 0.200 -1.026 0.128 54 Table 2.6: Parameter estimates and performance metrics for the sphere function for two-level centered full factorial design around (1,−0.6, 0.8,−0.5); 10 replication per design point, gridsize 0.05, 1000 macroreplications. Linear Regression DiGAR α∗-DiGAR correlation true value value MSE value MSE value MSE zero ρ = 0 βˆ1 2.0 2.070 23.660 2.007 0.116 2.007 0.115 βˆ2 -1.2 -1.395 24.683 -1.194 0.169 -1.195 0.169 βˆ3 1.6 1.584 24.645 1.607 0.248 1.607 0.247 βˆ4 -1.0 -1.099 24.780 -1.002 0.301 -1.003 0.300 low ρ = 0.2 βˆ1 2.0 2.087 24.469 1.991 0.124 1.992 0.124 βˆ2 -1.2 -1.481 26.743 -1.201 0.182 -1.202 0.182 βˆ3 1.6 1.503 25.016 1.608 0.228 1.608 0.226 βˆ4 -1.0 -0.919 26.534 -0.991 0.311 -0.990 0.310 medium ρ = 0.5 βˆ1 2.0 2.003 25.409 1.980 0.124 1.980 0.125 βˆ2 -1.2 -1.206 25.095 -1.232 0.177 -1.232 0.177 βˆ3 1.6 1.672 26.327 1.590 0.222 1.590 0.221 βˆ4 -1.0 -1.144 24.180 -1.011 0.290 -1.012 0.288 large ρ = 0.8 βˆ1 2.0 1.839 24.427 1.979 0.116 1.979 0.116 βˆ2 -1.2 -1.367 25.998 -1.221 0.171 -1.222 0.170 βˆ3 1.6 1.494 25.830 1.563 0.229 1.563 0.227 βˆ4 -1.0 -1.052 24.366 -1.042 0.281 -1.043 0.278 55 Chapter 3: Gradient Extrapolated Stochastic Kriging 3.1 Introduction Consider the same stochastic simulation setting as in Chapter 2, where direct gradient information are available. DiGAR model proposed in Chapter 2 incor- porate gradient estimates into regression models by building a separate model for the gradient estimates. Aside from regression, kriging has also been studied exten- sively in the deterministic simulation community (see, for example, [23] and [50]). Stochastic kriging was introduced by [21] as an extension of kriging in the stochas- tic simulation setting. Stochastic kriging provides flexible metamodels of simulation output performance measurements while taking simulation noise into consideration. [25] introduced stochastic kriging with gradient estimators (SKG) to exploit gradient estimates in stochastic kriging, showing that the new approach provides better prediction with smaller mean squared error (MSE). This approach is similar to cokriging proposed in deterministic simulations [26], and requires differentiability of the correlation functions, since derivatives of random processes or random fields are used to model gradient estimates. We take a different approach to incorporate gradient estimates into stochastic kriging and investigate the potential improvements. A new approach called Gradient 56 Extrapolated Stochastic Kriging (GESK) is proposed, which extrapolates additional responses in the neighborhood of each design point using the original responses and gradient estimates. These additional responses, which might be biased, lead to better predictions than stochastic kriging if step sizes for extrapolations are chosen carefully. The main idea is to further explore the response surface with simulation responses and gradient estimates, so that a metamodel with better overall accuracy can be constructed. This suggests that GESK models are superior when there are limited number of design points or a response surface contains multiple extreme values. To investigate the performance of GESK, we analyze the possible reduction in MSE of the GESK model over the standard stochastic kriging model, under two simplified and tractable settings. Conditions that guarantee reduction in MSE are provided as well. We also conduct numerical experiments to illustrate the effective- ness of the GESK model. Numerical results show that GESK performs comparably well or outperforms competing approaches such as stochastic kriging and SKG. Moreover, in certain cases, GESK captures fluctuations of the response surface that are usually missed by the other two approaches. An important part of implementing the GESK model is the choice of step size. Large step sizes usually lead to large approximation errors and deteriorate prediction accuracy; small step sizes gain little information from extrapolations and might lead to numerical stability issues. We formalize two different strategies, penalized maximum likelihood estimation (PMLE) and minimizing integrated mean squared error (IMSE), to determine optimal step sizes. A cross validation method is 57 presented to determine the regularization parameters required by each of the PMLE and IMSE approaches. We discuss pros and cons for each approach and compare them empirically with numerical examples. In this section, we review stochastic kriging introduced in [21] and stochastic kriging with gradient estimators (SKG) introduced in [25], and then present the GESK approach. 3.1.1 Stochastic Kriging Stochastic kriging was introduced in [21], focusing on modeling unknown response surfaces in stochastic simulation settings. Given an experiment design {(xi, ni)}, i = 1, 2, · · · , k, ni simulation replications are run at each design point xi. Let Yj(xi) be the simulation output from replication j at design point xi, j = 1, . . . , ni, and xi = (xi1, xi2, . . . , xid)ᵀ ∈ Rd. Stochastic kriging models the output as Yj(xi) = f(xi)ᵀβ + M(xi) + j(xi), (3.1) where f(xi) ∈ Rp is a vector with known functions of xi, β ∈ Rp is a vector with unknown parameters to be estimated. Components in f(xi) can be viewed as basis functions and a polynomial basis is usually adopted in the literature. The term f(xi)ᵀβ represents the trend of the overall response surface. It is assumed that M is a realization of a zero-mean stationary random process (or random field) of the second order. This assumption is inherited from the deterministic kriging literature, where the stochastic nature of M is imposed on the problem so that statistical 58 inference can be applied. For this reason, M is sometimes referred to as extrinsic uncertainty. This is contrasted with the term j(xi), which is the simulation noise for replication j taken at xi. The uncertainty in j(xi) comes from the nature of stochastic simulation, and it is sometimes referred to as intrinsic uncertainty. Given the simulation responses {Yj(xi)}nij=1, i = 1, 2, . . . , k, the sample mean of response output and simulation noise values at xi are denoted by Y¯(xi) = 1 ni ni∑ j=1 Yj(xi), ¯(xi) = 1 ni ni∑ j=1 j(xi). (3.2) The averaged responses Y¯(xi) at xi is modeled as Y¯(xi) = f(xi)ᵀβ + M(xi) + ¯(xi). Suppose that we would like to predict the response Y(x0) at any point x0. Let Y¯ = ( Y¯(x1), Y¯(x2), . . . , Y¯(xk) )ᵀ. Let ΣM be the k× k covariance matrix implied by the random field M and Σ be the k×k covariance matrix implied by the simulation noise across all design points {x1,x2, · · · ,xk}. Let ΣM(x0, ·) be the k × 1 vector (Cov(M(x0),M(x1)), . . . ,Cov(M(x0),M(xk)))ᵀ, which represents spatial covariances between a prediction point x0 and all design points. Also, define the k × p design matrix F as F = (f(x1), f(x2), . . . , f(xk))ᵀ. Suppose that ΣM, Σ and β are known. Then the MSE-optimal predictor at x0 is of the form Ŷ(x0) = f(x0)ᵀβ + ΣM(x0, ·)ᵀ[ΣM + Σ]−1(Y¯ − Fβ), (3.3) with corresponding MSE MSE ( Ŷ(x0) ) = ΣM(x0,x0)−ΣM(x0, ·)ᵀ [ΣM + Σ]−1 ΣM(x0, ·), (3.4) 59 where ΣM(x0,x0) is the spatial variance of the random field at x0. To build a stochastic kriging metamodel in practice requires imposing some structure on the spatial covariance matrix ΣM(·, ·). It is usually assumed that the spatial covariance between M(xi) and M(xj) is ΣM(xi,xj) = Cov [M(xi),M(xj)] = τ 2RM(xi,xj; θ), (3.5) where τ 2 is the spatial variance of the random field and RM is a correlation function with parameter θ. The assumption that M is second-order stationary allows us to write RM(xi,xj; θ) = RM(|xi − xj|; θ), i.e., the correlation depends only on the distance between xi and xj. Common candidates for the correlation function include the triangular correlation function, the Gaussian correlation function and the Mate´rn correlation function, etc. See [51] for a detailed discussion on effects of using different correlation functions in stochastic kriging. 3.1.2 Stochastic Kriging With Gradient Estimators We review the framework of stochastic kriging with gradient estimators (SKG) introduced by [25]. SKG builds stochastic kriging models for gradient estimators upon the stochastic kriging model for simulation responses. These two types of models are estimated together and applied to approximate response surfaces. Suppose that we observe not only the simulation responses Yj(xi), but also unbiased gradient estimates Gj(xi) ∈ Rd for the jth simulation replication at design point xi. Given an experimental design {(xi, ni)}ki=1, let the gradient estimate from replication j at design point xi be Gj(xi) = ( G1j (xi), . . . ,Gdj (xi) )ᵀ. In the SKG 60 framework, each response Yj(xi) is modeled the same as in stochastic kriging and each gradient estimator Grj (xi), r = 1, . . . , d, is modeled as Grj (xi) = (∂f(xi) ∂xir )ᵀ β + ∂M(xi)∂xir + δrj (xi). (3.6) This is valid under the following conditions: • The function f(xi) is differentiable with respect to xi. • The second-order mixed derivative of the correlation function RM in (3.5) exists and is continuous. Let G¯r(xi) and δ¯r(xi), r = 1, . . . , d, be the sample average of the gradient estimates and simulation noise, respectively, associated with xi: G¯r(xi) = 1 ni ni∑ j=1 Grj (xi), δ¯r(xi) = 1 ni ni∑ j=1 δrj (xi). The SKG framework models the averaged simulation responses and gradient esti- mates as follows: Y¯(xi) = f(xi)ᵀβ + M(xi) + ¯(xi), G¯r(xi) = (∂f(xi) ∂xir )ᵀ β + ∂M(xi)∂xir + δ¯r(xi). To satisfy the conditions required for (3.6) to hold, a common choice for the cor- relation function is the Gaussian correlation function. Let ΣM+ be the variance- covariance matrix including spatial covariances induced by M, spatial covariances induced by derivatives of M and those between M and its partial derivatives. Let ΣM+(x0, ·) be the vector analogous to ΣM(x0, ·) in stochastic kriging. We assume replications across design points are independent. In addition, simulation noise j 61 and δj are assumed to be independent from M. The covariance matrix Σ+ induced by simulation noise can be estimated by the sample covariances in practice. Let Y¯+ be the vector containing sample averages of response estimates and gradient estimates at all design points: Y¯+ = ( Y¯(x1), . . . , Y¯(xk), G¯1(x1), . . . , G¯1(xk), . . . , G¯d(x1), . . . , G¯d(xk) )ᵀ . The design matrix F in Section 2.1 now becomes F+, which can be written as F+ = ( f(x1), . . . , f(xk), (∂f(x1) ∂x11 ) , . . . , (∂f(xk) ∂xk1 ) , . . . , (∂f(x1) ∂x1k ) , . . . , (∂f(xk) ∂xkk ))ᵀ . When β is known, the SKG predictor and the corresponding MSE can be obtained by substituting Y¯+,F+,ΣM+ ,ΣM+(x0, ·) and Σ+ for Y¯ ,F,ΣM,ΣM(x0, ·) and Σ in (3.3) and (3.4), respectively. Under some simplified settings, [25] shows that SKG can reduce MSE by incorporating gradient estimates. Numerical experiments also demonstrate the advantage of SKG in improving prediction performance over stochastic kriging. 3.2 Gradient Extrapolated Stochastic Kriging We propose a different approach for incorporating the gradient estimates called Gradient Extrapolated Stochastic Kriging (GESK). Again, let Gj(xi) ∈ Rd be the gradient estimator at xi from replication j. Instead of modeling gradient esti- mates Gj(xi) as partial derivatives of the response surface, the gradient estimates are simply viewed as noisy measurements of the true gradient G(xi) ∈ Rd, i.e., Gj(xi) = G(xi) + δj(xi), where {δj(xi)}nij=1 represent the zero-mean independent 62 identically distributed noise across different replications at the design point xi. De- note the sample mean of gradient estimates at xi by G¯(xi) = 1 ni ni∑ j=1 Gj(xi). Notice that the response estimate Yj(xi) and the gradient estimate Gj(xi) within the same replication j are generally correlated. To incorporate gradient estimates into stochastic kriging, we extrapolate in the neighborhood of the original design points xi, i = 1, 2, · · · , k. Specifically, linear extrapolation is used to obtain additional responses as follows: x+i = xi + ∆xi, Yj(x+i ) = Yj(xi) + Gj(xi)ᵀ∆xi, (3.7) where ∆xi = (∆xi1,∆xi2, . . . ,∆xid)ᵀ, and the step size ∆xi needs to be small rela- tive to the spacing of xi. For simplicity, we assume that only one additional point is added in the neighborhood of xi and that the same step size is used for all design points, i.e., ∆xi = ∆x, i = 1, 2, . . . , k. Extensions include using more sophisti- cated extrapolation techniques and extrapolating multiple additional responses in the neighborhood of xi. Let Y¯(x+i ) be the sample average of these extrapolated response outputs, which is defined similarly as Y¯(xi) in (3.2). For ease of notation, let Y¯i = Y¯(xi) and Y¯+i = Y¯(x+i ). Let Y¯+ be the 2k × 1 vector containing both the original responses and the additional responses: Y¯+ = ( Y¯1, Y¯2, · · · , Y¯k, Y¯+1 , Y¯+2 , · · · , Y¯+k )ᵀ . 63 Similarly, x+ is defined as x+ = (x1,x2, · · · ,xk,x+1 ,x+2 , · · · ,x+k )ᵀ. The sample average of the additional responses Y¯+i are modeled similarly to the original responses Y¯i, i.e., Y¯+i = Y¯(x+i ) = f(x+i )ᵀβ + M(x+i ) + ¯(x+i ). It is worth mentioning that this approach of incorporating gradient information is not restricted to stochastic kriging, but it is a general approach that can be applied to other metamodel approaches. The following assumptions are made: Assumption 3.1. 1. Simulations across design points are conducted indepen- dently, i.e., the use of common random numbers (CRN) is not considered. 2. For any design point xi, the noise j(xi) are independent across replications. 3. The random field M is independent of all noise j(xi) and j(x+i ), for each design point xi and replication j. 4. The simulation noise ¯(xl) is independent of ¯(x+h ) for h 6= l. [52] finds that using CRN in stochastic kriging inflates mean squared errors generally. Assuming independence across replications and independence between M and simulation noise is inherited from the stochastic kriging literature. The last assumption says that the original simulation response is correlated with its corresponding extrapolated response, but not other extrapolated responses. 64 Let Σ+M be the 2k×2k variance-covariance matrix implied by the extrinsic spa- tial correlation model with 2k design points, including extrapolated design points: Σ+M =                       Cov[M(x1),M(x1)] · · · Cov[M(x1),M(xk)] Cov[M(x1),M(x + 1 )] · · · Cov[M(x1),M(x + k )] ... ... ... ... ... ... Cov[M(xk),M(x1)] · · · Cov[M(xk),M(xk)] Cov[M(xk),M(x + 1 )] · · · Cov[M(xk),M(x + k )] Cov[M(x+1 ),M(x1)] · · · Cov[M(x + 1 ),M(xk)] Cov[M(x + 1 ),M(x + 1 )] · · · Cov[M(x + 1 ),M(x + k )] ... ... ... ... ... ... Cov[M(x+k ),M(x1)] · · · Cov[M(x + k ),M(xk)] Cov[M(x + k ),M(x + 1 )] · · · Cov[M(x + k ),M(x + k )]                       , where each entry in Σ+M can be computed by (3.5) with a given correlation function RM and spatial variance τ 2. Let ¯+ ∈ R2k be the augmented vector of mean simulation noise: ¯+ = ( ¯(x1), . . . , ¯(xk), ¯(x+1 ), . . . ¯(x+k ) )ᵀ . Under Assumption 3.1, let Σ+ be the 2k × 2k variance-covariance matrix induced by ¯+, which can be expressed as Σ+ =                     Var[¯(x1)] 0 . . . 0 Cov[¯(x1), ¯(x+1 )] 0 . . . 0 0 Var[¯(x2)] . . . 0 0 . . . . . . 0 ... ... . . . ... ... ... . . . ... 0 0 . . . Var[¯(xk)] 0 0 . . . Cov[¯(xk), ¯(x+k )] Cov[¯(x+1 ), ¯(x1)] 0 . . . 0 Var[¯(x + 1 )] 0 . . . 0 0 . . . . . . 0 0 Var[¯(x+1 )] . . . 0 ... ... . . . ... ... ... . . . ... 0 0 . . . Cov[¯(x+k ), ¯(xk)] 0 0 . . . Var[¯(x + k )]                     . Let x0 be a prediction point and Σ+M(x0, ·) be a 2k × 1 vector Σ+M(x0, ·) = ( Cov[M(x0),M(x1)], . . . ,Cov[M(x0),M(x+k )] )ᵀ , 65 which represents spatial covariances between x0 and design points, including those extrapolated design points. The augmented design matrix F+ can be expressed as F+ = ( f(x1), . . . , f(xk), f(x+1 ), . . . , f(x+k ) )ᵀ . When β, Σ+M and Σ+ are known, the MSE-optimal predictor from the GESK model and its corresponding MSE can be constructed by substituting Y¯+, F+, Σ+M(x0, ·), Σ+M and Σ+ for Y¯ , F, ΣM(x0, ·), ΣM and Σ in (3.3) and (3.4), respectively. In practice, β, Σ+M and Σ+ are unknown and need to be estimated. The augmented matrix Σ+M is characterized by the spatial variance τ 2 and correla- tion function with parameters θ. We assume that the simulation noise vectors +j = ( j(x1), . . . , j(xk), j(x+1 ), . . . j(x+k ) )ᵀ are multivariate normally distributed with mean zero and covariance matrix Σ+ . Given the assumption, we first esti- mate Σ+ . Our approach to estimate Var[¯(xi)], Var[¯(x+i )] and Cov[¯(xi), ¯(x+i )], i = 1, 2, . . . , k will be described in the following. Estimation of Var[¯(xi)] is V̂ar[¯(xi)] = 1 ni [ 1 ni − 1 ni∑ j=1 ( Yj(xi)− Y¯(xi) )2 ] . Estimation for Var[¯(x+i )] can be done in a similar fashion by replacing Yj(xi) by Yj(x+i ). The covariance Cov[¯(xi), ¯(x+i )] is estimated by the sample covariance as Ĉov[¯(xi), ¯(x+i )] = 1 ni [ 1 ni − 1 ni∑ j=1 ( Yj(xi)− Y¯(xi) ) ( Yj(x+i )− Y¯(x+i ) ) ] . This provides us an estimate Σ̂+ for Σ+ . Combining this with normality assump- tions, we can estimate the set of parameters (β, τ 2,θ) together using maximum likelihood estimators (MLEs) as described in [21]. 66 Key to implementing the GESK model is the choice of step sizes for the ex- trapolated points, which depends on analyzing the potential improvements in per- formance from the GESK model as well as the approximation errors introduced by extrapolation. A good GESK model should take this bias-variance type tradeoff into consideration. We consider two tractable models: a two-point problem and a k- point problem with known model parameters. Under these two settings, we analyze the potential improvement in MSE by the GESK model over the stochastic kriging model, and provide conditions under which such improvement can be guaranteed. In addition, we also analyze the effects of step sizes on MSE following the discussions of the two-point problem and the k-point problem in Sections 3.2.1 and 3.2.2. Understanding the effects of step sizes will provide insights for determining step sizes, which will be discussed later in detail in Section 3.3. In the following discussion, we continue to assume that the same step size is used for extrapola- tion at each design point and only one additional response is extrapolated in the neighborhood of each original design point. The step size ∆x determines the MSE of the GESK predictor through several factors: the biases ζi in the extrapolated responses; the correlation ρi between the simulation noise of original responses and extrapolated responses; and the variances σ2i+ of the simulation noise in extrapolated responses. Since linear extrapolation is employed in the GESK model, the bias ζi in the extrapolated response Y¯+i is bounded by K||∆x||2 for some K > 0. This suggests that the upper bound of biases can be controlled by choosing different step sizes. The correlation ρi depends on both the step size and the covariance between the simulation noise of the responses 67 and those of the gradient estimators. A larger step size or smaller covariance leads to smaller correlation factor ρi, whereas σ2i+ changes as the step size changes, but also depends on the sign of the correlation ρi. Effects of these factors will be discussed in detail in the following using the two-point problem and k-point problem as in Sections 3.2.1 and 3.2.2. 3.2.1 A Two-Point Problem with Single Extrapolated Point Consider a one-dimensional problem (d = 1) of two design points x1 and x2 with numbers of replications n1 and n2, respectively. Without loss of generality, let x1 < x2 and the prediction point be x0 ∈ [x1, x2]. The simulation outputs include responses {Yj(xi)}nij=1 for i = 1, 2 at both design points and gradient estimators {Gj(x1)}n1j=1 at x1 only. A constant trend is used to represent the overall surface mean, i.e., f(xi)ᵀβ = β0. All parameters (β0, τ 2, θ) are assumed to be known. Let the spatial variance τ 2 > 0 and ril be the correlation between M(xi) and M(xl), i, l = 0, 1, . . . , k. The correlation ril can be calculated from the correla- tion function RM(xi, xl; θ), but no specific correlation function is specified in this discussion. Let the variance of the simulation noise at xi from replication j be Var[j(xi)] = σ2i . Let Y¯ = (Y¯1, Y¯2)ᵀ be the vector containing the sample means of responses at x1 and x2. The stochastic kriging predictor at x0 is given as Yˆ(x0) = β0 +τ 2 (r1(τ 2 + σ 2 2 n2 )− r2τ 2r12)(Y¯1 − β0) + (r2(τ 2 + σ 2 1 n1 )− r1τ 2r12)(Y¯2 − β0) (τ 2 + σ 2 1 n1 )(τ 2 + σ 2 2 n2 )− τ 4r212 , (3.8) 68 with corresponding MSE MSE ( Yˆ(x0) ) = τ 2 ( 1− τ 2 (r201 + r202)τ 2 + r201σ 2 2 n2 + r 2 02σ 2 1 n1 − 2r01r02r12τ 2 (τ 2 + σ 2 1 n1 )(τ 2 + σ 2 2 n2 )− τ 4r212 ) . (3.9) With a pre-specified step size ∆x, a new design point x+1 = x1 + ∆x in the interval [x1, x2] is added and GESK extrapolates its response as Yj(x+1 ) = Yj(x1)+∆xGj(x1). This additional response output is modeled as Yj(x+1 ) = β0 + M(x+1 ) + j(x+1 ). To address approximation error introduced by extrapolation, we assume that j(x+1 ) is normally distributed with mean ζi = ζ(xi) and variance σ21+ ; thus, the extrapolated responses Yj(x+1 ) at x+1 are biased unless ζi = 0. Let Y¯+1 be the sample mean of responses at x+1 and the vector Y¯+ = (Y¯1, Y¯2, Y¯+1 )ᵀ. Let ρ1 be the correlation between ¯(x1) and ¯(x+1 ) and ri1+ be the correlation between M(xi) and M(x+1 ) for i = 0, 1, 2. The variance-covariance matrix Σ+ = Σ+M + Σ+ takes the form Σ+ = τ 2         1 r12 r11+ r12 1 r21+ r11+ r21+ 1         +         σ21 n1 0 ρ1 σ1σ1+ n1 0 σ 2 2 n2 0 ρ1 σ1σ1+ n1 0 σ 2 1+ n1         =     Σ b bᵀ c     , where Σ is the 2 × 2 covariance matrix of the vector (Y¯1, Y¯2)ᵀ, b is a 2 × 1 vec- tor and c = τ 2 + σ21+/n1. The vector containing covariances between M(x0) and ( M(x1),M(x2),M(x+1 ) )ᵀ is Σ+M(x0, ·) = τ 2         r01 r02 r01+         =     ΣM(x0, ·) τ 2r01+     . 69 The new predictor at x0 from the GESK model is Ŷ+(x0) = Ŷ(x0) + 1 v [ bᵀΣ−1(Y¯ − β012)− (Y¯+1 − β0) ] [ ΣM(x0, ·)ᵀΣ−1b− τ 2r01+ ] , (3.10) where Ŷ(x0) is defined in (3.8) and v = c− bᵀΣ−1b. The following theorem provides an expression for MSE(Ŷ+(x0)) and conditions under which the GESK predictor in (3.10) has smaller MSE than that in (3.8). Theorem 3.1. The MSE of the predictor in (3.10) can be expressed as MSE ( Ŷ+(x0) ) = MSE ( Ŷ(x0) ) + (ζ21 v2 − 1 v ) [ ΣM(x0, ·)ᵀΣ−1b− τ 2r01+ ]2 , (3.11) and the GESK predictor has a smaller MSE if ζ21 < v. Proof. The MSE of GESK predictor MSE ( Yˆ+(x0) ) follows from straightforward calculation and details are provided in the Appendix. Both Σ+ and Σ are variance- covariance matrices, and det ( Σ+ ) = det (Σ) det ( c− bᵀΣ−1b ) = v det (Σ) , and it follows that v > 0 since both det(Σ+) and det(Σ) are positive. Since v > 0, the condition ζ21 < v is well defined. Under this condition, the GESK predictor has a smaller MSE than the stochastic kriging predictor. Theorem 3.1 provided the change in MSE at a prediction point x0: ∆MSE = (ζ21 v2 − 1 v ) [ ΣM(x0, ·)ᵀΣ−1b− τ 2r01+ ]2 . We summarize our findings in this setting as follows: 70 1. The bias ζ1 must satisfy ζ21 < v as shown in Theorem 3.1 to guarantee reduction in MSE. Since ζ1 is proportional to (∆x)2, intuitively the step size should be relatively small. 2. The greater the correlation ρ1, the greater the reduction in MSE. The param- eter ρ1 also depends on the correlation between j(x1) and δj(x1), namely, the simulation noise of output responses and gradient estimators. The parameter ρ1 increases as the correlation between j(x1) and δj(x1) increases. 3. The parameter σ21+ represents the noise in an extrapolated response Yj(x + 1 ). The reduction in MSE is greater if σ21+ is smaller. All conditions seem to prefer using smaller step sizes. However, other diffi- culties arise if the step sizes are too small: first, because the quantity v becomes smaller as ∆x becomes smaller, the condition ζ21 < v may not hold; second, as ∆x approaches zero, the correlation ρ1 approaches 1 and this may make the matrix Σ+ ill-conditioned, which leads to numerical issues. 3.2.2 A k -Point Problem In this section, we consider a tractable problem with k design points, where xi ∈ Rd, under the following assumptions: 1. Along with the response outputs Yj(xi), gradient estimators Gj(xi) are also collected from simulations at design points {xi}ki=1. 2. Only one additional response is extrapolated in the neighborhood of each 71 design point. 3. The trend f(xi)ᵀβ = β0 and all parameters (β0, τ 2,θ) are known. Within the region of interest, an additional response Y+j (xi) is extrapolated using each pair of observations (Yj(xi),Gj(xi)). All extrapolated design points should be in the interior of the design region; therefore, extrapolations from design points at the boundary should be done cautiously. Let Y¯+ be the 2k × 1 vector that consists of sample means of all response outputs Y¯+ = (Y¯1, Y¯2, . . . , Y¯k, Y¯+1 , Y¯+2 , . . . , Y¯+k )ᵀ, where Y¯i = Y¯(xi) and Y¯+i = Y¯(x+i ). The sample mean of original responses at xi are modeled as in Section 3.2. The sample mean of extrapolated responses are modeled similarly, i.e., Y¯(x+i ) = β0 + M(x+i ) + ¯(x+i ). Let ρi denote the correlation between ¯(xi) and ¯(x+i ). The spatial correlations between original design points and extrapolated design points are denoted as ril = Corr[M(xi),M(xl)], ril+ = Corr[M(xi),M(x+l )] and ri+l+ = Corr[M(x+i ),M(x+l )] for i, l = 1, 2, . . . , k. The 2k × 2k variance-covariance matrix Σ+ = Σ+M + Σ+ can be expressed in a block form Σ+ =     Σ B Bᵀ C     , (3.12) 72 where Σ =             τ 2 + σ21/n1 r12 · · · r1k r12 τ 2 + σ22/n2 · · · r2k ... ... . . . ... r1k r2k · · · τ 2 + σ2k/nk             , B =             τ 2r11+ + ρ1 σ1σ1+ n1 τ 2r12+ · · · τ 2r1k+ τ 2r12+ τ 2r22+ + ρ2 σ2σ2+ n2 · · · τ 2r2k+ ... ... . . . ... τ 2r1k+ τ 2r2k+ · · · τ 2rkk+ + ρk σkσk+ nk             , C =             τ 2 + σ21+/n1 τ 2r1+2+ · · · τ 2r1+k+ τ 2r1+2+ τ 2 + σ22+/n2 · · · τ 2r2+k+ ... ... . . . ... τ 2r1+k+ τ 2r2+k+ · · · τ 2 + σ2k+/nk             . Given a prediction point x0, let Σ+M(x0, ·) be a 2k × 1 vector that consists of the spatial covariances between x0 and all design points, Σ+M(x0, ·) = ( ΣM(x0,x1), . . . ,ΣM(x0,xk),ΣM(x0,x+1 ), . . . ,ΣM(x0,x+k ) )ᵀ = ( ΣᵀM(x0, ·) Σ ᵀ M+(x0, ·) )ᵀ , where the both ΣM+(x0, ·) and ΣM+(x0, ·) are k × 1 vectors. As in the analysis of the two-point problem, an important issue to address is the approximation error introduced by extrapolation. Let the noise terms (x+i ) 73 at x+i follow normal distributions with means ζi = ζ(xi), which implies that the additional response outputs Yj(x+i ) are biased if ζi 6= 0. We will analyze the effects of incorporating them in the following. Let the vector ζ ∈ R2k be ζ = (0, 0, . . . , 0, ζ1, ζ2, . . . , ζk)ᵀ = (0ᵀk ζ ᵀ k) ᵀ , which represents the expectation of the 2k × 1 noise vector ¯+. Let Ŷ+(x0) be the GESK predictor at x0. The MSE of the GESK predictor for this k-point problem is MSE ( Ŷ+(x0) ) = Σ+M(x0,x0)−Σ+M(x0, ·)ᵀ [ Σ+M + Σ+ ]−1 Σ+M(x0, ·) + ( Σ+M(x0, ·)ᵀ [ Σ+M + Σ+ ]−1 ζ )2 = MSE ( Ŷ2k(x0) ) + ( Σ+M(x0, ·)ᵀ [ Σ+M + Σ+ ]−1 ζ )2 . (3.13) The first term MSE ( Ŷ2k(x0) ) is the MSE of prediction that one would obtain if unbiased responses are collected at 2k design points, namely, running simulations at x+i to collect response estimates rather than extrapolating additional response estimates. The second term is the inflation of MSE caused by approximation errors ζ in the additional extrapolated responses. Let Ŷ(x0) be the stochastic kriging predictor with k design points. Our interest is to compare the MSE of the GESK predictor Ŷ+(x0) with that of Ŷ(x0). To achieve this, we begin by looking into the MSE of Ŷ2k(x0). Using the Woodbury matrix identity and block inverse formula in linear alge- bra, the MSE of Ŷ2k(x0) can be expressed as MSE ( Ŷ2k(x0) ) = Σ+M(x0,x0)−Σ+M(x0, ·)ᵀ ( Σ+ )−1 Σ+M(x0, ·) = MSE ( Ŷ(x0) ) − ωᵀVω, (3.14) 74 where ω = BᵀΣ−1ΣM(x0, ·)−ΣM+(x0, ·) and V = (C−BᵀΣ−1B)−1. Lemma 3.2. The matrix V = (C−BᵀΣ−1B)−1 is positive definite. Proof. Consider the 2k × 2k covariance matrix Σ+, Σ+ =     Σ B Bᵀ C     . First it is easy to see that Σ+ is positive definite, so [ u v ]     Σ B Bᵀ C         u v     > 0, for any k × 1 vector u,v ∈ Rk. This leads to uᵀΣu + 2vᵀBᵀu + vᵀCv > 0. For a fixed vector v, consider f(u) = uᵀΣu + 2vᵀBᵀu + vᵀCv as a function of u. The first-order condition shows that the minimum of f(u) is min u f(u) = vᵀ(C−BᵀΣ−1B)v, which has to be positive for any v ∈ Rk. Therefore the matrix C − BᵀΣ−1B is positive definite and its inverse V = (C−BᵀΣ−1B)−1 is also positive definite. Since the matrix V is positive definite, it follows immediately that MSE ( Ŷ2k(x0) ) = MSE ( Ŷ(x0) ) − ωᵀVω ≤ MSE ( Ŷ(x0) ) , where equality only holds if and only if ω = 0. Thus, not surprisingly, the MSE is reduced if the k additional response outputs are unbiased. 75 Next we investigate the effect of the extrapolated bias on the overall MSE. Combining (3.13) and (3.14) gives MSE ( Ŷ+(x0) ) = MSE ( Ŷ2k(x0) ) + ( Σ+M(x0, ·)ᵀ [ Σ+M + Σ+ ]−1 ζ )2 = MSE ( Ŷ(x0) ) − ωᵀVω + ( Σ+M(x0, ·)ᵀ [ Σ+M + Σ+ ]−1 ζ )2 = MSE ( Ŷ(x0) ) − ωᵀVω + (( ΣᵀM+(x0, ·)V −Σ ᵀ M(x0, ·)Σ−1BV ) ζk )2 = MSE ( Ŷ(x0) ) − ωᵀVω + (ωᵀVζk)2 = MSE ( Ŷ(x0) ) + ωᵀVζkωᵀVζk − ωᵀVω = MSE ( Ŷ(x0) ) + ωᵀVζkζᵀkVω − ωᵀVω = MSE ( Ŷ(x0) ) + ωᵀ (VζkζᵀkV −V)ω. (3.15) The next theorem provides a sufficient condition under which MSE ( Ŷ+(x0) ) is smaller than MSE ( Ŷ(x0) ) . Theorem 3.3. Let λi(A) denote the ith largest eigenvalue of matrix A, i = 1, 2, . . . , k. The symmetric matrix W = VζkζᵀkV −V is negative definite if ζᵀkζk ≤ λk(V) [λ1(V)]2 . (3.16) Proof. Using Weyl’s inequality in matrix theory and Corollary 11 in [53], the largest eigenvalue λ1(W) of W satisfies λ1(W) = λ1 (VζkζᵀkV −V) ≤ λ1 (VζkζᵀkV) + λ1(−V) = λ1 (VζkζᵀkV)− λk(V) ≤ [λ1(V)]2 λ1(ζkζᵀk)− λk(V). 76 The k × k matrix ζkζᵀk is known as a dyad, which has one positive eigenvalue ζ ᵀ kζk and k−1 zero eigenvalues provided that ζk 6= 0. It follows that the largest eigenvalue of ζkζᵀk is λ1(ζkζ ᵀ k) = ζ ᵀ kζk. Applying condition (3.16), we have λ1(W) < 0, namely, the largest eigenvalue of W is negative, so all eigenvalues of W are negative, and therefore the matrix W is negative definite. When the matrix W is negative definite, the quantity ωᵀWω is always neg- ative unless ω = 0, so the GESK model reduces MSE for the k-point problem if (3.16) holds. Remark 3.4. The condition in (3.16) is well defined, as the matrix V is shown to be positive definite in Lemma 3.2. Theorem 3.3 shows that when the biases are relatively small, the reduction in MSE from including the additional extrapolated points still exceeds the inflation in MSE introduced from the bias of the extrapolated points. In addition to assumptions in Section 3.2.2, we also assume that the k design points are widely spread such that the spatial correlation between design points is approximately 0. A similar assumption is used in [25] to isolate the impacts of incorporating gradient estimators from spatial covariances. This implies that the matrix Σ in (3.12) is a diagonal matrix. As the step size ∆x is usually small, we assume the same property holds for B and C in (3.12) also. The change in MSE of the GESK predictor is ∆MSE = ωᵀ (VζkζᵀkV −V)ω, 77 where ω = BᵀΣ−1ΣM(x0, ·) − ΣM+(x0, ·) and V = (C−BᵀΣ−1B)−1. The effects of ∆x, ρi and σ2i+ are summarized as follows: 1. Theorem 3.3 suggests that the quantity ζᵀkζk needs to be small enough to guarantee that GESK can reduce MSE. This condition requires the step size ∆xj in each dimension to be small. 2. Regarding the correlation ρi, the preferable sign of the correlation actually depends on the location of the prediction point x0. A condition between ∆x and xi−x0 determines the preferable sign of ρi. A specific analytical form for the condition depends on the type of correlation function, larger |ρi| is better in each favorable case. 3. If the correlation ρi ≈ 0, smaller σ2i+ is preferable, since it suggests that there is less noise in the extrapolated responses. The same conclusion holds when the correlation ρi is positive. However, if the correlation ρi < 0, smaller σ2i+ is not necessarily better, as there exists an optimal σ2i+ that reduces MSE the most. Analyzing the effects of step size on MSE in a general setting is more difficult, especially in multidimensional problems. For example, step size used in a multidi- mensional problem may be different along different directions. Choosing good step sizes is crucial for building the GESK models. In the next section, we propose two different approaches to determine the optimal step size. 78 3.3 Implementations of GESK In this section, we focus on two important questions in the implementation of the GESK model: choosing step sizes and choosing gradient estimators. We pro- vide two different techniques for determining steps sizes and discuss their pros and cons. We also make recommendations between the infinitesimal perturbation anal- ysis (IPA) and the likelihood ratio/score function (LR/SF) techniques for gradient estimation. As discussed in the previous section, central to building a GESK model is the determination of appropriate step sizes. A good choice of step size is crucial to the performance of the GESK model. Different step sizes, even with the same data set, may lead to dramatical performance differences of GESK models. The linear extrapolation used in GESK is only appropriate in a small neighborhood of the design points, so the step size cannot be too large. If the step size is too small, the additional points obtained from linear extrapolations provide little information and it may cause numerical stability issues. Two natural choices for determining step sizes are maximum likelihood esti- mation (MLE) and minimizing integrated mean squared error (IMSE). However, the MLE approach is not suitable for determining step sizes in this context. The MLE approach leads to step sizes as small as possible, which results in numerical stability issues when building the GESK model. One unique characteristic of the GESK model is that biases are introduced during extrapolations. Although the biases are unknown, they should be taken into consideration during parameter esti- 79 mations. To accomplish this, penalty terms are introduced in MLE and IMSE. We formalize two approaches for determining step sizes: penalized maximum likelihood estimation and minimizing integrated mean squared error. 3.3.1 Penalized Maximum Likelihood Estimation One natural choice for determining the step size ∆x is to treat it as a new parameter in addition to the other parameters (β, τ 2,θ). Under Assumption 1 in [21], we can write down the likelihood function. However, as mentioned earlier, naive MLE is not suitable for choosing step sizes in this case. Assuming the correlation function in (3.5) is used, we propose a penalized maximum likelihood method where the penalized likelihood function takes the following general form: Q(β, τ 2,θ,∆x) = − ln [ (2pi)k ] − 1 2 ln [∣ ∣Σ+M + Σ+ ∣ ∣ ] − 1 2(Y¯ + − F+β)ᵀ [ Σ+M + Σ+ ]−1 (Y¯+ − F+β)− pλ(∆x), where pλ(·) is a given nonnegative penalty function with a regularization param- eter λ. Common choices of penalty functions include L1 penalty, L2 penalty and smoothly clipped absolute deviation (SCAD). The proposed penalty function is pλ(∆x) = λ||∆x||−2, where ∆x = (∆x1,∆x2, . . . ,∆xd) and ||∆x||−2 := d∑ i=1 (∆xi)−2. Therefore the pro- 80 posed penalized likelihood function is Q(β, τ 2,θ,∆x) = − ln [ (2pi)k ] − 1 2 ln [∣ ∣Σ+M + Σ+ ∣ ∣ ] − 1 2(Y¯ + − F+β)ᵀ [ Σ+M + Σ+ ]−1 (Y¯+ − F+β)− λ||∆x||−2. (3.17) Penalized maximum likelihood estimation (PMLE) has been used to do vari- able selection [54] and overcome flat likelihood function issues [55] for kriging. One key difference is that previous PMLE approaches try to improve the quality of esti- mates for (β, τ 2,θ), while we propose to use PMLE for choosing step sizes. 3.3.2 Minimizing Integrated MSE Another view is to connect the problem of finding step sizes with design of experiments (DOE). Choosing step sizes is similar to adding new design points in DOE. In deterministic and stochastic kriging literature, many criteria have been proposed to find the “best” experiment design, most of which are based on MSE. Using integrated mean squared error (IMSE) as the objective function, the problem can be formulated as Minimize ∆x IMSE = ∫ x0∈Ω MSE+ ( Ŷ(x0; ∆x) ) dx0, (3.18) where Ω is the region of interest. Lower IMSE suggests smaller deviation associated with the approximation over the region of interest. In practice, a penalty term involving step sizes is added 81 to MSE+, MSE+(Ŷ(x0; ∆x)) = Σ+M(x0,x0)−Σ+M(x0, ·)ᵀ[Σ+M + Σ+ ]−1Σ+M(x0, ·) + λ||∆x||2, (3.19) where the Euclidean norm of ∆x is used. Adding a penalty term that is proportional to ||∆x||2 in MSE+ follows the discussion earlier. The PMLE approach estimates all the parameters (β, τ 2,θ) with ∆x simulta- neously. However, the IMSE approach requires τ 2 and θ to be known in advance. In practice, a two-stage strategy is proposed to address this issue: 1. In Stage 1, use the original dataset { xi, Y¯(xi) }k i=1 to obtain MLEs for (βˆ, τˆ 2, θˆ). 2. Calculate MSE+ ( Ŷ(x0; ∆x) ) in (3.18) with the estimated (βˆ, τˆ 2, θˆ) and a predetermined penalty constant λ. 3. In Stage 2, minimize the IMSE in (3.19) to find the optimal step size. In our implementation, we use the optimization routine fmincon in Matlab. 3.3.3 Choosing Regularization Parameter The question of selecting the regularization parameter λ in both approaches remains to be answered. We propose to use cross validation (CV), which is widely used in statistics and machine learning community, to choose the regularization parameters. Cross validation allows us to assess the performance with different regularization parameters without running additional simulations. When a J-fold cross validation is applied for a given regularization parameter λ, a corresponding 82 score CV(λ) can be calculated. Given the design points {xi}ki=1 with the averaged simulation output {Y¯(xi), G¯(xi)}ki=1, the CV score is calculated as follows: 1. Split the dataset D = {xi, Y¯(xi), G¯(xi)}ki=1 into J subsets D1,D2, . . . ,DJ . All the design points located on the boundary are handled separately so that extrapolated design points are still in the space of interest. 2. For j = 1, 2, . . . , J , choose all the design points in Dj as prediction points and build a GESK model using D\Dj. Predict the response Ŷ(x) for every x ∈ Dj. 3. Compute the CV score for a given parameter λ as a sum of squared errors between the prediction Ŷ(xi) and the averaged output Y¯(xi) on Dj, CV(λ) = J∑ j=1 ∑ (xi,Y¯i)∈Dj ( Y¯i − Ŷ(xi) )2 . To start the cross validation, we need to choose a set of regularization parameters Λ = {λ1, λ2, . . . , λL}. We compute the CV score for each λl ∈ Λ and choose the best regularization parameter λ∗ as λ∗ = arg min λl∈Λ CV(λl). Regarding the set of parameters Λ, if Λ contains a sufficiently large number of points and computational time is not an issue, cross validation will choose the best regularization parameter for building a GESK model. In practice, one way to choose the candidate parameter linearly on a logarithmic scale, for example. 10−1, 100, . . . , 103. This is essentially an optimization problem. When the size of the set Λ is small, an exhaustive search algorithm can find λ∗ easily. If the size of 83 Λ get large, randomized search algorithm can be applied as well. In our numerical experiments, CV score is computed for each parameter in Λ to find λ∗. The performance of the two proposed approaches, PMLE and IMSE, will be investigated via numerical examples in Section 3.4 using different test problems. We summarize their main features and differences as follows: • Both methods need a regularization parameter to be calibrated, as unknown biases are taken into consideration in both methods. We propose using cross validation methods to determine regularization parameters. • PMLE uses a penalty function to overcome small step size issues in using naive MLE. The PMLE approach takes biases into consideration, but does not guarantee good performance in MSE or IMSE. However, in high-dimensional problems, maximizing a penalized likelihood is usually computationally faster than integrating MSE. • IMSE minimizes the IMSE over the design region, but, in high-dimensional problems, numerical integrations become computationally expensive, requiring Monte Carlo methods with long computation time. 3.3.4 Choosing Gradient Estimators In this chapter, we only consider direct gradient estimators. Specifically, we focus on the infinitesimal perturbation analysis (IPA) and likelihood ratio/score function (LR/SF) methods. Under mild conditions, both techniques are able to provide unbiased gradient estimators, but we would like to know which technique is 84 preferable in building a GESK model, provided that both methods are applicable. Observations made in [56] and [25] suggest the following: (i) at a given point, correlations between the responses and the corresponding gradient estimates are higher when IPA is applied, (ii) IPA gradient estimators usually have smaller vari- ances than LR/SF estimators, (iii) IPA gradient estimators have better performance when applied in stochastic kriging with gradient estimators (SKG). Discussions in Sections 3.2.1 and 3.2.2 suggests that the GESK model prefers gradient estimators that are highly correlated with response estimates and have smaller variances. Therefore, under most settings, IPA gradient estimators are preferable to build a GESK model. IPA gradient estimators are employed in the M/M/1 example conducted in Section 3.4. 3.4 NUMERICAL EXAMPLES In this section, several numerical experiments are conducted to illustrate the proposed GESK model. Our goal in this section is three-fold: To demonstrate the effects of different step sizes on the performance of the GESK model; to empiri- cally compare the effectiveness of the PMLE and IMSE approaches in determining step sizes; to examine the performance of the GESK model in different settings and compare it with stochastic kriging [21] and stochastic kriging with gradient estima- tors (SKG) [25]. Implementation of SKG and GESK are built upon software for stochastic kriging downloaded from http://www.stochastickriging.net. Across all experiments, we assume little information is known about the re- 85 sponse surface and choose constant trends for all models, i.e., f(x)ᵀβ = β0. A Gaussian correlation function RM(x,x′) = exp{−θ‖x − x′‖2} is used for all the experiments, since it satisfies the conditions required by SKG. We implemented both PMLE and IMSE approaches discussed in Section ?? to determine step sizes, together with the cross validation method to choose regu- larization parameters. The corresponding GESK models are named GESK-PMLE and GESK-IMSE. The measure of performance we chose is the Empirical IMSE (EIMSE), as used in [57] and other kriging literature: EIMSE = 1N N∑ i=1 ( Ŷ(xi)− Y(xi) )2 , (3.20) where N is the number of predictions, Ŷ(xi) is the predicted response at xi and Y(xi) is the true value at xi. 3.4.1 Experiment on Step Sizes in GESK We investigate the effects of using different step sizes in the GESK models using an M/M/1 queue example [58]. The M/M/1 queue has arrival rate 1 and service rate x ∈ [1.1, 2]. We are interested in the steady-state expected waiting time y(x), which has an analytical solution y(x) = 1/(x(x− 1)). In our simulation, each sample path was initialized in steady state and simulated for 5000 customers. The outputs collected were the average waiting time and its derivative with respect to the service rate x. Six different experiment designs, (6, 50), (6, 200), (6, 1000), (8, 200), (10, 200), (20, 200), were used in the experiment, where the first element in each pair gives the 86 Design GESK-1 GESK-2 GESK-3 GESK-PMLE GESK-IMSE (6, 50) 0.085 (0.0036) 0.167 (0.0035) 0.618 (0.0149) 0.042 (0.0020) 0.027 (0.0017) (6, 200) 0.094 (0.0023) 0.181 (0.0019) 0.731 (0.0064) 0.034 (0.0011) 0.024 (0.0010) (6, 1000) 0.092 (0.0011) 0.180 (0.0010) 0.747 (0.0037) 0.038 (0.0006) 0.021 (0.0004) (8, 200) 0.006 (0.0004) 0.016 (0.0006) 0.194 (0.0023) 0.005 (0.0003) 0.002 (0.0002) (10, 200) 0.006 (0.0005) 0.007 (0.0011) 0.017 (0.0012) 0.007 (0.0007) 0.004 (0.0003) (20, 200) 0.002 (0.0008) 0.006 (0.0003) 0.048 (0.0020) 0.003 (0.0003) 0.001 (0.0001) Table 3.1: Averaged EIMSE from 100 macroreplications for GESK models under six designs (# of design points, # of reps) to predict expected waiting time in the M/M/1 queue example. Three fixed step sizes with those determined by PMLE and IMSE are compared. Standard errors are shown in parenthesis. number of design points and the second element represents the number of replica- tions at each design point. With equally spaced design points, three predetermined step sizes were chosen for each design, which correspond to 1/10, 1/5 and 1/2 of the length of the subin- terval. GESK models built with these step sizes are labelled as GESK-1, GESK-2 and GESK-3, respectively. We ran the experiments for 100 macroreplications. Within each macroreplication, we chose N = 1000 to estimate the EIMSE in (3.20). Table 3.1 shows the sample mean and standard errors of EIMSE, and Figure 3.1 contains boxplots for the EIMSE. Our findings are summarized as follows: • Predetermined step sizes vs. Optimal step sizes. Performances of 87 GESK-1GESK-2GESK-3 PMLE IMSE GESK-1GESK-2GESK-3 PMLE IMSE GESK-1GESK-2GESK-3 PMLE IMSE 0.0 0.2 0.4 0.6 0.8 1.0 1.2 E I M S E Design(6, 50) Design(6, 200) Design(6, 1000) (a) Design (6, 50), (6, 200), (6, 1000) GESK-1GESK-2GESK-3 PMLE IMSE GESK-1GESK-2GESK-3 PMLE IMSE GESK-1GESK-2GESK-3 PMLE IMSE 0.00 0.05 0.10 0.15 0.20 0.25 E I M S E Design(8, 200) Design(10, 200) Design(20, 200) (b) Design (8, 200), (10, 200), (20, 200) Figure 3.1: Boxplots of EIMSE from 100 macroreplications for the GESK models under six designs (# of design points, # of reps) to predict expected steady-state waiting time in the M/M/1 queue example. 88 the two optimal step sizes are better than those of predetermined step sizes, especially when the number of design points is small. This is expected, as the choice of step sizes should adapt to the experiment design and simulation output. • PMLE vs. IMSE. The performance of IMSE is better than that of PMLE under most experiment designs, in terms of having smaller averaged EIMSE, smaller variances of EIMSE and smaller number of outliers. Figure 3.2 shows boxplots for step sizes determined by PMLE and IMSE under all six designs. • Effect of number of design points. When the number of design points is small, for example k = 6, improvements in EIMSE are more significant. How- ever, when there are already enough design points, improvements are hardly noticeable. In addition, for both PMLE and IMSE, the relative step size (ra- tio to the size of the subinterval) generally increases as the number of design points increases. • Effect of number of replications. As the number of replications increases, the variances of EIMSE become smaller as shown in Table 3.1 and Figure 3.1. However, changes in the averaged IMSE are not significant. Variances of the chosen step sizes seem to decrease as well, as shown in Figure 3.2. 89 0.004 0.006 0.008 0.010 0.012 0.014 0.016 Step sizes Design (6, 50) Design (6, 200) Design (6, 1000) Design (8, 200) Design (10, 200) Design (20, 200) (a) Step sizes determined by PMLE 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 Step sizes Design (6, 50) Design (6, 200) Design (6, 1000) Design (8, 200) Design (10, 200) Design (20, 200) (b) Step sizes determined by IMSE Figure 3.2: Boxplots for step sizes determined by PMLE and IMSE based on 100 macroreplications under six designs (# of design points, # of reps) to predict the expected steady-state waiting time in the M/M/1 queue example. 90 3.4.2 Comparisons among SK, SKG and GESK In this section, we will compare the performances of three different meta- models: stochastic kriging, stochastic kriging with gradient estimators (SKG) and gradient extrapolated stochastic kriging (GESK) in three different experiments. The theoretical analysis of GESK in Section ?? assumes that all parameters are known, whereas the comparison here is empirical when all parameters must be estimated. 3.4.2.1 A stochastic simulation example We used the same M/M/1 queue example as in Section 3.4.1. Six different ex- periment designs were adopted as well. We ran the experiments for 100 macrorepli- cations. Within each macroreplication, we chose N = 1000 to estimate the EIMSE in (3.20). Results are shown in Table 3.2 and Figure 3.3. These 100 macroreplica- tions used the same random numbers as those used in Section 3.4.1, so numbers for the two GESK models in Table 3.3 and corresponding boxplots in Figure 3.4 are the same as those in Section 3.4.1. Our findings are summarized as follows: • SK vs. SKG vs. GESK. Not surprisingly, SKG and GESK perform better than SK, as incorporating gradient estimators provides more informa- tion about the response surface. GESK-IMSE and SKG perform better than GESK-PMLE in most cases, since objective adopted by PMLE is not directly related to MSE. Both GESK models perform comparably well or better than the SKG model. It is not easy to distinguish GESK-IMSE and SKG, since 91 Design SK SKG GESK-PMLE GESK-IMSE (6, 50) 0.313 (0.0134) 0.031 (0.0036) 0.042 (0.0020) 0.027 (0.0017) (6, 200) 0.324 (0.0062) 0.016 (0.0007) 0.034 (0.0011) 0.024 (0.0010) (6, 1000) 0.328 (0.0027) 0.016 (0.0003) 0.038 (0.0006) 0.021 (0.0004) (8, 200) 0.054 (0.0019) 0.002 (0.0003) 0.005 (0.0003) 0.002 (0.0002) (10, 200) 0.009 (0.0004) 0.004 (0.0014) 0.007 (0.0007) 0.004 (0.0003) (20, 200) 0.004 (0.0002) 0.004 (0.0004) 0.003 (0.0003) 0.001 (0.0001) Table 3.2: Averaged EIMSE from 100 macroreplications for SK, SKG and GESK with six different designs on estimating the expected steady-state waiting time in an M/M/1 queue problem. The design (6, 50) means 6 design points with 50 repli- cations at each design point. Standard errors are shown in parentheses. 92 their performances are really close. • Number of design points. Incorporating gradient estimators improves per- formance considerably when the design points are sparse. For example, both SKG and GESK have more significant improvement over stochastic kriging when k = 6. As the number of design points increases, performance of most models improves. • Number of replications. As the number of replications increases with a fixed number of design points, the variance of EIMSE decreases for all three methods, as shown in Figures 3.3(a). However, the averaged EIMSE does not improve significantly. 3.4.2.2 A stylized example with added noise We consider a one-dimensional example from [23], where the true response surface is Y(x) = exp(−1.4x) cos(7pix/2) with x ∈ [−2, 0]. The presence of mul- tiple local extreme values on the response surface makes building a good meta- model difficult. The simulation response output at x from replication j is Yj(x) = exp(−1.4x) cos(7pix/2) + j(x), with j(x) ∼ N (0, 1). Direct gradient estimates are assumed of the form Gj(x) = Y′(x) + δj(x) as the gradient estimate at x from sim- ulation replication j, with δj(x) ∼ N (0, 25). We let δj(x) have a larger variance in order to empirically investigate the performances of SKG and GESK when gradient estimates are noisier. We ran the experiments for 100 macroreplication. Within each macroreplica- 93 Design SK SKG GESK-PMLE GESK-IMSE (6, 50) 39.616 (0.0374) 2.044 (0.0106) 1.909 (0.0181) 1.828 (0.0111) (6, 200) 39.586 (0.0192) 2.023 (0.0049) 1.830 (0.0091) 1.757 (0.0050) (6, 1000) 39.581 (0.0084) 7.652 (0.8823) 1.829 (0.0033) 1.758 (0.0023) (8, 200) 2.793 (0.0039) 0.069 (0.0009) 0.063 (0.0026) 0.068 (0.0025) (10, 200) 0.949 (0.0026) 1.243 (0.4204) 0.178 (0.0008) 0.012 (0.0007) (20, 200) 0.008 (0.0002) 0.001 (0.0001) 0.046 (0.0004) 0.004 (0.0004) Table 3.3: Averaged EIMSE from 100 macroreplications for SK, SKG and GESK with five different designs on y(x) = exp(−1.4x) cos(7pix/2)+. Standard errors are shown in parentheses. tion, we chose N = 1000 to estimate the EIMSE in (3.20). Six different experiment designs, (6, 50), (6, 200), (6, 1000), (8, 200), (10, 200) and (20, 200) were adopted, with results shown in Table 3.3 and Figure 3.4. Notice that ln(EIMSE) values are shown in Figure 3.4, as EIMSE results from the three models differ substantially. • SK vs. SKG vs. GESK. As shown in Figure 3.4, both the SKG and the GESK models are better than SK when there is a limited number of design points. The GESK models perform better than SKG when k = 6. The expla- nation is that the response surface has several fluctuation and extrapolation allows GESK models to explore and approximate the response surface better than the others. SKG performs better when there are enough design point, for example, k = 20. SKG experiences numerical issues under designs (6, 1000) and (10, 200). 94 • Number of replications. When the number of replications increases, the variance of EIMSE decreases in general, as shown in Table 3.3 and Fig- ure 3.4(a), except for SKG in design (6, 1000). However, the averaged EIMSE does not change much as the number of replications increases, similar to the M/M/1 queue example. • Number of design points. We fixed the number of replications at 200 and increases the number of design points up to 20. Boxplots are shown in Figures 3.4(b). EIMSE results for all models improve as the number of design points increases, with the exception of SKG and GESK-PMLE with design (10, 200). • Step sizes. Step sizes determined by the PMLE and IMSE approaches are shown in Figure 3.5. The plots suggest relationships between experiment de- signs and step sizes: (i) relative step size (ratio to the size of the subinterval) increases generally when the number of design points increases, (ii) the vari- ability of step sizes decreases as the number of replications increases. • Remark. Performance of SKG with design (10, 200) shown in Table 3.3 and Figure 3.4(b) doesn’t seem to match each other. The reason is that several outliers outside of the range shown in Figure 3.4(b) are omitted. 95 3.4.2.3 A multidimensional example Lastly, we consider a stylized multidimensional example to test the perfor- mance of GESK models, especially when gradient estimates have much larger vari- ances. We consider the sphere function defined by Y(x) = 2∑ i=1 x2i + 4∑ i=3 10x2i . We chose the experimental design space as [−1, 1]4. The simulated response at x = (x1, x2, x3, x4) from replication j is Yj(x) = Y(x) + j(x) with the noise j(x) ∼ N (0, 1). The gradient estimate with respect to xr at x from replication j is given by Grj (x) = ∂Y(x)∂xr + δ r j (x) with δrj (x) ∼ N (0, 25). The added noise terms are mutually independent. We chose two different experiment designs: (20, 500) and (40, 500), which cor- respond to 20-point and 40-point Latin-hypercube designs with 500 independent replications at each design point, respectively. We collected simulation responses Yj(x) and gradient estimates Grj (x) for r = 1, 2, 3, 4, j = 1, 2, . . . , 500 to build metamodels. We ran the experiments for 100 macroreplications. Within each replication, we chose N = 1000 to estimate the EIMSE in (3.20). Figure 3.6 contains boxplots for the EIMSE calculated from the 100 macroreplications. Our findings are summarized as follows: • SKG and both GESK models perform better than SK. As the number of design points increases, the performances of all models improve. Under design (20500), GESK-IMSE seems to be the best; under design (40, 500), SKG is preferred due to its low average and low variance in EIMSE. 96 • Between the two GESK models, PMLE scales better than IMSE for high- dimensional problems. The IMSE approach requires multidimensional integra- tions to determine step sizes, which is expensive and depends on the accuracy of integration approximation in high-dimensional problems. • Step sizes determined by PMLE are generally much larger than those deter- mined by IMSE. Along each dimension, step sizes determined by PMLE and IMSE have similar behavior. If the step size chosen by PMLE is relatively smaller on one dimension, so is the step size chosen by IMSE. Step sizes cho- sen for a dimension with higher gradient values are not necessarily smaller than others. 3.5 CONCLUSIONS AND FUTURE RESEARCH In this paper we investigated gradient extrapolated stochastic kriging (GESK), which exploits the availability of direct gradient estimates in stochastic simulation settings. The performance of the GESK models was analyzed theoretically and nu- merically, with a focus on analyzing the approximation errors introduced by extrap- olation. Since step sizes are crucial to GESK models, two methods for determining step sizes were proposed and tested in numerical examples, which indicated substan- tial gains in performance over SK in all of the experiments. Between the proposed PMLE and IMSE approaches for determining step sizes, IMSE demonstrated better performance in numerical experiments, but it becomes computationally expensive for high-dimensional problems. The numerical experiments showed comparable per- 97 formance for GESK and SKG, except when the number of design points is very small, where GESK shows some advantage. From our analysis and numerical experiments, we offer the following overall conclusions: • GESK can be especially effective when the number of design points is relatively small, e.g., in the setting where simulation is expensive. • GESK offers additional flexibility in choosing the correlation function; in par- ticular, differentiability is not a constraint. • For high-dimensional problems, GESK using the PMLE is recommended, since its computation does not increase with dimension, whereas the computational burden increases exponentially for IMSE minimization and at least quadrati- cally for SKG. Our work points to several other directions for future research. The first direction is to focus on the extrapolation strategy in GESK. For this paper, we use linear extrapolation with the same step size and assume that only one additional point is extrapolated from each design point. More sophisticated techniques could use the local response surface information and adaptively determine the extrapolation strategy. This is especially important in higher-dimensional problems with multiple extreme values. Another direction is to further investigate a comparison of the SKG and GESK models. Improvements from incorporating gradient estimates can be expected from both models. However, it would be valuable to be able to characterize when one 98 model is likely to be more effective. A theoretical analysis of various properties comparing the two models can lead to useful guidelines for practitioners. 99 SK SKG PMLE IMSE SK SKG PMLE IMSE SK SKG PMLE IMSE 0.0 0.2 0.4 0.6 0.8 1.0 E I M S E Design(6, 50) Design(6, 200) Design(6, 1000) (a) Design (6, 50), (6, 200), (6, 1000) SK SKG PMLE IMSE SK SKG PMLE IMSE SK SKG PMLE IMSE 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 E I M S E Design(8, 200) Design(10, 200) Design(20, 200) (b) Design (8, 200), (10, 200), (20, 200) Figure 3.3: Boxplots of EIMSE from 100 macroreplications for SK, SKG and GESK with six different designs on estimating the expected steady-state waiting time in an M/M/1 queue problem, corresponding to results in Table 3.2. 100 SK SKG PMLE IMSE SK SKG PMLE IMSE SK SKG PMLE IMSE 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 l o g ( E I M S E ) Design(6, 50) Design(6, 200) Design(6, 1000) (a) Design (6, 50), (6, 200), (6, 1000) SK SKG PMLE IMSE SK SKG PMLE IMSE SK SKG PMLE IMSE −8 −6 −4 −2 0 2 l o g ( E I M S E ) Design(8, 200) Design(10, 200) Design(20, 200) (b) Design (8, 200), (10, 200), (20, 200) Figure 3.4: Boxplots of EIMSE from 100 macroreplications for SK, SKG and GESK with five different designs on y(x) = exp(−1.4x) cos(7pix/2) + , corresponding to results in Table 3.3. 101 0.000 0.005 0.010 0.015 0.020 Step sizes Design (6, 50) Design (6, 200) Design (6, 1000) Design (8, 200) Design (10, 200) Design (20, 200) (a) Step sizes determined by PMLE 0.000 0.002 0.004 0.006 0.008 0.010 Step sizes Design (6, 50) Design (6, 200) Design (6, 1000) Design (8, 200) Design (10, 200) Design (20, 200) (b) Step sizes determined by IMSE Figure 3.5: Boxplots for step sizes determined by PMLE and IMSE based on100 macroreplication in the stylized example with added noise. 102 SK SKG PMLE IMSE SK SKG PMLE IMSE −4 −2 0 2 4 6 8 l o g ( E I M S E ) Design(20, 500) Design(40, 500) Figure 3.6: Boxplots of EIMSE from 100 macroreplications for SK and GESK with two different Latin-hypercube designs on a four-dimensional function with added noise. 103 Chapter 4: Simulation Selection with Unknown Correlation Struc- ture 4.1 Introduction Consider a decision-maker who must identify the best among a finite set of design alternatives with unknown performance values. Stochastic simulation is used to estimate the performance of an alternative. More simulation experiments will produce better estimates; however, these experiments are expensive and time- consuming, limiting the simulation budget. We must use this budget efficiently to maximize the quality of the final selection decision. In many applications, the simulation budget is comparable to, or smaller than, the number of design alternatives. However, there may be correlations between the underlying mean performance values. Correlations can potentially allow us to handle much larger problems: a single piece of information about one alternative can now be used to learn about other alternatives with “similar” values. However, these similarities can be difficult to quantify or guess heuristically. Consider the following examples: 1. Wind farm placement. Given a set of candidate locations for a new wind farm 104 installation, we wish to select the one with the highest average power output. However, power output depends on volatile wind speeds and other physical factors like pressure gradient, frictional forces, wind currents, and topographi- cal features. These factors are difficult to quantify, but simulation can be used to estimate the net result [59]. Physical and topographical similarities induce complex correlations between locations. 2. Logistics management. In a vehicle routing problem with service choice [60], customers can request privilege for early delivery through bidding. The service provider will accept a set of requests if the total bid price exceeds the additional cost incurred by deviating from the optimal route. The number of acceptable sets of requests grows combinatorially. To solve this problem, we have to use the routing cost computed for one set of requests to infer the costs of other sets that contain one or more of the same customers. 3. Call center control. A call center administrator assigns agents in shifts to minimize average call waiting time. The administrator is uncertain about employee efficiency [61], making it difficult to determine the best assignment. Simulation can be used to test performances from different assignments. The performance of two different assignments will be correlated if the assignments involve the same agents. Simulation selection procedures consist of a statistical model of the decision- maker’s estimates of the performance values, and an optimization algorithm for choosing an alternative to simulate based on the current statistical estimates. In 105 the classical literature on ranking and selection (R&S), estimates are constructed using frequentist statistics, and decisions are made using the indifference-zone (IZ) approach pioneered by [62]; see also [63] for an overview of classical results. IZ methods guarantee asymptotic lower bounds for the probability of correct selec- tion (PCS), as long as the true underlying performance values are sufficiently far apart. The best-performing IZ methods include those by [64,65] and [66]. Numerous reviews and surveys are available, including [67], [68], [69] and [70]. Bayesian models for R&S consider the tradeoff between our estimates of the performance values and our uncertainty about those estimates. This is known as the “value of information” approach, going back to [27] and extended in later work. See [31] or [5] for an overview of value of information procedures (VIP). The optimal computing budget allocation (OCBA) methodology (see e.g. [71–73]), designed to maximize a Bayesian version of PCS, can also be included in the Bayesian category. Both frequentist [66] and Bayesian [29,30] methods are able to handle problems where the simulation output has unknown variance. However, most work on R&S typically makes independence assumptions on the estimates of performance values: under this assumption, a single experiment only provides information about a single alternative, making it difficult to handle large problems with a small simulation budget. Correlations have largely been studied in the context of common random numbers inside simulators; see [74] and [75] for IZ methods in this setting. [76] considers this problem from the perspective of OCBA. The present chapter, however, uses the term “correlation” in a broader sense. In the Bayesian setting, correlations can be used inside a distribution of belief as 106 a measure of the inherent similarities or differences between alternatives (e.g., the geographical similarities between two wind farm locations). As we show in this chapter, correlated beliefs can significantly improve the performances of simulation selection, even when the simulation output is completely independent. Recently, [77] studied Bayesian R&S with correlated beliefs ( [78] extends this analysis to include correlated simulation output), but under the restrictive assumption that the correlation structure was correctly specified by the decision-maker. By contrast, we develop a model where the correlation structure is unknown, and has to be learned together with the performance values. Our model has the ability to correct inaccurate prior beliefs as new information arrives. If the simulation output is correlated, we can also learn that correlation structures provided that some prior information about it is available. To our knowledge, [79] is the only work on Bayesian R&S to consider unknown correlation structures. Bayesian R&S procedures rely on conjugate prior distributions on the unknown model parameters in order to maintain computational tractability. The Wishart distribution is a well-known conjugate prior for an unknown covariance matrix, as- suming that we can simultaneously observe the performance of every alternative. See e.g. [80] or [81] for applications of the Wishart distribution in simulation meta- modeling and input uncertainty. However, in fully sequential R&S, we only sam- ple from one alternative at a time. There is no standard conjugate prior for this problem, although the statistics community has made several attempts to create one; see [82], [83] or [84] for examples. Unfortunately, these models either present computational difficulties in an R&S setting or cannot extract information about 107 multiple alternatives from a single scalar observation. [79] resolves this problem by imposing restrictions on the sampling procedure: the simulation budget is allocated equally among a certain subset of alternatives. We propose a different approach, where we have the flexibility to simulate any one alternative at any time. Although our prior is not exactly conjugate, we create an optimal approximation of conjugacy by minimizing the Kullback-Leibler divergence between the true posterior and the normal-Wishart distribution, lead- ing to a computationally efficient learning model. The approximate model enables us to derive a new VIP that generalizes previous procedures on R&S with known correlations. We establish intuitive analogies between the new model and classical statistical results on unknown sampling variance. We also show that, all else be- ing equal, information has greater value when the correlation structure is unknown, making it important to consider this uncertainty when allocating the next simulation experiment. 4.2 Learning Unknown Correlation Structures Let {1, 2, · · · , K} be a set of alternatives. Let Ŷ be a multivariate normal random vector in RK with mean µ = (µ1, · · · , µK) and covariance matrix Σ. Our goal is to discover the alternative x with the largest underlying mean µx. Assuming that Σ is invertible, we define R = Σ−1 to be the precision matrix of Ŷ. For ease of computation and presentation, we will work with the precision matrix instead of the covariance matrix throughout this chapter. 108 The vector Ŷ describes the behavior of K alternatives, all observed concur- rently. We suppose that both µ and R are unknown. Let yˆx represents the simula- tion output of the behavior of the xth alternative. The sampling distribution of yˆx given µ and R is univariate normal with the following probability density function p(yˆx|µ,R) ∝ 1 |e′xR−1ex| 1 2 exp { − (yˆx − µx)2 2e′xR−1ex } , (4.1) where ex = (0, . . . , 1, . . . , 0) is a K × 1 vector, with 1 at the xth component, and 0 at others. The prime denotes transpose. We allow the precision matrix R to be non-diagonal, implying correlations between components of Ŷ. As the rest of this section will show, when µ and R are both unknown, a set of beliefs about R will induce correlations between our beliefs about different components of µ, implying similarities and differences between alternatives. We can expect that a single observation yˆx should also provide some information about other alternatives that are correlated with x. However, the nature of this information is not clear as the correlation structure is unknown. 4.2.1 Learning from complete observations Taking the Bayesian viewpoint, we view the unknown mean vector µ and the precision matrix R as a random vector and a random matrix, respectively. In accordance with the Bayesian approach, we assume that our prior knowledge about these unknown quantities is reflected by a prior distribution, which we write as µ|R ∼ NK(θ0, q0R), R ∼ WK(b0,B0). 109 The precision matrix R is assumed to follow a Wishart distribution parametrized by a scalar b0 and a K × K matrix B0. The conditional distribution of µ given R is multivariate normal with mean vector θ0 and precision matrix q0R, where θ0 is a K vector and q0 is a scalar. The probability density function of the Wishart distribution, see e.g. [85], is given as p(R) = 1Z(b0,B0) |R| b0−K−1 2 exp { − 1 2 tr(B 0R) } , with a normalizing constant Z(b0,B0) = piK(K−1)4 ∣ ∣ ∣ ∣ B0 2 ∣ ∣ ∣ ∣ − b 0 2 K∏ i=1 Γ (b0 + 1− i 2 ) . Therefore the joint prior distribution of µ and R is p0(µ,R) = 1Z(b0,B0) |R| b0−K−1 2 exp { − 1 2 tr(B 0R) }( q0 2pi )K 2 |R| 1 2 exp { − q0 2 (µ− θ 0)′R(µ− θ0) } . The Wishart distribution has the property that E(R) = b0(B0)−1, whence E(Σ) = B0b0−K+1 . The matrix B0 can be viewed as a generalized “sum of squares”. If the prior parameters are constructed from historical data (known as a “first-stage sample” in [79]), the diagonal entries of B0 will be the sums of squared deviations of the first-stage observations from their means. The scalar b0 is analogous to the size of the first-stage sample, so that B0b0−K+1 is precisely the empirical covariance matrix constructed from the first-stage data. The parameter q0 is also analogous to a sample size; if first-stage sampling is used, R−1q0 will be the covariance matrix of the sample mean µ. If our distribution of belief at stage n is normal-Wishart, and our next ob- servation is the entire vector Ŷn+1 ∼ NK(µ,R), standard results from Bayesian 110 analysis [86] tell us that the posterior density pn+1(µ,R|Ŷn+1) = p(Ŷ n+1|µ,R)pn(µ,R) ∫∫ p(Ŷn+1|µ,R)pn(µ,R)dµdR is another normal-Wishart distribution with parameters qn+1 = qn + 1, (4.2) bn+1 = bn + 1, (4.3) θn+1 = q nθn + 1 qn + 1 , (4.4) Bn+1 = Bn + q n qn + 1(θ n − Ŷn+1)(θn − Ŷn+1)′. (4.5) This is known as the conjugacy property of the normal-Wishart distribution. Con- jugacy allows us to represent a distribution of belief with a finite, small number of parameters, which can be easily updated after each new observation. It is conve- nient to denote these parameters by the notational shorthand Sn = (qn, bn,θn,Bn), representing the state of our beliefs at time n. If we are able to observe the entire vector Ŷn+1 (also known as a “complete observation”), the decision-maker’s objective can be easily formulated as follows. Let (Ω,F ,P) be an appropriate probability space, and define a filtration Fn, where Fn is the σ-algebra generated by the first n observations Y1, ...,Yn. Then, θn = E(µ|Fn) intuitively represents our time-n beliefs about µ. We wish to find E(max x E(µx|FN)), (4.6) where N is the simulation budget. The maximum inside the outer expectation in (4.6) represents the decision-maker’s implementation decision: at time N , we 111 will select the alternative that appears to be the best. If complete observations are available, (4.6) is simply a statistical estimation problem with no elements of simulation optimization. However, we will now move to a setting where only a single component of Ŷn+1 can be observed at a time, giving rise to the problem of choosing the right component. 4.2.2 Learning model for scalar observations As is common in fully sequential ranking and selection, we now suppose that at stage n we sample from alternative x only, with an observation yˆn+1x ∼ N (µx, (e′xR−1ex) −1). The distribution of yˆn+1x is simply the marginal distribution of the xth component of Ŷn+1. Using Bayes’ rule, the joint posterior distribution of µ and R, given the observation yˆn+1x , can be written as pn+1(µ,R|yˆn+1x ) ∝ 1 Z(bn,Bn) |R| bn−K−1 2 exp { − 1 2 tr(B nR) }( qn 2pi )K 2 |R| 1 2 · exp { − qn 2 (µ− θ n)′R(µ− θn) } 1 (2piR−1) 1 2 xx exp { − (yˆn+1x − µx)2 2(R−1)xx } . (4.7) After decomposing the posterior distribution in (4.7) into the conditional posterior distribution of µ given R and the marginal posterior distribution of R, we ob- serve that the conditional distribution of µ given R is multivariate normal, but the marginal distribution of R is no longer a Wishart distribution. Computational difficulties arise from here. Equation (4.7) suggests that the conjugacy property of the normal-Wishart distribution is lost if we can no longer observe the entire vector Ŷn+1. Conjugacy of the prior distribution is necessary 112 in order to build a computationally tractable learning model and develop efficient sequential decision procedures that make sampling choices based on a small set of belief parameters. In what follows, we force conjugacy using the density projection technique. To be precise, by minimizing the Kullback-Leibler (KL) divergence, we find the best approximation for the posterior distribution in (4.7) from the normal- Wishart family. The posterior distribution is then replaced by its normal-Wishart approximation, and the decision-maker’s beliefs are assumed to be normal-Wishart after each successive observation. Let ξ(µ,R) be a distribution from the normal-Wishart family with parameters (q, b,θ,B) such that µ|R ∼ NK(θ, qR), R ∼ WK(b,B). Define DnKL(ξ‖pn+1) to be the Kullback-Leibler (KL) divergence between ξ(µ,R) and pn+1(µ,R|yˆn+1x ), which is given by DnKL(ξ‖pn+1) = Eξ ( log ξ(µ,R)pn+1(µ,R|yˆn+1x ) ) , (4.8) where Eξ[·] is the expectation with respect to ξ(µ,R). This quantity, bounded below by zero, is used to measure the “distance” between the distributions ξ and pn+1. Lower KL divergence suggests that there is more similarity between the two distributions. For simplicity of notation, we write DnKL(ξ‖p) instead of DnKL(ξ‖pn+1). We wish to find (qn+1, bn+1,θn+1,Bn+1) = arg min(q,b,θ,B)DnKL(ξ‖p), (4.9) the set of parameters that projects (according to KL divergence) the normal-Wishart 113 distribution onto the true posterior in (4.7). We first give a closed-form expression for (4.8), and then solve (4.9). We briefly sketch the proof of the solution, but readers are referred to the Appendix for the complete details. Proposition 4.1. The KL divergence DnKL(ξ‖p) defined in (4.8) can be expressed as DnKL(ξ‖p) = bn+1 − bn 2 ( − log ∣ ∣ ∣ ∣ Bn+1 2 ∣ ∣ ∣ ∣+ K∑ i=1 ψ (bn+1 − i+ 1 2 )) − bn+1K 2 + b n+1 2 tr ( Bn(Bn+1)−1 ) + log Z(b n,Bn) Z(bn+1,Bn+1) + 1 2 logB n+1 xx + 12 [ K log q n+1 qn +K qn qn+1 −K + q n(θn − θn+1)′bn+1(Bn+1)−1(θn − θn+1) ] − 1 2ψ (bn+1 −K + 1 2 ) + 12qn+1 + 1 2(yˆ n+1 x − θn+1x )2 bn+1 −K + 1 Bn+1xx + C, where ψ(x) = d ln Γ(x)/dx is the digamma function and C is some constant that does not depend on the parameters of ξ. Proof. Proof: First notice that the posterior distribution in (4.7) can be written as pn+1(µ,R|yˆn+1x ) = pn+1(µ|R)pn+1(R)p(yˆn+1x |µ,R) p(yˆn+1x ) . 114 Then the KL divergence is given as DnKL(ξ|p) = Eξ ( log ξ(µ,R)pn+1(µ,R|yˆn+1x ) ) = ∫∫ ξ(µ,R) log ξ(R)ξ(µ|R)p(yˆ n+1 x ) pn+1(µ|R)pn+1(R)p(yˆn+1x |µ,R) dµdR = ∫∫ ξ(µ,R) log ξ(R)pn+1(R)dµdR (4.10) + ∫∫ ξ(µ,R) log ξ(µ|R)pn+1(µ|R)dµdR (4.11) − ∫∫ ξ(µ,R) log p(yˆn+1x |µ,R)dµdR (4.12) + ∫∫ ξ(µ,R) log p(yˆn+1x )dµdR, (4.13) where (4.10) can be computed as ∫∫ ξ(µ,R) log ξ(R)pn+1(R)dµdR = ∫ ξ(R) log ξ(R)pn+1(R)dR = ∫ ξ(R) log { Z(bn,Bn) Z(bn+1,Bn+1) |R| bn+1−bn 2 exp { − 1 2 tr(B n+1R)− 12 tr(BR) }} dR = b n+1 − bn 2 Eξ (log |R|) + log Z(bn,Bn) Z(bn+1,Bn+1) − 1 2Eξ [ tr((Bn+1 + Bn)R) ] = b n+1 − bn 2 ( − log ∣ ∣ ∣ ∣ Bn+1 2 ∣ ∣ ∣ ∣+ K∑ i=1 ψ (bn+1 − i+ 1 2 )) + log Z(b n,Bn) Z(bn+1,Bn+1) + b n+1 2 tr ( Bn(Bn+1)−1 ) − bn+1 2 K, (4.14) 115 the term in (4.11) can be computed as ∫∫ ξ(µ,R) log ξ(µ|R)pn+1(µ|R)dµdR = ∫∫ ξ(µ,R) log { |qn+1R| 12 exp { −12(µ− θn+1)′qn+1R(µ− θn+1) } |qnR| 12 exp { −12(µ− θn)′qnR(µ− θn) } } = ∫∫ ξ(µ,R) [K 2 log qn+1 qn − 1 2(µ− θ n+1)qn+1R(µ− θn+1) + 12(µ− θ) ′qnR(µ− θ) ] dµdR = K2 log qn+1 qn + K 2 qn qn+1 − K 2 + qn 2 (θ n+1 − θn)′bn+1(Bn+1)−1(θn+1 − θn), (4.15) and the term in (4.12) can be computed as ∫∫ ξ(µ,R) log p(yˆn+1|µ,R)dµdR = ∫∫ ξ(µ,R) log [ (2piR−1)− 1 2 xx exp [ − 1 2(R −1)−1xx (yˆn+1x − µx)2 ]] dµdR =− 12Eξ ( log(R−1)xx ) − 1 2 log(2pi)− 1 2(yˆ n+1 x − θn+1x )2E ( (R−1)−1xx ) − 1 2qn+1 =− 12 ( logBn+1xx − log 2− ψ (bn+1 −K + 1 2 )) − 1 2 log(2pi)− 1 2qn+1 − 1 2(yˆ n+1 x − θn+1x )2 bn+1 −K + 1 Bn+1xx . (4.16) Notice that p(yˆn+1x ) is the marginal distribution of yˆn+1x , which is not a function of µ and R, so (4.13) doesn’t depend on the parameters of ξ. The proof then follows from equations (4.14), (4.15) and (4.16). Theorem 4.2. There exists a finite value ∆bn s.t. the solution to (4.9) can be 116 expressed as qn+1 = qn + 1K , (4.17) bn+1 = bn + ∆bn, (4.18) θn+1 = ( qnbn+1(Bn+1)−1 + b n+1 −K + 1 e′xBn+1ex exe′x )−1 · ( qnbn+1(Bn+1)−1θn + b n+1 −K + 1 e′xBn+1ex yˆn+1x ex ) , (4.19) Bn+1 = b n+1 bn B n + b n+1 bn + 1 ( qn(yˆn+1x − θnx)2 qnbn+1 bn+1−K+1 + 1 − Bnxx bn ) Bnexe′xB n (Bnxx)2 . (4.20) Proof. Proof: Taking derivatives of (4.8) with respect to the parameters and apply- ing matrix calculus, we obtain ∂DnKL(ξ‖p) ∂qn+1 = 1 2 { K qn+1 − qnK (qn+1)2 − 1 (qn+1)2 } , (4.21) ∂DnKL(ξ‖p) ∂bn+1 = 1 2 {(yˆn+1x − θnx)2 Bn+1xx (qn)2bn+1(K − 1) (qnbn+1 + bn+1 −K + 1)2 + bnK + 1 bn+1 −K +b n+1 − bn 2 K∑ i=1 ψ′ (bn+1 − i+ 1 2 ) − 1 2ψ ′ (bn+1 −K + 1 2 )} , (4.22) ∂DnKL(ξ‖p) ∂θn+1 = q nbn+1(Bn+1)−1(θn − θn+1) + b n+1 −K + 1 Bn+1xx (exe′x) (yˆn+1x ex − θn+1), (4.23) ∂DnKL(ξ‖p) ∂Bn+1 = − 1 2q nbn+1 ( Bn+1 )−1 (θn − θn+1)(θn − θn+1)′(Bn+1)−1 + b n 2 (B n+1)−1 − bn+1 2 (B n+1)−1Bn(Bn+1)−1 + ( 1 2Bn+1xx − bn+1 −K + 1 2 (yˆn+1x − θn+1x )2 (Bn+1xx )2 ) exe′x. (4.24) Setting (4.21) - (4.24) to zero and solving, we notice that (4.17) follows immediately from (4.21). However, (4.22)-(4.24) are more difficult to solve because each equation depends on multiple parameters. We denote the change in degrees of freedom by 117 ∆bn ≡ bn+1− bn. Then we can derive (4.19) and (4.20) as functions of ∆bn. Finally, we compute ∆bn itself (see Remark 4.4 for a discussion). Setting (4.23) to zero, we obtain ( qnbn+1(Bn+1)−1 + b n+1 −K + 1 Bn+1xx ) θn+1 = qnbn+1(Bn+1)−1θn+b n+1 −K + 1 Bn+1xx exe′xyˆn+1x ex, solving for θn+1 and it yields (4.19). Setting (4.24) to zero and multiplying Bn+1 from left and right, we obtain, Bn+1 = b n+1 bn B n + q nbn+1 bn (θ n+1 − θn)(θn+1 − θn)′ − 1 bn ( 1 Bn+1xx − (bn+1 −K + 1)(yˆn+1x − θn+1x )2 (Bn+1xx )2 ) Bn+1exe′xB n+1. Since (θn+1 − θn)(θn+1 − θn)′ = (yˆ n+1 x − θnx)2 (γn)2 Bn+1exe′xB n+1 (Bn+1xx )2 , and (yˆn+1x − θn+1x )2 = (yˆn+1x − θnx)2(γn − 1)2 (γn)2 , where γn = 1 + q nbn+1 bn+1 −K + 1 , it follows that Bn+1 = b n+1 bn B n + (qnbn+1(yˆn+1x − θnx)2 bnγn − Bn+1xx bn ) Bn+1exe′xB n+1 (Bn+1xx )2 = b n+1 bn B n + b n+1 bn ( qn(yˆn+1x − θnx)2 qnbn+1 bn+1−K+1 + 1 − Bn+1xx bn+1 ) Bn+1exe′xB n+1 (Bn+1xx )2 . (4.25) The matrix Bn+1 shows up in both sides of (4.25). We will show how to derive updating equations for all entries in the matrix Bn+1. 118 Consider Bn+1xx , it follows from (4.25) that bnBn+1xx = bn+1Bnxx + qnbn+1(yˆn+1x − θnx)2 γn −B n+1 xx , then Bn+1xx = bn+1 bn + 1B n xx + 1 bn + 1 qnbn+1(yˆn+1x − θnx)2 γn . It follows from symmetry of the matrix Bn that Bn+1ix = Bn+1xi . Consider Bn+1xi and Bn+1ix for i 6= x, bnBn+1xi = bn+1Bnxi + (qnbn+1(yˆn+1x − θnx)2 γn −B n+1 xx ) Bn+1xx Bn+1xi (Bn+1xx )2 , then it follows that Bn+1xi = bn+1 bn + 1B n xi + 1 bn + 1 qnbn+1(yˆn+1x − θnx)2 γn Bnxi Bnxx . (4.26) The following result from (4.26)is worth mentioning, Bn+1xi Bn+1xx = B n xi Bnxx . Consider Bn+1ij for i, j 6= x, bnBn+1ij = bn+1Bnij + (qnbn+1(yˆn+1x − θnx)2 γn −B n+1 xx ) Bn+1xi Bn+1xj (Bn+1xx )2, and it follows that Bn+1ij = bn+1 bn B n ij + 1 bn + 1 (qnbn+1(yˆn+1x − θnx)2 γn − bn+1 bn B n xx ) BnxiBnxj (Bnxx)2 . Using the Sherman-Morrison-Woodbury formula [?, see e.g.]]GoLo96, we can rewrite (4.19) without using inverse matrices as θn+1 = θn + yˆ n+1 x − θnx qnbn+1 bn+1−K+1Bnxx +Bnxx Bnex. (4.27) 119 The most crucial aspect of (4.27) is that a single scalar observation is now used to update the entire posterior mean vector through the matrix Bn. Similar behavior occurs in the Kalman filter-like update used by [77] in the case of known correlation structures. In that setting, the updating equation incorporates both the variance of the current belief and the known variance of the observations. However, when the correlation structure is unknown, the matrix Bn is used to estimate both types of variances. Equations (4.17)-(4.20) allow us to conveniently represent and update a joint distribution of belief about µ and R using a finite number of parameters, which can be compactly encoded in the belief state Sn. We can now connect the mechanism of approximate Bayesian inference back to a formal objective function. Recall from Section 4.2.1 that the sampling model is defined on a probability space (Ω,F ,P), where P is the law of the process Sn. Essentially, our approximate Bayesian learn- ing model replaces P by an alternate probability measure P¯ under which (µ,R) is normal-Wishart, given Fn with parameters obtained through KL minimization. We use the notation EP¯(·) for expectations under the probability measure P¯. Given a measurement budget of N , the experimenter chooses a measurement pol- icy, which is a function xpi mapping a belief state Sn to an alternative xpi(Sn) ∈ {1, . . . , K}. Under the probability space (Ω,F , P¯), the policy makes measurement decisions sequentially. As before, Fn is the σ-algebra generated by all the decisions made in the first n stages, as well as the observations collected. Let pi be a mea- surement policy. The notation Epi indicates that the expectation is taken when the measurement policy pi is applied. We also define Π as the set of measurement poli- 120 cies. The challenge is to choose a measurement policy pi for allocating the simulation budget one measurement at a time, and our objective can be written as sup pi∈Π EpiP¯ ( max x EpiP¯(µx|FN) ) . (4.28) As in (4.6), the maximum in (4.28) represents the decision-maker’s imple- mentation decision to select the alternative that seems to be the best at time N . However, unlike (4.6), equation (4.28) now contains the optimization problem of choosing a policy pi, i.e., a sequence of measurement decisions. We close our discussion of the learning model with several remarks on the in- terpretation of the model parameters. The approximate updating equations (4.17)- (4.20) are intuitive generalizations of the conjugate update in (4.2)-(4.5). For ex- ample, in (4.5), the squared error matrix (θn− Yˆn+1)(θn− Yˆn+1)′ is used to update Bn. In (4.20), the full matrix is not available, so the update uses the scalar squared error (θnx − yˆn+1x )2 to update all covariances between x and other alternatives. Remark 4.3. The parameter qn in the prior distribution is intended to be a reflec- tion of prior precision relative to the sample size that is tunable by the researcher or practitioner to reflect their prior confidence. Recall from (4.2) that, when we have complete observations, this parameter is always increased by 1. By analogy, if we only collect information about one out of K alternatives, qn is increased by 1/K. Remark 4.4. Although one might expect bn to behave in the same way as qn, this is not exactly the case. The parameter bn increases by 1 when we have complete ob- servations. However, when we sample from only one alternative, the increment ∆bn actually depends on (qn, bn, yˆn+1x , θnx , Bnxx). The quantity ∆bn does not have a closed- 121 form expression, but can easily be obtained numerically via a bisection procedure or Newton’s method on the interval [0, 1]. We have also observed in our numeri- cal experiments that the optimal values of ∆bn appear to be smaller than 1/K and approach 1/K asymptotically over time. Remark 4.5. Note that the computational complexity of the updates (4.17)-(4.20) is O(K2), identical to that of the conjugate updates for R&S with known correlations; see equations (2.22) and (2.23) in [5]. The number of iterations of the bisection method needed to compute ∆bn within a fixed, pre-specified tolerance level does not depend on K. However, the effort needed for a single iteration of the bisection method is O(K), since the terms in (4.22) have to be recomputed when different values of ∆bn are considered. 4.2.3 Predictive distribution of the next observation Given the prior distribution on the unknown parameters, the distribution of yˆn+1x represents the decision-maker’s beliefs about the next observation (assuming that alternative x will be measured). For this reason, it is known as the predictive distribution. In Section 4.3, we introduce a policy that uses the predictive distri- bution to look ahead to the outcome of a simulation decision. In preparation for this discussion, we now present the predictive distribution for the normal-Wishart model under the approximate Bayesian learning scheme of Section 4.2.1. That is, we assume that the decision-maker’s beliefs at time n are represented by a normal- Wishart distribution whose parameters are contained in the state Sn, and use this 122 assumption to characterize yˆn+1x . For completeness, we provide the definition of the multivariate t-distribution [87]. Definition 4.1. A p-dimensional random vector X = (X1, . . . , Xp) is said to have the p-variate t-distribution with ν degrees of freedom, mean vector µ, and correlation matrix V if its joint pdf is given by f(x) = Γ( ν+p 2 ) (piν)p/2|V|1/2Γ(ν2 ) [ 1 + 1ν (x− µ) ′V−1(x− µ) ](ν+p)/2 . The predictive distribution of yˆn+1x follows from the following two lemmas. Lemma 4.6. Suppose that (µ,R) follows a normal-Wishart distribution with pa- rameters (qn, bn,θn,Bn). Then the predictive distribution of a complete observation Ŷn+1 is a multivariate t-distribution with bn−K+1 degrees of freedom, mean vector θn and correlation matrix (qn + 1)Bn/qn(bn −K + 1). Proof. Proof: The predictive distribution of the vector Ŷn+1, p(Ŷn+1) = ∫∫ p(Ŷn+1,µ,R)dµdR, and p(Ŷn+1,µ,R) ∝ |R| b n−K+1 2 exp { − 1 2(Ŷ n+1 − µ)′R(Ŷn+1 − µ) } · exp { − qn 2 (µ− θ n)′R(µ− θn) } exp { − 1 2 tr(B nR) } . It can be verified that (Ŷn+1 − µ)′R(Ŷn+1 − µ) + qn(µ− θn)′R(µ− θn) = (qn + 1)(µ− θ¯n)′R(µ− θ¯n) + q n qn + 1(θ − Ŷ n+1)′R(θ − Ŷn+1), 123 with θ¯n = q nθn + Ŷn+1 qn + 1 . It follows that p(Ŷn+1,µ,R) ∝ |R| 12 exp { − qn + 1 2 (µ− θ¯ n)′R(µ− θ¯n) } |R| bn−K 2 exp { − 1 2 tr(B¯ nR) } , where B¯n = Bn + q n qn + 1(θ n − Ŷn+1)(θn − Ŷn+1)′. Integration with respect to µ and R yields p(Ŷn+1) ∝ ( 1 + q n qn + 1(Ŷ n+1 − θn)′(Bn)−1(Ŷn+1 − θn) )− b n+1 2 . This shows that Ŷn+1 has a multivariate t-distribution with degrees of freedom bn −K + 1, mean vector θn and correlation matrix qn+1qn(bn−K+1)Bn. Lemma 4.7. The predictive distribution of yˆn+1x is yˆn+1x ∼ t ( bn −K + 1, θnx , qn(bn −K + 1) (qn + 1)Bnxx ) , (4.29) which is a univariate Student’s t-distribution with bn − K + 1 degrees of freedom, mean θnx and variance q n+1 qn(bn−K−1)Bnxx. Proof. Proof: This result follows by combining Lemma 4.6 together with results in Section 1.10 from [87]. Using the predictive distribution found in (4.29), we can derive another ex- 124 pression for the updating equation of θn in (4.27). Define T n = yˆ n+1 x − θnx ( qn+1 qn(bn−K+1)Bnxx )1/2 , s˜n(qn, bn,Bn, x) = ( qn+1 qn(bn−K+1) )1/2 ( qnbn+1 bn+1−K+1 + 1 ) (Bnxx)1/2 Bnex. Then, (4.27) can be rewritten as θn+1 = θn + s˜n(qn, bn,Bn, x)T n, (4.30) where T n has a Student’s t-distribution with bn −K + 1 degrees of freedom, mean 0 and scale parameter 1. At time n, the vector θn+1 of future beliefs is unknown. However, we see from (4.30) that our uncertainty about this vector originates from a single scalar random variable. This is in line with previous work on ranking and selection with known correlation structures [77], where the scalar random variable is normally distributed. When the correlations are unknown, we use Student’s t-distribution, forming a precise analogy to classical frequentist statistics. 4.2.4 Information loss due to approximate Bayesian inference The KL divergence DnKL(ξ ‖ p) can be thought of as the incremental infor- mation loss incurred by forcing conjugacy after the (n + 1)st observation, under the assumption that our beliefs at time n are accurately represented by a normal- Wishart distribution. This section shows that, under the probability measure P¯, the incremental information loss converges to zero in probability as n → ∞. That 125 is, if conjugacy is maintained up to time n, the error due to a single application of approximate Bayesian inference at time n + 1 will become vanishingly small for large enough n. This result is intended to provide the intuition that, over time, the learning model with scalar observations bears greater resemblance to a conjugate learning model. As in Section 4.2.3, we assume that the decision-maker’s beliefs at time n are represented by the normal-Wishart distribution. We begin by showing in Proposi- tion 4.8 that the degrees of freedom parameter bn goes to infinity, eventually leading to the result that the incremental loss from one additional observation vanishes to zero. Proposition 4.8. If b0 is sufficiently large, then ∆bn ∈ (0, 1) P¯-a.s. and bn → ∞ as n→∞. Proof. Proof: Let ∆bn = 1 in (4.22) and from which we have ∂Dn(ξ|p) ∂bn+1 ∣ ∣ ∣ bn+1=bn+1 ≥ 1 2 (bnK + 1 bn + 1 −K ) + 14 K∑ i=1 ψ′ (bn − i+ 2 2 ) − 1 4ψ ′ (bn −K + 2 2 ) = − K − 12(bn + 1) + 1 4 K−1∑ i=1 ψ′ (bn − i+ 2 2 ) ≥ − K − 1 2(bn + 1) + 1 4 K−1∑ i=1 2 bn − i+ 2 > − K − 12(bn + 1) + K − 1 2(bn + 1) = 0. Let ∆bn = 0 in (4.22), we can show that for any  > 0, there exists sufficiently large bn such that the first term is less than . We also observe that 1 bn − 1 2ψ ′ (bn −K + 1 2 ) < 1bn − 1 bn −K + 1 < 0. 126 Since (4.22) is a continuous function of ∆bn on [0, 1], we know that ∆bn ∈ (0, 1), whence bn has a limit by the monotone convergence theorem. In the following, we will prove that bn goes to infinity by contradiction. Assume that there exists M <∞ such that bn converges to M . This suggests that ∆bn converges to zero. Taking the limit of (4.22) as n goes to infinity yields, lim n→∞ (yˆn+1x − θnx)2 Bn+1xx (qn)2bn+1(K − 1) (qnbn+1 + bn+1 −K + 1)2 = ψ ′ (M −K + 1 2 ) − 1 M (4.31) From Lemma 4.7, the predictive distribution of yˆn+1x is a Student’s t-distribution with bn −K + 1 degrees of freedom, mean θnx and variance q n+1 qn(bn−K+1Bnxx. It follows that (yˆn+1x − θnx) (Bnxx)1/2 = T n ( qn + 1 qn(bn −K + 1) )1/2 . (4.32) As qn →∞, lim n→∞ (qn)2bn+1(K − 1) (qnbn+1 + bn+1 −K + 1)2 = 1 M , whence (4.31) can be rewritten as lim n→∞ l(T n) = [ ψ′ (M −K + 1 2 ) − 1 M ] M2, (4.33) where l(T n) is a function of the random variable T n. Since bn → M , the random variable T n converges weakly to a Student’s t-distribution with M−K+1 degrees of freedom. That means that (4.33) cannot hold almost surely. Therefore, we conclude that the degrees of freedom bn goes to infinity as n→∞. The fact that the degrees of freedom parameter bn goes to infinity is a key to the other results in this section. We next show several preliminary results concerning the updating equation for Bn. 127 Proposition 4.9. Let Mnx = ( bn+1 bn + 1 )( qn(bn+1 −K + 1) qnbn+1 + bn+1 −K + 1 (yˆn+1x − θnx)2 (Bnxx)2 − 1 bnBnxx ) . Then, MnxBnxx converges to zero in P¯-probability. Proof. Proof: First note that MnxBnxx = ( bn+1 bn + 1 )( qn(bn+1 −K + 1) qnbn+1 + bn+1 −K + 1 (yˆn+1x − θnx)2 Bnxx − 1 bn ) , therefore showing that MnxBnxx converges to zero in probability is equivalent to show- ing that (yˆn+1x − θnx)2/Bnxx converges to zero in probability. For any  > 0, using (4.32) and Chebyshev’s inequality, we know that P¯ (∣ ∣ ∣ ∣ (yˆn+1x − θnx) (Bnxx)1/2 ∣ ∣ ∣ ∣ >  ) ≤ qn + 1 qn(bn −K + 1) 1 2 . Then we have lim n→∞ P¯ (∣ ∣ ∣ ∣ (yˆn+1x − θnx) (Bnxx)1/2 ∣ ∣ ∣ ∣ >  ) = 0, as required. If we view the updating equation (4.13) from the viewpoint of stochastic ap- proximation, then the quantity MnxBnxx can be considered as the stepsize. Since the stepsize converges to zero, this guarantees that the change in the matrix Bn will not be too large. We will provide two propositions related to the determinant and the trace of the matrix Bn. Instead of checking the matrix Bn componentwise, these two results provide the changes in the determinant and the trace analytically, which are useful for studying the asymptotic behavior of the matrix Bn. 128 Proposition 4.10. The determinant of Bn is updated recursively through det(Bn+1) = det(Bn) (bn+1 bn )K ( 1 + b n bn+1M n xBnxx ) . (4.34) Proof. Proof: First rewrite (4.20) as Bn+1 = Bn + Bn (∆bn bn IK +M n x exe ′ xB n ) It follows from the matrix determinant lemma [88] that det(Bn+1) = det(Bn) det ( IK + ∆bn bn IK +M n x exe ′ xB n ) = det(Bn) det (bn+1 bn IK +M n x exe ′ xB n ) = det(Bn) det (bn+1 bn IK )( 1 + b n bn+1M n x e ′ xB nex ) = det(Bn) (bn+1 bn )K ( 1 + b n bn+1M n xBnxx ) , as desired. Proposition 4.11. tr ( Bn(Bn+1)−1 ) = b nK + 1 bn+1 − qn 2 (θ n+1−θn)′(Bn+1)−1(θn+1−θn)−12(yˆ n+1 x −θn+1x )2 bn+1 −K + 1 Bn+1xx . Proof. Proof: Multiplying (4.24) by Bn+1 from the left yields bn 2 IK − bn+1 2 B n(Bn+1)−1 − q nbn+1 2 (θ n+1 − θ)(θn+1 − θn)′(Bn+1)−1 + ( 1 2Bn+1xx − bn+1 −K + 1 2(Bn+1xx )2 (yˆn+1x − θn+1x )2 ) Bn+1exe′x = 0 Taking trace on both sides, it gives bnK 2 − bn+1 2 tr ( Bn(Bn+1)−1 ) − qnbn+1 2 (θ n+1 − θn)′(Bn+1)−1(θn+1 − θn) ( 1 2Bn+1xx − bn+1 −K + 1 2(Bn+1xx )2 (yˆn+1x − θn+1x )2 ) Bn+1xx = 0 (4.35) Solving for tr (Bn(Bn+1)−1) from (4.35) completes the proof. 129 The next lemma finds the limit of a sequence of expressions involving the gamma and digamma functions. The limit will appear repeatedly in the proof of our main results later. Lemma 4.12. For any α, β, γ ∈ R, lim x→∞ log Γ(x+ α)Γ(x+ β) − (α− β)ψ(x+ γ) = 0. (4.36) Proof. Proof: Notice that log Γ(x+ α)Γ(x+ β) − (α− β)ψ(x+ γ) = log (Γ(x+ α) Γ(x+ β)e −(α−β)ψ(x+γ) ) , then to prove (4.36) is equivalent to prove lim x→∞ Γ(x+ α) Γ(x+ β)e −(α−β)ψ(x+γ) = 1. From [89], we have the asymptotic expansion Γ(x+ α) Γ(x+ β) = x α−β [ 1 + (α− β)(α + β − 1)2x +O(x −2) ] . (4.37) For x > 0, we have [90] log x− 1x < ψ(x) < log x− 1 2x. Without loss of generality, we assume that α > β. This gives (x+ γ)−(α−β)e α−β 2(x+γ) < e−(α−β)ψ(x+γ) < (x+ γ)−(α−β)e α−β x+γ . Therefore, by (4.37), we find that lim x→∞ Γ(x+ α) Γ(x+ β)e −(α−β)ψ(x+γ) = 1. 130 We now state the key theorem. As the number of measurements goes to infinity, the KL divergence converges to zero in probability. This suggests that the incremental information loss incurred by forcing conjugacy vanishes. Theorem 4.13. As n→∞, the KL divergence DnKL(ξn+1 ‖ pn+1) converges to zero in P¯-probability. Proof. Proof: The constant C omitted in Proposition (4.1) can be given explicitly as C = 12 log qn qn + 1 − 1 2 logB n xx + log ( Γ (bn −K + 2 2 )) − log ( Γ (bn −K + 1 2 )) − bn −K + 2 2 log ( 1 + q n (qn + 1)Bnxx (yˆn+1x − θx)2 ) , 131 whence the KL divergence DnKL(ξn+1 ‖ pn+1) can be expressed as DnKL(ξn+1 ‖ pn+1) = bn 2 log det(Bn+1) det(Bn) + bn+1 2 tr ( Bn(Bn+1)−1 ) − bn+1 2 K (4.38) + q nbn+1 2 (θ n+1 − θn)′(Bn+1)−1(θn+1 − θn) + 12(yˆ n+1 x − θn+1x )2 bn+1 −K + 1 Bn+1xx (4.39) − bn −K + 2 2 log ( 1 + q n (qn + 1)Bnxx (yˆn+1x − θnx)2 ) (4.40) + b n+1 − bn 2 K∑ i=1 ψ (bn+1 − i+ 1 2 ) + K∑ i=1 log Γ (bn − i+ 1 2 ) − K∑ i=1 log Γ (bn+1 − i+ 1 2 ) (4.41) − 1 2ψ (bn+1 −K + 1 2 ) + log Γ (bn −K + 2 2 ) − log Γ (bn −K + 1 2 ) (4.42) + 12K log (qn+1 qn ) + 12K qn qn+1 − 1 2K + 1 2qn+1 + 1 2 log qn qn + 1 (4.43) + 12 logB n+1 xx − 1 2 logB n xx. (4.44) Following Propositions 4.10 and 4.11, the terms in (4.38) and (4.39) can be simplified as bnK 2 log bn+1 bn + bn 2 log ( 1 + b n bn+1M n xBnxx ) + b nK + 1 2 − bn+1K 2 . As bn →∞, it is easy to show that lim n→∞ bnK 2 log bn+1 bn + bnK 2 − bn+1K 2 = 0. Also, we can show that lim n→∞ bn 2 log ( 1 + b n bn+1M n xBnxx ) − bn −K + 2 2 log ( 1 + q n (qn + 1)Bnxx (yˆn+1x − θnx)2 ) = −12 . 132 This suggests that the sum of (4.38), (4.39) and (4.40) approaches zero as n→∞. Using Lemma 4.12, we can show that both (4.41) and (4.42) go to zero as n→∞. It is easy to check that (4.43) approaches zero as n → ∞. It follows from (4.20) that Bn+1xx = bn+1 bn + 1B n xx + 1 bn + 1 qnbn+1(yˆn+1x − θnx)2 1 + qnbn+1bn+1−K+1 . Therefore, log B n+1 xx Bnxx = log ( bn+1 bn + 1 + 1 bn + 1 qnbn+1 1 + qnbn+1bn+1−K+1 (yˆn+1x − θnx)2 Bnxx ) , which is easily shown to converge to zero. This completes the proof. 4.3 The Value Of Information Value of information procedures allocate the simulation budget by evaluating the potential of new observations to improve the current estimate of the best value (see [31] for a survey). The information potential is defined in terms of the expected difference in the estimated objective value before and after the next observation occurs. We do not know exactly how an observation of alternative x will change our beliefs about the best alternative, but we can compute an expectation over the predictive distribution in (4.30). In this way, we can “look ahead” to the random outcome of the next observation, attempting to anticipate the results before we see them. If we sample from alternative x at time n and collect observation yˆn+1x , the value of information is defined as Vx(Sn) = En [ max i θn+1i | xn = x ] −max i θni , 133 where En is the conditional expectation taken with respect to the decision-maker’s distribution of belief at time n, and xn denotes the alternative measured at time n. Note that the predictive distribution of θn+1 depends on qn, bn and Bn only through the vector s˜n from (4.30). As a result, the expected value of information can be rewritten as V(θn, s˜(Sn, x),m) = En [ max i θni + s˜(Sn, xn)Tm | xn = x ] −max i θni , (4.45) where V is defined by V(a,b,m) = E[max i ai + biTm] −max i ai, a and b are K × 1 vectors. The random variable Tm follows a univariate Student’s t-distribution with degrees of freedom m, mean 0 and variance 1. Once again, (4.45) assumes that the decision-maker’s beliefs are represented by a normal-Wishart distribution at each time step. In practice, the normal-Wishart distribution is an approximation of the true posterior beliefs, updated using (4.17)- (4.20). By using this approximation to represent our uncertainty, we can solve (4.45) in closed form, leading to a computationally efficient procedure. We introduce a fully sequential policy called Projected Learning of Unknown Correlations with Knowledge Gradients (PLUCK). The PLUCK policy chooses an alternative by computing xPLUCK(Sn) = arg max x V(θn, s˜(Sn, x),m). (4.46) We now show how the value of information can be computed exactly under P¯. 134 4.3.1 Computation of the Value of Information To compute the expected value of information, we start by defining a function h : R 7→ {1, 2, . . . , K} as h(t) := max(argmax i ai + bit). The function h tells us which alternative is the best among {1, 2, · · · , K} in the sense of having largest value of ai + bit given Tm = t. The largest index is chosen if multiple alternatives tie. Instead of calculating V(a,b,m) directly, we notice that max i ai + biTm = ah(Tm) + bh(Tm)Tm, and rewrite ah(Tm) + bh(Tm)Tm as a telescoping sum, ah(0)+bh(0)Tm+   h(Tm)−1∑ i=h(0) (ai+1 − ai) + (bi+1 − bi)Tm  +   h(0)−1∑ i=h(Tm) (ai − ai+1) + (bi − bi+1)Tm   . Using standard techniques (see Section 5.3 of [5]), we can find a non-decreasing sequence {ci}Ki=0 such that h(z) = i if and only if z ∈ [ci−1, ci). It follows that V(a,b,m) can be written as V(a,b,m) = K−1∑ i=1 (bi+1 − bi)E[(Tm − |ci|)+]. To continue the computational procedure, we need an analytical form for the tail expectation of a univariate Student’s t-distribution. Denote the pdf and cdf of a standard Student’s t-distribution with m degrees of freedom as gm(·) and Gm(·), respectively. We can easily rewrite E [(Tm − |ci|)+] as E [ (Tm − |ci|)+ ] = E(Tm · 1{Tm>|ci|})− |ci|(1−Gm(|ci|)). 135 It also can be shown [30] that E(Tm · 1{Tm>|ci|}) = m+ c2i m− 1 gm(|ci|). With the analytical form for the tail expectation, the value of information can be expressed as: V(a,b,m) = K−1∑ i=1 (bi+1 − bi) (m+ c2i m− 1 gm(|ci|)− |ci|(1−Gm(|ci|)) ) . (4.47) We note that (4.46) has the same computational complexity as the analo- gous VIP for R&S with known correlation structures [5, 77]. The breakpoints ci can be computed in O(K logK) time. Repeating this for every alternative yields O(K2 logK). Just as in the learning model of Section 4.2.1, we can account for unknown correlations for the same computational cost. 4.3.2 Monotonicity of the Value of Information The value of information calculated in (4.47) depends on the degrees of free- dom m of the Student’s t-distribution. Lemma 4.7 shows that in the nth stage, the predictive distribution of the new observation yˆx follows a univariate Student’s t-distribution with degrees of freedom m = bn−K+1. The parameter bn is updated through (4.18) and increases as the PLUCK policy collects information. The rela- tionship between the value of information and the degrees of freedom is summarized in the next theorem. Theorem 4.14. For fixed K×1 vectors a and b, the value of information V(a,b,m) is a decreasing function in the degrees of freedom parameter m. 136 Proof. Proof: Let V(a,b,m) and V(a,b, n) be the values of information for two different values of the degrees of freedom parameter, with m ≥ n. Since a and b are fixed, it is sufficient to consider E(Tm − |ci|)+ and E(Tn − |ci|)+. Let gm(t) and gn(t) be the probability density function of Student’s t distribu- tions with m and n degrees of freedom, respectively. There exists c∗ > 0 such that gm(c∗) = gn(c∗) with gm(t) ≤ gn(t) on [c∗,∞) and gm(t) > gn(t) on [0, c∗). We will consider two cases: (i) If |ci| ≥ c∗, then gm(t) ≤ gn(t) for t ∈ [|ci|,∞). E(Tm − |ci|)+ = ∞∫ |ci| (t− |ci|)φm(t)dt ≤ ∞∫ |ci| (t− |ci|)φn(t)dt = E(Tn − |ci|)+. (ii) If |ci| < c∗, then gm(t) > gn(t) for t ∈ [0, |ci|). E(Tm − |ci|)+ − E(Tn − |ci|)+ = c∗∫ |ci| (t− |ci|)gm(t)dt+ ∞∫ c∗ (t− |ci|)gm(t)dt− c∗∫ |ci| (t− |ci|)gn(t)dt− ∞∫ c∗ (t− |ci|)gn(t)dt = c∗∫ |ci| (t− |ci|)gm(t)dt− c∗∫ |ci| (t− |ci|)gn(t)dt+ ∞∫ c∗ (t− |ci|)gm(t)dt− ∞∫ c∗ (t− |ci|)gn(t)dt = c∗∫ |ci| (t− |ci|)gm(t)dt− c∗∫ |ci| (t− |ci|)gn(t)dt+ c∗∫ 0 (t− |ci|)gm(t)dt− c∗∫ 0 (t− |ci|)gn(t)dt = |ci|∫ 0 (t− |ci|)gn(t)dt− |ci|∫ 0 (t− |ci|)gm(t)dt < 0 It follows from (i) and (ii) that for any |ci| > 0, V(a,b,m) = K−1∑ i=1 (bi+1− bi)E(Tm− |ci|)+ ≤ K−1∑ i=1 (bi+1− bi)E(Tn− |ci|)+ = V(a,b, n). Therefore the value of information decreases in the number of degrees of freedom. 137 The theorem suggests that the value of information decreases as the degrees of freedom increase, with all else being equal. In other words, the same information, under the same estimated means and covariances, is less valuable when we have already accumulated many other observations. Theorem 4.14 leads to an interesting comparison with earlier work on R&S with known correlations. It is a well-known result that Tm converges weakly to a standard normal random variable as the degrees of freedom m goes to infinity. Recall that, when the correlation structure is known, we calculate a version of (4.45) using a standard normal random variable; see (5.16) in [5]. Theorem 4.14 suggests that, given the same estimated means and conditional covariances, the value of information is inherently higher when the correlations are unknown. That is, a single measurement provides more information when we are learning both means and covariances. 4.4 Numerical Experiments We present experimental results demonstrating the value added by learning unknown correlations using PLUCK. Throughout this section, we considered six policies: the PLUCK policy, the correlated KG (CKG) policy in [77], a sequential modified version of proportional to variance (PTV) policy, a sequential OCBA policy designed for opportunity cost in [91], a Greedy policy and the LL policy with linear loss in [79]. We briefly explain the distinctions between the remaining policies below. Both PLUCK and CKG are designed to sample sequentially, with CKG assum- 138 ing a known covariance structure and using a conjugate Bayesian learning model. This comparison allows us to see the value added by incorporating unknown corre- lations into our decision-making. The PTV and greedy policy are also sequential, and make simulation decisions at time n in the following ways: PTV policy chooses the alternative with the highest variance; the greedy policy chooses the alternative arg maxx θnx . For both of these methods, we use our approximate Bayesian learning model in (4.17)-(4.20) to update our beliefs about the alternatives. This compari- son allows us to see the value added by using the PLUCK policy to make decisions, in addition to the value of learning unknown correlations. Lastly, the LL policy of [79] first screens out a subset of alternatives, then allocates the simulation budget equally among the rest. This structure allows a conjugate normal-Wishart prior to be used, with the drawback that the policy often samples alternatives that do not provide a lot of useful information. The LL policy can also be extended to allow multiple screening stages; however, this approach works best with large sampling budgets. In our experiments, we consider problems where the simulation budget is comparable to the number of alternatives, making it difficult to run LL for more than two stages. 4.4.1 Wind Farm Placement Example Our study is based on the wind farm placement problem mentioned in the introduction. For the purpose of these experiments, we use the wind speed at a location as a stand-in for power output. In practice, wind speed data are collected 139 at each location simultaneously, rather than sequentially. However, this also allows us to use the data to demonstrate the effectiveness of our policy by comparing its performance to how well we could have done. Practical applications of sequential simulation in wind farm placement use complex physics-based models incorporating factors other than wind speed, thus necessitating the use of sequential simulation. For our purposes, the public availability of wind-speed data allows us to create a realistic test setting for the PLUCK algorithm. We used hourly wind speed data [92] across the United States with latitude and longitude resolution of 0.125 degrees. The data consist of two components: the zonal component u (in the west-east direction) and the meridional component v (in the north-south direction). Assuming for the purpose of this example that all the wind turbines can be placed in the right direction, we focus on the magnitude of the wind speed, which is defined as √ u2 + v2. The objective is to select the location with the highest wind speed over a set of 64 locations. We considered data from four regions across the United States: Kansas, Wash- ington, Iowa and Oklahoma. All regions have had a high percentage of wind power generation, or a large amount of wind capacity installed, in recent years. For each of the four regions, we selected 64 different locations sitting on an 8×8 grid (the areas of these grids range from 3500 to 4500 square miles) within the region as alternatives for wind farm placement. We used 1800 days of data to estimate the mean and covariance matrix of a multivariate normal distribution. These fitted parameters were taken to repre- sent the “true” underlying sampling distribution. In our experiments, individual 140 Policies Experiment Performance Measure PLUCK CKG Greedy OCBA PTV LL Kansas Opportunity Cost 0.0314 0.1270 0.3073 0.2760 0.0537 0.2848 Standard Errors 0.0026 0.0068 0.0045 0.0040 0.0036 0.0047 Washington Opportunity Cost 0.0640 0.0917 0.2651 0.2021 0.1125 0.1647 Standard Errors 0.0051 0.0050 0.0018 0.0053 0.0060 0.0058 Iowa Opportunity Cost 0.0512 0.0892 0.1917 0.1526 0.1229 0.1335 Standard Errors 0.0037 0.0040 0.0046 0.0036 0.0034 0.0040 Oklahoma Opportunity Cost 0.0509 0.0911 0.2413 0.1401 0.1816 0.2031 Standard Errors 0.0040 0.0050 0.0013 0.0017 0.0039 0.0042 Table 4.1: Final opportunity cost and standard errors for the experiments observations were generated by simulating from normal distributions with the true parameter values. However, the policies were not allowed to see these true values when making decisions. It is critical to collect information efficiently when the decision-maker’s prior beliefs are inaccurate or misleading. To show that the PLUCK policy is particularly effective in such a situation, we used a small number of data points to create a prior for which the location that appeared to be the best was quite different from the true best location. We discuss this issue further below; for now, we note that each policy was given a budget of N = 200 measurements to correct this initial error. Table 4.1 gives the final opportunity cost, which is defined as Cpi = max x µx − µ(arg maxx θNx ) after N measurements for each policy pi, averaged over 500 sample paths. Lower opportunity cost suggests that a policy selects an alternative closer to the best. The 141 0 20 40 60 80 100 120 140 160 180 2000 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Number of Samples Oppo rtuni ty Co st CKG PolicyPLUCK PolicyGreedy PolicySOCBA PolicyPTV Policy (a) 64 locations from Kansas 0 20 40 60 80 100 120 140 160 180 2000 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Number of Samples Oppo rtuni ty Co st CKG PolicyPLUCK PolicyGreedy PolicySOCBA PolicyPTV Policy (b) 64 locations from Washington 0 20 40 60 80 100 120 140 160 180 2000 0.05 0.1 0.15 0.2 0.25 Number of Samples Oppo rtuni ty Co st CKG PolicyPLUCK PolicyGreedy PolicySOCBA PolicyPTV Policy (c) 64 locations from Iowa 0 20 40 60 80 100 120 140 160 180 2000 0.05 0.1 0.15 0.2 0.25 0.3 Number of Samples Oppo rtuni ty Co st CKG PolicyPLUCK PolicyGreedy PolicySOCBA PolicyPTV Policy (d) 64 locations from Oklahoma Figure 4.1: Averaged opportunity cost as the number of samples increases, where the dashed lines are the mean plus or minus two standard errors PLUCK policy outperforms all other competing policies by a statistically significant amount based on Table 4.1, while the CKG policy has smaller final opportunity cost than the other policies in three experiments. The PTV policy performs better than the sequential OCBA policy and the greedy policy. The LL policy performs poorly in all four experiments, possibly because the simulation budget is quite small relative to the number of alternatives. Figure 4.1 shows how this performance measure changes as the number of sam- 142 (a) Expected Improvement 0 20 40 60 80 100 120 140 160 180 2000.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 Number of Samples Max imum Erro r Test4: 64 locations from Washington CKG PolicyPLUCK PolicyGreedy PolicyPTV Policy (b) Maximum Error Figure 4.2: Value of information and maximum error as the number of samples increases ples N varies from 1 to 100. The bands indicate the mean performance measures plus or minus two standard errors. The LL policy is omitted from these figures because it allocates simulations in batch rather than sequentially, by dividing them uniformly across any alternatives that were not screened out. The opportunity cost for PLUCK tends to decrease over time. For CKG, sometimes there is a degra- dation in performance at the beginning. We conjecture that this behavior arises because CKG assumes a known covariance structure. If the prior beliefs about the correlations are inaccurate, this misdirects the way in which CKG incorporates new information into the posterior. A small amount of information can thus make CKG produce even worse performance than what could be obtained with just the prior. The sequential OCBA policy and the greedy policy tend to work poorly on all cases, and performance of the PTV policy differs dramatically among cases. We also considered a different set of experiments in which results were averaged 143 across multiple priors constructed from a small sample of wind speed data. Overall, we found that PLUCK still outperformed the competition, with the caveat that all policies were more heavily affected by the initial degradation in performance (the early iterations needed to get a handle on the true correlation structure). We make two interesting observations from the experimental results. Fig- ure 4.2(a) shows the logarithm of the value of information as computed by both PLUCK and CKG (for a particular experiment), while Figure 4.2(b) shows the max- imum absolute difference between the posterior and true means for various policies. Figure 4.2(a) shows that the value of information is much higher when we consider unknown correlations, as suggested by Theorem 4.14. Figure 4.2(b) shows that the True Mean 2 4 6 8 2 4 6 8 Prior Mean 2 4 6 8 2 4 6 8 • 39 • 107 • 52 • 2 Post. Mean Using PLUCK 2 4 6 8 2 4 6 8 • 43 • 10 • 45 • 4 • 40 • 14 • 44 Post. Mean Using CKG 2 4 6 8 2 4 6 8 • 49 • 151 Post. Mean Using PTV 2 4 6 8 2 4 6 8 • 7 • 7 • 7 • 7 • 7 • 7 • 7 • 7 • 7 • 7 • 7 • 7 • 7 • 7 • 7 • 7 • 7 • 7 • 7 • 7 • 7 • 7 • 7 • 7 • 7 • 7 • 7 Post. Mean Using LL 2 4 6 8 2 4 6 8 Figure 4.3: Contour map of different policies after 200 measurements 144 Policies Experiment Performance Measure PLUCK CKG Greedy OCBA PTV LL Queue (correlated) Opportunity Cost 0.3181 0.4677 0.6979 1.0521 0.4862 2.1105 Standard Errors 0.0380 0.0423 0.0631 0.0645 0.0451 0.034 Network Opportunity Cost 0.1938 0.2902 0.3339 0.3439 0.3420 0.2374 Standard Errors 0.0024 0.0063 0.0047 0.0055 0.0066 0.0045 Table 4.2: Final opportunity cost and standard errors for the queue selection and network selection problems PLUCK policy does a better overall job of estimating the true values. Figure 4.3 gives contour maps of the true means, prior means and posterior means after 100 measurements with four different policies. The sequential OCBA policy and the greedy policy are omitted due to its poor performance. The num- ber of times that each alternative is measured is also shown on the contour maps (zeros are omitted). Red colors indicate higher values. We can see that the true best alternative is in the upper-left corner, whereas the prior misdirects us toward bottom-left. After 200 measurements, the PLUCK policy captures the general trend of the true values, whereas CKG and PTV are still stuck on beliefs that resemble the prior. Observe that both PLUCK and CKG measure the true best alternative in the upper-left corner almost equally often. However, the statistical model used by PLUCK provides more accurate posterior beliefs, leading to a better implemen- tation decision. The LL policy performs poorly and the identified best is far away from the true best. Also, its batch structure allocates many samples to alternatives that do not provide a lot of useful information. 145 4.4.2 A Single-Server Queue Selection Problem In simulation, correlations may arise due to common random numbers. How- ever, it is important to keep in mind that correlated beliefs reflect inherent similar- ities or differences between alternatives, even when the actual simulation output is completely independent. The following example demonstrates that correlated beliefs can enhance performance even when no correlations are present in the simulation output. Consider 20 first-come, first-served M/G/1 queues. The interarrival times follow an exponential distribution with λ = 0.05 and the service times follow Pareto distributions with mean service rates 23(0.1 + 0.05(i − 1)), i = 1, 2, ..., 20. Suppose that the administrator of these queues wishes to reduce costs by closing the worst server, i.e., the one with the largest expected waiting time. System 1 is the worst, having the smallest service rate. However, this is unknown to the administrator. Observe that, due to the structure of the problem, the performance of queues i and j will exhibit greater similarity if |i − j| is smaller. Thus, even though these queues function independently, our beliefs about their performance can be corre- lated. Of course, we do not know the problem structure, but we can use an empirical covariance matrix computed from a small sample of observations to initialize our prior distribution, and use PLUCK to improve on this prior. Table 4.2 gives the final opportunity cost. Figure 4.4(a) shows the performance of PLUCK over time when the prior matrix parameter B0 is diagonal, while 4.4(b) shows performance with B0 computed using small-sample empirical covariances. We see that, although 146 0 10 20 30 40 50 60 70 80 90 1000 0.5 1 1.5 2 2.5 3 3.5 Number of Samples Oppo rtuni ty Co st CKG PolicyPLUCK PolicyGreedy PolicyPTV PolicySOCBA Policy (a) With independent belief 0 10 20 30 40 50 60 70 80 90 1000 0.5 1 1.5 2 2.5 3 3.5 Number of Samples Oppo rtuni ty Co st CKG PolicyPLUCK PolicyGreedy PolicyPTV PolicySOCBA Policy (b) With correlated belief Figure 4.4: Comparing averaged opportunity cost in M/G/1 queue selection problem the queues function independently, PLUCK can leverage correlated beliefs to learn much more quickly than the other policies. As before, a small sample of 10 replications was used to create a prior for the covariance matrix.We then compared PLUCK, CKG, PTV, LL, the sequen- tial OCBA policy and the greedy policy by running 1000 macroreplications. The Pollaczek-Khinchin formula can be applied to compute the true expected waiting time. Figure 4.4(b) shows that PLUCK outperforms the competing policies, espe- cially in early stages. The PTV policy and CKG policy are indistinguishable most of the time. The greedy policy and the sequential OCBA policy work poorly in this experiment. The performance of the CKG policy and PTV policy are behind the PLUCK policy initially, but they eventually catch up and lag behind PLUCK slightly. In summary, this experiment suggests that we are learning the similarities- between alternatives, enabling us to discover the optimal solution more quickly even when the actual simulation outputs are independent. 147 4.4.3 3-Station Jackson Network Consider a classical 3-station open Jackson network shown in Figure 4.5(a), where the interarrival times and service times follow exponential distributions. Let λ be the total external arrival rate to the system, and let µj represent the service rate at station j. Upon completing service at station i, a job leaves the network with probability pi0 or is routed to station j with probability pij. The goal of the administrator is to minimize the average time spent by cus- tomers in the system subject to a constraint on the overall service rate. Suppose that all the available agents can achieve an overall service rate of 3 for stations 2 and 3. Consider 10 different assignments where the service rate at station 2 is 1 + 0.1i, i = 1, 2, · · · , 10. The performance of different assignments will exhibit correlation due to similarities in the service rates. We chose λ = 0.5 and the routing probability matrix P = [pij] as P =         pi1 pi2 pi3 pi0 0 0.7 0.3 0 0.3 0 0 0.7 0.2 0 0 0.8         Again, a small sample of 10 replications was used to created priors for the mean and covariance matrix. We compared PLUCK, CKG, PTV, LL, the sequential OCBA policy and the greedy policy, where each policy is given a sampling budget of 50. The true expected times in the system for different assignments are computed analytically. The final opportunity costs averaged over 500 sample paths are shown 148 in Figure 4.5(b), and Table 4.2 gives the final opportunity cost. The PLUCK policy again outperforms all the other policies. (a) A three-station Jackson network 0 5 10 15 20 25 30 35 40 45 500.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Number of Samples Oppo rtuni ty Co st CKG PolicyPLUCK PolicyGreedy PolicyPTV PolicyOCBA Policy (b) Averaged opportunity cost Figure 4.5: Numerical experiment on a three-station Jackson network 4.5 Conclusion We have presented the first computationally tractable statistical learning model for fully sequential ranking and selection with unknown correlation structures. The model uses approximate Bayesian inference to represent and update our beliefs about unknown performance means and unknown covariances using the normal-Wishart distribution. We have also derived a value of information procedure that antici- pates new information about both the true values and the true correlations when allocating simulations. Previous work in this area has required known correlation structures, an assumption that is likely to be violated in many applications. We re- lax this assumption, but retain the ability to learn about multiple alternatives from a single observation, for the same computational cost as the known-covariance case. 149 We believe that our work offers a useful way to tackle large learning problems with difficult correlation structures, and opens up new applications for Bayesian optimal learning. 150 Chapter 5: Bayesian Learning on Logistic Demand Curves 5.1 Introduction The problem of business-to-business (B2B) pricing arises in high-volume com- mercial transactions between businesses. For example, consider the problem faced by a supplier of raw materials negotiating a long-term contract with a large manu- facturing concern. After a period of negotiation, the seller quotes a price, which can be accepted or rejected. If the pricing offer is rejected, the seller loses a substantial amount of revenue, but it may not be clear exactly how much lower the offer should have been. If the offer is accepted, the seller makes a profit, but is left wondering whether a somewhat higher offer would still have been accepted. The seller’s goal is to maximize total revenue from a sequence of contracts, in the face of uncertainty about buyer behavior. Dynamic pricing in general is subject to uncertainty. Classic models in revenue management often assume stochastic demand for a product [93, 94], or uncertain customer valuations of it [95]. Recent work, however, has considered the additional dimension that the uncertainty may be environmental, that is, the seller does not even know the distribution from which customer valuations are drawn. In practice, this distribution must be estimated, and the estimate must be adjusted over time 151 as new transactions are observed. This gives rise to the problem of “learning and earning,” in which the seller does not always prefer the decision that appears to be optimal based on the current demand model (referred to as the “myopic” decision), but rather may engage in more exploratory or experimental behavior. For example, an online retailer may increase or decrease some prices for a period of time, simply to observe the effect on sales. Although this behavior may result in lost revenue, it provides new information that produces a more accurate demand model, enabling better pricing decisions in the future. The literature has used Bayesian statistics to model environmental uncer- tainty [33, 34], and different pricing strategies have been proposed to optimize the balance between revenue and information. For example, [35] proposes a one-step look-ahead strategy for problems with logistic revenue curves, while [36] presents an approach based on multi-armed bandit theory. A recent stream of work, represented by [37], [38], [39], and [40], has focused on establishing long-run convergence rates for policies that are mostly myopic, with occasional periods of exploration spaced in- creasingly further apart. However, in the specific context of B2B pricing, individual transactions typically have high volume (for example, the seller may be negotiating the price of a year’s supply of raw materials) and incur high costs (e.g. the time and money spent during negotiations), making it important to obtain good performance quickly. We consider an application where information arrives in the form of binary win/loss observations, representing customers’ yes/no responses to the seller’s pric- ing offers (or “bids”). A common demand model in this setting (used e.g. by [35]) 152 assumes that these binary outcomes follow a logistic distribution, which also allows us to relate the win probability to a set of regression features representing additional information about the product or customer type. Although this is a fairly natural choice of demand model (essentially just an instance of logistic regression), it is quite challenging to connect to the Bayesian way of representing new information and using it to update the seller’s beliefs. While, for linear regression, the multi- variate normal distribution offers an intuitive and easy-to-use conjugate prior [96], no such model is available for logistic regression, making it difficult to represent a belief over a continuous space of logistic curves. We approach this problem with approximate Bayesian inference, using the technique of density projection to create a multivariate normal posterior distribution that is “approximately conjugate,” in the sense of minimizing the Kullback-Leibler divergence from the actual posterior. See e.g. [97] for an application of this technique to the problem of learning unknown correlation structures in ranking and selection. In the context of logistic regression, our approach is similar to the variational ap- proximation by [98], but involves an additional optimization step using infinitesimal perturbation analysis (see e.g. [11] or [13]) to further improve the quality of the approximation. Using this statistical technique to efficiently update a multivariate normal prior on the parameters of the logistic demand curve, we then apply a policy that optimizes a myopic estimate of the expected revenue curve (see e.g. Ch. 11 of [5]). Our numerical experiments provide evidence in favor of both the approxi- mate Bayesian learning model and the Bayes-greedy pricing policy. Although our Bayesian model has numerous applications outside pricing, in this particular context 153 it enables the seller to compactly model a set of beliefs about win probabilities for a wide range of customer and product segments, and then quickly update this belief in real time. 5.2 Problem Formulation and Learning Model Section 5.2.1 introduces the demand and revenue curves optimized by a seller in the B2B pricing problem. In Section 5.2.2, we discuss the challenge of developing a Bayesian model for learning the parameters of the demand curve. Then, Sections 5.2.3 and 5.2.4 outline our proposed approach for overcoming this challenge. 5.2.1 Problem Formulation Consider a seller who must quote prices for a sequence of corporate clients. The (n+ 1)st client will accept a price offer pn ≥ 0 with probability ρ, which may also depend on additional properties of the client or product. The function ρ is called the demand curve, and is not known exactly to the seller. However, the seller does observe the client’s response, modeled as a binary variable Y n+1, where Y n+1 = 1 with probability ρ, representing a sale (or “win”), and Y n+1 = 0 represents a “loss.” The seller’s expected revenue from the client is R(pn) = pnρ, pn ≥ 0, (5.1) where the demand curve ρ usually depends on the price pn. In most applications, we need to consider the marginal cost c for the product, and work with the expected 154 profit Π(pn) = (pn − c)ρ, pn ≥ c. (5.2) We assume that Y n follows a logistic distribution, allowing us to write the demand curve as ρ(xn) = P(Y n+1 = 1) = 11 + e−µᵀxn , (5.3) where xn is a vector of features, observed by the seller, providing relevant informa- tion for the (n+ 1)st pricing decision. In the simplest possible model, the customers are assumed to be homogeneous, xn = [1, pn]ᵀ, and the parameter vector µ consists only of an intercept and a slope term. We use this simple model in our examples throughout this chapter. However, our analysis is readily applicable to the general case, where xn may also contain information about the product (type or volume) and the client (region, industry, history with the seller). In all of these cases, the parameter vector µ is unknown to the seller and must be inferred using a combination of prior knowledge and incoming win/loss results. The shape of the demand curve is extremely sensitive to the parameter values, making it important to obtain accurate estimates of the parameters as quickly as possible. We now propose a Bayesian framework for learning the demand curve. 5.2.2 Bayesian Model For Dynamic Pricing We adopt the Bayesian view, and represent our uncertainty about the vector µ using a multivariate normal prior distribution, that is, µ ∼ N (θ,Σ). (5.4) 155 The multivariate normal distribution offers a compact and powerful way to model correlations between our beliefs about different components of µ. Because the ob- servation Y n+1 provides information about an entire vector xn, our beliefs about different components of this vector should become correlated due to their depen- dence on the same observations. A second convenience of the multivariate normal distribution (important for computational purposes) is that the linear combination µᵀxn follows a univariate normal distribution. In linear regression, where a continuous response variable is related to a linear combination of features, the multivariate normal prior possesses the property of conjugacy. That is, if the residual errors are i.i.d. normal, the posterior distribution of the regression parameters, conditional on a sequence of observations, will remain normal [96]. This model makes the learning process highly efficient computationally, as one only needs to recursively update the mean vector and covariance matrix of the belief distribution after each observation. Unfortunately, in logistic regression, there is no known prior distribution that is conjugate with logistic observations. To see this, we first assume that µ ∼ N (θn,Σn), and write the likelihood function of Y n+1 as P (Y n+1) = g(Hn+1(µ)), (5.5) where g(z) = (1 + e−z)−1 and Hn+1(µ) = (2Y n+1− 1)(µᵀxn). Equation (5.5) allows us to represent the win/loss probability in a concise form. Applying Bayes’ rule, the posterior distribution, given the bidding price pn and the observation Y n+1, can be 156 written as p(µ|pn, Y n+1) ∝ g(Hn+1(µ))|Σn|−1/2 exp { − 1 2(µ− θ n)ᵀ(Σn)−1(µ− θn) } , (5.6) which is clearly non-normal. We would like to retain the multivariate normal prior due to its power in modeling correlated beliefs. However, we are now required to use the techniques of approximate Bayesian inference to develop a multivariate normal posterior that is “approximately conjugate.” Several such approaches have been proposed, includ- ing approximation methods based on Laplace approximation [99] and variational bounds [98]. We take a variational Bayesian approach to approximate the poste- rior distribution by minimizing the Kullback-Leibler divergence between the true posterior distribution and a multivariate normal distribution. 5.2.3 Variational Bayesian Approximation Suppose that, after observing n responses, our beliefs about µ are multivari- ate normal with parameters (θn,Σn). Let P (µ|pn, Y n,θn,Σn) be the likelihood function of this distribution. The variational Bayesian approach approximates the posterior distribution of µ, given Y n+1, with a normal distribution Q(µ|θn+1,Σn+1) by minimizing the Kullback-Leibler (KL) divergence. The KL divergence between P (µ|pn, Y n+1,θn,Σn) and Q(µ|θn+1,Σn+1) is defined as D(Q q P ) := EQ ( log Q(µ|θ n+1,Σn+1) P (µ|pn, Y n+1,θn,Σn) ) , (5.7) where the expectation is taken with respect to Q. This definition can be partially simplified, as stated in the following result. 157 Proposition 5.1. The KL divergence can be written as D(Q q P ) = EQ [ log ( 1 + e−Hn+1(µ) )] + h(θn,Σn,θn+1,Σn+1), (5.8) with the second component specified as h(θn,Σn,θn+1,Σn+1) = 12 [ tr ( (Σn)−1Σn+1 ) + (θn − θn+1)ᵀ(Σn)−1(θn − θn+1)− k − ln |Σ n+1| |Σn| + C ] , where C is a constant that does not depend on θn+1 and Σn+1. To minimize the KL divergence, the first step is to take the gradient of D(Q q P ) with respect to its parameter θn+1 and Σn+1. Unfortunately, a closed-form expression for the gradient is not available, because the expectation in equation (5.8) is intractable. However, if our goal is to minimize an expected value, a connection to gradient-based stochastic search [100] comes naturally to mind. The work by [101] uses such an approach, where a likelihood ratio estimate [102] of the gradient is constructed. However, this approach leads to a noisy simulation optimization problem, whose dimensionality is quadratic in the number of features, presenting substantial computational difficulties. Instead of optimizing with respect to (θn+1,Σn+1), we utilize a dimension reduction technique and propose the following form for θn+1 and Σn+1: θn+1 = Σn+1 ( (Σn)−1 θn + ( Y n+1 − 12 ) xn ) (5.9) Σn+1 = ( (Σn)−1 + λxn (xn)ᵀ )−1 (5.10) 158 Applying the Sherman-Morrison formula to (5.9) and (5.10), we obtain θn+1 = θn + Y n+1−1/2 λ − (xn)ᵀθn 1 λ + (xn)ᵀΣnxn Σnxn, (5.11) Σn+1 = Σn − Σ nxn(xn)ᵀΣn 1 λ + (xn)ᵀΣnxn . (5.12) In this form, there is only one parameter λ to be determined. We minimize the KL divergence with respect to λ to find the optimal multivariate normal posterior distribution from the parametrized family in (5.11)-(5.12). Aside from the compu- tational convenience of reducing the size of the problem, we choose precisely this form because it resembles the Kalman-filter-like equations used for Bayesian linear regression; the parameter λ is analogous to the precision of the residuals, while Y n+1−1/2 λ stands in for the continuous observation. In this way, our learning model for logistic regression makes an intuitive connection to the well-understood linear setting. Moreover, previous work on logistic regression, including [98] and [99], has derived updating rules with very similar form, based on different approximation techniques for the posterior likelihood function. For additional convenience, we apply the transformation v = 1λ and find v∗ = argmin v D(Q q P ). (5.13) The parameter v is analogous to the variance of the residuals in a linear regression model. Since no such explicit parameter is given in logistic regression, we simply find the value that produces the most accurate approximation. 159 5.2.4 Minimizing the Kullback-Leibler Divergence We now propose a stochastic approximation method to solve the minimization problem in (5.13), which requires estimations of the gradient of D(Q q P ) with respect to the single parameter v. This results in ∇vD(Q q P ) = ∇vEQ [ log ( 1 + e−Hn+1(µ) )] +∇vh(θn,Σn,θn+1,Σn+1). (5.14) Since we do not have a close-form expression for ∇vEQ [ log ( 1 + e−Hn+1(µ) )] , we propose to use infinitesimal perturbation analysis (IPA) to obtain noisy samples of the gradient (see e.g. [100] or [13] for an introduction). First, we transform the expectation in (5.14) into an integration with respect to a standard univariate normal distribution, E[f¯(Z)], where Z ∼ N (0, 1) and f¯ (z) = log ( 1 + exp { −(2Y n+1 − 1) [( xn(pn(µ))ᵀΣn+1xn(pn) )1/2 z + (θn+1)ᵀxn(pn) ]}) . The next result shows that the conditions for IPA [103] hold. Proposition 5.2. ∇vE [ f¯ (Z) ] = E [ ∇vf¯ (Z) ] . The IPA estimator itself is given as E [ ∇vf¯(Z) ] ≈ 1 N N∑ i=1 ∇vf¯(Z(i)), where Z(i) are independent samples from a standard normal distribution. We denote the gradient estimator by ∇̂vEQ [ log ( 1 + e−Hn+1(µ) )] and plug it into (5.14) for ∇vEQ [ log ( 1 + e−Hn+1(µ) )] . This produces an estimator of ∇̂vD(Q q P ), and we can apply the Robbins-Monro stochastic approximation algorithm vn+1 = vn − an∇̂vD(Q q P ), 160 for some suitably chosen stepsize an, to find the optimal v∗ and thus the optimal λ∗. Then we can apply the updating rules in (5.11) and (5.12) to determine the approximate posterior distribution after collecting each observation Y n. 5.3 Dynamic Pricing Policy We have shown a way in which the seller’s beliefs can be updated after observ- ing customer response to a price. It remains to address how that price can be chosen in the first place. In this section, we expand upon the notion of a “Bayes-greedy” pricing policy introduced in Ch. 11 of [5]. Greedy and semi-greedy policies have been widely studied in the literature on dynamic pricing under environmental uncertainty (see e.g. [39]), and our policy may also be viewed as part of that realm. However, in the setting of Bayesian logistic regression, the concept of “greedy” admits important nuances. Ideally, the seller would like to choose the price that maximizes the true revenue curve, p∗ = arg max p p 1 + e−(µᵀx(p)) , (5.15) where we emphasize that x depends on p since p is typically one component of the vector of features. A simple “greedy” policy will simply replace µ in (5.15) by the current posterior mean vector θn. This is typically the approach used in frequentist models (e.g. in [40]) where an MLE estimator is used in place of θn. In the Bayesian setting, however, this approach will under-perform, because it does not use all of the available information. In particular, it does not account for 161 the uncertainty in our beliefs, expressed by Σn. The covariance matrix is important because it specifies a whole family of possible revenue curves, parametrized by µ ∼ N (θn,Σn). Thus, a Bayes-greedy policy will still myopically optimize the expected single-period revenue, but the expectation will be over the entire space of revenue curves. That is, pn = argmax p E [R(p)] = argmax p E [ p 1 + e−(µᵀxn(p)) ] , (5.16) where the expectation is taken with respect to the (approximate) posterior joint distribution of the parameters. 5.3.1 Computation of the Bayes-Greedy Policy In order to use the Bayes-greedy policy, we require the ability to compute the expectation in (5.16). The approximate Bayesian model suggests that the posterior distribution is multivariate normal, which leads to another convenient dimension reduction. If µ ∼ N (θn,Σn) after collecting n observations, then µᵀx ∼ N ((θn)ᵀx,xᵀΣnx) for arbitrary x. Therefore, let W = µᵀx, noticing that W is actually a function of p, and rewrite (5.16) as pn = argmax p E [ p 1 + e−W ] , (5.17) where the expectation is now taken with respect to a univariate normal distribution with appropriately chosen mean and variance. 162 The expectation in (5.17) is known as the logistic-normal integral [104], which plays an important role in statistics. However, this integral is impossible to compute analytically. It may be computed using Monte Carlo simulation, in particular using IPA (it can be shown that the relevant conditions hold). However, [105] offers a tractable approximation E [ 1 1 + e−W ] ≈ 1 1 + e− E(W ) γ , where γ = √ 1 + pi8 Var(W ). This leads to an approximate Bayes-greedy policy that can be written as pn = argmax p p 1 + e− (θn)ᵀxn(p) γn(p) , (5.18) where γn(p) = √ 1 + pi8 x n (p)ᵀ Σnxn (p). This approximation gives us a closed-form expression for the expected revenue func- tion, so that making a pricing decision using (5.18) is computationally easier. 5.3.2 Analysis of the Bayes-Greedy Policy It can be easily shown that the objective function optimized by the point- estimate policy, RPE(p) = p 1 + e−(θn)ᵀxn(p) , is log-concave (but not concave). As a consequence, this function has a single globally optimal price. We show that the Bayes-greedy objective function in (5.16) 163 possesses the same property, whence it follows that the idea of a “Bayes-greedy price” is well-defined. Theorem 5.3. The Bayes-greedy objective function RBG(p) = E[R(p)] = E [ p 1 + e−(µᵀxn(p)) ] is quasi-concave in p when p > 0. An important consequence of Theorem 5.3 is that, if we apply IPA to optimize RBG, we are guaranteed to converge to the optimal price. In general, IPA is only guaranteed to find a local optimum. However, in this case, we can apply stochastic approximation to solve the problem directly instead of using the approximation in (5.18). However, we are still interested in understanding the approximate problem, since it is easier to solve. One can observe that (5.18) resembles the point-estimate objective, but with an additional factor γn (p) incorporating our uncertainty about the regression coefficients. The following proposition summarizes structural proper- ties of this factor. Proposition 5.4. The factor γn(p) ≥ 1 is convex in p. The variance factor can be viewed as the risk we have to take when choose a price. In most problem instances that we have observed, the factor γn(p) is not only a convex function, but also an increasing function within the domain of bidding prices. This suggests that the risk is higher when we take a higher price, but the possible reward is also higher. Moreover, the probability of success decreases when we choose a higher price. However, since the factor γn(p) is greater than 1, the 164 Bayes-greedy policy tends to explore higher prices than the point-estimate policy, leading to possible higher profit. 5.4 Numerical Experiments In this section, we present numerical experiments using the approximate Bayesian learning model with stochastic approximation proposed in Section 5.2 and the Bayes- greedy policy proposed in Section 5.3. We compare this with several alternative approaches, described as follows: 1. The standard frequentist logistic regression approach with the point-estimate policy. In this approach, logistic regression is reapplied after collecting each observation to estimate parameters in the demand function. Then, a pricing decision is made using the point-estimate policy mentioned in Section 5.3. 2. The variational lower bound approach in [98], using the point-estimate policy to make pricing decisions. 3. The variational lower bound approach in [98], with the proposed Bayes-greedy policy. This is used to show the advantage of choosing the parameter λ opti- mally using IPA. Suppose that we are running a computer company, and one of our standard desktop computer models has production cost c. When we set a price for selling, we restrict the prices to be within the range [pl, pu], where the lower bound pl ≥ c. This means that we never set a price that is lower than the cost, and it is very unlikely 165 that we will make a sale if the selling price is above pu. Our objective is to maximize the profit function as in (5.2). In our experiment, we choose c = 300, pl = 300 and pu = 500. We consider a finite number of possible bidding prices from 300 to 500 in increments of 10. For the purposes of this example, we use a two-parameter model, that is, µ = (µ1, µ2) and x (p) = [1, p]ᵀ. We begin with a prior mean θ0 = [500,−1]ᵀ. The corresponding demand and profit curves, based on this prior, are shown as the solid green lines in Figure 5.1. The figure also shows three different realizations of µ in which the maximum possible profits are “low,” “medium,” and “high.” Minor changes in the regression parameters can significantly alter the shape of the profit curve. From Figure 5.1(a), we see that the prior is essentially telling us that any customer is highly likely to purchase the computer for prices between $300 and $500. Considering the upper and lower bounds we have chosen for our price, this type of prior can be understood as uninformative (since our belief suggests that a customer will buy the product for almost any price in the range). The prior covariance matrix we choose for one specific setting is given by Σ0 =     100 0 0 0.01     . Instead of interpreting this as our uncertainty about the parameters in the prior belief, it may be more meaningful to consider Σ0 as the uncertainty about possible prices that buyers will pay. Notice that the magnitudes of the two variances are quite different, again due to the extreme sensitivity of the profit curve to small changes in the regression parameters, particularly the price sensitivity µ2. If the 166 variance of µ2 is too large, this essentially means that µ2 has a high probability of being positive, which is quite unlikely to occur in practice. Furthermore, in a practical application of the Bayes-greedy policy, we may have x (p) = [x˜, p]ᵀ, where x˜ contains product and customer attributes unrelated to the price. For the purpose of myopically optimizing the price, this model is equivalent to a two-parameter model where multiple features are embedded into µ1, in which case the variance of µ1 actually represents the variance of a sum of random variables, and should be much larger than the variance of µ2. We compare the performances, including pricing decisions, single-period profit and cumulative profit, of the approach proposed in this chapter (referred to as “IPA- Bayes”) and three alternative approaches. The results are reported for the first 20 iterations and averaged over 1000 sample paths, with the true value of µ fixed according to the three scenarios shown in Figure 5.1. In all numerical experiments, the Bayes-greedy policy refers to the approximate policy in equation (5.18). We briefly discuss each of the three scenarios below. Low-truth scenario. The parameters of the low truth setting are µ1 = 40 and µ2 = −0.115. With pre-specified values for µ1 and µ2, the optimal bidding price is 340. As shown in Figure 5.2, all four methods start with the same bidding price initially and converge after 6 iterations, but the values they converge to are different. All methods converge to prices below the optimal selling price, but the prices from IPA-Bayes and the logistic regression method are closer to optimal, and produce similar profits. IPA-Bayes adjusts more quickly to new information than the 167 Low truth Med. truth High truth Prior 300 350 400 450 500 0.0 0.2 0.4 0.6 0.8 1.0 Prices Pro ba bil ity of Su cce ss (a) Probability of success Low truth Med. truth High truth Prior 300 350 400 450 500 0 50 100 150 Prices Pro fit (b) Profit curves Figure 5.1: Probability of success and corresponding profit curve as a function of the price under three different scenarios (a) Bidding prices (b) Single-period-profit (c) Cumulative profit Figure 5.2: Plots of bidding prices, single-period profit and cumulative profits over time under the low-truth scenario other three methods, without the volatile behavior observed for frequentist logistic regression. Additionally, IPA-Bayes shows advantages in single-period profit during the first 5 iterations, resulting in higher cumulative profit. Medium-truth scenario. The true parameters are µ1 = 32.5 and µ2 = −0.08, and the optimal bidding price is 380. Figure 5.3 shows that all four methods converge to a price close to optimal. Both the single-period profit and the cumulative profit from the IPA-Bayes method dominate those from the other three methods. Note that, while the frequentist method explores higher prices than IPA-Bayes for some 168 (a) Bidding prices (b) Single-period-profit (c) Cumulative profit Figure 5.3: Plots of bidding prices, single-period profit and cumulative profits over time under the medium-truth scenario (a) Bidding prices (b) Single-period-profit (c) Cumulative profit Figure 5.4: Plots of bidding prices, single-period profit and cumulative profits over time under the high-truth scenario time, this behavior is actually too aggressive, and produces lower profits. High-truth scenario. The parameters of the high-truth setting are µ1 = 55 and µ2 = −0.125, with the optimal price being $420. As shown in Figure 5.4, both IPA-Bayes and Bayes-greedy start from a lower bidding price than the other meth- ods, due to the effect of the uncertainty factor γn (p). However, IPA-Bayes quickly adjusts and increases the bidding price, eventually getting close to optimal, and dominating the other methods in terms of single-period profit. After 10 iterations, frequentist logistic regression catches up and produces similar single-period profit, at the expense of volatile behavior and smaller profits in the early iterations. 169 Discussion. Frequentist logistic regression generally performs well after a few iter- ations. However, in the sequential setting, we have to refit a new logistic regression model after every observation, which becomes more time-consuming as the number of observations increases. The Bayesian learning model, while less accurate (due to the approximation of conjugacy), provides a quick and efficient way to update pa- rameters, and generally produces a “smoother” sequence of prices; essentially, the uncertainty encoded in the covariance matrix smooths the pricing decision, com- pared to the volatile prices chosen by the frequentist method in the early iterations. Among the methods using approximate Bayesian inference, IPA-Bayes is consis- tently the best. In these examples, some competing policies tend to perform very similarly to IPA-Bayes after about 10 iterations. However, in the specific context of B2B pricing, the early iterations are especially important because each individual contract tends to have much higher value, and the opportunity cost of pricing suboptimally is more severe. To give some perspective, if the value of every contract is on the order of hundreds of thousands, or millions, of dollars, the overall planning horizon will be shorter, and the first 5− 10 iterations will become very significant. 5.5 Conclusion We have presented an approximate Bayesian approach for learning the pa- rameters in a logistic regression model, with specific application to learning revenue curves in B2B pricing problems. We use infinitesimal perturbation analysis and 170 stochastic search to improve the quality of the approximation. We also consider a pricing policy that incorporates uncertainty about the parameters into the esti- mated expected revenue curve, and chooses a price that optimizes this aggregated function. The proposed model and pricing policy show encouraging results in our empirical experiments. Future work will test the proposed approach on real-world pricing data, where the underlying statistical model can be high-dimensional. 171 Chapter 6: Conclusion In this thesis we have proposed two different metamodeling approaches that incorporates direct gradient estimates for solving simulation optimization problems with continuous variables and a knowledge-gradient method that employs variational Bayesian technique for ranking and selection problems. We have also proposed an approximate Bayesian statistical model and price recommendation strategy in business-to-business (B2B) pricing context. In Chapter 2, we analyzed regression models that explicitly incorporate di- rect gradient estimators, and derived the corresponding parameter estimators. We provided preliminary evidence for the potential gains from the DiGAR approach by comparing with standard regression both theoretically via analytical calculations un- der settings with more restrictive assumptions, and empirically via simple queueing examples where the assumptions under which the theoretical results are established do not hold. More generally, we investigated the idea of augmenting statistical mod- els when direct gradient estimators are available, motivated by stochastic simulation settings. We provided an alternative model for the local improvement step in the sequential RSM approach used in experimental design for optimization. In Chapter 3, we investigated the idea of incorporating gradient estimates into 172 stochastic kriging by extrapolating additional responses using the original responses and gradient estimates. This approach is not restricted to stochastic kriging, but can be applied to other metamodeling approaches as well. We analyzed the proposed GESK model theoretically under simplified settings and showed that it provides predictions with smaller MSE than stochastic kriging. We also conducted numerical experiments and illustrated the performance of the GESK model. We presented two different strategies, namely PMLE and IMSE, to determine extrapolation step sizes used in GESK. Effectiveness of these two strategies were compared using numerical examples. In Chapter 4, we created a new Bayesian model for simultaneously learning unknown means and unknown correlations in fully sequential R&S. We derived a new VIP for ranking and selection with unknown correlation structures. The new procedure intuitively generalizes VIP for R&S with known correlations, with the additional ability to incorporate the decision-makers uncertainty about the corre- lation structure into decision-making. We proved that the value of information is greater when the correlation structure becomes unknown. We also argued that the incremental information loss from a single application of approximate Bayesian in- ference eventually vanishes. We provided numerical results to show the value added by learning unknown correlations. In particular, we studied a version of the wind farm placement problem using real data. In Chapter 5, we considered the problem of optimally choosing prices to max- imize revenue or profit from transactions with heterogeneous customers. We de- veloped a learning model that maintains a set of beliefs about the effects of the 173 significant characteristics, but is able to adjust and improve those beliefs as new data come in. We also developed an optimization algorithm that recommends a price to maximize average revenue, based on the estimates provided by the predic- tive and learning models. The uncertainty measured by the learning model is used as a factor in the price calculation. For example, if the estimate from the data sug- gests that we should quote a high price, but there is a large amount of uncertainty suggesting that the estimate is unreliable, the final recommendation made by the procedure will tend to be more conservative. We conducted simulations to check the performance of the optimal prices. The results of the simulations suggest that the optimal bidding algorithm has the potential to substantially increase cumulative revenue over time. Our work has initiated some new ideas and points to several other more general directions for future research. The proposed DiGAR model introduced in Chapter 2 can be used in the application to simulation-based optimization, , for example, sequential RSM, which was one of the motivations for choosing regression models for incorporating direct stochastic gradient estimators. How much gains will be realized in the optimization efficiency from the improved linear regression model? Although it is reasonable to expect improvements, since the new method does obtain better fitted values than simple linear regression, both theoretical work and numerical experimentation are needed to characterize and quantify the improvements. The GESK method proposed in Chapter 3 use linear extrapolation with the same step size and assume that only one additional point is extrapolated from each design point. More sophisticated techniques could use the local response surface 174 information and adaptively determine the extrapolation strategy. This is especially important in higher-dimensional problems with multiple extreme values. Another line of research is on the comparison between GESK and SKG. Improvements from incorporating gradient estimates can be expected from both models. However, the question is when does one model performs better than the other? To compare these two models theoretically will provide more insights about this question and lead to guideline for practitioners as to when to choose each of these two models for practitioners. 175 A.1 Analytical Results for M/M/1 and U/U/1 Queues For the M/M/1 queue, the true models for an interarrival mean of 0.2 are given by y(2)(x) = x+ x 2 5 + x, y(3)(x) = x+ 5x 2 (5 + x)2 + x3(15 + 2x) (5 + x)3 , y(4)(x) = x+ 25x 2 (5 + x)3 + 25x3 (5 + x)4 + 5x3(15 + 2x) (5 + x)4 + x4(225 + 50x+ 3x2) (5 + x)5 , y(5)(x) = x+ 125x 2 (5 + x)4 + 250x3 (5 + x)5 + 25x3(15 + 2x) (5 + x)5 + 5x4(225 + 50x+ 3x2) (5 + x)6 +25x 4(15 + 2x) (5 + x)6 + 250x4 (5 + x)6 + x5(10 + x)(350 + 65x+ 4x2) (5 + x)7 . 176 For the U/U/1 queue, the true models are given by y(2) = δ14 − θ1 2 + 3θ2 2 + 1 δ1 ( δ22 12 + θ21 4 − θ1θ2 2 + θ22 4 ) y(3) = 5δ112 − θ1 + 2θ2 + 1 12δ1 ( 2δ22 + 9θ21 − 18θ1θ2 + 9θ22 ) − 1 12δ21 (θ1 − θ2)(δ22 + 2θ21 − 4θ1θ2 + 2θ22) y(4) = 107δ1192 − 25θ1 16 + 41θ2 16 + 1 2880δ1 (750δ22 + 4590θ21 − 9180θ1θ2 + 4590θ22) − 1 48δ21 (θ1 − θ2)(13δ22 + 35θ21 − 70θ1θ2 + 35θ22) + 12880δ31 (13δ42 + 270δ22θ21 − 540δ22θ1θ2 + 270δ22θ22 + 405θ41 − 1620θ31θ2 + 2430θ21θ22 − 1620θ1θ32 + 405θ42) y(5) = 221δ1320 − 107θ1 48 + 155θ2 48 + 1 2880δ1 (1070δ22 + 8430θ21 − 16860θ1θ2 + 8430θ22) − 1 48δ21 (θ1 − θ2)(29δ22 + 99θ21 − 198θ1θ2 + 99θ22) + 12880δ31 (49δ42 + 1230δ22θ21 − 2460δ22θ1θ2 + 1230δ22θ22 + 2325θ41 − 9300θ31θ2 + 13950θ21θ22 − 9300θ1θ32 + 2325θ42) − 1 720δ41 (θ1 − θ2)(9δ42 + 80δ22θ21 − 160δ22θ1θ2 + 80δ22θ22 + 96θ41 − 384θ31θ2 + 576θ21θ22 − 384θ1θ32 + 96θ42) A.2 Gradient Estimation for G/G/1 Queue Let Ak be the interarrival time between the (k − 1)st and kth customer (by convention, taking A1 to be the time of the 1st arrival), and let Xk be the service time of the kth customer. The system time of the kth customer, denoted by Tk, 177 satisfies the well-known Lindley equation: Tk+1 = Xk+1 + (Tk − Ak+1)+, (1) where a+ = max(a, 0). The infinitesimal perturbation analysis (IPA) estimator is then obtained by simple differentiation, which for a general parameter θ is given by ( [46]): dTk+1 dθ = dXk+1 dθ + (dTk dθ − dAk+1 dθ ) 1{Tk ≥ Ak+1}, k > 1, with dT1 dθ = dX1 dθ . (2) For x a parameter of the (common) customer service time distribution, the unbiased IPA estimator is dTk+1 dx = dXk+1 dx + dTk dx 1{Tk ≥ Ak+1}, where dX/dx can be calculated based on the distribution for the random variable X. For example, if X is exponentially distributed (with mean x), then dX/dx is simply given by X/x, and (A.2) becomes dTk+1 dx = Xk+1 x + dTk dx 1{Tk ≥ Ak+1}, k > 1, with dT1 dx = X1 x , the latter assuming that the system starts empty. This is what is used for the M/M/1 queue example. Similarly, for the U/U/1 example, where the interarrival time and service time distributions are U(θ1−δ1, θ1 +δ1) U(θ2−δ2, θ2 +δ2), respectively, the four unbiased 178 IPA estimators are ∂Tk+1 ∂θ1 = (∂Tk ∂θ1 − 1 ) 1{Tk ≥ Ak+1}, k > 1, with ∂T1 ∂θ1 = 0, ∂Tk+1 ∂θ2 = 1 + ∂Tk∂θ2 1{Tk ≥ Ak+1}, k > 1, with ∂T1 ∂θ2 = 1, ∂Tk+1 ∂δ1 = (∂Tk ∂δ1 − Ak+1 − θ1 δ1 ) 1{Tk ≥ Ak+1}, k > 1, with ∂T1 ∂δ1 = 0, ∂Tk+1 ∂δ2 = Xk+1 − θ2δ2 + ∂Tk∂δ2 1{Tk ≥ Ak+1}, k > 1, with ∂T1 ∂δ2 = X1 − θ2δ2 , 179 Bibliography [1] M. C. Fu. Optimization via simulation: A review. Annals of Operations Research, 53:199–248, 1994. [2] M. C. Fu. Optimization for simulation: Theory vs. practice (Feature Article). INFORMS Journal on Computing, 14(3):192–215, 2002. [3] M. C. Fu. Simulation optimization: Evolution or revolution? INFORMS Journal on Computing, 14(3):226–227, 2002. [4] Handbook of Simulation Optimization. Springer, 2014. [5] W. B. Powell and I. O. Ryzhov. Optimal Learning. John Wiley and Sons, 2012. [6] Y. C. Ho and X. R. Cao. Perturbation Analysis and Discrete Event Dynamic Systems. Kluwer Academic, 1991. [7] P. Glasserman. Gradient Estimation via Perturbation Analysis. Kluwer Aca- demic Publishers, Boston, Massachusetts, 1991. [8] P. Glasserman. Monte Carlo Methods in Financial Engineering. Springer, New York, 2004. [9] R. Y. Rubinstein. Monte Carlo Optimization, Simulation and Sensitivity of Queueing Networks. John Wiley & Sons, 1986. [10] R. Y. Rubinstein and A. Shapiro. Discrete Event Systems: Sensitivity Analysis and Stochastic Optimization by the Score Function Method. John Wiley & Sons, 1993. [11] M. C. Fu. Stochastic gradient estimation. In S. G. Henderson and B. L. Nelson, editors, Handbooks of Operations Research and Management Science, vol. 13: Simulation, pages 575–616. North-Holland Publishing, Amsterdam, 2006. 180 [12] S. Asmussen and P. Glynn. Stochastic Simulation: Algorithms and Analysis. Springer, New York, 2007. [13] M. C. Fu. What you should know about simulation and derivatives. Naval Research Logistics, 55(8):723–736, 2008. [14] R. R. Barton and M. Meckesheimer. Metamodel-based simulation optimiza- tion. In S. G. Henderson and B. L. Nelson, editors, Handbooks in Operations Research and Management Science: Simulation, chapter 18, pages 535–574. Elsevier, 2006. [15] R. R. Barton. Simulation optimization using metamodels. In M. D. Rossetti, R. R. Hill, B. Johansson, A. Dunkin, and R. G. Ingalls, editors, Proceedings of the 2009 Winter Simulation Conference, pages 230–238, Piscataway, New Jersey, December 2009. Institute of Electrical and Electronics Engineers, Inc. [16] J.P.C. Kleijnen. Design and Analysis of Simulation Experiments. New York: Springer, 2008. [17] Feng Yang, Bruce Ankenman, and Barry L Nelson. Efficient generation of cycle time-throughput curves through simulation and metamodeling. Naval Research Logistics, 54(1):78–93, 2007. [18] N.A.C. Cressie. Statistics for spatial data. Wiley series in probability and mathematical statistics: Applied probability and statistics. John Wiley and Sons, New York, 2, revised edition, 1993. [19] Michael L. Stein. Interpolation of spatial data: some theory for kriging. Springer series in statistics. Springer-Verlag, New York, 1999. [20] Jack P. C. Kleijnen, Wim C. M. van Beers, and Inneke van Nieuwenhuyse. Constrained optimization in expensive simulation: Novel approach. European Journal of Operational Research, 202(1):164–174, April 2010. [21] B. E. Ankenman, B. L. Nelson, and J. Staum. Stochastic Kriging for Simula- tion Metamodeling. Operations research, 58(2):371–382, March 2010. [22] W. Liu. Development of gradient-enhanced kriging approximations for multi- disciplinary design optimization. PhD thesis, University of Notre Dame, 2003. [23] T. J. Santner, B. Williams, and W. Notz. The Design and Analysis of Com- puter Experiments. Springer-Verlag, 2003. [24] Y. C. Ho, L. Shi, L. Dai, and W. Gong. Optimizing discrete event dynamic systems via the gradient surface method. Discrete Event Dynamic Systems: Theory and Applications, 2:99–120, Jan 1992. [25] X. Chen, B. E. Ankenman, and B. L. Nelson. Enhancing stochastic kriging metamodels with gradient estimators. Operations Research, 61(2):512–528, 2013. 181 [26] J. J. Alonso and H. S. Chung. Using gradients to construct cokriging approx- imation models for high-dimensional design optimization problems. In 40th AIAA Aerospace Sciences Meeting and Exhibit, AIAA, pages 2002–0317, 2002. [27] S. S. Gupta and K. J. Miescke. Bayesian look ahead one-stage sampling al- locations for selection of the best population. Journal of Statistical Planning and Inference, 54(2):229–244, 1996. [28] P. I. Frazier, W. B. Powell, and S. Dayanik. A knowledge gradient policy for sequential information collection. SIAM Journal on Control and Optimization, 47(5):2410–2439, 2008. [29] S. E. Chick and K. Inoue. New procedures to select the best simulated system using common random numbers. Management Science, 47:1133–1149, 2001. [30] S. E. Chick, J. Branke, and C. Schmidt. Sequential sampling to myopically maximize the expected value of information. INFORMS Journal on Comput- ing, 22(1):71–80, 2010. [31] H. S. Chang. Converging marriage in honey-bees optimization and applica- tion to stochastic dynamic programming. Journal of Global Optimization, 35(3):423–441, 2006. [32] J. Branke, S. E. Chick, and C. Schmidt. Selecting a selection procedure. Management Science, 53(12):1916–1932, 2007. [33] E. Cope. Bayesian strategies for dynamic pricing in e-commerce. Naval Re- search Logistics, 54(3):265–281, 2007. [34] V. F. Farias and Ben. Van Roy. Dynamic pricing with a prior on market response. Operations Research, 58(1):16–29, 2010. [35] A. X. Carvalho and M. L. Puterman. Learning and pricing in an internet environment with binomial demands. Journal of Revenue and Pricing Man- agement, 3(4):320–336, 2005. [36] M. Chhabra and S. Das. Learning the Demand Curve in Posted-Price Digi- tal Goods Auctions. In Proceedings of the 10th International Conference on Autonomous Agents and Multi-Agent Systems, pages 63–70, 2011. [37] Omar Besbes and Assaf Zeevi. Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Operations Research, 57(6):1407–1420, 2009. [38] A. V. den Boer and B. Zwart. Dynamic pricing and learning with finite inventories. Submitted for publication, 2011. [39] J. M. Harrison, N. B. Keskin, and A. Zeevi. Bayesian dynamic pricing policies: Learning and earning under a binary prior distribution. Management Science, 58(3):570–586, 2012. 182 [40] J. Broder and P. Rusmevichientong. Dynamic pricing under a general para- metric choice model. Operations Research, 60(4):965–980, 2012. [41] George E. P. Box and Norman R. Draper. Response surfaces, mixtures, and ridge analyses. Wiley & Sons, New York, Jan 2007. [42] M. C. Fu. Gradient estimation. In S. G. Henderson and B. L. Nelson, edi- tors, Handbooks in Operations Research and Management Science: Simulation, chapter 19, pages 575–616. Elsevier, 2006. [43] S. Weisberg. Applied Linear Regression, 3rd Edition. Wiley & Sons, New York, 2005. [44] Alvin C. Rencher and G. Bruce. Schaalje. Linear Models in Statistics. Wiley & Sons, New York, 2007. [45] R. J. Beckman and R. D. Cook. Outliers. Technometrics, 25(2):119–163, 1983. [46] R. Suri and M. A. Zazanis. Perturbation analysis gives strongly consistent sensitivity estimates for the M/G/1 queue. Management Science, 34:39–64, 1988. [47] M. C. Fu. Convergence of a stochastic approximation algorithm for the GI/G/1 queue using infinitesimal perturbation analysis. Journal of Opti- mization Theory and Applications, 65(1):149–160, 1990. [48] P. L’Ecuyer, N. Giroux, and P. Glynn. Stochastic optimization by simulation: Numerical experiments with the M/M/1 queue in steady-state. Management Science, 40(10):1245–1261, 1994. [49] Anatoly Zhigljavsky, Holger Dette, and Andrey Pepelyshev. A new approach to optimal design for linear models with correlated observations. Journal of the American Statistical Association, 105(491):1093–1103, 2010. [50] J. P. C. Kleijnen and W. C. M. van Beers. Robustness of kriging when interpo- lating in random simulation with heterogeneous variances: some experiments. European Journal of Operational Research, 165(3):826 – 834, 2005. [51] W. Xie, B. L. Nelson, and J. Staum. The influence of correlation functions on stochastic kriging metamodels. In Proceedings of the 2010 Winter Simu- lation Conference, pages 1067–1078, Piscataway, New Jersey, December 2010. Institute of Electrical and Electronics Engineers, Inc, Winter Simulation Con- ference. [52] X. Chen, B. E. Ankenman, and B. L. Nelson. The effects of common random numbers on stochastic kriging metamodels. ACM Transactions on Modeling and Computer Simulation, 22(2):7:1–7:20, March 2012. 183 [53] F. Zhang and Q. Zhang. Eigenvalue inequalities for matrix product. IEEE Transactions on Automatic Control, 51(9):1506 –1509, Sept. 2006. [54] T. Chu, J. Zhu, and H. Wang. Penalized maximum likelihood estimation and variable selection in geostatistics. The Annals of Statistics, 39(5):2607–2625, 2011. [55] R. Li and A. Sudjianto. Analysis of computer experiments using penalized likelihood in gaussian kriging models. Technometrics, 47(2):111–121, May 2005. [56] P. L’Ecuyer. A unified view of the IPA, SF, and LR gradient estimation techniques. Management Science, 36:1364–1383, 1990. [57] W. C. M. van Beers and J. P. C. Kleijnen. Customized sequential designs for random simulation experiments: Kriging metamodeling and bootstrapping. European Journal of Operational Research, 186(3):1099 – 1113, 2008. [58] J. Staum. Better simulation metamodeling: The why, what, and how of stochastic kriging. In M. D. Rossetti, R. R. Hill, B. Johansson, A. Dunkin, and R. G. Ingalls, editors, Proceedings of the 2009 Winter Simulation Con- ference, pages 119–133, Piscataway, New Jersey, December 2009. Institute of Electrical and Electronics Engineers, Inc, Winter Simulation Conference. [59] G. Marmidis, S. Lazarou, and E. Pyrgioti. Optimal placement of wind turbines in a wind park using Monte Carlo simulation. Renewable Energy, 33(7):1455 – 1460, 2008. [60] P. Francis, K. Smilowitz, and M. Tzur. The period vehicle routing problem with service choice. Transportation Science, 40(4):439–454, 2006. [61] A. Arlotto, N. Gans, and S. Chick. Optimal employee retention when infer- ring unknown learning curves. In B. Johansson, S. Jain, J. Montoya-Torres, J. Hugan, and E. Yu¨cesan, editors, Proceedings of the 2010 Winter Simulation Conference, pages 1178–1188, 2010. [62] R. E. Bechhofer. A single-sample multiple decision procedure for ranking means of normal populations with known variances. The Annals of Mathe- matical Statistics, 25(1):pp. 16–39, 1954. [63] R. E. Bechhofer, T. J. Santner, and D. M. Goldsman. Design and Analysis of Experiments for Statistical Selection, Screening, and Multiple Comparisons. Wiley, 1995. [64] S.-H. Kim and B. L. Nelson. A fully sequential procedure for indifference- zone selection in simulation. ACM Transactions on Modeling and Computer Simulation, 11:251–273, 2001. 184 [65] S.-H. Kim and B. L. Nelson. On the asymptotic validity of fully sequen- tial selection procedures for steady-state simulation. Operations Research, 54(3):475–488, 2006. [66] T. Homem-de-Mello. A study on the cross-entropy method for rare-event probability estimation. INFORMS Journal on Computing, 2006. accepted for publication. [67] S.-H. Kim and B. L. Nelson. Selecting the best system. In S.G. Henderson and B.L. Nelson, editors, Handbooks of Operations Research and Management Science, vol. 13: Simulation, pages 501–534. North-Holland Publishing, Am- sterdam, 2006. [68] S.-H. Kim and B. L. Nelson. Recent advances in ranking and selection. In S. G. Henderson, B. Biller, M.-H. Hsieh, J. Shortle, J. D. Tew, and R. R. Barton, editors, Proceedings of the 2007 Winter Simulation Conference, pages 162–172, 2007. [69] L. J. Hong and B. L. Nelson. A brief introduction to optimization via simula- tion. In M.D. Rosetti, R.R. Hill, B. Johansson, A. Dunkin, and R.G. Ingalls, editors, Proceedings of the 2009 Winter Simulation Conference, pages 75–85, 2009. [70] S.-H Kim. Statistical ranking and selection. In S. I. Gass and M. C. Fu, edi- tors, Encyclopedia of Operations Research and Management Science. Springer, 2013. To appear. [71] C. H. Chen, D. He, M. C. Fu, and L. H. Lee. Efficient simulation budget allocation for selecting an optimal subset. INFORMS Journal on Computing, 2008. accepted for publication. [72] D. He, L. H. Lee, C. H. Chen, M. C. Fu, and S. Wasserkrug. Simulation optimization using the cross-entropy method with optimal computing bud- get allocation. ACM Transactions on Modeling and Computer Simulation, 20(1):4:1–4:22, February 2010. [73] C.-H. Chen and L. H. Lee. Stochastic Simulation Optimization: An Optimal Computing Budget Allocation. System Engineering and Operations Research. World Scientific, 2011. [74] W. N. Yang and B. L. Nelson. Using common random numbers and control variates in multiple-comparison procedures. Operations Research, 39:583–591, 1991. [75] B. L. Nelson and F. J. Matejcik. Using common random numbers for indifference-zone selection and multiple comparisons in simulation. Manage- ment Science, 41:1935–1945, 1995. 185 [76] M. C. Fu, J. Q. Hu, C. H. Chen, and X. Xiong. Simulation allocation for determining the best design in the presence of correlated sampling. INFORMS Journal on Computing, 19(1):101–111, 2007. [77] P. I. Frazier, W. B. Powell, and S. Dayanik. The knowledge-gradient policy for correlated normal rewards. INFORMS Journal on Computing, 21(4):599–613, 2009. [78] P I. Frazier, J. Xie, and S. E. Chick. Value of information methods for pair- wise sampling with correlations. In S. Jain, R. R. Creasey, J. Himmelspach, K. P. White, and M. Fu, editors, Proceedings of the 2011 Winter Simulation Conference, pages 3974–3986, 2011. [79] S. E. Chick and K. Inoue. New two-stage and sequential procedures for se- lecting the best simulated system. Operations Research, 49:732–743, 2001. [80] R. C. H. Cheng and C. S. M. Currie. Optimization by simulation metamod- elling methods. In R.G. Ingalls, M.D. Rossetti, J. S. Smith, and B. A. Peters, editors, Proceedings of the 2004 Winter Simulation Conference, pages 485–490, 2004. [81] B. Biller and C. G. Corlu. Accounting for parameter uncertainty in large-scale stochastic simulations with correlated inputs. Operations Research, 59(3):661– 673, 2011. [82] J. Kadane and R. Trader. A bayesian treatment of multivariate normal data with observations missing at random. Statistical Decision Theory and Related Topics, 4(1):225–234, 1988. [83] F. Dominici, G. Parmigiani, and M. Clyde. Conjugate analysis of multivariate normal data with incomplete observations. Canadian Journal of Statistics, 28(3):533–550, 2000. [84] K. Triantafyllopoulos. Missing observation analysis for matrix-variate time series data. Statistics and Probability Letters, 78(16):2647–2653, 2008. [85] A. K. Gupta and D. K. Nagar. Matrix Variate Distributions. Chapman & Hall, 2000. [86] M. H. DeGroot. Optimal Statistical Decisions. Wiley-Interscience, New York, wcl edition, 2004. [87] S. Kotz and S. Nadarajah. Multivariate t Distributions and Their Applications. Cambridge University Press, New York, 2004. [88] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1990. 186 [89] A. Erde´lyi and F. G. Tricomi. The asymptotic expansion of a ratio of gamma functions. Pacific Journal of Mathematics, 1(1):133–142, 1951. [90] G. D. Anderson and S.-L. Qiu. A monotoneity property of the gamma function. Proceedings of The American Mathematical Society, 125(11), 1997. [91] D. He, S. E. Chick, and C. H. Chen. The opportunity cost and OCBA selection procedures in ordinal optimization. IEEE Transactions on Systems, Man, and Cybernetics–Part C, 37:951–961, 2007. [92] B. A. Cosgrove, D. Lohmann, K. E. Mitchell, P. R. Houser, E. F. Wood, J. C. Schaake, A. Robock, C. Marshall, J. Sheffield, Q. Duan, L. Luo, R. W. Higgins, R. T. Pinker, J. D. Tarpley, and J. Meng. Real-time and retrospective forcing in the north american land data assimilation system (NLDAS) project. Journal of Geophysical Research, 108(D22):8842–8853, 2003. [93] G. Gallego and G. Van Ryzin. Optimal dynamic pricing of inventories with stochastic demand over finite horizons. Management Science, 40(8):999–1020, 1994. [94] Y. Feng and G. Gallego. Optimal starting times for end-of-season sales and op- timal stopping times for promotional fares. Management Science, 41(8):1371– 1391, 1995. [95] E. P. Lazear. Retail pricing and clearance sales. The American Economic Review, 76(1):14–32, 1986. [96] T. P. Minka. Bayesian linear regression. Technical report, Microsoft Research, 2000. [97] H. Qu, I. O. Ryzhov, and M. C. Fu. Ranking and selection with unknown cor- relation structures. In Proceedings of the 2012 Winter Simulation Conference, 2012. to appear. [98] T. S. Jaakkola and M. I. Jordan. Bayesian parameter estimation via variational methods. Statistics and Computing, 10(1):25–37, 2000. [99] David J. Spiegelhalter and Steffen L. Lauritzen. Sequential updating of condi- tional probabilities on directed graphical structures. Networks, 20(5):579–605, 1990. [100] S. Kim. Gradient-based simulation optimization. In L. F. Perrone, F. P. Wieland, J. Liu, B. G. Lawson, D. M. Nicol, and R. M. Fujimoto, editors, Proceedings of the 2006 Winter Simulation Conference, pages 159–167, 2006. [101] D. M. Blei, M. I. Jordan, and J. W. Paisley. Variational bayesian inference with stochastic search. In Proceedings of the 29th International Conference on Machine Learning, pages 1367–1374, 2012. 187 [102] J. C. Spall. Introduction to stochastic search and optimization: estimation, simulation, and control. Wiley-Interscience, 2005. [103] P. L’Ecuyer. On the interchange of derivative and expectation for likelihood ratio derivative estimators. Management Science, 41(4):738–747, 1995. [104] L. Devroye. Random variate generation. In S. G. Henderson and B. L. Nelson, editors, Handbooks in Operations Research and Management Science: Simu- lation, chapter 4. Elsevier, 2005. [105] G. E. Crooks. Logistic approximation to the logistic-normal integral. Technical report, Lawrence Berkeley National Laboratory, 2009. 188