ABSTRACT Title of thesis: BAYESIAN BINOMIAL REGRESSION: PROBABILITY OF SUCCESS IN PRO FOOTBALL Albert de Hombre, Master of Arts, 2006 Thesis directed by: Professor Benjamin Kedem Department of Mathematics Bayesian regression provides a flexible alternative to standard GLM, as it allows for more user control of the data. The Bayesian may use his or her prior knowledge to influence the outcome of an experiment. While GLM is a robust tool that applies to virtually any type of data, there is little room for manipulation on the part of the user, who may have some additional ?expert? knowledge beyond the raw data. The assumptions of both practices are outlined and applied to Binomial data acquired from the 2005 NFL season to highlight the advantages and disadvantages of each approach. Under ideal conditions the Bayesian analysis provides more accurate estimators and features relatively simple computations. BAYESIAN BINOMIAL REGRESSION: PROBABILITY OF SUCCESS IN PRO FOOTBALL By Albert de Hombre Thesis submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Master of Arts 2006 Advisory Committee: Professor Benjamin Kedem Professor Paul Smith Professor Lawrence Washington Copyright by Albert de Hombre 2006 i Contents 1 Bayesian Binomial Regression 1 1.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 GLM 5 2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 The Binomial Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 Application to NFL Data 7 3.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3 Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 An Altered Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.5 Case Deletion Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . 25 4 Simulation 32 5 Comments 33 6 References 34 ii List of Figures 1 Prior and Posterior Densities at Locations 1 and 2 . . . . . . . . . . . 13 2 Prior and Posterior Densities at Locations 3, 4, and 5 . . . . . . . . . 14 3 Predictive Probabilities of Winning as a Function of Rushing Yardage for Fixed NP and TO . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4 Predictive Probabilities of Winning as a Function of Rushing Yardage for Fixed NP and TO . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5 Predictive Probabilities of Winning as a Function of Passing Yardage for Fixed NR and TO . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 6 Predictive Probabilities of Winning as a Function of Passing Yardage for Fixed NR and TO . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 7 Posterior Densities at Locations 1 and 2 with Altered Prior . . . . . . 23 8 Posterior Densities at Locations 3, 4, and 5 with Altered Prior . . . . 24 9 Posterior Densities at Locations 1 and 2 with Diffuse Prior . . . . . . 26 10 Posterior Densities at Locations 3, 4, and 5 with Diffuse Prior . . . . 27 11 Dpi Values for Winning and Loosing Teams . . . . . . . . . . . . . . . 30 12 Differences in Predicted Success Probabilities After Removing Case 122 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 iii 1 Bayesian Binomial Regression 1.1 Assumptions The proceeding approach for binomial data can be found in Bedrick, Christensen, and Johnson (BCJ 1997). A similar approach is illustrated by Leonard. Suppose that we have data (yi,xprimei) where each yi is a success proportion from independent binomial (Ni,pi) random variables and each of the n vectors xprimei is a known vector of covariates. We may assume that the probability, pi, of success for a given vec- tor of covariates, xprimei, is pi = r(xprimei,?), where ? is an unknown vector of regression coefficients. Here the transformation r may be any cdf, but we will use the logistic transformation exclusively. Namely, pi = r(xprimei,?) = e xprimei? 1 + exprimei? (1) Under the above assumptions we may write the Likelihood as a function of ? given the vector of success proportions Y and the corresponding vector of known covariates X as follows: L(?|Y ) = nproductdisplay i=1 Ni! (Ni ?Niyi)!Niyi![r(x prime i?)] Niyi[1?r(xprime i?)] Ni?Niyi (2) For a given prior, pi(?), the posterior of ? is: pi(?|Y ) = L(?|Y )pi(?)integraltext L(?|Y )pi(?)d? (3) Inference and prediction can be based on a discrete approximation of the distribution of the posterior. An interesting feature of the procedure is that the prior for beta is derived from prior assumptions on the probability, p, of success rather than the obscure regression coefficients. A Beta prior for p is assigned and then change of variables gives the ?proper prior (BCJ 1997)? for beta. The reason for the use of the Beta prior for the pi will be clear shortly as the use of such a prior combined with the above transformation function r(xprime?) provides a convenient form for the posterior of ?. Specifically, one may assign an empirical prior distribution 1 pi0 (we will use Beta(ai,bi) priors for each ?pi) of P for a given set of covariates ?X and derive the induced prior pi(?) for ? as follows. Let pi0(P) = pi0(R( ?X?)). Then pi(?) = pi0(R( ?X?))|?R( ?X?) ?? | (4) where R is the vector transformation that applies r to each ?xi of the design matrix ?X, which must be a nonsingular square matrix of the same dimension, say k, as ?, the unknown vector of regression coefficients. It turns out that under the logistic transformation and the assumed Beta(ai,bi) prior for the ?pi = r(?xiprime?), where ai = ?Ni?yi,bi = ?Ni ? ?Ni?yi and i = 1,2,...k, the induced prior for ? has the same form as the likelihood. We may see this fact from the following calculation using change of variables and assuming independence of the ?pi. pi(?) = pi0(R( ?X?))|?R( ?X?)?? | = producttextki=1 ?(ai+bi)?(ai)?(bi)r(?xiprime?)ai?1[1?r(?xiprime?)]bi?1|?R( ?X?)?? | Note that |?R( ?X?)?? | = ? ?? ?? ?? ?r( ?x1prime?) ??1 ... ?r( ?x1prime?) ??k . ... . . ... . ?r( ?xkprime?) ??1 ... ?r( ?xkprime?) ??k ? ?? ?? ?? where ?r( ?xiprime?) ??j = (1+exi1?1+...+xik?k)(exi1?1+...+xik?k)xij?(exi1?1+...+xik?k)(exi1?1+...+xik?k)xij (1+exi1?1+...+xik?k)2 = xij(exi1?1+...+xik?k)[(1+exi1?1+...+xik?k)?(exi1?1+...+xik?k)](1+exi1?1+...+xik?k)(1+exi1?1+...+xik?k) = xij (exi1?1+...+xik?k)(1+exi1?1+...+xik?k)[1? (exi1?1+...+xik?k)(1+exi1?1+...+xik?k)] 2 = xijr(?xi?)[1?r(?xi?)] Therefore we may write |?R( ?X?)?? | = |det( ? ?? ?? ?? ?? ? r( ?x1prime?)[1?r( ?x1prime?)] 0 ... 0 . . ... . . . r(?xiprime?)[1?r(?xiprime?)] . . . ... 0 0 0 ... r( ?xkprime?)[1?r( ?xkprime?)] ? ?? ?? ?? ?? ? ?X)| = |det( ?X)|producttextki=1 r(?xiprime?)[1?r(?xiprime?)] Thus given a particular response vector ?y of success proportions and cor- responding design matrix of covariates ?X, under the logistic transformation, the induced prior for ? is: pi(?) = |det( ?X)| kproductdisplay i=1 ?(ai + bi) ?(ai)?(bi)[r( ?xprimei?)]ai[1?r(?xprimei?)]bi (5) The design matrix ?X must be nonsingular and have dimension k ?k, where k is the dimension of the unknown regression coefficient vector ?. Here one may include an additional component to allow for the incorporation of an intercept in the analysis. Typically the design matrix is chosen to represent covariate locations that are rather extreme within the data. These covariates and their corresponding prior distributions can be thought of as some ?prior knowledge?. Choosing a design matrix will be discussed more in the context of our application to the NFL data set. 1.2 Inference The Bayesian approach to estimating regression parameters is quite different from that of standard GLM. The philosophy of the Bayesian of course is to estimate (the mean of) a posterior distribution given the data for some prior, while the assumptions for GLM assign the data to some family of distributions in order to estimate such parameters as the mean and variance. Both schools of thought are 3 based on estimating the mean and variance of some assumed type of distribution. As opposed to attempting some maximization problem (the GLM approach), we will proceed with Importance Sampling to estimate the mean and covariance matrices of ?. It is important to select an importance density function, g(?) that is ?similar in shape to that of the known kernel of the posterior L(?|Y )pi(?) (BCJ 1997)?. The tails of g(?) should be heavy enough to include all extreme values in the data, so that these locations are not over-weighted in the estimation process. To get an estimator for ? consider the following estimate of the conditional mean. From the density function g(?) sample: ?1,?2,...,?t Then we can get, by the Law of Large Numbers, a discrete approximation to the conditional mean as follows: E(?|Y ) = integraltext ?pi(?|Y )d? = integraldisplay ? L(?|Y )pi(?)/g(?)integraltext L(?|Y )pi(?)d?/g(?)d? ? tsummationdisplay i=1 ?i L(? i|Y )pi(?i)/g(?i) summationtextt j=1 L(?j|Y )pi(?j)/g(?j) (6) Now let qi = q(?i) = L(?i|Y )pi(?i)/g(?i) (7) and ?qi = qisummationtextt i=1 qi (8) Thus we get the estimator for ? as ?? = tsummationdisplay i=1 ?i?qi (9) and the estimator for the covariance matrix ?cov(?|Y ) = tsummationdisplay i=1 ?i?iprime?qi ? ????prime (10) 4 1.3 Prediction For a new trial y with covariate x the predictive probability of success is p(y = 1|Y,x) = E[r(xprime?)|Y,x] = integraltext r(xprime?)pi(?|Y )d? We may use a discrete approximation to the posterior to get ?p(y = 1|Y,x) = tsummationdisplay j=1 r(xprime?j)?qj (11) Here Y represents all previous observations, while y is the unknown outcome, which has the above probability of success. 2 GLM 2.1 Assumptions The versatility of the Generalized Linear Model allows for application to many assumed distributions of data. Instead of imposing a specific distribution, GLM assumes only that the distribution of the data belongs to the exponential family. The data may even be correlated as described by Kedem and Fokianos (2002). Suppose that we have response data yi with expected value E(yi) = ?i and density function, f(yi,?i), from the exponential family. Therefore f(yi,?i) may be written as: f(yi,?i) = eyi?i+b(?i)+c(yi) The expected value of the yi is related to the covariate vector xi and an un- observable regression coefficient vector ? by way of a link function g. Particulary, g(?i) = xprimei?. In this sense a function of the expected value is linear. 2.2 The Binomial Case The binomial distribution of yi with parameters ni and pi is of the exponential fam- ily as it can be written as (Rencher 2000): 5 f(yi;pi) = ni!(n i ?yi)!yi! pyii (1?pi)ni?yi (12) = eyi lnpi?yi ln(1?pi)+ni ln(1?pi)+ln( ni!(ni?yi)!yi!) = eyi ln( pi1?pi )+ni ln(1?pi)+ln( ni!(ni?yi)!yi!) = eyi?i+b(?i)+c(yi) where ?i = ln( pi1?pi), b(?i) = ni ln(1?pi) = ?ni ln(1 +e?i), and c(yi) = ln( ni!(ni?yi)!yi!) The likelihood function is given by: L(?) = nproductdisplay i=1 eyi?i+b(?i)+c(yi) (13) The estimator of ? is obtained from maximizing the log-likelihood function: lnL(?) = nsummationdisplay i=1 yi?i + nsummationdisplay i=1 b(?i) + nsummationdisplay i=1 c(yi) (14) For the exponential family E(yi) = ?i = ?bprime(?i) (Venables and Ripley 2002), which gives the relation between ?i and the link function g. g(?i) = xprimei? The solutions to the corresponding maximization equations (obtained by differenti- ating the log-likelihood with respect to each ?i) are found iteratively. This highlights a significant difference to the Bayesian procedure, where the estimators are found using a discrete approximation to the posterior of ?. Also notice that for the bino- mial distribution ?bprime(?i) = n e?1+e? = np since ? = ln( p1?p) ? p = e?e?+1 = r(?) therefore, g(x) = r?1(x) = ln( x1?x) is the natural link function for binomial data. 6 3 Application to NFL Data The preceding outlined Bayesian procedure provides an alternative approach to logistic regression. This method may provide more accuracy than standard GLM, as it allows for a prior manipulation of the data based on an ?expert opinion? of the user. As a lifelong football fan, I will put my opinion to work with the Bayesian method and compare the results to GLM with the previously mentioned link function. For simplicity the data are assumed to be independent. The data set consists of outcomes from the 2005 NFL regular season. For each game i where i = 1,2,...,256 one team was randomly selected and the following were recorded; yi = 1 if the selected team won and yi = 0 if not, and the corresponding vector of covariates, x, consists of the selected team?s net rushing yards, ?NR?, net passing yards, ?NP?, turnover differential, ?TO?, and a 1 or 0 indicating whether the selected team was the home or away team respectively. The goal was to determine which of these factors have the most effect on the outcome of a pro football game. Of course only one of the above listed covariates (home or away) would actually be known before an unplayed game, so this approach is best suited to find the impact on the outcome of a game of the previously listed measurable performance indicators. The nature of the data set and the Bayesian method provide plenty of room for manipulation on the part of the user. One important fact to note is that since each covariate vector xi consists of 5 components, there were no repeated identical observations. Particularly the data are Bernouli, and thus each success proportion yi is either 0 or 1. This fact does cause some technical difficulties for assigning Beta priors for the ?pi. Specifically, if the observed covariate ?xi occurs only once, then the corresponding prior distribution will be of the form Beta(0,1) or Beta(1,0) which are both undefined. Under either circumstance we may still symbolically derive a well defined induced prior for ? for inference, but this inhibits our ability for comparing the prior and posterior distributions of the ?pi. To avoid this problem we employ our knowledge of the data set to find sets of covariate locations for each ?xi that have empirically similar odds of success. We can therefore regard each observation of these sets as a repeated observation of the selected design covariate ?xi. In that sense we will specify values ?Ni > 0 and 0 < ?yi < 1 in order to get well 7 defined priors for the ?pi. 3.1 Experimental Design Before we can derive a prior for ?, we must specify a design matrix ?X. The authors BCJ (1997) suggest selecting ?values of the variables that were relatively extreme within the data, but still had substantial probabilities (of success).? To do this we employ our ?expertise? of the subject at hand and consider different choices for the design. As mentioned above, for each of the selected design covariate locations ?xi we selected from the data other covariate vectors xj which satisfied similar conditions as the design location ?xi. This process of course requires intimate knowledge of the data in order to determine what specific conditions must be met to achieve such probabilistic similarity. There is a widespread belief throughout all levels of organized football that three of the biggest factors in determining the outcome of a game are turnovers, success rushing the ball, and home field advantage (drbobsports.com). With this in mind we specify our design: ?X = ? ?? ?? ?? ?? ? 1 ?123 ?34 1 0 1 ?100 112 1 1 1 ?12 18 ?1 1 1 100 ?12 1 1 1 139 ?44 0 0 ? ?? ?? ?? ?? ? (15) ?xiprime=(intercept,NR,NP,TO,H/A) A 1 in the turnover column indicates that that team won the turnover battle by 1, i.e. that particular team had one less turnover than their counterpart that day. Similarly a -123 in the rushing column indicates that that team was out rushed by 123 yards in that game. The philosophy of this design was to select observations that had a low turnover differential and then contrast high, medium, and low net rushing totals with home and away. The belief here is that while keeping the turnovers relatively constant, we may be able to determine the relative importance of rushing versus being the home or away team. This design could be interpreted as being relatively centered in the turnover dimension and extreme within the rushing and 8 home or away dimension. As previously mentioned the data are Bernouli, and some analysis must be done to determine an appropriate prior for each ?pi. For each design location ?xi the empirical odds of success were determined by selecting from the data other covariate vectors that had similar components. For example, consider ?x2prime = (1,?100,112,1,1). This team?s performance can be characterized as ?getting out rushed by a lot?, ?gaining a lot more passing?, and having a low turnover differential while playing at home. We then made suitable numeric conditions that reflect this characterization and found 10 games that satisfied the criterion. Of these 10 games only 2 teams won with this type of performance, and the associated prior for ?p2 = r( ?x2prime?) is a Beta(2,8). This is an admittedly crude process, but it reflects our belief that such a performance results in a low probability of success, while maintaining much un- certainty about the actual distribution of the probability of success. For each of the covariate locations ?xi listed above, the following priors were assigned. pi0(?p1) = Beta(1,5) pi0(?p2) = Beta(2,8) pi0(?p3) = Beta(2,2) pi0(?p4) = Beta(19,1) pi0(?p5) = Beta(7,3) Choosing an appropriate importance density function is of great importance for the accuracy of prediction. The authors BCJ (1997) suggest a multivariate t distribution centered at the MLE for ? with dispersion ?proportional to the asymp- totic covariance matrix evaluated at the mode.? The goal is to get a density that is ?similar in shape to the known kernel L(?|Y )pi(?)? with ?tails that decay less rapidly than the likelihood (BCJ 1997).? If the importance density function is of a poor match, then some of the resulting probabilities ?qi can be too heavily weighted causing the posterior distributions to be inaccurate. After several trial runs and us- ing a numerical maximization method, we found that a multivariate t density with 6 degrees of freedom and mean = (?0.29713023,0.01005308,0.00337546,0.52726051,?0.02104360) (16) 9 ? = ? ?? ?? ?? ?? ? .2632 0 0 0 0 0 .0032 0 0 0 0 0 .0022 0 0 0 0 0 .1312 0 0 0 0 0 .3662 ? ?? ?? ?? ?? ? (17) works well for the data. If a poorly fitting importance density function is chosen, it will be clear from the resulting discrete approximation of the posterior as some of the sampled ?t will have an extremely high relative weight ?qt. In trial runs with t densities of different means we sampled 8000 ?t and found that in each of these trials there were sampled ?t with associated probabilities, ?qt, of nearly .98. This situation is very problematic for acquiring an appropriate posterior for the ?pi. While these particular ?t with a high associated sampled probability were pretty good estimates to the posterior mean, this was not evident in the resulting posterior for ?pi since all of the other sampled ?t had an associated probability of less than 10?3. The result is a posterior density that is virtually indistinguishable among all of the prior covariate locations ?xi. For example, in one trial with a poor fitting t density we found that the largest associated probability for any of the sampled ?t was approximately .97 and the corresponding ?p1 = r( ?x1prime?t) ? 0.15, a seemingly reasonable probability of success for that particular covariate vector. However the mean of the entire augmented sample r( ?x1prime?j) for j = 1,2,...8000 was nearly .5, and the resulting discrete approximation to the posterior for ?p1 had the form of a Beta with a similar mean. In fact the mean of all of the r(?xiprime?j) for i = 1,2,...5 and j = 1,2,...8000 were virtually the same. Clearly this is not a desirable situation as it dilutes any of the information regarding the accuracy of the posterior prediction. We can make the best of this situation by centering the importance t density at the vector ?t with the disproportionately high associated probability ?qt. Choosing the mean vector for the importance t density in this way and adjusting the degrees of freedom to allow for ?slower? decaying tails provides a good strategy for finding an appropriate importance density function. 10 3.2 Results We sampled vectors ?t for t = 1,2,...8000 from a multivariate t density with the above named parameters for inference and prediction. The estimated vector of regression coefficients was: ?? = ? ?? ?? ?? ?? ? ?0.47015 0.01490 0.00286 0.58997 0.33056 ? ?? ?? ?? ?? ? (18) compared to the estimated coefficients using GLM: ??glm = ? ?? ?? ?? ?? ? ?0.71893 0.01999 0.00607 0.79196 0.77375 ? ?? ?? ?? ?? ? (19) and the covariance matrix with the Bayesian procedure: ?cov(?|Y ) = ? ?? ?? ?? ?? ? 2.389e?02 3.605e?05 ?4.037e?06 3.174e?04 ?1.276e?02 3.605e?05 2.698e?06 ?2.800e?08 7.388e?06 2.267e?05 ?4.037e?06 ?2.800e?08 1.885e?06 4.120e?06 ?1.133e?05 3.174e?04 7.388e?06 4.120e?06 7.896e?03 ?1.511e?03 ?1.276e?02 2.267e?05 ?1.133e?05 ?1.511e?03 4.741e?02 ? ?? ?? ?? ?? ? (20) Therefore the variance of the estimator ?? is: ?var(?|Y ) = ? ?? ?? ?? ?? ? 2.3897e?02 2.6982e?06 1.8858e?06 7.8967e?03 4.7410e?02 ? ?? ?? ?? ?? ? (21) 11 compared to the variance of the estimator found using GLM: var(??glm) = ? ?? ?? ?? ?? ? 5.1166e?02 9.7969e?06 4.0763e?06 1.7192e?02 1.3417e?02 ? ?? ?? ?? ?? ? (22) It is encouraging that the estimated regression coefficients and the estimated variances of the two respective methods are similar. The fact that the variance of the estimators under the Bayesian method are slightly smaller than those found using GLM suggests that we may have more accurate estimators with the Bayesian method, which could translate to better prediction. To compare the prediction accuracy of the two respective methods we find the predicted probability of success with the estimated parameters for each one of the 256 observations (games). For a given game with observed covariate vector xi the predicted probability of success under the GLM is: pi = ex primei ??glm 1+exprimei ??glm If pi > .5 then we predict the outcome to be a win, and if not then a loss. The predicted outcomes were then compared to the actual outcome with the following results. The Bayesian procedure with the above given importance t density and design matrix ?X correctly predicted 84.675 (217 out of 256) percent of the games, while GLM with the logit link function and the above estimated parameters correctly predicted 83.9 (215 out of 256) percent. An approximation to the posterior density of each of the ?pi for i = 1,2,...,5 was found by smoothing a random sample from the discrte approximation that takes the value r(?xi?j) with probability ?qj. Histograms of the smoothed samples as well as the prior densities for each of the prior locations ?xi are given below(figures 1 and 2 ). It is apparent that the means of the posterior densities are close to what we anticipated as the probability of success for each of the design locations. For example, the prior mean of design location 1 was 1/6 ? .17, and from the histogram the posterior mean appears to be close to that. The actual mean of the smoothed sample for location 1 was .24. So in this sense our prior beliefs are enforced, but we 12 Figure 1: Prior and Posterior Densities at Locations 1 and 2 13 Figure 2: Prior and Posterior Densities at Locations 3, 4, and 5 14 get a much better understanding as to the actual distribution of the probability of success for such covariates. There is a significant difference in the prior and posterior densities as far as shape and allowance of extreme events. It is exciting that the Bayesian procedure provides better prediction for the given data, but it may be more interesting to investigate the particular covariate lo- cations at which the two models differ in predicted probability and consider altering the assigned priors. To compare the significance of rushing yardage and turnover differential to the widely accepted home field advantage, we consider predictive probabilities of winning as a function of rushing yardage with different combinations of passing and turnover totals for home and away teams. The first diagram (figures 3 and 4) contrasts predictive probabilities as a function of rushing yardage for the fixed values for net passing yardage, NP = ?100 and turnover differential, TO = ?2 for home vs. away teams. The second is identical except that TO = 2. Then we consider the same values for TO and adjust the net passing to NP = 100. The respective plots give insight into the effect of turnover differential for fixed passing yardage as a function of rushing yardage for home and away teams. A comparison of the respective distributions shows the relative importance of passing yardage to rushing yardage for fixed turnover differentials, TO = ?2 and TO = 2, that can be characterized as ?relatively low.? We can draw some very interesting conclusions from the above distributions. First notice that the shape, particularly the rate of growth, is very similar among the 4 distributions with common turnover differentials (counting home vs. away). For those with TO = ?2 the growth is relatively slow compared to those with TO = 2. This suggests that it requires a certain amount of success in the running game to overcome turnovers. The plots also strongly indicate that net passing yards is relatively insignificant regardless of turnovers. For example, both distributions with a fixed TO = ?2 indicate an approximately equal need of between 75 ? 85 rushing yards by the home team to have pr(win) > .5 regardless of having a 100 yard deficit in the passing game. The plots with TO = 2 are similar with respect to the little effect of passing yardage on pr(win), but obviously show a clear difference in that the necessary rushing yardage is significantly less to have pr(win) > .5. 15 Figure 3: Predictive Probabilities of Winning as a Function of Rushing Yardage for Fixed NP and TO 16 Figure 4: Predictive Probabilities of Winning as a Function of Rushing Yardage for Fixed NP and TO 17 As expected the pr(win|home) >= pr(win|away) in all distributions, but is interesting to note where the significant separation in pr(win) occurs between the home and away team. It appears that different NR and TO combinations directly affect home field advantage. Apparently home field advantage becomes negligible for extreme NR and TO totals as indicated in both figure 3 and figure 4. Particularly, there is little difference between pr(win) for the home and away team for negative values of TO and large (in the negative sense) net rushing totals. Also there is little difference between pr(win) for the home and away team for positive values of TO and large net rushing totals. This observation is not surprising as it simply means that teams that run the ball well and win the turnover battle, which usually means that the particular team plays good defense, effectively eliminate any disadvantage of playing on the road, but it also indicates that these teams would have an even greater advantage when playing at home. In light of the above observations we will consider distributions for pr(win|home) vs. pr(win|away) as functions of net passing yardage (figures 5 and 6), NP, for fixed values of NR and TO. The values TO = 2, TO = ?2, NR = 50, and NR = ?50 are chosen to give a clear contrast while maintaining significant win and loss prob- abilities. It is clear from figures 5 and 6 that NP has a relatively small effect on pr(win) compared to NR and TO. For example, in figure 5 we have the fixed values NR = 50 and TO = 2 and the range of pr(win|home) is approximately between .75?.9 for domain values of NP between ?200 to 200. This means that a 400 yard swing in the passing game has a minimal effect on pr(win). Also in figure 6 for the fixed values NR = ?50 and TO = ?2 the pr(win) changes very little and barely exceeds .2 for the home team and .15 for the away team. Again the indication here is that winning the turnover battle and out rushing the other team has the most effect on pr(win). We also see that there is a significant ?gap? between the home and away team in both figure 5 and 6. Since the rushing totals chosen here were not extremely large or small, we see the home field advantage at work. 18 Figure 5: Predictive Probabilities of Winning as a Function of Passing Yardage for Fixed NR and TO 19 Figure 6: Predictive Probabilities of Winning as a Function of Passing Yardage for Fixed NR and TO 20 3.3 Test Data For an additional measure of the relative accuracy of the two methods, all computa- tions were repeated using only the first 226 observations. Then the same prediction decision as before was employed to predict the outcome of the final 30 observations. The results of the two methods were identical. Both predicted 26/30 ? 0.8667 of the remaining games, which is slightly better than the outcome using all 256 observations. 3.4 An Altered Prior To consider the effect of the chosen prior the procedure was repeated with both an empirically poorly fitting prior and the diffuse prior, i.e. pi(?) = 1. For a poorly fitting prior simply switch each ai for bi and bi for ai. pi0(?p1) = Beta(5,1) pi0(?p2) = Beta(8,2) pi0(?p3) = Beta(2,2) pi0(?p4) = Beta(1,19) pi0(?p5) = Beta(3,7) Each pi0 is the opposite from the original prior, except for pi0(?p3) which is identical. Computation with the above prior yielded the following results: ?? = ? ?? ?? ?? ?? ? ?0.42812 0.00658 0.00367 0.597518 0.086264 ? ?? ?? ?? ?? ? (23) 21 ?cov(?|Y ) = ? ?? ?? ?? ?? ? 2.3399e?02 6.9333e?08 1.3754e?05 ?1.9367e?03 ?1.7714e?02 6.9333e?08 2.2376e?06 2.3135e?07 ?1.9549e?05 ?5.1948e?05 1.3753e?05 2.3134e?07 1.4561e?06 8.8129e?06 ?3.0065e?05 ?1.9367e?03 ?1.9549e?05 8.8129e?06 6.1055e?03 8.1005e?05 ?1.7714e?02 ?5.1948e?05 ?3.0065e?05 8.1005e?05 4.0605e?02 ? ?? ?? ?? ?? ? (24) As before, each xi was determined to be a success if ?pi > .5 and a failure otherwise. This time the results were not as favorable but still respectable. Predic- tion in this manner with the ?poor? prior correctly predicted 0.8047 of the games, compared to 0.84765 before. Clearly the new prior has an undesirable effect on prediction. Considering the chosen priors, it seems that extreme valued ?pi would be impacted the most since the design locations with an empirically low probability of success were assigned Beta distributions with a large mean and vice versa. With this in mind consider the posterior densities for each ?pi(figures 7 and 8). There are some interesting, apparent differences among the respective posterior densities under the two priors. The posterior means of the first two locations are noticeably larger, while the third location is relatively unchanged. The densities of the last two locations, particularly location 5, are extremely skewed compared to the original results. The skewing of the densities is to be expected, as the new prior has provided empirically false information. For example in the case of location 5, the left hand tail (figure) decays slower than with the original prior allowing for a greater probability of failure. We now perform all calculations using the diffuse prior, pi(?) = 1. Results: ?? = ? ?? ?? ?? ?? ? ?0.4324 0.0139 0.0031 0.6057 0.2856 ? ?? ?? ?? ?? ? (25) 22 Figure 7: Posterior Densities at Locations 1 and 2 with Altered Prior 23 Figure 8: Posterior Densities at Locations 3, 4, and 5 with Altered Prior 24 ?cov(?|Y ) = ? ?? ?? ?? ?? ? 2.7576e?02 1.4572e?05 ?3.7450e?06 ?4.8933e?04 ?1.6528e?02 1.4572e?05 3.8754e?06 ?3.7560e?08 1.9311e?05 9.5400e?06 ?3.7450e?06 ?3.7560e?08 2.1300e?06 1.1266e?05 1.0243e?05 ?4.8933e?04 1.9311e?05 1.1266e?05 7.5078e?03 4.2294e?04 ?1.6528e?02 9.5400e?06 1.0244e?05 4.2294e?04 5.5042e?02 ? ?? ?? ?? ?? ? (26) The mean of ? differs slightly from the estimates found using both the original prior and GLM. For example, the estimate found using GLM puts more emphasis on home field advantage, and it is interesting to note that the ?poor? prior provides estimates for ?2 and ?5, the coefficients for NR and home field advantage, that differ significantly from the others. This is of particular interest since the design, ?X was chosen to contrast these two parameters. Posterior densities for the ?pi (figures 9 and 10) are not noticeably different from those found with the original prior aside from a slight difference in the rate of decay of the tails at some locations. 3.5 Case Deletion Diagnostics We examine the effect on the predictive probabilities of removing ?prior? observa- tions. Specifically we would like to know if there are any data that will significantly affect the predictive probabilities upon deletion. Following the notation of BCJ (1997), if Y(i) denotes the data without yi, then the likelihood for ? for all data except yi is: L(?|Y(i)) = L(?|Y )L(?|y i) (27) Therefore the posterior of ? from the reduced data is: pi(?|Y(i)) = L(?|Y(i))pi(?)a82 L(?|Y (i))pi(?)d? = pi(?|Y )/L(?|yi)integraltext pi(?|Y )/L(?|y i)d? (28) 25 Figure 9: Posterior Densities at Locations 1 and 2 with Diffuse Prior 26 Figure 10: Posterior Densities at Locations 3, 4, and 5 with Diffuse Prior 27 As before we use the discrete approximation to the posterior, which takes values ?j with probability ?qj(i) where ?qj(i) = ?qj/L(? j|yi) summationtextt k=1 ?qk/L(?k|yi) (29) Thus for each observation yi we may compute the new predictive probability of success for any covariate vector xj having removed the observation yi from the data and compare the results to the full data set. ?The symmetric Kullback-Leibler (KL) divergence is used to measure the discrepency between full and reduced data predictive distributions (BCJ 1997).? The predictive distribution of a single trial is Bernoulli, and the KL divergence between two Bernoulli distributions with proba- bilities p and q is J(p,q) = (p?q)log[p(1?q)q(1?p)] (30) We would like to have a measure of the effect on all predictive probabilities of removing say yi from the data. The authors, BCJ (1997), define a ?symmetric predictive divergence diagnostic? as Dpi = nsummationdisplay j=1 J(p(y = 1|Y,xj),p(y = 1|Y(i),xj)) (31) Where p(y = 1|Y(i),xj) is the probability of success having removed yi from the data, and p(y = 1|Y,xj)is the predictive probability based on all the data. We may estimate the probability of success based on the reduced data having removed case i by using the weights ?qj(i), and then p(y = 1|Y(i),x) = integraldisplay r(xprime?)pi(?|Y(i))d? ? tsummationdisplay j=1 r(xprime?j)?qj(i) (32) Clearly a larger Dpi value indicates that the observation yi is more extreme within the data, and perhaps contributes more to the variability of the estimates, which is of course undesirable. In the application to the NFL data set the maximum Dpi values were .034, .033, and .026 which occurred twice. For simplicity the Dpi were computed in two groups; teams that won and teams that lost. It is clear from the plots of the respective Dpi (figure 11) that the above values are outliers and have a relatively large effect on the predictive probabilities. The Dpi values above correspond to the following observations: 28 .034 ? (49,?67,?4,0) .033 ? (?112,?156,4,0) .026 ? (22,?149,3,0)and(53,51,3,1) Upon further review of the data it was discovered that the largest Dpi value of .034 corresponded to a data vector that was incorrect. The data vector x122 = (49,?67,?4,0) corresponds to a team that won the game but has a very low pre- dicted probability of winning p = .0974. The correct data vector should include a TO total of 4 instead if ?4. This change of course drastically alters the predicted probability of winning for the corrected vector x122 = (49,?67,4,0) to p = .914. We would therefore like to know the effect on all predicted probabilities of remov- ing this observation from the data set (figure 12). Note that in the following cal- culations the data vector x122 is still in its original incorrect form, but we can consider the effect of removing it from the data set by looking at the difference pr(win|alldata)?pr(win|alldataexceptcase122). Most of the differences are close to zero, but there are some outliers. For example, it is interesting that of the winning teams the case with the largest de- crease in predicted probability was case x29 = (87,20,?2,1), which is similar to the removed x122. It is reasonable that similar data vectors to x122 would have a decreased predicted probability of success after removing this case since x122 has a very low predicted probability of success but was characterized as a win. The vector x65 = (?112,?156,4,0) experienced the largest increase in predicted probability of success among winning teams, and it is virtually the opposite of the original x122. The most obvious conclusion that can be drawn from these two observations is that the incorrectly recorded x122 skewed the weighting of the effect of turnovers on pre- dicted probabilities. The original data vector describes a team that still won the game having given up 4 more turnovers than the opposition. Thus the detrimental effect of turning the ball over was slightly diminished by this extreme case. 29 Figure 11: Dpi Values for Winning and Loosing Teams 30 Figure 12: Differences in Predicted Success Probabilities After Removing Case 122 31 4 Simulation The results of inference and prediction with the NFL data set were favorable to- wards the Bayesian procedure. To get an idea of which procedure may provide better results in general, both are applied to simulated data. Consider the station- ary (Shumway and Stoffer 2000) AR(2) time series: zt = 16zt?1 ? 56zt?2 +epsilon1t and let xprime = (1,z199,z198) with response y = 1 if z200 > 0 and y = 0 otherwise. Repeat 200 times to get a binomial data set with 200 observations. Then ?? and predicted probabilities are computed using both methods. This process was repeated 100 times giving 100 ?var?i and 20000 ?pi collectively acquired from the simulation with the following results. Average variance of ? from Bayesian procedure : (1/100)?summationtext100i=1 ?var(?i) = ? ?? ? 0.0156 0.0055 0.0144 ? ?? ? Average variance of ? from GLM: (1/100)?summationtext100i=1 ?var(?i) = ? ?? ? 0.0436 0.0157 0.0495 ? ?? ? The mean squared error from Bayesian procedure: (1/20000)?summationtext20000i=1 (?pi ?yi)2 = 0.14018 The mean squared error from GLM: (1/20000)?summationtext20000i=1 (?pi ?yi)2 = 0.12954 The results of the two procedures are very similar. Bayesian inference produces ?primes with smaller variance, while on average GLM predicted probabilities have a 32 slightly smaller error. It should be noted that to automate the simulation process, a general algorithm was developed to select a prior and design matrix for each simulated data set. This of course has an effect on inference and prediction. In practice the user has the ability to more carefully select the prior and design matrix and thus achieve better results as was the case with the NFL data. 5 Comments There is a rather large body of literature related to modeling NFL data, but most research focuses on predicting the point spread of a future contest. Harville (1977, 1980) provides an extensive treatment of prediction for high school, college, and pro football using a mixed linear model, which assumes a Normal distribution for the point spreads. Thompson (1975) gives an MLE approach with similar Normal assumptions, while Glickman and Stern (1998) describe a MCMC analysis with Bayesian features. Most of the aforementioned procedures use only past point spreads and do not account for such factors as rushing and passing yards as de- scribed here. Of course, with effective predictors for the performance indicators used here one could use these results to predict point spreads with perhaps more accuracy than the previously mentioned methods since these performance indicators can give a more complete picture of a team?s strengths and weaknesses. Football handicappers, such as drbobsports.com, have noticed some of the trends relating football success to such factors as rushing yardage and turn overs, yet it is unclear whether these conclusions were made numerically or intuitively. The computations used in the Bayesian procedure outlined herein are relatively simple compared to the numerical maximization methods used for GLM, although the use of any statistical computation software renders this point mute. The software package used here was R, and there is no supplemental package that will perform the computations with the logit link function. There was therefore considerable effort put into writing the necessary code from scratch. 33 6 References Ashburn, J. R. and Colvert, P. M. (2006), ?A Bayesian Mean-Value Approach with a Self-Consistently Determined Prior Distribution for the Ranking of College Football Teams? Bedrick, E. J., Christensen, R., and Johnson, W. (1997), ?Bayesian Binomial Regression: Predicting Survival at a Trauma Center?, The American Statistician, Vol. 51, No. 3, 211-218 Glickman, M. E. and Stern, H. S. (1998), ?A State-space Model for National Football League Scores?, Journal of the American Statistical Association, Vol.93, No. 441, 25-35 Harville, D. (1977), ?The Use of Linear-Model Methodology to Rate High School or College Football Teams?, Journal of the American Statistical Association, Vol. 72, No. 358, 278-289 Harville, D. (1980), ?Predictions for National Football League Games Via Linear- Model Methodology?, Journal of the American Statistical Association, Vol. 75, No. 371, 516-524 Kedem, B. and Fokianos, K. (2002), Regression Models for Time Series Analysis, John Wiley and Sons, Inc. Leonard, T. (1972), ?Bayesian Methods for Binomial Data?, Biometrika, Vol. 59, No.3, 581-589 Pollard, R. (1973), ?Collegiate Football Scores and the Negative Binomial Distribution?, Journal of the American Statistical Association, Vol. 68, No. 342, 351-352 Rencher, A. C., (2000), Linear Models in Statistics, John Wiley and Sons, Inc. Shumway, R. H. and Stoffer, D. S. (2000), Time Series Analysis and Its Applications, Springer-Verlag Stern, H. (1991), ?On the Probability of Winning a Football Game?, The American Statistician, Vol. 45, No. 3, 179-183 Thompson, M. (1975), ?On Any Given Sunday:Fair Competitor Orderings with Maximum Likelihood Methods?, Journal of the American Statistical 34 Association, Vol. 70, No. 351, 536-541 Venables, W.N. and Ripley, B.D., (2002), Modern Applied Statistics with S, Springer- Verlag www.drbobsports.com 35