ABSTRACT Title of Dissertation: DATA-INFORMED CALIBRATION AND AGGREGATION OF EXPERT JUDGMENT IN A BAYESIAN FRAMEWORK Calvin Homayoon Shirazi, Doctor of Philosophy, 2009 Dissertation Directed by: Professor Ali Mosleh, Reliability Engineering Program, Department of Mechanical Engineering Historically, decision-makers have used expert opinion to supplement lack of data. Expert opinion, however, is applied with much caution. This is because judgment is subjective and contains estimation error with some degree of uncertainty. The purpose of this study is to quantify the uncertainty surrounding the unknown of interest, given an expert opinion, in order to reduce the error of the estimate. This task is carried out by data-informed calibration and aggregation of expert opinion in a Bayesian framework. Additionally, this study evaluates the impact of the number of experts on the accuracy of aggregated estimate. The objective is to determine the correlation between the number of experts and the accuracy of the combined estimate in order to recommend an expert panel size. DATA-INFORMED CALIBRATION AND AGGREGATION OF EXPERT OPINION IN A BAYESIAN FRAMEWORK By Calvin Homayoon Shirazi Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2009 Advisory Committee: Professor Ali Mosleh, Chair Professor Mohammad Modarres Professor Jeffrey Hermann Professor Michael Cukier Professor Gregory Baecher (Dean Representative) ?Copyright by Calvin Homayoon Shirazi 2009 ii Dedication To my parents? To my wife and son? Who waited patiently for me to reach higher! iii Acknowledgement I express my sincere appreciation to Professor Ali Mosleh for his guidance and support in completion of this dissertation. I also extend my gratitude to all who contributed to this study. iv Table of Contents Dedication.............................................................................................................. ii Acknowledgement ................................................................................................ iii Table of Contents.................................................................................................. iv List of Tables ......................................................................................................... v List of Figures....................................................................................................... vi Chapter 1: Introduction.......................................................................................... 1 Chapter 2: Literature Review................................................................................. 5 2.1 Eliciting Expert Opinion ........................................................................... 6 2.2 Utilizing Expert Opinion......................................................................... 11 Chapter 3: Data Collection and Characterization ................................................ 15 3.1 Data Collection........................................................................................ 15 3.2 Description of Case Studies .................................................................... 17 3.3 Data Characterization .............................................................................. 32 3.4 Selection of Forecast Accuracy Measure ................................................ 34 Chapter 4: Bayesian Formalism........................................................................... 37 4.1 Introduction ............................................................................................. 37 4.2 Governing Model .................................................................................... 39 4.3 Construction of the Likelihood and Posterior: Homogenous Pool.......... 40 4.4 Construction of Likelihood and Posterior: Nonhomogenous Pool ......... 44 4.5 Construction of Likelihood and Posterior: Hybrid Pool ......................... 48 Chapter 5: Data-Informed Calibration of Expert Opinions ................................. 49 5.1 Introduction ............................................................................................. 49 5.2 Methodology ........................................................................................... 50 5.3 Performance Assessment of Case-Specific Likelihood Functions.......... 53 5.4 Performance Assessment of Generic Likelihood Functions ................... 62 5.5 Conclusion............................................................................................... 65 Chapter 6: Data-Informed Aggregation of Expert Opinions ............................... 66 6.1 Introduction ............................................................................................. 66 6.2 Mathematical Model................................................................................ 68 6.3 Aggregation by Simulation ..................................................................... 70 6.4 Simulation Results................................................................................... 74 6.4.1 Aggregation Performance .............................................................. 74 6.4.2 Dependent Experts Performance.................................................... 74 6.4.3 Size of Expert Panel....................................................................... 75 6.5 Aggregation Using Empirical Data ......................................................... 77 6.5.1 Aggregation Performance .............................................................. 84 6.5.2 Expert Panel Size ........................................................................... 84 Chapter 7: Summary of Results ........................................................................... 85 7.1 Research Contribution............................................................................. 85 7.2 Data-Informed Calibration of Expert Judgment...................................... 86 7.3 Data-Informed Aggregation of Expert Judgment.................................... 87 7.4 Research Limitations............................................................................... 88 7.5 Future Research....................................................................................... 89 References............................................................................................................ 90 v List of Tables Table 1. Representation of Homogenous Data ......................................................42 Table 2. Representation of Non-Homogenous Data..............................................45 Table 3. Representation of Hybrid Data ................................................................48 Table 4. Bayesian Treatment of Homogenous Pool Using Case-Specific Likelihood Function.......................................................................................54 Table 5. Bayesian Treatment of Non-Homogenous Pool Using Case-Specific Likelihood Function.......................................................................................55 Table 6. Bayesian Treatment of Hybrid Pool Using Case-Specific Likelihood Function .........................................................................................................56 Table 7. Bayesian Treatment of Non-Homogenous (NH), Homogenous (H) and Hybrid Pools Using Case-Specific Likelihood Function...............................59 Table 8. Best Fitted Distribution for Expert Relative Errors .................................61 Table 9. Numerical Example to Measure Performance of Generic Likelihood Function Using Mean (?) of Posterior ...........................................................64 Table 10. Numerical Example to Measure Performance of Generic Likelihood Function Using Median (u50) of Posterior .....................................................64 Table 11. Numerical Example for Aggregation Procedure Illustration.................77 Table 12. Bayesian Update and Relative Error for Aggregation Example............77 Table 13. Continuation of Aggregation Example..................................................78 Table 14. Aggregation Results for Example Data .................................................79 Table 15. Example for Aggregation Procedure: Out-of-order data .......................79 Table 16. Aggregation Results for Example: Out-of-order data............................80 Table 17. Aggregation Performance: Case-Specific Likelihood ...........................82 Table 18. Aggregation Performance Summary: Generic Likelihood ? Mean.......83 Table 19. Aggregation Performance Summary: Generic Likelihood ? Median....83 vi List of Figures Figure 1. Process of Selecting Forecast Accuracy Measure ..................................35 Figure 2. Construction of Likelihood Functions for Homogenous Data ...............42 Figure 3. Treatment of Homogenous Data ............................................................43 Figure 4. Treatment of Non-Homogenous Data ....................................................46 Figure 5. Process Flow of Bayesian Treatment .....................................................52 Figure 6. Histogram of Accumulated Homogenous and Nonhomogenous Data...55 Figure 7. Distribution Identification for Accumulated Expert Relative Errors .....56 Figure 8. Improvement by Bayesian Treatment in All Empirical Cases ...............60 Figure 9. Histogram of All Relative Errors ...........................................................60 Figure 10. Lognormal (3P) Distribution of for All Relative Errors.......................61 Figure 11. Aggregation Simulation Approach.......................................................73 Figure 12. Aggregation Simulation for Dependent Experts ..................................73 Figure 13. Performance of Aggregation Methods .................................................76 Figure 14. Simulation Results: Expert Panel Size vs. % Estimates Improved ......76 Figure 15. Fitted Line Plot: Improvement vs. Experts Panel Size.........................82 Figure 16. Bayesian Treatment vs. Expert Panel Size...........................................87 1 Chapter 1: Introduction Historically, decision-makers have utilized expert judgment to supplement insufficient data or carry out a task proficiently. A major source of information in estimating parameters of risk and reliability models is expert knowledge. Cases involving new design, very rare events, and proceedings that are beyond our direct experience, call for the use of expert opinion as a surrogate source of information. Experts can extensively influence key decisions in the political, financial, legal, and social arenas. Although, expert estimate is treated as scientific data, it is applied with much caution. This is because an opinion is not a fact, verified by an experiment; it is a person's assessment or judgment about a specific subject. According to the RAND Corporation, opinion is a blend of knowledge and speculation (Forrester, 2005). In the Oxford English dictionary, speculation denotes assumptions with minimum or no supporting evidence and knowledge is defined as the theoretical or practical understanding of a subject. Considering these definitions, uncertainty in judgment simply translates into a range of possible outcomes, given the current state of expert knowledge. Though, it can be argued that other types of data can also be uncertain, the human psyche introduces a unique category of complications by itself. This means that there are degrees of inherent variation in the expert judgment. Problems in expert judgment studies begin with the identification of attributes by which one can qualify an individual as an ?expert?. There is no established intra- or interdisciplinary taxonomy based on the relation between the expert qualifications and the accuracy of judgment. 2 Expert selection is often founded on uncorroborated ideas or subjective criteria, such as sufficient knowledge or experience in a discipline. Of course, this kind of general approach is subject to interpretation, which in turn, results in inconsistencies across the board. Additionally, the majority of the developed models used for the assessment of expert accuracy are based on historical performance of the individual expert. Therefore, decision makers need to be aware of the prior performance of the expert. When such information is not available, analysts are puzzled about the quality of opinion or the degree of confidence to place on the judgment. In practice, decision makers remain uncertain about the proper procedure to evaluate the expert judgment accuracy. In contrast to many studies revealing deficiencies in the expert judgment, this research study assesses how well experts are able to make predictions. This task is carried out by data-informed calibration of experts in a Bayesian framework. Bayesian method begins with the analyst prior belief of an unknown. Once the expert estimate is obtained, this prior belief is renewed using Bayes? method to establish a posterior, describing the analyst updated knowledge of unknown of interest. The main problem in applying Bayesian method is the complications associated with the development of a proper likelihood function. This distribution is a probabilistic model for data and must capture the interrelationships among estimates and the unknown. The first part of this research is dedicated to development and validation of proper likelihood functions. 3 In the beginning, a comprehensive database of observed relative errors of experts in various fields is assembled to determine the distribution of errors. Realizing the norm and the spread of errors, a totally unique generic likelihood is developed, independent of discipline, capable of improving the expert estimate. The generic likelihood along with case-specific likelihood distributions developed by Droguette and Mosleh (2003) is then tested using empirical data to reveal their ability in reducing future error of prediction. To the author?s best knowledge, there has not been any study conducted, in comparison with this comprehensive research, employing such sizeable empirical data from various fields. This study also considers the impact of the number of experts on the accuracy of aggregated estimate in a Bayesian framework. Because expert opinion is considered uncertain, it seems logical to consult multiple experts in an attempt to have a more inclusive database or at least gather more information. Speculations about the positive correlation between the prediction accuracy and the number of experts, assert that the more experts are elicited, the higher the accuracy of the combined estimates achieved. Question still remains whether empirical data actually support this assertion, and if so, to what extend this link has an impact on practical cases. The second part of this study answers this question. Collected expert judgments are combined in a Bayesian framework using likelihood distributions developed in the first part of the research study. Total number of estimates with reduced errors is depicted against corresponding expert panel size. The objective is to determine the correlation between the number of experts and the accuracy of the combined estimate to recommend an expert panel size. 4 The material presented in this research begins with a comprehensive literature review in eliciting and aggregating of expert opinion in Chapter 2. Chapter 3 characterizes the collected empirical data and explains the rationale of selection of the forecast accuracy measure. An introduction of Bayesian methodology as well as detail mathematical formulation of likelihood functions and posterior distributions are presented in Chapter 4. Chapter 5 is dedicated to the result of calibration studies as well as performance evaluation of the developed generic likelihood function. Chapter 6 presents the result of aggregation analysis via empirical data. In this chapter, Bayesian mathematical aggregation method is evaluated and compared with representative models of axiomatic methods. Additionally, expert panel size is suggested based on the accuracy of aggregated estimate achieved using likelihood functions formulated. The last chapter, Chapter 7, wraps up the topics discussed in this research and summarizes the results of the study for a quick reference. 5 Chapter 2: Literature Review In the absence of complete scientific information, decision-makers have to rely on their own intuition or on expert opinion (Baldwin, 1975). Expert judgment represents the expert state of knowledge at the time of response to a question (Keeney and von Winterfeldt, 1991). According to Booker and Meyer (1996), expert opinion is used in the structuring of technical problems including the determination of relevant information for analysis. It is also used in direct qualitative or quantitative estimates of uncertainties and probabilities. Lannoy and Procaccia (2001) assert that recourse to expert judgment is required in the completing, validating, interpreting and integrating the existing data as well as predicting the rate of future events and the consequences of a decision. Other situations requiring expert judgment include determining the present state of knowledge in one field and providing the basis for decision- making in the presence of several options. Issues surrounding the use of expert opinion fall into two broad categories of eliciting and utilizing the opinion, which includes selection of experts, determination of expert panel size, ascertain calibration and aggregation methods, and so on. In line with the scope of this research, a brief review of the literature related to eliciting and aggregating of expert opinion is presented in this chapter. 6 2.1 Eliciting Expert Opinion DeGroot (1988) believed that the range of people who can be considered as expert includes ?anyone or any system that will give you a prediction? to ?someone whose prediction you will simply adopt as your own posterior probability without modification?. Nevertheless, expert judgments should be used with caution, not to replace ??hard science? (Apostolakis, 1990). The poor quality of expert judgment can be broadly classified as those associated with the individual expert (i.e. attributes, expert definition or distinction), the actual estimates or judgments as well as the elicitation process (formal vs. informal elicitation), aggregation or combining estimates, calibration (performance measures of experts and expertise), and available technical documents (Mosleh and Forrester, 2005). According to Garthwaite et al. (2005), the quality of expert judgments can be controlled by a formal procedure of expert elicitation and documentation. Application of formal elicitation processes have been recommended by Hora and Iman (1989), Keeney and von Winterfeldt (1991), among many others. The formal elicitation of expert judgment started with the establishment of the RAND Corporation in the United States after Word War II (Cooke, 1991). RAND developed two formal methods for eliciting expert opinion, Delphi and Scenario Analysis through the collaborative project with U.S. Air Force and Douglas Aircraft in 1946 (Ayyub, 2001). Herman Kahn is regarded as the father of scenario analysis (Cooke, 1991). In this method, scenarios or hypothetical sequences of events are set forth to concentrate on decision-making processes (Kahn and Wiener, 1967). 7 Helmer and Dalkey were founders of Delphi method (G?naydin, 2009). According to Helmer (1977), Delphi method facilitates level communication among experts and therefore assists the formation of a group judgment. Wissema (1982) states Delphi procedure is developed in order to make discussion between experts possible without permitting a certain social interactive behavior. By 1974, the Delphi study count exceeded 10,000 (Linstone and Turoff, 1975). Delphi method has been widely used to generate forecasts in technology, education, and other fields (Cornish, 1977). Delphi is based on a structured process for collecting and refining data from a group of experts by means of a series of questionnaires interspersed with controlled opinion feedback (Adler and Ziglio, 1996). Many researchers have suggested that performance feedback is a particularly effective method for improving calibration (e.g., Fischhoff, 1982). Perhaps the most intensive study using performance feedback was conducted by Lichtenstein and Fischhoff (1980). Subjects completed 11 training sessions of 200 general knowledge questions. At the completion of each training session, they were given personalized feedback, including performance measures in calibration and overconfidence. This feedback was then discussed with all the subjects for about 5 to 10 minutes. There result of the training was clear improvement in calibration (Stone, 2000). In some fields, experts have shown relatively well-calibrated judgments. The typical example is meteorology, where forecasts of precipitation and of maximum and minimum daily temperatures have been shown to be well calibrated (Murphy and Winkler, 1977). In contrast, financial analysts have been shown to significantly overestimate corporate earnings growth (Chatfield et al., 1989; Dechow and Sloan, 1997). 8 In the context of environmental risk analysis, Hawkins and Evans (1989) found that industrial hygienists provided reasonably accurate estimates of the mean and 90th percentile of a distribution of personal exposure to chemical- industry workers. Walker et al. (2003) found that experts provided reasonably well calibrated estimates of mean and 90th percentile ambient, indoor, and personal exposures to benzene. Human decision is a function of heuristics and biases (Tversky and Kahneman, 1974). An important point to consider is when eliciting from an expert who has some sort of personal interest in the prediction outcome (Kadane and Winkler, 1988). Also, experts and novices may experience the same biases in decision-making (Ericsson and Staszewski, 1989). Perhaps the most widely used heuristic is judgment by anchoring and adjustment (Tversky and Kahneman, 1974). With this strategy, an expert estimates an unknown with an initial value. This estimate is then adjusted to obtain a nominal value. The adjustment of the initial value (which is named the anchor) is usually too small (Slovic, 1972), a phenomenon called anchoring. An experiment conducted by Tversky and Kahneman (1974) demonstrated this problem. Subjects were asked to estimate various quantities, stated in percentages (e.g. the percentage of African countries in the United Nations). They were given randomly chosen starting values and had to adjust it to their best estimate. Subjects whose starting values were high ended up with substantially higher estimates than those who started with low values. For example, the median estimates of the percentage of African countries in the U.N. were 25% for subjects who received 10% as their starting point and 45% for those who received 65%. 9 Another aspect of using expert judgment is the problem of adjusting for the overconfidence (Alpert and Raiffa, 1982; Morgan and Henrion, 1990). Shlyakhter et al. (1994) has developed an empirical model for adjusting individual expert distributions to account for overconfidence. The model uses a single parameter to calibrate the spread of an expert distribution. Hammitt and Shlyakhter (1999) use this model in their study of expert assessments related to global climate change. Other situations to consider include convergence and conflict among experts (Hynes and Vanmarke, 1977). Expert elicitation has been criticized in many ways as well, such as selection method of experts and accurate expression of expert knowledge (O?Hagan and Oakley, 2004). Simon and Chase (1973) suggest that for most domains it takes a minimum of ten years of experience to gain expertise. According to Ericsson, Krampe, and Tesch-R?mer (1993), expert knowledge is only achieved through continuing involvement in the subject matter. Wilson (1994) states that expert knowledge is more coherent and structured than novice knowledge. Although there are certainly instances of positive correlations between experience and expertise, there is little reason to expect this relation to apply universally (Shanteau, 2002). Vegelin (2003) states that experience significantly influences accuracy. In the context of Bayesian analysis, elicitation arises often as a method for specifying the prior distribution for an unknown of interest (O'Hagan et al, 2004). Eliciting a prior distribution is difficult due to the subjectivity nature of the prior (O'Hagan, 1998). An excellent literature review of the elicitation of prior beliefs in the Bayesian framework is presented by Kadane and Wolfson (1998). 10 The expert elicitation has been applied to many studies, such as future climate change (Arnell et al., 2005; Miklas et al., 1995), performance assessment of proposed nuclear waste repositories (Hora and Jensen, 2005; McKenna et al., 2003; Draper et al., 1999; Hora and von Winterfeldt, 1997; Zio and Apostolakis, 1996; Morgan and Keith, 1995; DeWispelare et al., 1995; Bonano and Apostolakis, 1991; Bonano et al., 1990), estimation of parameter distributions (Parent and Bernier, 2003; Geomatrix Consultants, 1998; O?Hagan, 1998), development of Bayesian network (Pike, 2004; Stiber et al., 1999, 2004; Ghabayen et al., 2006), and interpretation of seismic images (Bond et al., 2007). Another question in elicitation is to determine number of experts needed. Ashton and Ashton (1985) studied judgmental forecasts of the number of advertising pages in Time magazine. The conclusion was that by combining the forecasts of four experts, error of estimates is reduced by 3.5%. Study reported that accuracy improved by increasing the panel size up to 13 experts. Hogarth model (1978) showed using at least six experts but no more than 20. Libby and Blashfield (1978) showed improvement in accuracy of forecasts when increasing the size of the expert panel from one to three, but recommended the optimum size between five and nine. Batchelor and Dua (1995) showed increase in accuracy from 10 to 22 economists. Their study also revealed a small improvement from 22 to the remaining 12. 11 2.2 Utilizing Expert Opinion In uncertain situation, combining data can reduce error (Armstrong, 2001). For example, Klugman (1945) found that combining judgments led to greater improvements for estimates of heterogeneous items (irregularly-shaped lima beans in a jar) than of homogeneous items (identically-sized marbles in a jar). Krishnamurti et al. (1999), in a study of short-term weather forecasts, concluded that accurate predictions are needed from combining of six or seven estimates. Winkler and Poses (1993) examined physician?s predictions of survival for 231 patients who were admitted to an intensive care unit. Physicians sometimes received unambiguous and timely feedback, so those with more experience were more accurate. They grouped the physicians into four classes based on their experience, 23 interns, four fellows, four attending physicians, and four primary care physicians. The group averages were then averaged. Accuracy improved substantially as they included two, three, and then all four groups. The error measure dropped by 12% when they averaged all four groups across the 231 patients (compared to that of just one group). The two well-established mathematical approaches to aggregate opinions are axiomatic and Bayesian models (Boring, 2007; Clemen and Winkler, 1997). Many different methodologies have been developed for axiomatic aggregation. Previous research has considered simple averaging as a mental model of the aggregation process (Anderson, 1981; Dawes, 1979; Einhorn and Hogarth, 1975; Einhorn, Hogarth, and Klempner, 1977; Hastie, 1986; Sniezek and Henry, 1989). Many studies have suggested simple averaging of individual opinions as a method for improving the accuracy of predictions (Armstrong, 1985; Ashton, 1986; Hill, 1982; Hogarth, 1978; Zajonc, 1962; Zarnowitz, 1984). 12 Stone (1961) proposed a linear opinion pools in which the aggregation result is expressed as a linear combination of estimates. A linear opinion pool provides a very simple mechanism for representing unequal degrees of expertise. The determination of expertise (weight) can be a subjective matter and prone to numerous assumptions and interpretations (Genest and McConway, 1990). Cooke?s classical method is a linear opinion pool, applied widely in Europe (Clemen and Winkler, 1993), including major studies of nuclear-power risks, among others (Cooke, 1994; Goossens and Harper, 1998; Jones et al., 2001). Morris (1983, 1986) introduced an axiomatic approach to expert aggregation. French (1985) and Genest and Zidek (1986) provide critical reviews of axiomatic aggregation literature. The first formal proposal to apply the Bayesian method in expert judgment study was offered by Morris (1974, 1977). Since original research by Morris, many forms of Bayesian procedures have been introduced in various papers. Mendel and Sheridan (1989) developed a Bayesian model that allows for the aggregation of non-normal probability distributions. Clemen and Winkler (1993) proposed subjective aggregation of point estimates using ?influence diagram?. Bayesian hierarchical model (where prior depends on parameters not addressed in the likelihood) was presented by Lipscomb, Parmigiani, and Hasselblad (1998). Wisse, Bedford and Quigley (2005) introduced ?moment method? to avoid the computational complications of continuous probability distributions. In addition, Genest and Schervish (1986) consider the problem of aggregating expert judgments when the decision maker does not provide complete probabilistic assessments of the required distributions, but instead offer certain moments of the distributions. 13 A major issue in aggregation is the problem of dependence among experts. Judgments of multiple experts about a parameter can be extremely informative when experts are probabilistically independent, conditional on the ?true? value. Clemen and Winkler (1985) reveal the number of independent experts whose combined data is equivalent to that of a larger number of dependent experts. Dependence is both central to proper combination of expert judgments and difficult to evaluate (Kallen and Cooke, 2002). Jouini and Clemen (1996) propose a copula-based approach to combining distributions. This approach provides a flexible method for representing dependence among experts. A copula function (e.g., Nelsen, 1999) provides a way to write a joint distribution function as a function of its marginal distributions. Hammitt and Shlyakhter (1999) and Lacke (1998) use the copula aggregation models in the contexts of global climate change and colon cancer risk modeling, respectively. Clemen and Reilly (1999) suggest using the multivariate normal copula, which does not require that experts be treated symmetrically and so permits greater flexibility in modeling dependence. Overall, identifying a likelihood function for expert probability assessments is considered as one of the actual difficulties in using Bayesian. Some of the recent research studies such as Mosleh and Forrester (2005) indicate multiple attempts to tackle the problem of developing proper likelihood functions. The appropriate likelihood model in which each expert provides a normal distribution for the target parameter developed by Winkler (1981) and studied by Winkler and Makridakis (1983), Clemen and Winkler (1985), Schmittlein et al. (1990), Chhibber and Apostolakis (1993), and Chandrasekharan et al. (1994). 14 Difficulties with the axioms themselves are discussed by French (1985) and Genest and Zidek (1986). Lindley (1985) gives an example of the failure of both axioms. Genest and Zidek (1986), Winkler (1968), French (1985), and Lindley (1985) all ruled for Bayesian approach. The limited available evidence on relative performance of combination methods suggests that simple averages often perform nearly as well as the theoretically superior Bayesian methods (Clemen and Winkler, 1999; Kallen and Cooke, 2002). A comprehensive review of aggregation literature, including dependence, can be found in French (1985), Ouchi (2004), Genest and Zidek (1986), French and R?os Insua (2000). 15 Chapter 3: Data Collection and Characterization 3.1 Data Collection Generally, in assessing uncertainty about an unknown of interest, information can come in form of existing evidence about the unknown, evidence on the credibility of the expert?s estimate, evidence on the applicability and relevance of judgment, and data provided by the expert (Droguette and Mosleh). Experts provide qualitative information or quantitative estimates in form of a probability distribution, point estimate, range, statement or partial evidence of the unknown. In classical mathematics, data refers to a collection of organized information, which is often the result of experience, observation or experiment. In this research, data is subjective information and refers to expert point estimate in discrete or continuous form. Estimates are generated by experts or produced by forecasting models using expert input, review or final adjustment. A data collection plan is first established to populate a database with large number of expert estimate with corresponding seed (calibration), target (acceptance criterion or specification), true (real), or observed (as a result of experiment) values in different disciplines. The search for evidence on expert accuracy began with a general survey of the literature, internet publications, books, refereed and non-referred sources. Additionally, a broad exploration of the relevant Dissertation Abstracts database was performed to identify work across expert judgment studies and disciplines. 16 The wide literature search included databases such as Econpapers, Elsvier, PubMed, IEEE Digital Library, University of Maryland Digital Library, Medline, TU Delft Database, DOE?s Information Bridge, ACM Digital Library, WorldCat, CE Database, and Waste Management Research Abstracts. Over 2000 sources and publications since 1930s were initially flagged for general relevance. Of these sources, approximately 500 were selected. Each source was examined for significance to the elicitation and aggregation of expert judgment. Additionally, TU Delft expert judgment database was used, which reports the assessment of over 800 experts on over 4000 variables, representing 80,000 elicited questions. From the selected sources in this stockpile, over 1900 point estimates were collected in more than 60 different disciplines. In the next section, data sources utilized in this research are introduced. 17 3.2 Description of Case Studies In this section, a brief description of case studies used as data source is presented. An attempt is made to echo the objective of each case and convey any explanations or rationale offered by the authors to address the expert error. 3.2.1 Case #1 This study was conducted by National Human Exposure Assessment Survey (NHEXAS) using the estimates of seven experts to obtain exposure assessment in residential ambient, residential indoor and personal air Benzene concentrations (?g/m3) in United State Environmental Protection Agency (U.S. EPA's Region V), experienced by the nonsmoking, non- occupationally exposed population. These experts were selected by a peer nomination process. Individually elicited judgments were gathered from the experts during a 2-day workshop. (Walker, K. et al. Use of expert judgment in exposure assessment - Part 1. Characterization of personal exposure to benzene. Journal of Exposure Analysis and Environmental Epidemiology, 2003 (11):308-322 and Part 2. Calibration of expert judgments about personal exposures to benzene. Journal of Exposure Analysis and Environmental Epidemiology, 2003 (13):1-16) 3.2.2 Case #2 This study focus on value-added forecasting. It claims that due to internal politics, personal agendas, and financial performance requirements that skew the process, much of the management effort directed toward forecasting actually makes the forecast worse. (Gilliland, M. Is Forecasting a Waste of Time? Supply Chain Management Review, 2002) 18 3.2.3 Case #3 This article examins weather trends for eight locations in Kansas to determine the relationship between rainfall, yields, and farm income. Wheat, grain sorghum, corn and soybean yields are predicted using the yield prediction formulas and historical monthly precipitation. The predicted yields are then compared to the actual county average yield for a given crop and year. Data is obtained from Kansas Agricultural Statistics for the years 1970-2001 in Colby, Tribune, Garden City, Hays, Hutchinson, Manhattan, Ottawa, and Parsons Counties. (Dumler, T. J. Rainfall and Farm Income. Risk and Profit Conference, 2003) 3.2.4 Case #4 This study lists the criteria for selecting an appropriate error measure in forecast of hotel occupancy. The reported data are taken from a 166-room hotel in the mid-west of United State. It contains two sets of figures, the predicted and the actual daily occupancies for the month of September 1996. The predicted figures are the combined product expert predictions and input of hotel managers based on their experience and expectations. (Schwartz, Z. Monitoring the Accuracy of Multiple Occupancy Forecasts) 19 3.2.5 Case #5 The objective of this study is to compare the clinical acumen of paediatric cardiovascular examination between various hospital paediatrician grades. Pre-echocardiography clinical diagnoses are compared with echocardiography results according to grade of referring hospital doctor (ranging from houseman to consultant). The results show that Echocardiographers had the highest clinical accuracy and the highest attempts at reaching a clinical diagnosis. Accuracy and attempts at diagnosis decreased as doctor?s hospital grade decreased, from consultant to houseman. It is reported that the echocardiographers are the most accurate in the clinical detection of cardiac pathology, or its absence due to the fact that echocardiographers have the greatest experience. It is stated that Doctors with less paediatric cardiology exposure naturally experience more difficulty and housemen or senior house officers attempted the least diagnoses. Study concludes that experienced doctors are more likely to differentiate between normal and abnormal hearts. (Spiteri, A. Torpiano, J. Bailey, M. Mercieca, V. & Grech, V. A comparison of clinical paediatric murmur assessment with echocardiography. Malta Medical Journal, November 2004, (16):4) 20 3.2.6 Case #6 A weather precipitation case study among expert meteorologists at the University of Maryland, College Park was performed. The objectives of the study were to predict the APE of experts given their estimates and to determined the effect of expertise on expert performance. The study involved four experts who were asked to make 48-hour precipitation forecasts projections. In the field of meteorology, a 48-hour forecast of precipitation is considered moderately difficult, and requires specialized skills. The forecast were conducted on three different days for cities of Orlando, Seattle, San Francisco, New Orleans and Detroit. (Forrester, Y. 2005. The Quality of Expert Judgment: An Interdisciplinary Investigation. Weather precipitation research study among expert meteorologists at UMCP) 3.2.7 Case #7, 8, 9, 10 This study describes an evaluation of forecasting model accuracy and induced demand representation over a 10-year period in the integrated land use and transportation model, the 2000 Sacramento MEPLAN model. It is reported that error may be due to a developer model with limited sensitivity to process set too low or large zones in the outer regions which tend to underestimate the travel time. (Rodier, C. J. 2005. Verify the accuracy of land use model used in transportation and air quality planning: a case study in Sacramento, California region, MTI Report 05-02) 21 3.2.8 Case #11 This article evaluates the labor force, employment by industry, and occupation projections that BLS made in 1989 for the year 2000. The different causes of forecast errors, such as participation rate, are reported. The results show that in most cases, the accuracy of the BLS projections is comparable to estimates obtained from na?ve extrapolative models, and hence, are of low accuracy. (Stekler, H. O. & Thomas, R. Evaluating BLS Labor Force, Employment and Occupation Projection for 2000) 3.2.9 Case #12 The Bureau of Labor Statistic (BLS) has made labor force projections since the late 1950s. Beginning in 1968, the Bureau of Labor Statistics has not considered the projection process complete until it assesses the accuracy of its projections. This article examines the errors in the labor force projections to 1995 and the sources of the errors. The analysis compares projected and actual (most recent Current Population Survey estimate) levels of the labor force. The different causes of error are reported which includes immigration, projection period, or participation by age, sex, and race. The analysis also shows that gradual improvement in the accuracy of projections occurs over time. (Fullerton, H. N. BLS. Evaluation the 1995 BLS labor force projection) 22 3.2.10 Case #13 This study analyzes the accuracy of the United Nations? (UN) population forecasts in the past, based on six Southeast Asian countries: Indonesia, Malaysia, Singapore, Philippines, Thailand, and Vietnam. The study uses available projected and estimated age-structured data published by the UN from 1950 onwards. The study reveals that there is inconsistancies in the accuracy of the UN projections for different countries and the errors are age specific. The analysis also shows that gradual improvement in the accuracy of projections occurs over time. The fluctuation in error amount is reported to be due to the wrong assumptions made in various past projections. (Abdullah Khan, H. T. A Comparative Analysis of the Accuracy of the United Nations? Population. Projections for Six Southeast Asian Countries. IR-03-015) 3.2.11 Case #14 & 15 In this study, census 2000 counts are used to measure forecast error in projections for April 1, 2000. The different causes of error are reported includes up and down swings in population growth, projection outliners, or forecast evaluation of the detailed demographic components. The analysis also shows that gradual improvement in the accuracy of projections occurs over time. (Campbell R. Evaluating Forecast Error in State Population Projections Using Census 2000 Counts. U.S. Bureau of Census, Population Division Working Paper Series No. 57, 2002) 23 3.2.12 Case #16 In this article, a number of forecasts as well as actual data are provided for a monthly electric bill from January, 1991, through December 2000 for educational purposes. Paper claims that the values provide a real dataset to use for applications ranging from simple graphical analysis through a variety of time series forecasting methods. (McLaren, C. H. & McLaren, B. J. 2003. Electric Bill Data. Journal of Statistics, Ed. [Online], 11, 1) 3.2.13 Case #17 This work involves forecasting the number of domestic and international airline passengers in Saudi Arabia. Annual data from 1975 to 1986 was used and categorized into 16 variables. The forecast was obtained using the Model Quest Miner package, using some historical data for developing the model then proceeds to an evaluation phase. The period used for developing the model for the number of passengers was 18 years, while the period used for evaluation was 6 years for the five cities of Cities of Dhahran, Madina, Riyadh, Jeddah and Taif in Saudi Arabia. (BaFail, A. O. Applying Data Mining Techniques to Forecast Number of Airline Passengers in Saudi Arabia, Domestic and International Travels. King Abdul Aziz University, 2004) 24 3.2.14 Case #18 These data are obtained from Dr. Ali Mosleh from the University of Maryland, Mechanical Engineering Department, Reliability Engineering Program, reporting repair time for mechanical and electrical equipment. (Forrester, Y. The Quality of Expert Judgment: An Interdisciplinary Investigation, 2005) 3.2.15 Case #19 The case study contains experts? responses to 11 questions on Adult Weight Management, and the completion of a brief inquiry about experts? expertise. The entirety of experts attributes is used to predict the performance of experts. A weight management survey instrument is administered to registered dieticians with varying degrees of expertise. Experts are given a clinical nutrition diagnostic problem regarding the recommended ?very low calorie diet? for an obese girl. Experts wereasked to make a judgment about maximum recommended Kcal per day. (Forrester, Y. The Quality of Expert Judgment: An Interdisciplinary Investigation, 2005) 25 3.2.16 Case #20 A) In this study, the Foodborne Illness Risk Ranking Model (FIRRM) is developed, which is a decision-making tool that quantifies and compares the relative burden to society of 28 food-borne pathogens. An expert elicitation survey was designed and implemented, in which experts were asked to estimate, for each pathogen, the percentage of illnesses attributable to each food vehicle. The survey was developed, with the aid of Dr. Paul Fischbeck, Carnegie Mellon University, a recognized authority in the field of expert elicitation, using standard methodologies found in the literature (Morgan et al. 1990; Cooke 1991). The survey included 11 major pathogens and elicited uncertainty bounds around responses. The survey was sent to a peer-reviewed list of 101 scientists, public health officials, and food safety policy experts; and received 45 responses. The data include experts? best judgment estimates of attribution percentages for Campylobacter and Listeria and outbreak data. (Batz, M. B., et al. Identifying the Most Significant Microbiological Food-borne Hazards to Public Health: A New Risk Ranking Model, Food Safety Research Consortium. Discussion Paper Series Number (1) - FIRRM Food Attribution Percentages for Illnesses from Foodborne Campylobacter and Listeria monocytogenes, 2004) 26 B) Hoffmann et al. develops a formal protocol for expert elicitation with large, cross-functional expert panels and uses formal survey methods to take advantage of variation in individual expert uncertainty and inconsistancy among experts as a means of quantifying and comparing sources of uncertainty about parameters of interest. The pool of respondents represent a broad range of workplaces; three respondents reported having significant work experience in multiple institutional settings and the remainder were evenly distributed among government, academia, and industry. It is reported that experts? backgrounds and experiences as well as self-reported pathogen expertise help explain variation in individual experts? ranges. Respondents who identify government as their primary career setting have tighter ranges than those whose careers have been primarily in academia, industry, or multiple sectors. Those with significant career experience in multiple sectors have the largest ranges, followed by those in industry, and followed by academia. Highest degree also explains variation in range. Those with master?s degrees have the least confidence in their best estimates, and Doctors of Veterinary Medicne or DVMs have the most. (Hoffmann, S., et al. Eliciting Information on Uncertainty from Heterogeneous Expert Panels: Attributing U.S. Foodborne Pathogen Illness to Food Consumption. RFFDP6-17, April of 2006) 27 3.2.17 Case #21 This research, employing 11 experts who estimated an exposure parameter (the percentages of four nickel species) in 12 workplaces in a nickel primary production industry, providing a large dataset from which useful inferences can be drawn about the quality of expert judgments and the variability among the experts. It describes the application of Bayesian ideas to the comparison of expert opinions, mathematically combining expert opinions and refining these combined expert opinions with actual workplace measurements. The study reports that expertise does not necessarily require intimate familiarity with the workplace, however, the expert judgment knowledge has indeed enhanced the quality of the combination of expert judgment. (Ramachandran, G. et al. Expert Judgment and Occupational Hygiene: Application to Aerosol Speciation in the Nickel Primary Production Industry) 3.2.18 Case #22 The accuracy of cause-specific mortality by physician review is reported in this article. Data is drawn from a multi-center validation study of 796 adult deaths that occurred in hospitals in Tanzania, Ethiopia, and Ghana. Study reveals that the physician review shows a high diagnostics accuracy. (Quigley, M. A., et al. Diagnosis accuracy of physician review expert algorithms and data-derived algorithms in adult verbal autopsies International epidemiological Association, International Journal of Epidemiology, 1999(28): 1081-1087) 28 3.2.19 Case #23 In this article, four forecasts are evaluated for relative forecast accuracy by examining their performance over specified period of time. The reported actual price data and individual forecast series extracted are quarterly observations on and forecasts of the USDA seven-market-average hog price for barrows and gilts (200-220 lb.) from the third quarter of 1973 through the second quarter of 1986. According to this article, the individual forecast data are an expert's forecast and the expert's forecasts are for one-quarter-ahead cash prices made by Glen Grimes, professor of Agricultural Economics. The futures forecast prices would correspond directly to the expert forecasts. The futures forecasts for each period are the closing price quoted in the annual Yearbook of the Chicago Mercantile Exchange for the day Grimes' forecast is published and for the contract that would expire as close as possible to the end of the one-quarter lead time. The results of this study reveals that the it would have been better for analyst to use a composite forecast rather than tempting to identify a "best" individual value obtained from each of the forecast. (McIntosh, S. & Bessler, A. Forecasting Agricultural Prices Using a Bayesian Composite Approach. Southern Journal of Agricultural Economics, December of 1988) 29 3.2.20 Case #24 In this article, AEPCO and the University of Arizona, Department of Agriculture and Resource Economics (AREC) collaborate during the fall semester 2005 on a project to improve forecasts of next-day electricity load reported in Mega-Watt. The project is conducted as part of an AREC graduate class in applied econometrics. Mr. Cathers of AEPCO developed a detailed proposal outlining specific objectives for improving forecast accuracy. Dr. Gary Thompson of the University of Arizona, AREC, agreed to coordinate the department?s efforts and conduct the project in connection with his graduate course, Advance Applied Econometrics. Students developed econometric models for forecasting next-day hourly load profiles. The particular econometric models developed are known as ARIMA (autoregressive, integrated, moving average) models. It is concluded in this paper that existing methods using expert judgment appear to have been sufficiently accurate for AEPCO?s current load levels and thus it is suggested that AEPCO may continue to employ expert judgment methods while comparing their daily forecasts to those derived from statistical models. (Cathers, C. A. & Thompson, G. D. 2006. Forecasting Short-Term Electricity Load Profiles. Sierra Southwest Cooperative Services, Inc. The University of Arizona, Cardon Research Papers) 30 3.2.21 Case # 25 & 26 Tennessee Valley Authority produces its own forecasts of regional economic activity based on forecasts of the national economy developed by a forecasting service, Global Insight. These forecasts are publicly distributed throughout the Tennessee Valley. The reported data are TVA Economic Forecast Five-Year Forecast Gross Product in Billions of Dollars from 1980 to 1995. It is stated in the study appendix that the regional economic forecast performance improvement can be attributed, in part, to the better performance of the national forecasts and to improvements in the TVA economic forecasting process, including validation procedures. (Tennessee Valley Authority (TVA). Appendix B ? Methodology and Results from Socioeconomic Modeling. Final Environmental Assessment) 3.2.22 Case #27 This paper considers a dilemma an analyst faces as influential forecaster. It states that clients request an unbiased forecast but pressures sometimes exist to provide a bias forecast. The impact of these pressures on the quality of forecasts is evaluated and the different causes of error are reported such as the difference between forecasting and decision-making or lack of control on new product launches. (Ehrman, C. M. & Shugan, S. M. 1995. The Forcaster?s Dilemma. Marketing Science, 14(2): 123-127, Springer) 31 3.2.23 Case #28 Over the last fifteen years, the Delft University of Technology (both the Safety Science Group and the Department of Mathematics of TU Delft) has developed methods and tools to support the formal application of expert judgment. Over 800 experts assessed over 4000 variables, in total representing more than 80,000 elicited questions. Applications were made in a variety of sectors, such as nuclear, chemical and gas industries, toxicity of chemicals, external effects (pollution, waste disposal sites, inundation, volcano eruptions), aerospace and aviation sector, occupational sector, health sector, and the banking sector. Expert judgment data provided by Dr. R. M. Cooke on 2009. (Goossens, L. H. J.; Cooke, R. M.; Hale, A. R. & Rodic?-Wiersma, Lj. Fifteen years of expert judgement at TU Delft. Safety Science 46 (2008) 234?244) 32 3.3 Data Characterization In the expert judgment case studies, where empirical data is collected, there is a range of reasons explaining the expert error. This includes, but not limited to, career affiliation, academic degree, field of expertise, or years of experience. The errors of expert estimates vary by subject matters as well. For example, the study conducted by Hoffmann, Fischbeck, Krupnick, and McWilliams (2006) show that variability in best estimates does differ by professional background and discipline as well as expert characteristics. Respondents who identify government as their primary career setting have smaller ranges than those whose careers have been primarily in academia or industry; individuals with significant career experience in multiple sectors have the largest ranges, followed by those in industry and academia. For the forecasts obtained by model, in addition to model inputs and assumptions, there is a range of reasons listed to explain the error of forecasts such as model types, forecast period and projection horizon, forecast accuracy measures used, additional information that becomes available, the size of the error, seasonal and geographical errors, and so on. Overall, there are many factors affecting the estimate accuracy such as expert attributes, calibration method, decision processes, aggregation procedure, and so on. Inconsistencies caused by these elements are accepted as inherent variation in the modeling and assessment processes in this research study. The purpose is to capture actual errors (though the sources of these fallacies remain unknown) and examine the formulated likelihood functions in dealing with these variations. This is especially true for generic likelihood function which is domain independent, but is made from a pool of data from different fields. 33 A question may arise as how to draw a conclusion from information without any boundaries. It should be noted that there are circumstances that expert previous performance is not entirely realized. There are also events that are beyond our direct experience. In these cases, decision makers are indeed puzzled about the quality level of the opinion, or in other words, the degree of confidence to place in the judgment. The generic likelihood function developed in this study can justly be used to update the expert estimate when facing with lack of such information. 34 3.4 Selection of Forecast Accuracy Measure According to Armstrong and Fildes (1995), the objective of a forecast accuracy measure is to provide information about the error distribution. It has been shown by Chen and Yang (2004) that Mean Square Error (MSE) is the optimal selection when the errors are normally distributed. However, MSE and similar measures are not suitable for this study since they are not unit-free. Absolute performance measures such as simple difference between the estimate and true value may produce very big numbers due to outliers, which can make the comparison of different estimates not feasible. It is generally accepted that there is no single best accuracy measure, and selecting an assessment method is essentially a subjective decision. Figure 1 depicts the logic of selecting the forecast accuracy measure in this research. As reflected in this diagram, general and specific provisions were first defined. Among the most popular measures listed, relative error measure is chosen since it is scale-independent, interpretable, minimally impacted by outlier observations or errors and can eliminate the bias introduced by possible trends, and seasonal components. Amongst the relative error candidates, the simplest form was selected since it seemed to be able to satisfy the majority of established requirements, while being easy enough for numerical calculations: u uE ?= Equation 1 u : is the quantity of interest, 'u : is the expert opinion and E : is the relative error. 35 MEAN ERROR ME: Forecast Error ui: The actual value u?i: The forecast value START )]'([1 1 ? = ?= n i ii uunME MEAN SQUARE ERROR MSE: Mean Square Error ui: The actual value u?i: The forecast value MEAN ABSOLUTE DEVIATION MAD: Mean Absolute Deviation ui: The actual value u?i: The forecast value MEAN ABSOLUTE PERCENTAGE ERROR MAPE: Mean Absolute Percentage Error ui: The actual value u?i: The forecast value 2 1 )]'([1 ? = ?= n i ii uunMSE ? = ?= n i ii uunMAD 1 '1 ]100'1 1 ? ? ? ? ? ?= ? = n i i ii u uu nMAPE END General Requirements 1. Meaningful 2. Reliable 3. Accurate 4. Valid 5. Easy to use 6. Sensitive Specific Requirements 1. Measurement to a reference point 2. Scale independence (unit free) 3. Minimum trends impact 4. Minimum seasonal components impact 5. Minimum outliers impact 6. Simple for calculations RELATIVE ERROR E: Mean Absolute Percentage Error ui: The actual value u?i: The forecast value i i u uE '= Identify General Requirements Identify Specific Requirements Identify Forecast Accuracy Measures Meeting Requirements Decision Point: Select The Best Measure Meeting Most of the Established Requirements Figure 1. Process of Selecting Forecast Accuracy Measure 36 The unit of measurement has minimum impact on the development of the domain-independent (generic) likelihood function. For illustration purpose, consider an expert predict, i.e., tomorrow?s temperature to be 78?F. If the observed temperature turns to be 85?F, the relative error of estimates becomes 0.88. One may argue that the prediction under same conditions results in relative error of 0.92 in Celsius scale. Therefore, unit of measurement still plays a role in calculations despite the fact that the dimensionless relative error is selected as the accuracy measure of estimate. However, it should be noted that first, the impact of this types of errors on the desired outcome where they are used in the research (distribution identification of expert relative errors) phases out as the population of data gets larger. Additionally, an expert should have the same accuracy in predicting a same unknown in different measuring systems (i.e. in temperature estimation example, expert relative error should be 0.88 in Fahrenheit and Celsius systems). This is because expert knowledge or expertise (or any other attributes qualified one as an expert) does not change from one measuring system to another. Even if estimates do change in various measuring systems, these kinds of inconsistencies are accepted as inherent variation in modeling and assessment processes o test the capability and robustness of formulated likelihood functions in tolerating variations. 37 Chapter 4: Bayesian Formalism 4.1 Introduction Conceptually, the formulation of the Bayesian method for use of expert opinion is quite simple. The expert estimate is treated as a piece of evidence about the unknown quantity of interest. This evidence is then used to update the analyst?s or decision maker?s own (prior) knowledge through Bayes?. ( ) ( ) ( )( ) ( )?= duuuuL uuuLuu o o pi pipi ' '' Equation 2 u : is the quantity of interest, 'u : is the set of the experts? opinions, ( )uopi : is the decision maker?s prior or initial state of knowledge about the unknown quantity u (prior to obtaining the opinion of the experts). Prior distributions are used to describe the uncertainty surrounding the unknown. ( )uuL ' : is the likelihood of the evidence 'u given that the true value of the unknown quantity is u. The likelihood function asks this question: If the true value is u, what is the probability that the expert estimates it as u?? As such, the likelihood function is a statement on the accuracy and credibility of the expert as viewed by the decision maker. ( )'uupi : is posterior distribution representing the decision maker?s updated state of knowledge about the unknown quantity, u'. After observing the data (in this case expert opinion), the posterior distribution provides a coherent post data summary of the remaining uncertainties. 38 The first formal framework of the Bayesian methods for use of expert opinion was presented by Morris (1974, 1977). Morris?s work fully establishes the foundations for the Bayesian paradigm in the analysis of expert judgment. Building on Morris?s method, Mosleh and Apostolakis (1986) proposed the use of ?Additive? and ?Multiplicative? error models for constructing the likelihood functions, expressing the experts? assessments as the sum (or ratio) of the true value of unknown and an ?error? term. Mathematically speaking: 1) Additive error model: Euu += ' Equation 3 2) Multiplicative error model: uuE '= (refer to Equation 1) Still, the main problem in applying the Bayesian technique remains as complications associated with the development of a suitable likelihood function. This distribution is a probabilistic model for data and must capture the interrelationships among estimates and the unknown of interest. Particularly, it must account for the bias of the individual estimate, represent expert expertise and be able to model dependencies among experts. 39 4.2 Governing Model The prior knowledge of u is updated using the likelihood function developed by relative errors. The error distribution can be marginalized in terms of a finite set of parameters (? ) or epistemic uncertainty, which by itself is a variable symbolized by a population variability distribution of g(?) or aleatory uncertainty. Using likelihood averaging technique: ( ) ( ) ( ) ??? ? dEguuLEuuL ? ?=? ,, Equation 4 Applying Equation 2: ( ) ( ) ( ) ( ) ( ) ( ) ( )?? ? ? ? =? u duudEguuL udEguuL Euu 0 0 , , , pi??? pi??? pi ? ? Equation 5 Where, u : is the quantity of interest u? : is expert estimate ( )nEEE ...1= : is evidence or relative error of estimates ( )n??? ...1= : reflects that parameters of error distribution In the next sections, the likelihood functions and posterior distributions are constructed for homogenous, nonhomogenous and hybrid pools of data. The hybrid or mixed case has been formulated in this research only. 40 4.3 Construction of the Likelihood and Posterior: Homogenous Pool As represented in Table 1 and illustrated in Figure 2, the available information regarding the quantity of interest ( u ) is comprised of expert?s estimates ( nuu '...'1 ) and evidence in form of error of estimates ( nEE ...1 ). The overall distribution of errors of estimates, f(E), can be characterized in terms of a finite set of parameters. Postulating a lognormal distribution: ( )EE ?? ,50= ( 50E ): is the median of the error distribution ( E? ): is the standard deviation of the error distribution ( ) 2 50lnln 2 1 2 1 ??? ? ??? ? ?? = E EE E e E Ef ??pi Equation 6 The probability distribution of errors also represents the likelihood of errors given the distribution parameters. Assuming independence among experts: ( ) ? = ??? ? ??? ? ?? = n i EE iE En E i e E EEEL 1 lnln 2 1 501 2 50 2 1,... ? ?pi? Equation 7 Estimating the set of likelihood parameters: ( ) ( ) ( )( ) ( )? ?= 50 50500501 500501 150 ,,... ,,......, E EEEn EEn nE E ddEEEEEL EEEELEEE ? ??pi? ?pi??pi Equation 8 The term ( )EE ?pi ,500 is the prior which is assumed to be a lognormal distribution as well. A generic likelihood function, ( )Ef , can be formulated by de- conditioning the posterior (Equation 8) using: ( ) ( ) ( ) E E nEE ddEEEEEEfEf E ??pi? ? 5015050 50 ...,,? ?= Equation 9 41 To construct the likelihood function, ( )uuL ' , based on the likelihood of relative errors ( )uEL , the relation between the distribution of relative errors, f(E), and the distribution of estimates, f(u'), must be established. udu dEduudE u uE 1 '' ' =?=?= Equation 10 ( ) ( ) ( ) ( )EfdudEufdEEfduuf '''' =?= Equation 11 ( ) ( )Efuuf 1' = Equation 12 Therefore the likelihood function ( )uuL ' can be linked to ( )uEL as: ( ) ( ) ?? ? ?? ? ?=? ? ?? ? ?= ??? ? ??? ? ?? 250lnln 2 1 2 111' E EE E e Eu uELuuuL ??pi ( ) 2 50lnln'ln 2 1 '2 1,' ???????? ???= E Euu E euuuL ??pi? Equation 13 The above equation is the first term in Equation 4. Estimating the epistemic uncertainty of ?: g ? E( )= L E ?( )pi 0 ?( )L E ?( )pi 0 ?( )d? ? ? Equation 14 Where, ( ) ( )? = = n i iiELEL 1 ?? Equation 15 The new expert estimate can now be updated using Equation 5. The mean or median of the posterior, both shown with symbol (?) in figures, as the distribution marker, is compared with the true value (?/u) in order to determine if and how much the formulated likelihood function has been able to reduced the error of estimates. This process is depicted in Figure 3. 42 Table 1. Representation of Homogenous Data Estimate (i = 1...n) True Value Expert?s Error ( uuE ii '= ) 1'u 1E 2'u 2E ? ? ? ? ? ? nu' u nE ( ) ( )Efuuf ? ? ?? ? ?= 1' nuu '...'1 ( )? = n i iiEL 1 ? ( )?,' uuL ( )nEE ...1?pi Figure 2. Construction of Likelihood Functions for Homogenous Data 43 Posterior Marker (?) Model Error: (?/u) Decision: (?/u) vs. (u?/u) New Estimate, u? True Value, u Measure expert estimate error reduction using: a. Likelihood function formulated by relative error b. Generic likelihood function ( ) ( ) ( ) ( )( ) ( ) ( )??? ??=? u duudEguuL udEguuL Euu 0 0 , , , pi??? pi??? pi ? ? ( )?,' uuL ( )Eg ? Figure 3. Treatment of Homogenous Data 44 4.4 Construction of Likelihood and Posterior: Nonhomogenous Pool As represented in Table 2 and Figure 4, the information regarding true values of ( nuu ...1 ) is comprised of expert?s estimates ( nuu '...'1 ). The error distribution can be marginalized in terms of a finite set of parameters (? ), which by itself is a variable symbolized by a population variability distribution of g(?). This ?hyper? distribution can be characterized by a set of ?hyper-parameters? ( )? : ( )n??? ...1= Equation 16 )()( ??? gg = Equation 17 The likelihood function for the data point ( iu' , iu ) and therefore iE is estimated by eliminating the epistemic uncertainty over ?: ( ) ( ) ( ) ????? ? dgELEL ii ?= Equation 18 Under the assumption of independence among experts: ( ) ( ) ( )?? = = n i i dgELEL 1 ????? ? Equation 19 Estimating the ?hyper-parameters? using likelihood function ( )?EL : ( ) ( ) ( ) ( ) ( ) ( ) ( )? ?? ?? ??? ? ??? ? ??? ? ??? ? = = = ? ? ? ??pi???? ?pi???? ?pi ddgEL dgEL E n i i n i i 0 1 0 1 Equation 20 The posterior expected distribution as ( )Eg ? is estimated by eliminating the aleatory uncertainty over ?: ( ) ( ) ( ) ??pi??? ? dEgEg ?= Equation 21 45 The new expert estimate can be updated using general Bayesian procedure: ( ) ( ) ( ) ( ) ( ) ( ) ( )? ? ? ??? ? ??? ?=? u duuuuLEg uuuLEg Euu 0 0 ,' ,' , pi?? pi?? pi ? ? Equation 22 Table 2. Representation of Non-Homogenous Data Estimate (i = 1...n) True Value (i = 1...n) Expert?s Error (i = 1...n) 1'u 1u 1E 2'u 2u 2E ? ? ? nu' nu nE 46 )()( ??? gg = ( ) ( ) ( )?? = = n i i dgELEL 1 ????? ? ( )E?pi ( ) ( ) ( ) ??pi??? ? dEgEg ?= ( ) ( ) ( ) ( ) ( ) ( ) ( )? ? ? ??? ? ??? ?=? u duuuuLEg uuuLEg Euu 0 0 ,' ,' , pi?? pi?? pi ? ? Figure 4. Treatment of Non-Homogenous Data 47 It can be shown that homogenous is an especial case of nonhomogenous, when evidence provides perfect knowledge of the parameter set ?. Additionally, error distribution parameters (? ) have no aleatory variability. The distribution ( )??g turns to a Dirac Delta function and hence in Equation 20: ( ) ( ) ( ) ( ) ( ) ( ) ( )? ?? ?? ??? ? ??? ? ? ??? ? ??? ? ? = = = ? ? ? ??pi????? ?pi????? ?pi ddEL dEL E n i i n i i 0 1 0 1 Equation 23 Since for Dirac Delta function we have ( ) ( ) ( )dxxxxfxf 00 ?= ? ? Equation 24 Then Equation 23 changes to: ( ) ( ) ( ) ( ) ( )? ? ? = == ? ??pi? ?pi? ?pi dEL EL E n i i n i i 0 1 0 1 Equation 25 From Equation 21, we have: ( ) ( ) ( ) ( ) ( ) ( ) ? ??pi? ?pi? ???? ? ? ddEL EL Eg n i i n i i ? ? ? ? ? ? ? ? ? ? ? ? ?= ? ? ? ? = = 0 1 0 1 Equation 26 Applying Equation 24, which is the same equation as for homogenous data: ( ) ( ) ( ) ( ) ( )?? ? = == ? ??pi? ?pi? ? dEL EL Eg n i ii n i ii 0 1 0 1 Equation 27 48 4.5 Construction of Likelihood and Posterior: Hybrid Pool In the case of mixed or hybrid data, for each instance of (k =1...N), the estimate ( )kMi ...1= of ( uk) is ( )kiu' , representing evidence ( )kiE . Therefore, as represented in Table 3, the relative error term has two dimensions of (i, k) to cover all k instances: pi ? E( )= L Eik ?( )?? g ? ?( )d? i=1 M k ?? ? ? ? ? ? ? ? k=1 N? pi 0 ?( ) L Eik ?( )?? g ? ?( )d? i=1 M k ?? ? ? ? ? ? ? ? k=1 N? pi 0 ?( )d??? Equation 28 g ? E( )= g ? ?( )?? L Eik ?( )?? g ? ?( )d? i=1 M k ?? ? ? ? ? ? ? ? k=1 N? pi 0 ?( ) L Eik ?( )?? g ? ?( )d? i=1 M k ?? ? ? ? ? ? ? ? k=1 N? pi 0 ?( )d??? d? Equation 29 As we can see the homogenous and non-homogeneous cases are special case of the mixed pool. For example Equation 28 is reduced to Equation 20 when for each true value we have only one estimate, that is when MK = 1 for all k. Table 3. Representation of Hybrid Data Cases (k = 1...N) Estimate (i = 1?Mk) True Value (k = 1...N) Expert?s Error ( k ki ki u uE '= ) 1 [ ]1211 '' uu 1u [ ]1211 EE 2 [ ]26252423 '' '' uuuu 2u [ ]26252423 EEEE 3 37'u 3u 37E ... ? ? ? N [ ]kk MNMN uu ,1, '' ? Nu [ ]kk MNMN EE ,1, ? 49 Chapter 5: Data-Informed Calibration of Expert Opinions 5.1 Introduction The objective of this section is data-driven expert calibration within the Bayesian formalism. Calibration is defined as the degree of agreement between the estimates of an event compared to its actual occurrence value. In some fields, experts have been shown to make relatively well-calibrated judgments. The typical example is meteorology (Murphy and Winkler, 1977). In contrast, financial analysts have been shown to significantly overestimate corporate earnings growth (Chatfield et al., 1989; Dechow and Sloan, 1997). Hawkins and Evans (1989) found that industrial hygienists provided reasonably accurate estimates of the mean and 90th percentile of a distribution of personal exposure to chemical-industry workers. An investigation of several practical questions is conducted regarding the calibration of expert judgment using empirical data. The objectives are: [1] Measuring the uncertainty surrounding the unknown of interest in the Bayesian framework, given an expert estimate [2] Formulate a ?generic? likelihood function based on large numbers of observed expert relative errors in different domains, and [3] To explore whether use of generic likelihood would reduce future prediction errors [4] Performance comparison between posterior mean and median in reducing the overall errors of experts when using generic likelihood distribution 50 5.2 Methodology Likelihood functions for homogenous, nonhomogenous, and hybrid data have been developed in Chapter 4. The further steps taken to conduct the study in this chapter include: I. Descriptive statistics of empirical errors are produced to quantitatively summarize the data. II. Relative errors are fitted into matching probability distributions to select the form of the likelihood function. III. A generic error likelihood distribution for use in Bayesian assessment of expert opinion is developed using empirical data. IV. Bayesian method is employed to update the expert estimate using: i. Case-specific likelihood function ii. Domain-independent or generic likelihood function To perform the analyses flat or noninformative priors are used. This approach can provide a basis for defining knowledge or expertise of information sources (in the matter of estimating true value) relative to the analyst. Additionally, if the decision maker or analyst believes, as would normally be the case in consulting experts, that prior information should have little or no impact on the posterior, a noninformative prior of true value would be a proper modeling choice (Edwards, 1963). In the Bayesian method, the posterior marker or estimator is compared with the true value to assess the error of updated estimate. According to Christensen and Huffman (1985), the most often used posterior markers have been the mean, median, and mode, with no consensus among experts on which is the most appropriate. 51 Barnett (1982) believes that there would be no useful criterion for choosing a single value than to use the most likely value, unless further information on the consequences of incorrect choice is incorporated. Berger (1980) states the mean and median are often better values than the mode. According to Cox and Hinkley (1974), if it is required to summarize the posterior distribution in a single quantity, mean is frequently the most sensible. In particular, if the prior density is exactly or approximately constant, the use of the mean of the likelihood function with respect to the parameter is indicated. For illustration purposes, step by step numerical calculations of posterior markers for an example of each data type are presented. The steps to numerical execution of the first part are depicted in Figure 5. 52 Figure 5. Process Flow of Bayesian Treatment 53 5.3 Performance Assessment of Case-Specific Likelihood Functions Assessment in Bayesian framework was performed by ?Uncertainty Modeling? software released and validated by ?The Center for Risk and Reliability (CRR)?, Droguette and Mosleh (2003). Evaluation of the data included descriptive data generation and distribution analysis by Mathwave Easyfit? and MINITAB?. Table 4 shows empirical data reported in the Benzene concentration case study (case #1) used as an example of a homogenous pool. It is shown that Bayesian treatment of the homogenous data improves 62% of the estimates on average. For nonhomogenous data, an example of study can be found in Table 5 (case #1). For nonhomogenous pool, the percentage of improved estimates increases to 71%. Case #1 is also used for an example of hybrid data. The percentage of improved expert estimates is 71%, as shown in Table 6. The histogram of relative errors of two homogenous and nonhomogenous cases, Figure 6, shows that over 57% of relative errors of estimates are between (0.5 ? 0.8), and about 71% of data points fall between (0.5 ? 1.0). The average of relative errors is 1.3 with standard deviation of 0.5. Figure 7 shows best-fitted distributions to all relative errors. Considering the producer risk of 5% (? = 0.05), lognormal is among the top three fitted distributions. 54 Table 4. Bayesian Treatment of Homogenous Pool Using Case-Specific Likelihood Function True Value Expert Estimate Expert Relative Error Bayesian Mean Bayesian Mean Relative Error Error Reduction: + Error Increase: - No change: 0 3.6 3.9 1.083 3.3 0.917 0 3.6 3.2 0.889 3.1 0.861 - 3.6 4.6 1.278 3.3 0.917 + 3.6 7.8 2.167 3.8 1.056 + 3.6 5.8 1.611 4.8 1.333 + 3.6 3.2 0.889 3.1 0.861 - 3.6 3.7 1.028 3.5 0.972 0 7.2 5.5 0.764 5.1 0.708 - 7.2 6.2 0.861 5.4 0.750 - 7.2 6.5 0.903 6.7 0.931 + 7.2 16.2 2.250 8.7 1.208 + 7.2 15.6 2.167 9.5 1.319 + 7.2 11.2 1.556 10.8 1.500 + 7.2 6.0 0.833 7.5 1.042 + 7.5 13.9 1.853 11.5 1.533 + 7.5 7.0 0.933 6.4 0.853 - 7.5 8.6 1.147 6.5 0.867 + 7.5 11.2 1.493 5.8 0.773 + 7.5 21.7 2.893 7.9 1.053 + 7.5 12.1 1.613 8.3 1.107 + 7.5 7.9 1.053 9.2 1.227 - Average 1.394 1.038 Standard Deviation 0.587 0.237 % of Estimates Improved 62% (13 out of 21) 55 Table 5. Bayesian Treatment of Non-Homogenous Pool Using Case-Specific Likelihood Function True Value Expert Estimate Expert Relative Error Bayesian Mean Bayesian Relative Error Error Reduction: + Error Increase: - No change: 0 90 115 1.278 111 1.233 + 110 95 0.864 93 0.845 - 90 95 1.056 91 1.011 + 105 110 1.048 93 0.886 - 100 115 1.150 101 1.010 + 115 125 1.087 120 1.043 + 130 145 1.115 134 1.031 + Average 1.085 1.008 Standard Deviation 0.125 0.125 % of Estimates Improved 71% (5 out of 7) Histogram 0 2 4 6 8 10 12 14 16 18 1.2 1.6 2.5 0.8 2.0 More Bin Fre qu en cy 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% Frequency Cumulative % Figure 6. Histogram of Accumulated Homogenous and Nonhomogenous Data 56 Table 6. Bayesian Treatment of Hybrid Pool Using Case-Specific Likelihood Function True Value Expert Estimate Expert Relative Error Bayesian Update Bayesian Relative Error Error Reduction: + Error Increase: - No change: 0 3.6 3.9 1.083 5.0 1.389 - 3.6 5.8 1.611 4.6 1.278 + 7.5 13.9 1.853 4.9 0.653 + 7.5 21.7 2.893 8.7 1.160 + 7.5 8.6 1.147 7.3 0.973 + 7.2 5.5 0.764 5.4 0.750 - 7.2 11.2 1.556 8.0 1.111 + Average 1.558 1.045 Standard Deviation 0.695 0.270 % of Estimates Improved 71% (5 out of 7) C1 Pe rc en t 101 99 90 50 10 1 C1 - Threshold Pe rc en t 10.0001.0000.1000.0100.001 90 50 10 1 C1 - Threshold Pe rc en t 10.001.000.100.01 90 50 10 1 C1 Pe rc en t 2.01.00.5 99 90 50 10 1 3-Parameter Weibull AD = 0.335 P-Value > 0.500 Lognormal AD = 0.881 P-Value = 0.021 Goodness of Fit Test Loglogistic AD = 0.794 P-Value = 0.021 2-Parameter Exponential AD = 0.491 P-Value > 0.250 Probability Plot for C1 Loglogistic - 95% CI 2-Parameter Exponential - 95% CI 3-Parameter Weibull - 95% CI Lognormal - 95% CI Descriptive Statistics (Minitab?) N Mean StDev Median Minimum Maximum 34 1.30 0.50 1.08 0.76 2.89 Figure 7. Distribution Identification for Accumulated Expert Relative Errors 57 The above demonstrated courses of actions for example case studies are repeated for all empirical expert judgment data (1922 data points) collected. As reflected in Table 7, the study reveals an average of 77% of estimates improved, applying the case-specific homogenous and nonhomogenous likelihood functions. The graphical presentation can be found in Figure 8. The histogram of expert relative errors depicted in Figure 9 shows that over 45% of relative errors are equal or close to one (expert estimate ~ true value), about 45% of data points fall between (1 ? 2) and about 5% falling in the range of (2 ? 3). The average relative error is 1.2 and only 5% among all empirical relative errors data are greater than 3. Table 8 shows the best-fitted probability distributions for relative errors, considering the producer risk of 5% (? = 0.05). Lognormal is among the top fitting distributions, since it arises when independent random variables are combined in a multiplicative fashion, as relative error or ?E? is selected for the accuracy measure. The distribution fitting tests also point to Wakeby and Cauchy distributions as the two first best fits. This fit seems logical since they are also ratio distributions. The random variable associated with ratio distribution comes about as the proportion of two Gaussian distributed variables with zero mean (the Cauchy distribution is also called the normal ratio distribution). The other best fits are Log-Logistic, Burr, and Dagum distributions, which are continuous probability distributions for a nonnegative random variable. The Pearson distribution is a fit since it can visibly contain skewed observations. 58 Among the above discussed distribution, lognormal seems a better choice for Bayesian models due to ease of use, flexibility to fit many types of data, and wide-spread application in many fields (i.e. environmental application of lognormal distribution, Ashok et al., 1997), and great utility in decision science (Johnson et. al, 2003). Johnson et al. note that some practitioners maintain ?that the lognormal distribution is as fundamental as the normal distribution? and that the lognormal distribution has found applications in fields including the physical sciences, life sciences, social sciences, and engineering. He continues, ?practitioners find few ? if any ? tables of its cumulative distribution function available to support their work?. Additionally, distribution of the data seems to be positively skewed and for non-negative values, suggesting more reasons to select lognormal distribution as the choice. The lognormal (3P) distribution of expert relative error is depicted in Figure 10. 59 Table 7. Bayesian Treatment of Non-Homogenous (NH), Homogenous (H) and Hybrid Pools Using Case-Specific Likelihood Function Case # - H/NH %Estimates Improved 1 H 62% 2 NH 71% 3 NH 100% 4 NH 100% 5 NH 71% 6 NH 67% 7 NH 67% 8 NH 67% 9 NH 100% 10 NH 71% 11 NH 100% 12 NH 57% 13 NH 71% 14 NH 100% 15 NH 86% 16 NH 100% 17 NH 57% 18 NH 86% 19 H 80% 20 NH 57% 21 NH 86% 22 NH 57% 23 NH 86% 24 NH 86% 25 NH 57% 26 NH 100% 27 NH 57% 28 H (multiple cases) 63% Average 77% Minimum 57% Maximum 100% 60 %Estimates Improved 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 H 3 NH 5 NH 7 NH 9 NH11 NH13 NH15 NH17 NH 19 H21 NH23 NH25 NH27 NH Case# %E st im at es Im pr ov ed Figure 8. Improvement by Bayesian Treatment in All Empirical Cases Experts Relative Errors Fr eq ue nc y 211815129630 1200 1000 800 600 400 200 0 Histogram of Experts Relative Errors Descriptive Statistics: Relative Error (Minitab?) Variable N Mean StDev Median Min Max Relative Error 1922 1.2 1.5 1.0 0.0003 21.3 Figure 9. Histogram of All Relative Errors Min: 57% 61 Table 8. Best Fitted Distribution for Expert Relative Errors Kolmogorov Smirnov Anderson Darling Chi-Squared Best Fitted Distribution (MathWave-EasyFit) Rank Rank Rank Wakeby 1 1 1 Cauchy 2 2 2 Dagum (4P) 3 5 5 Log-Logistic (3P) 4 4 4 Burr (4P) 5 3 3 Burr 6 7 7 Dagum 7 6 6 Pearson 6 (4P) 8 8 8 Lognormal (3P) 9 9 9 Figure 10. Lognormal (3P) Distribution of for All Relative Errors Probability Density Function Lognormal (0.46295; 0.23985; -0.23735) x 32.521.510.50 f(x ) 0.72 0.64 0.56 0.48 0.4 0.32 0.24 0.16 0.08 0 62 5.4 Performance Assessment of Generic Likelihood Functions The entire process of updating estimates was also repeated, using a generic likelihood function. If ?E? is a lognormally distributed, its expected value is: ( ) 2ln 2 1 2 1,; ?????? ??= ? ? pi??? E e E Ef Equation 30 From Figure 10: ? = 0.24 ? = 0.46 50ln E=? Equation 31 Median: 27.124.050 === eeE ? Equation 32 The above parameters are prior and should be updated using hybrid formulations. From Equation 13: ( ) ?= ??? ? ??? ? ??? 250lnln'ln 2 1 '2 1,' E Euu E e u uuL ??pi? ( ) ( ) 2 46.0 24.0ln'ln 2 1 '46.02 1' ?????? ???= uue u uuL pi Equation 33 Representation of independent expert estimates by ( ) ( )nuuu ??= ...' 1 : ( ) ( )? = ?????? ??? = n i uu e u uuL 1 46.0 24.0ln'ln 2 1 2 '46.02 1' pi Equation 34 ( ) ( ) ( )? ? ? = = ?????? ??? ?????? ??? = duue ueuu n i n i uu uu 0 46.0 24.0ln'ln 2 1 0 46.0 24.0ln'ln 2 1 1 2 1 2 ' pi pipi Equation 35 63 It can be shown that the parameters of the above posterior distribution can be calculated as (Mosleh, 1981): nn i i E uu 1 1 50 50 ? = ??? ? ??? ? ?= Equation 36 n E 2 2 ?? = Equation 37 250 2? ? eu ?= Equation 38 u50: is the posterior median ?: is the posterior standard deviation n: is the number of estimates or experts ?: is the posterior mean Mean and the median of the posterior are compared with the true value to explore whether the formulated likelihood distribution is able to reduce the expert error. An example is presented in Table 9 and Table 10 using TU Delft data (case 28). The complete study for all data tested reveals an overall improvement in the accuracy of expert, applying the formulated generic likelihood function considering available case-independent evidence. 64 Table 9. Numerical Example to Measure Performance of Generic Likelihood Function Using Mean (?) of Posterior u' u Other Available Experts Estimates on True Value ? = u50 Exp(?2/2) u'/u ?/u Error Reduced: + Error Increased: - 0.019 0.027 (0.05, 0.02, 0.02, 0.035) 0.023 0.704 0.866 + 0.05 0.027 (0.019, 0.02, 0.02, 0.035) 0.018 1.852 0.680 + 0.02 0.027 (0.019, 0.05, 0.02, 0.035) 0.023 0.741 0.855 + 0.02 0.027 (0.019, 0.05, 0.02, 0.035) 0.023 0.741 0.855 + 0.035 0.027 (0.019,0.05, 0.02, 0.02) 0.020 1.296 0.743 + Table 10. Numerical Example to Measure Performance of Generic Likelihood Function Using Median (u50) of Posterior u' u Other Available Experts Estimates on True Value u50=?(u'/E50)1/n u'/u u50/u Error Reduced: + Error Increased: - 0.019 0.027 (0.05, 0.02, 0.02, 0.035) 0.023 0.704 0.844 + 0.05 0.027 (0.019, 0.02, 0.02, 0.035) 0.018 1.852 0.662 + 0.02 0.027 (0.019, 0.05, 0.02, 0.035) 0.022 0.741 0.833 + 0.02 0.027 (0.019, 0.05, 0.02, 0.035) 0.022 0.741 0.833 + 0.035 0.027 (0.019,0.05, 0.02, 0.02) 0.020 1.296 0.724 + 65 5.5 Conclusion The questions answered include empirical assessment of expert errors and to explore whether the use of formulated likelihood functions would reduce future prediction errors. The empirical assessment of data revealed that approximately: 1. 45% of errors were close to one (expert estimate ~ true value) 2. 45% of data points were between (1 ? 2) 3. 5% of relative errors were falling in the range of (2 ? 3) 4. 5% among all empirical errors data was greater than 3 5. Lognormal was identified as one of the best fitted distributions 6. The average error was 1.2 7. The standard deviation was 1.5 Applying the case-specific likelihood function developed by relative errors showed: ? 77% of estimates improved Application of generic likelihood function using the posterior mean and case-independent evidence revealed: ? 50% of estimates improved Application of generic likelihood function using the posterior median and case-independent evidence showed: ? 52% of estimates improved Results confirm that the developed generic likelihood function, in conjunction with available evidence, is able to update at least half of the estimates. 66 Chapter 6: Data-Informed Aggregation of Expert Opinions 6.1 Introduction In uncertain situation, combining data can reduce error (Armstrong, 2001). Speculations made about the correlation between accuracy of expert estimates and the number of experts elicited, have led many to conclude that the more experts are elicited, the higher accuracy of estimates can be reached. This may seem similar to increasing the sample size in an experiment. Ashton and Ashton (1985) studied judgmental forecasts of the number of advertising pages in Time magazine. The conclusion was that by combining the forecasts of four experts, error of estimates is reduced by 3.5%. Batchelor and Dua (1995) showed increase in accuracy from 10 to 22 economists. Their study also revealed a small improvement from 22 to the remaining 12. The two well-established mathematical approaches to aggregate opinions are axiomatic and Bayesian models (Boring, 2007; Clemen and Winkler, 1997). The first formal framework of the Bayesian methods for use of expert opinion was presented by Morris (1974, 1977). French (1985), Lindley (1985), and Genest and Zidek (1986) all conclude that a Bayesian updating scheme is the most appropriate method when a group of experts provide information for a decision maker. A comprehensive review of aggregation literature, including dependence, can be found in French (1985), Ouchi (2004), Genest and Zidek (1986), French and R?os Insua (2000). 67 The objectives of this part of the research are to: 1. Investigate whether mathematical aggregation of expert opinions reduce the error of aggregated estimate 2. Assess the correlation between the number of experts and the accuracy of estimates through Bayesian aggregation. The above questions are addressed, using empirical data in the Bayesian framework, applying likelihood distribution formulated in Chapter 4 by considering: o Case-specific likelihood function o Generic likelihood function In this chapter, mathematical formulas for generic aggregation are presented. Using empirical data, aggregation performance and number of experts for optimum accuracy are determined in each method based on the results obtained. 68 6.2 Mathematical Model Expert opinions are aggregated in the Bayesian framework using the likelihood function formulated by relative error of estimates as well as generic likelihood function developed. Postulating independent experts with lognormal likelihood distributions with parmaters (?i, ?i) we have (see Chapter 4): ( ) ( ) ?? = ??? ? ??? ? ?? = == n i u i n i i i ii e u uuLuuL 1 lnln 2 1 1 2 '2 1'' ? ? ?pi Equation 39 ln ?i = ln ui- ln E50 iu' : is the i th expert estimate u : is the unknown of interest 'u : set of expert estimates Expanding the above equation and rearranging the terms as a function of ?u?: ( ) ( ) ?= ???????? ?? = ? = n i i iiu n i i nn e u uuL 1 2ln'ln 2 1 1 '2 1' ? ? ?pi Equation 40 Using the above likelihood in Bayes? theorem the posterior distribution of the unknown of interest given set of errors: ( ) ( ) ( )? ? ? = = ??? ? ??? ? ?? ??? ? ??? ? ?? = duue ueuu n i i ii n i i ii u u 0 ln'ln 2 1 0 ln'ln 2 1 1 2 1 2 ' pi pipi ? ? ? ? Equation 41 For the generic likelihood formulated by relative errors in Chapter 5, (E50 = 1.2 and ?E = 0.69). The assumption is that E50 and ?E are the same for all experts. Therefore for expert ?i': ( ) 2 50lnln'ln 2 1 '2 1' ??? ? ??? ? ??? = E i Euu iE i euuuL ? ?pi Equation 42 69 Postulating independence among experts, as in equation 31: ( ) ? = ??? ? ??? ? ??? = n i Euu iE E i e u uuL 1 lnln'ln 2 1 250 '2 1' ? ?pi Equation 43 ( ) ( ) ?= ???????? ??? = ? = n i E i Euu n i i n E n e u uuL 1 2 50lnln'ln 2 1 1 '2 1' ? ?pi Equation 44 This results in posterior distribution for this case: ( ) ( ) ( )? ? ? = = ??? ? ??? ? ??? ??? ? ??? ? ??? = u Euu Euu ue ueuu n i E i n i E i 0 lnln'ln 2 1 0 lnln'ln 2 1 1 2 50 1 2 50 ' pi pipi ? ? Equation 45 It can be shown that the median of this posterior distribution can be calculated as: nn i i E uu 1 1 50 50 ? = ??? ? ??? ? ?= Equation 46 n E 2 2 ?? = Equation 47 The mean of the posterior is 250 2? ? eu ?= Equation 48 The relative error of aggregated value (mean and median of posterior), u uE Aggregate Aggregate '= , is compared with the expert?s relative error, u uE i i '= . The number of estimates improved is monitored as the number of experts (n) increases in order to uncover whether this boost reduces the overall error of estimates and to unveil the minimum number of experts needed to obtain maximum accuracy. 70 6.3 Aggregation by Simulation In this section, simulation-based performance assessment of Bayesian and representative models of axiomatic aggregation of point estimates is conducted. In addition, the impact of the number of experts on Bayesian aggregation performance is assessed through replication. Simulation is carried out considering both cases of independence and dependence (for Bayesian method) among experts. The simulation process flow is depicted in Figure 11. There are two loops constructed. Model inputs and random true values are produced in the first loop using sampling of lognormal distributions. In the second loop, expert estimates are generated within the same data range and aggregation is performed. The process is repeated in each of the loops for the calculated number of iteration. The simulation loop iteration is calculated based on the formula proposed by Winston (2001): 2 2 2 2 4 D z m ?? ?????? = Equation 49 In this formula, m is the number of iterations needed, ? is the estimated standard deviation of the output, and D is the desired width of the confidence interval. Simulation is first run with just 100 iterations (? = 0.05, and therefore 2? z =1.96) to obtain an estimate for the standard deviation. The number of iterations can then be calculated using the same formula and calculated standard deviation. 71 The selected axiomatic aggregation methods are arithmetic weighted sum and weighted geometric mean: I. Arithmetic Unweighted Sum: this is just an unweighted linear combination of ?n? expert estimates (u': expert estimates). ? = ?=? n i iunu 1 Aggeragte 1 Equation 50 II. Unweighted Geometric Mean: An unweighted geometric mean is obtained as the product of the estimates raised to the power equal to one over the number of estimates (n) (u': expert estimates). n n i iuu 1 1 Aggeragte ?? ? ? ??? ? ?=? ? = Equation 51 For the Bayesian aggregation simulation, posterior distribution is formed. The mean of the posterior, as the aggregated estimate and expert estimates are compared with the true value (refer to Chapter 4) imported from the first loop. To address dependency among experts, choices of copulas are used for likelihood functions as listed in the following. The basis of applying a copula distribution is that a copula-based model is constructed by joining the copula function with the marginal distributions. According to Sklar?s Theorem (1959), given a joint cumulative distribution function F(x1, ?, xn) for random variables (x1?xn) with marginal cumulative distribution F1(x1)?Fn(xn), F can be written as a function of its marginal distributions: ( ) ( ) ( )[ ]nnn xFxFcxxF ...... 111 = Equation 52 The function ?c? is called a copula. This means that the joint density f(x1?xn) can be written as: ( ) ( ) ( ) ( ) ( )[ ]nnnnn xFxFcxfxfxxf ......... 11111 = Equation 53 72 It is clear that the above copula density ?c? captures information about the dependence among the Xs, and therefore, it is called a dependence function. There are many families of copulas which typically have several parameters related to the strength and form of the dependence. More discussion and properties of these selected copula functions can be found in Clayton (1978), Frank (1979), Gumbel (1960), Hougaard ( 2000), Silva and Lopez (2008). The selected families of copulas are: 1. Gaussian ? Multivariate normal copula: this copula captures dependence like the multivariate normal distribution, by using only pair wise correlations among the variables. It accomplishes the task for variables with arbitrary marginal distributions. Moreover, the normal copula permits the use of any positive-definite correlation matrix, meaning that it is not limited to intra class correlation matrices. 2. Archimedean 2.3 Frank: Frank can be used to capture positive dependence among random variables. 2.4 Clayton: In the Clayton copula, the random variables are statistically independent. 2.5 Gumbel: The Gumbel copula is asymmetric, with more weight in the right tail. The simulation process for dependent experts is depicted in Figure 12. The simulation is executed using MATLAB? software produced by ?The Mathworks??. MATLAB? is a technical computing language for algorithm development and numerical computation. 73 Loop 2: Aggregation Assessment F) Develop Posterior A) Identify Number of Experts (n) B) Define Distributions and Range of Data I) ?1...?n (?i = u+biasi) II) ?1...?n C) Generate True Value (u) Loop 1: Define Model Inputs Change ?1...?n? Change n? D) Sample Distributions: (u'1...u'n) G) Calculate Posterior Mean (?) Simulation Results: 1. Bayesian Model Performance (?/u) vs. Expert Performance (u'/u) 2. Axiomatic Model Performance (u?ax/u) vs. Expert Performance (u'/u) 3. Bayesian Model Performance (?/u) vs. Axiomatic Model Performance (u'ax/u) 4. Assessment of Number of Experts (n) on Error Reduction of Aggregated Estimate Bayesian Aggregation YesNo E) Select Axiomatic Aggregation Method F) Calculate Aggregated Estimate (u'ax) E) Dependent Experts? No Yes 2 Figure 11. Aggregation Simulation Approach Figure 12. Aggregation Simulation for Dependent Experts 74 6.4 Simulation Results 6.4.1 Aggregation Performance Simulation shows that Bayesian aggregation method results in less aggregation error than Axiomatic procedures, as depicted in Figure 13. In this graph, the x-axis is the number of experiments or cases simulated, unique to their generated inputs in both loops. The y-axis is the relative errors or uuE '= , where u? is the estimate and u is the true value. The spikes which can be noted in the graphs show selection of high standard deviations (low expert expertise), which clearly reveals that the decrease of expertise increases the error. 6.4.2 Dependent Experts Performance For dependent experts, Gaussian, Frank, Clayton and Gumbel copula families are used and minimum improvement among these choices are reported. Model error shows about 80% overall reduction in error of aggregated estimate compared to the mean of all expert errors with correlation of 0.25, about 75% with correlation of 0.50, and finally about 70% with correlation of 0.75. This means that the more independent experts are; the more accurate aggregated estimate becomes, however, the amount of improvement is not significant. 75 6.4.3 Size of Expert Panel The simulation reveals that there is not a strong correlation between accuracy of aggregated estimate and the number of experts. As depicted in the Figure 14, about 50% of estimates are improved by increasing the number of experts to two. It seems that selecting more than two experts can lead to more improved estimates (over 60%). However, it can be noted that from 3 to 10 experts, the percentage of improved estimates is not noteworthy (less than 10%). 76 0 5 10 15 20 25 No. of Cases Simulated Re lat ive Er ror Bayes' Error Error of Arithmetic Unweighted Sum Error of Geometric Average Sum Expert Error Figure 13. Performance of Aggregation Methods Simulation Results: Expert Panel Size vs. % Estimates Improved 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 2 3 4 5 6 7 8 9 10 No. of Experts %E sti ma tes Im pro ve d Figure 14. Simulation Results: Expert Panel Size vs. % Estimates Improved 77 6.5 Aggregation Using Empirical Data In this section aggregation is performed using empirical data to assess: (a) The Bayesian aggregation performance and (b) Impact of number of experts on aggregation in a real world application. Expert opinions are combined in a Bayesian framework using likelihood function formulated by relative error of estimates, considering independence among experts. The relative error of aggregate is compared with the expert error. This procedure is illustrated using sample data, reflected in Table 11. Table 11. Numerical Example for Aggregation Procedure Illustration Expert ID Estimate True Value A 0.019 0.027 B 0.05 0.027 C 0.02 0.027 D 0.02 0.027 E 0.035 0.027 The available evidence to update the estimate of expert B is the relative error of expert A. The posterior is developed, and the average of this distribution is calculated as the Bayesian update. The result can be found in Table 12. Table 12. Bayesian Update and Relative Error for Aggregation Example Expert ID Estimate True Value Bayesian Update Expert Relative Error Bayesian Relative Error A 0.019 0.027 0.704 B 0.05 0.027 0.06 1.852 2.222 78 In the next step, the first two expert relative errors are considered available evidence to update the estimate of expert C, as reflected in Table 13. The aggregated estimate is compared with all three expert estimates, revealing reduction in error. From the results obtained for this set of data, going from one to two experts increases the error for both estimates released by expert A and B. However, increasing the number of experts from two to three reduces the error for all experts A, B, and C. Table 13. Continuation of Aggregation Example Estimate True Value Bayesian Update Expert Relative Error Bayesian Relative Error % Re du ced Er ror (+ ) % Inc rea sed Er ror (- ) % Re du ced Er ror (+ ) % Inc rea sed Er ror (- ) Expert ID 0.019 0.027 0.704 - + A 0.05 0.027 0.06 1.852 2.222 - + B 0.02 0.027 0.023 0.741 0.852 + C This process is continued to include all experts in the data set, as shown in Table 14. 79 Table 14. Aggregation Results for Example Data Estimate True Value Bayesian Update Expert Relative Error Bayesian Relative Error 0.019 0.027 0.704 0.05 0.027 0.06 1.852 2.222 0.02 0.027 0.023 0.741 0.852 0.02 0.027 0.025 0.741 0.926 0.035 0.027 0.039 1.296 1.444 Aggregate ID A & B A, B & C A, B, C & D A, B, C, D & E Bayesian Aggregate Relative Error ID Expert Relative Error 2.222 0.852 0.926 1.444 A 0.704 - + + - B 1.852 - + + + C 0.741 + + - D 0.741 + - E 1.296 - Estimates Improved 0 3 4 1 Total (Expert Panel Size) 2 3 4 5 To treat the data completely random, another step is taken where a sample of 10% of the data sets are used for calculations out-of-reported order. Rearranging the raw data in previous example (Table 11) is shown in Table 15. Table 15. Example for Aggregation Procedure: Out-of-order data Expert ID Estimate True Value A 0.019 0.027 D 0.02 0.027 C 0.02 0.027 B 0.05 0.027 E 0.035 0.027 80 The same process as described before in the example is repeated for this random set and results are shown in Table 16. Table 16. Aggregation Results for Example: Out-of-order data Estimate True Value Bayesian Mean Expert Relative Error Bayesian Relative Error 0.019 0.027 0.704 0.02 0.027 0.029 0.741 1.074 0.02 0.027 0.024 0.741 0.889 0.05 0.027 0.040 1.296 1.481 0.035 0.027 0.069 1.852 2.556 Aggregate ID A & D A, D & C A, D, C & B A, D, C, B & E Bayesian Aggregate Relative Error ID Expert Relative Error 1.074 0.889 1.481 2.556 A 0.704 + + - - D 0.741 + + - - C 0.741 + - - B 1.296 - - E 1.852 - Estimates Improved 2 3 0 0 Total (Expert Panel Size) 2 3 4 5 These calculation steps are executed for empirical data sets. The number of improved estimates is monitored as the number of experts increase for estimates involving 2 to 10 experts to: 3. Investigate whether mathematical aggregation of expert opinions reduce the error of aggregated estimate 4. Assess the correlation between the number of experts and the accuracy of estimates through Bayesian aggregation. 81 Bayesian calculations were performed by ?Uncertainty Modeling? software released and validated by The Center for Risk and Reliability (CRR), Droguette and Mosleh (2003). The improvements (reduction in error) in all data sets per expert panel size are listed in Table 17 using case-specific likelihood. Additionally, the correlation between percentages of error reduction with the increase of the number of experts is investigated using best-fitted line, as depicted in Figure 15. The best fitted line reveals a positive correlation, but with a moderate adjusted coefficient of determination (R2 = 63%). The computation was also performed using mean and median of the generic likelihood function, summarized in Table 18 and Table 19. 82 Table 17. Aggregation Performance: Case-Specific Likelihood Expert Panel Size Total Data No. of Estimates Improved % of Estimates Improved 2 98 52 53% 3 147 91 62% 4 184 109 59% 5 225 144 64% 6 240 160 67% 7 259 193 75% 8 152 100 66% 9 72 51 71% 10 60 42 70% Figure 15. Fitted Line Plot: Improvement vs. Experts Panel Size No. of Experts % Es tim at es Im pr ov ed 10987654321 80 75 70 65 60 55 50 S 4.06300 R-Sq 67.5% R-Sq(adj) 62.9% Regression 95% CI Fitted Line Plot %Estimates Improved = 53.22 + 2.000 No. of Experts 83 Table 18. Aggregation Performance Summary: Generic Likelihood ? Mean Expert Panel Size Total Data No. of Estimates Improved % of Estimates Improved 2 70 25 53% 3 105 50 59% 4 140 76 60% 5 175 92 58% 6 192 108 52% 7 203 115 51% 8 128 70 55% 9 45 16 44% 10 40 12 43% Table 19. Aggregation Performance Summary: Generic Likelihood ? Median Expert Panel Size Total Data No. of Estimates Improved % of Estimates Improved 2 70 41 46% 3 105 62 52% 4 140 87 57% 5 175 111 54% 6 192 113 49% 7 203 118 46% 8 128 100 66% 9 45 26 42% 10 40 24 38% 84 6.5.1 Aggregation Performance Bayesian aggregation resulted in less relative error on average. Application of the likelihood function developed by relative errors revealed on average 65% of estimates improved. Application of generic likelihood for homogenous data using posterior mean revealed on average 53% of estimates improved. Application of generic likelihood for homogenous data using posterior median showed on average 50% of estimates improved. 6.5.2 Expert Panel Size Best-fitted line graphs for case-specific events, Figure 15, reveal that increasing the number of experts is positively correlated with the accuracy of aggregated estimate. The moderate coefficient of determination (R2 = 63%) suggests that this association is not very strong. It seems that eliciting two experts (instead of one) can lead to reduction in error for more than 50% of estimates. It can be seen that increasing the number of experts from two to three, reduces the error for approximately 60% of estimates. However, from 3 to 10 experts, the percentage of improved estimates is not significant. 85 Chapter 7: Summary of Results 7.1 Research Contribution This research contributes to the body of knowledge of expert judgment. In contrast to many studies revealing shortcoming in the expert judgment, this research reports how well experts are able to make a prediction in real world. This task was carried out by data-informed calibration and aggregation of experts in the Bayesian framework. A generic likelihood was developed, which showed the ability to update the expert estimates. Additionally, specific likelihood distributions for homogenous, nonhomogenous and mixed data were formulated using expert relative errors of estimates, revealing that formulated likelihood functions can reduce future prediction errors. To study the impact of number of experts on the accuracy of aggregated estimate collected expert judgments were combined in a Bayesian framework using likelihood distributions developed in the first part of the research study. Total number of estimates with reduced errors was depicted against corresponding expert panel size. The objective achieved was the determination of the correlation between the number of experts and the accuracy of the combined estimate to recommend an expert panel size. The result of the study showed weak to moderate correlation between the expert panel size and the accuracy of aggregate. It was noted that eliciting two experts (instead of one) could lead to reduction in relative error of estimates. 86 7.2 Data-Informed Calibration of Expert Judgment The objective of this section was empirical assessment of expert judgment in different disciplines as well as feasibility and value of data-driven expert calibration within the Bayesian formalism. The result of the conducted study revealed: 1. 45% of errors are close to one (expert estimate ~ true value) 2. 45% of data points are between (1 ? 2) 3. 5% of relative errors are falling in the range of (2 ? 3) 4. 5% among all empirical errors data are greater than 3 5. Lognormal is identified as one of the best fitted distributions 6. The average relative error is 1.2 with standard deviation of 1.5 Applying the case-specific likelihood function developed by relative error for homogenous and nonhomogenous cases showed: ? 77% of estimates improved Application of generic likelihood function using posterior mean, considering the existing evidence revealed: ? 57% of estimates improved Application of generic likelihood using posterior median, considering the existing evidence showed: ? 52% of estimates improved 87 7.3 Data-Informed Aggregation of Expert Judgment The objective of this section was: 1. to determine if mathematical aggregation reduces the error of aggregate, 2. to explore the correlation between the number of experts and accuracy of aggregated estimate in order to recommend an expert panel size Figure 16 gives a quick overview of the results obtained: 1. Mathematical aggregation reduces the error of estimate. 2. The accuracy of the aggregate increases by adding to number of experts. 3. The optimum expert panel size is 3, if the improvement of 50-60% of estimates is satisfactory. Overall, the decision of eliciting more experts can be properly made, considering governing circumstances of the case on-hand. If possible, the panel should be large enough to capture complementary expertise and achieve diversity of opinion, to ensure a balanced and broad spectrum of viewpoints, expertise, and technical points of view. The decision-maker should assess if targeted improvement pays off the cost of hiring more experts. 0% 10% 20% 30% 40% 50% 60% 70% 80% 2 3 4 5 6 7 8 9 10 Expert Panel Size % Es tim ate s I m pr ov ed Generic Likelihood - Median Generic Likelihood - Mean Domain-specific Likelihood Figure 16. Bayesian Treatment vs. Expert Panel Size 88 7.4 Research Limitations Reader should be aware of the limitations and restrictions encountered in conducting this research to have a complete picture of this study: ? Besides expressing their subjective judgments directly, experts in this study could use prototypes, models, destructive and nondestructive tests (among other tools) to gather data, gain practical knowledge to estimate the unknown. ? This study only focused on expert point estimates in discrete or continuous forms. ? The estimates provided by forecasting models were considered ?expert data?. This was because of expert input into construction of the model, or expert review and adjustment of the output. ? Experts were considered independent in model development and numerical calculations. ? Inconsistencies among experts were accepted as inherent variation in modeling and assessment processes. Inherent variation could help to capture real-world error causes (though the sources of these fallacies remain unknown) and examine the formulated likelihood functions in dealing with these variations. ? The focus of this research was on mathematical procedures for calibration and aggregation of expert point estimates. Specially, Bayesian method was the central point of the study. 89 7.5 Future Research There are many factors which can impact the accuracy of expert judgment such as expert attributes, elicitation and aggregation methods and so on. This research focused on calibration and aggregation of expert judgment in a Bayesian framework, considering independent experts. Dependency is a major factor affecting the quality of judgment. Future research using methods presented in this research should consider the case of dependence in both calibration and aggregation procedures and address the pertinent issues. Additionally, the empirical data available for collection allowed this research to only consider up to 10 experts. If data is available, the study should continue to larger expert panel size, and perhaps, determine the variations seen in the % of improved estimates as number of expert increases. 90 References [1] Abdullah Khan, H. T. A Comparative Analysis of the Accuracy of the United Nations? Population Projections for Six Southeast Asian Countries. Interim Report, IR-03-015. [2] Adler, M. & Ziglio, E. (1996). Gazing into the oracle. JK Publishers: Bristol, PA. [3] Alpert, M. & Raiffa, H. (1982). A progress report on the training of probability assessors, in Judgement Under Uncertainty: Heuristics and Biases, eds. D. Kahneman, P. Slovic & A. Tversky, Cambridge University Press, New York, 294?305. [4] Anderson, U. & Wright, W. F. (1988). Expertise and the explanation effect. Organization Behavior and Human Decision Processes, 13, 431-446. [5] Apostolakis, B. E. (1990). Interfuel and energy-capital complementarity in manufacturing industries. Applied Energy 35, 2, 83-107. [6] Armstrong, J. S. (2001). Chapter 13 in Armstrong, J. S. (2nd eds). Principles of Forecasting: A Handbook for Researchers and Practitioners, Combining Forecasts. Norwell, MA: Kluwer Academic Publishers. [7] Armstrong, J. S. & Fildes, R. (1995). On the selection of error measures for comparisons among forecasting methods. Journal of Forecasting, 14, 67-71. [8] Armstrong, J. S. (1985). Long-term Forecasting: From Crystal Ball to Computer (2nd eds). New York: John Wiley. [9] Arnell, N.; Tompkins, E. & Adger, N. (2005). Vulnerability to abrupt climate change in Europe. Technical Report 34, Tyndall Centre for Climate Change Research, Norwich Babbitt. [10] Ashton, A. H. & Ashton, R. H. (1985). Aggregating subjective forecasts: Some empirical results. Management Science, 31, 1499-1508. [11] Ashton, A. H. (1986). Combining the judgments of experts: How many and which ones? Organizational Behavior and Human Decision Processes, 38, 405-414. [12] Ashton, R. H. (1974). An experimental study of internal control judgments. Journal of Accounting Research, 12, 143-157. [13] Asian Development Bank (2005). Accuracy of Asian Development Outlook Forecasts. [14] Ayuub, B. M. (2001). A Practical Guide on Conducting Expert-Opinion Elicitation of Probabilities and Consequences for Corps Facilities. Prepared for U.S. Army Corps of Engineers, Institute for Water Resources. [15] Ayyub, B. M. (2001). Elicitation of Expert Opinions for Uncertainty and Risks. [16] BaFail, A. O (2004). Applying Data Mining Techniques to Forecast Number of Airline Passengers in Saudi Arabia (Domestic and International Travels). King Abdul Aziz University. [17] Baldwin, J. M. (1975). Thought and things: a study of the development and meaning of thought or generic logic. New York: The Macmillan Company. [18] Batchelor, R. A. & Dua, P. (1995). Forecaster Diversity and the Benefits of Combining Forecasts. Management Science, 41, 68-75. [19] Bates, J. M., & Granger, C. W. J. (1969). The combination of forecasts. Operational Research Quarterly, 20, 451-468. [20] Batz, M. B.; Hoffmann, S. A.; Krupnick, A. J.; Morris, J. G.; Sherman, D. M.; Taylor, M. R. & Tick; J. S. (2004). Identifying the Most Significant Microbiological Food-borne Hazards to Public Health: A New Risk Ranking Model, Food Safety Research Consortium. Discussion Paper Series Number (1), September 2004 - FIRRM Food Attribution Percentages for Illnesses from Foodborne Campylobacter and Listeria monocytogenes. [21] Bedford, T.; Quigley, J. & Walls, L. (2006). Expert Elicitation for Reliable System Design. Statistical Science, 21, 4, 428?450. [22] Bonano, E. J. & Apostolakis, G. E. (1991). Thoretical foundation and practical issues for using expert judgments in uncertainty analysis of high-level radioactive waste disposal. Radioactive Waste Management and the Nuclear Fuel Cycle, 16, 2, 137?159. [23] Bonano, E. J.; Hora, S. C.; Keeney, R. L. & von Winterfeldt, D. (1990). Elicitation and Use of Expert Judgment in PerformanceAssessment for High-Level Radioactive Waste Repositories. NUREG/CR-5411. Washington, DC: Nuclear Regulatory Commission. [24] Bonano, E. J.; Hora, S. C.; Keeney, R. L. & von Winterfeldt, D. (1990). Elicitation and use of expert judgment in performance assessment for high-level radioactive water repository. Nuclear Regulatory Commission, NUREG/CR-5411, Washington, DC. 91 [25] Bond, C. E.; Gibbs, A. D.; Shipton, Z. K. & Jones, S. (2007). What do you think this is? ??Conceptual uncertainty?? in geoscience interpretation. GSA Today, 17, 11, 4?10. [26] Booker, J. M. & Meyer, M. (1996). Elicitation and Analysis of Expert Judgment. Los Alamos National Laboratory, LA-UR-99-1659. [27] Budescu, D. V. & Rantilla, A. K. (2000). Confidence in aggregation of expert opinions. Acta Psychologica, 104, 371-398. [28] Campbell P. R. (2002). Evaluating Forecast Error in State Population Projections Using Census 2000 Counts. U.S. Bureau of Census, Population Division Working Paper No. 57. [29] Cathers, C. A. & Thompson, G. D. (2005). Forecasting Short-Term Electricity Load Profiles, Cardon Research Papers. [30] Chase, W. G. & Simon, H. A. (1973). The mind?s eye in chess. Visual Information Processing. W. G. Chase. New York, Academic Press, 215?281. [31] Chen, Z. & Yang, Y (2004). Assessing Forecast Accuracy Measures. Iowa State University [32] Chhibber, S. & Apostolakis, G. (1993). Some approximations useful to the use of dependent information sources. Reliability Engineering and System Safety, 42, 67-86. [33] Christensen, C.; Christensen, R. & Huffman, M. D. (1985). Bayesian point estimation using the predictive distribution. The American Statistician, 39, 319-321. [34] Claessens, M. (1990). An application of expert opinion in ground water transport. DSM Technical Report ? R 90 8840, University of Delft, the Netherlands. [35] Clemen, R. T. & Reilly, T. (February of 1999). Correlations and Copulas for Decision and Risk Analysis. Management Science, 45, 2. [36] Clemen, R. T. & Winkler, R. L. (1985). Limits for the precision and value of information from dependent sources. Operations Research, 33, 427-442. [37] Clemen, R. T. & Winkler, R. L. (1993). Aggregating Point Estimates: A Flexible Modeling Approach. Management Science, 39, 501-515. [38] Clemen, R. T. & Winkler, R. L. (1999). Combining probability distributions from experts in risk analysis. Risk Analysis, 19, 187-203. [39] Clemen, R. T. & Winkler, R. L. (October of 1997). Combining Probability Distributions from Experts in Risk Analysis. [40] Clemen, R. T.; Fischer, G. W. & Winkler, R. L. (2000). Assessing dependence: Some experimental results. Management Science, 46, 1100-1115. [41] Clifford A. Cathers, C. A. & Thompson, G. D. (2006). Forecasting Short-Term Electricity Load Profiles. Sierra Southwest Cooperative Services, Inc. The University of Arizona, Cardon Research Papers, August 2006. [42] Cooke, R. M. (1991). Experts in uncertainty: Opinion and subjective probability in science. New York: Oxford University Press. [43] Cooke, R. M. (1994). Uncertainty in dispersion and deposition in accident consequence modeling assessed with performance-based expert judgment. Reliability Engineering and System Safety, 45, 35 ? 46. [44] Cooke, R. M. (2003). Book review of Elicitation of Expert Opinions for Uncertainty and Risks, by B. M. Ayyub. Fuzzy Sets and Systems, 133, 267?268. [45] Cornish, E. (1977). The study of the future. World Future Society, Washington, D.C. [46] Dalkey, N. C. (1970). The Delphi Method: An Experimental Study of Group Opinion. Technical Report RM-5888-PR, The Rand Corporation. [47] Dawes, R. M. (1979). The robust beauty of improper linear models. American Psychologist. [48] Dawid, A. P. (1982). The Well-Calibrated Bayesian. Journal of the American Statistical Association, 77, 605-613. [49] de Finetti, B. (1937). Its logical laws, its subjective sources. In Kyburg, Jr. & Smokler, Studies in Subjective Probability, 2nd (eds), 53-118. Huntington, NY: Robert E. Krieger. [50] De Finetti, B. (1964). La pervision: ses loislogique, ses source subjectives. Annales de Linstut Henri Poincare, 7:1-68 (1937). English translation in Kyburg & Smokler (eds.), Studies in Subjective Probability, Wiley, New York. [51] Dechow, P. M., Sloan, R. G. (1997). Returns to contrarian investment: tests of the na?ve expectations hypotheses. Journal of Financial Economics, 43, 3-28. [52] DeGroot, M. H. (1988) A Bayesian view of assessing uncertainty and comparing expert opinion. Journal of Statistical Planning and Inference, 20, 295-306. [53] DeWispelare, A. R.; Herren, L. T. & Clemen, R. T. (1995). The use of probability elicitation in the high-level nuclear waste regulation program. International Journal of Forecasting 11, 5?24. 92 [54] Draper, D.; Pereira, A.; Prado, P.; Saltelli, A.; Cheal, R.; Eguilior, S.; Mendes, B. & Tarantola, S. (1999). Scenario and parametric uncertainty in GESAMAC: a methoological study in nuclear waste disposal risk assessment. Computer Physics Communications 117, 142?155. [55] Dumler, T. J. (2003). Rainfall and Farm Income. Risk and Profit Conference. [56] Droguett, E. L., Mosleh, A. Framework for Integrated Treatment of Model and Parameter Uncertainties. Submitted to the Risk Anal. [57] Droguett, E. L., Mosleh, A. Assessment of Model Uncertainty: Use of Model Performance Data. Submitted to the Risk Anal. [58] Edwards, W. (1968). Conservatism in human information processing. In B. Kleinmuntz (ed.), Formal Representation of Human Judgment, 17-52. New York: John Wiley. [59] Ehrman, C. M. & Shugan, S. M. (1995). The Forcaster?s Dilemma. Marketing Science, 14, 2, 123-127. [60] Einhorn, H. J.; Hogarth, R. M. & Klempner, E. (1977). Quality of group judgment. Psychological Bulletin, 84, 158?172. [61] Ericsson, K. A.; Krampe, R. T. & Tesch-R?mer, C. (1993). The role of deliberate practice in the acquisition of expert performance. Psychological Review, 100, 363-406. [62] Ericsson, K. A. & Staszewski, J. J. (1989). Skilled memory and expertise: Mechanisms of exceptional performance. In Klahr & Kotovsky (eds.), Complex information processing: The impact of Herbert A. Simon, 235-267. Hillsdale, NJ: Lawrence Erlbaum. [63] Fischhoff, B. (1982). Debiasing. In Judgment under uncertainty: heuristics and biases (eds), Kahnemann, Slovic and Tversky, 422-444. Cambridge University Press. [64] Fischhoff, B. (1982). For those condemned to study the past: Heuristics and biases in hindsight. In Judgment under uncertainty: heuristics and biases (eds), Kahnemann, Slovic and Tversky, 335-351. Cambridge University Press. [65] Forrester, Y. (2005). The Quality of Expert Judgment: An Interdisciplinary Investigation. Doctor of Philosophy Dissertation, University of Maryland. [66] French, S. (1985). Group Consensus Probability Distribution: A Critical Survey. In Bayesian Statistics (2nd eds), JM Bernado et al. Amsterdam, 183-201. [67] French, S., R?os Insua, D. Statistical decision theory. London: Arnold, 2000. [68] Fullerton Jr., H. N. Evaluating the 1995 BLS Labor Force Projections. [69] Garthwaite, P. H., Kadane, J. B., & O?Hagan, A. (2005). Statistical methods for eliciting prior distributions. Journal of the American Statistical Association, 100, 680-700. [70] Genest, C. & McConway, K. J. (1990). Allocating the weights in the linear opinion pool. Journal of Forecasting, 9, 53?73. [71] Genest, C. & Zidek, J. V. (1986). Combining Probability Distributions: A Critique and an Annotated Bibliography. Statistical Science, 1, 114-135. [72] Genest, C., & Schervish, M. J. (1985). Modeling expert judgments for Bayesian updating. Annals of Statistics, 13, 1198-1212. [73] Geomatrix Consultants (1998). Saturated zone flow and transport expert elicitation project. Deliverable Number SL5X4AM3. CRWMS M&O, Las Vegas, NV. [74] Ghabayen, S. M. S.; McKee, M. & Kemblowski, M. (2006). Ionic and isotopic ratio for identification of salinity sources and missing data in the Gaza aquifer. Journal of Hydrology 318, 1?4, 360?373. [75] Gigone, D. & Hastie, R. (1997). Proper analysis of the accuracy of group judgments. Psychological Bulletin, 121, 149-167. [76] Gilliland, M. (2002). Is Forecasting a Waste of Time? July/August issue of Supply Chain Management Review. [77] Goossens, L. H. J. & Harper F. T. (1998). Joint EC/USNRC expert judgment driven radiological protection uncertainty analysis. J. of Radiological Protection, 18, 4, 249?264. [78] Goossens, L. H. J.; Cooke, R. M.; Hale, A. R. & Rodic?-Wiersma, Lj. Fifteen years of expert judgement at TU Delft. Safety Science 46 (2008) 234?244 [79] Gustafson, D. H.; Shukla, R. K.; Delbecq, A. & Walster, G. W. (1973). A Comparative Study of Different Subjective Likelihood Estimates Made by Individuals, Interacting Groups, Delphi Groups, and Nominal Groups, Organizational Behavior Human Perform, 9, 200?291. [80] Hammitt, J. K. & Shlyakhter, A. I. (1999). The Expected Value of Information and the Probability of Surprise. Risk Analysis 19, 1, 135-152. [81] Hawkins N. C. & Evans J. S. (1989). Subjective estimation of toluene exposures: a calibration study of industrial hygienists. Appl. Ind. Hyg. J. 4, 3, 61?68. 93 [82] Helmer, O. (1977). Problems in futures research: Delphi and causal cross-impact analysis. Futures, 17-31. [83] Hill, G. W. (1982). Group vs. individual performance: Are N+1 heads better than one? Psychological Bulletin, 91, 517-539. [84] Hogarth, R. M. (1977). Methods for aggregating opinions. In Jungermann & DeZeeuw (eds.), Decision Making and Change in Human Affairs, 231?255. Dordrecht, Netherlands: Reidel. [85] Hogarth, R. M. (1978). A note on aggregating opinions. Organizational Behavior and Human Performance, 21, 40-46. [86] Hora, S. C. & Iman R. L. (1989). Expert opinion in risk analysis: The NUREG-1150 methodology. Nuclear Science and Engineering, 102, 323-331. [87] Hora, S. & von Winterfeldt, D. (1997). Nuclear waste and future societies: A look into the deep future. Technological Forecasting and Social Change, 56, 155-170. [88] Hora, S., Jensen, M., 2005. Expert panel elicitation of seismicity following glaciation in Sweden. SSI Report 2005:20, Swedish Radiation Protection Authority. [89] Hynes M. E. & Van Marke E. H. (1977). Reliability of embankment performance predictions. In: Mechanics in Engineering, 1st ASCE-EMD Specialty Conference, University of Waterloo. [90] Johnson, A. C. & Thomopoulos, N. T. Tables and Characteristics of Standardized Lognormal Distribution. [91] Jouini, M. N. & Clemen, R. T. (1996). Copula models for aggregating expert opinions. Operations Research, 44, 444-457. [92] Kadane, B. & Wolfson, J. (1998). Experiences in Elicitation. The Statistician, 47, 3?19. [93] Kadane, J. B. & Winkler, R. L. (1988). Separating Probability Elicitation from Utilities. Journal of the American Statistical Association, 83, 402, 357?363. [94] Kahn, H. & Wiener, A. J. (1967). The Year 2000: A Framework for Speculation on the Next Thirty-Three Years, London: Collier-Macmillan Limited. [95] Kallen, M. J. and Cooke, R. M. (2002). Expert aggregation with dependence. In Probabilistic Safety Assessment and Management, In Bonano, Camp, Majors and Thompson (eds.), 1287?1294. North-Holland, Amsterdam. [96] Keeney, R. L. & von Winterfeldt, D. (1989). On the uses of expert judgment on complex technical problems. IEEE Transactions on Engineering Management, 36, 83-86. [97] Klugman, S. F. (1945). Group judgments for familiar and unfamiliar materials. Journal of General Psychology, 32, 103-110. [98] Krishnamurti, T. N. et al. (1999). Improved weather and seasonal climate forecasts from multimodal ensemble. Science, 285, 1548-1550. [99] Lacke, C. (1998). Decision Analytic Modeling of Colorectal Cancer Screening Policies. Operations Research Program, North Carolina State University. [100] Lannoy, A. & Procaccia, H. (2001). L'utilisation du jugement d'expert en s^uret_e de fonctionnement, Tec & Doc (in French). [101] Libby, R. & Blashfield, R. K. (1978). Performance of a composite as a function of the number of judges. Organizational Behavior and Human Performance, 21, 121-129. [102] Lichtenstein, S. & Fischhoff, B. (1980). Training for calibration. Organizational Behavior and Human Decision Processes, 26, 149-7 1. [103] Lindley, D. V. (1985). Reconciliation of discrete probability distributions. In Bernardo, DeGroot, Lindley, & Smith (eds.), Bayesian statistics 2, 375-390. Amsterdam: Holland. [104] Linstone, H. & Turoff, M. (1975). The Delphi method: techniques and applications reading. Addison Wesley, Nass. [105] Mc Laren, C. H. & Mc Laren, B. J. (2003). Electric Bill Data. Journal of Statistics, Online Edition, 11, 1. [106] McIntosh, C. S. & Bessler, D. A. (1988). Forecasting Agricultural Prices Using a Bayesian Composite Approach. Southern Journal of Agricultural Economics. [107] McKenna, S. A.; Walker, D. D. & Arnold, B. (2003). Modeling dispersion in three- dimensional heterogeneous fractured media at Yucca Mountain. Journal of Contaminant Hydrology 62, 3, 577?594. [108] McLean, I.; Phil, M.; Anderson, M. & White, C. (2003). The accuracy of guestimates by 11 Forensic Clinicians. Journal of the Royal Society of Medicine, 96J, 96, 497?498. [109] Mendel, M. & Sheridan, T. (1989). Filtering Information from Human Experts. IEEE Transactions on Systems, Man & Cybernetics, 36, 6-16. 94 [110] Meyer, M. A. & Booker, J. M. (1991). Eliciting and Analyzing Expert Judgment: A Practical Guide. London: Academic Press. [111] Miklas, M. P. J.; Norwine, J.; DeWispelare, A. R.; Herren, L. T. & Clemen, R. T. (1995). Future climate at Yucca Mountain, Nevada proposed high-level radioactive waste repository. Global Environmental Change 5, 3, 221?234. [112] Morgan, M. G. & Keith, D. W. (1995). Subjective judgments by climate experts. Environmental Policy Analysis 29, 10, 468?476. [113] Morgan, M. G., & Henrion, M. (1990). Uncertainty: A Guide to Dealing with Uncertainty in Risk and Policy Analysis. Cambridge University Press, Cambridge. [114] Morris, J. M. & D?Amore, R. J. (1980). Aggregating and Communicating Uncertainty. Pattern Analysis and Recognition Corp., 228 Liberty Plaza, NY. [115] Morris, P. A. (1974). Decision analysis expert use. Management Science, 20, 1233-1241. [116] Morris, P. A. (1977). Combining expert judgments: a Bayesian approach. Management Science, 23, 679-693. [117] Morris, P. A. (1983). An axiomatic approach to expert resolution. Management Science, 29, 24-32. [118] Morris, P. A. (1986). Observations on Expert Aggregation. Management Science, 32, 321- 328. [119] Mosleh, A. & Apostolakis, G. (1986). The Assessment of Probability Distributions from Expert Opinions with an Application to Seismic Fragility Curves. Risk Analysis Journal, 6, 4, 447-461. [120] Mosleh, A (1981). On the Use of Quantitative Judgment in Risk Assessment: A Bayesian Approach. Dissertation, University of California, Los Angeles. [121] Murphy, A. H. & Winkler, R. L. (1977). Reliability of Subjective Probability Forecasts of Precipitation and Temperature. Applied Statistics, 26, 41-47. [122] Nelsen, R. B. (1999). Introduction to copulas. Springer Verlag, New York. [123] O?Hagan, A. & Oakley, J. E. (2004). Probability is perfect, but we can?t elicit it perfectly. Reliability Engineering & System Safety 85, 239?248. [124] O?Hagan, A. (1998). Eliciting expert beliefs in substantial practical applications. The Statistician, 47, 1, 21?35. [125] Ouchi, F. (2004). A Literature Review on the Use of Expert Opinion in Probabilistic Risk Analysis. World Bank Policy Research Working Paper 3201. [126] Parent, E. & Bernier, J. (2003). Encoding prior experts judgments to improve risk analysis of extreme hydrological events via POT modeling. J. of Hydrology, 283, 1?18. [127] Pike, W. A. (2004). Modeling drinking water quality violations with Bayesian networks. Journal of the American Water Resources Association, 40, 6, 1563?1578. [128] Quigley, M. A.; Chandramohan, D. & Rodrigues, L. (1999). Diagnostic Accuracy of Physician Review, Expert Algorithms & Data-derived Algorithms in Adult. Verbal Autopsies, International Journal of Epidemiology, 28, 1081-1087. [129] Ramachandrani, G.; Banerjee, S. & Vincent, J. H. (April of 2003). Expert Judgment and Occupational Hygiene: Application to Aerosol Speciation in the Nickel Primary Production Industry. Ann. occup. Hyg., 47, 6, 461?475, British Occupational Hygiene Society, Published by Oxford University Press. [130] Rantilla, A. K. & Budescu, D. V. (1999). Aggregation of Expert Opinions. Proceedings of the 32nd Hawaii International Conference on System Sciences. [131] Rodier, C. J. (2005). Test of Model Error in Travel Forecast, Verify the accuracy of land use model used in transportation and air quality planning: a case study in Sacramento, California region. MTI Report 05-02. [132] Rodr?guez, J. L. (2005). Recent Applications of the Delphi Method in Social Sciences. Institute of Applied Business Economics, University of the Basque Country/Euskal Herriko Unibertsitatea (UPV/EHU). [133] Sanders, N. R. (1992). Accuracy of judgmental forecasts. Omega: International Journal of Management Science, 20, 353?364. [134] Savage, L. J. (1971). Elicitation of Personal Probabilities and Expectations. Journal of the American Statistical Association, 66, 783-801. [135] Schmittlein, D. C.; Kim, J. & Morrison, D. G. (1990). Combining forecasts: Operational adjustments to theoretically optimal rules. Management Science, 36, 1044-1056. [136] Schwartz, Z. & Cohen, E. (2004). Hotel Revenue management Forecasting - Evidence of Expert-judgment Bias. Cornell University. [137] Shanteau, J., (2002). Domain differences in expertise. 95 [138] Slovic, P. (1972). From Shakespeare to Simon: Speculation and some evidence about man?s ability to process information. Oregon Research Inst. Research Monograph, 12, 2. [139] Slovic, P. (1972). Information processing, situation specificity and the generality of risk- taking behavior. Journal of Personality and Social Psychology, 22, 128-134. [140] Slovic, P. (1972). Psychological study of human judgment: Implications for investment decision making. The Journal of Finance, 27, 4, 779-799. [141] Sniezek, J. A. & Henry, R. A. (1989). Accuracy and confidence in group judgment. Organizational Behavior and Human Decision Processes, 43, 1-28. [142] Spiteri, A.; Torpiano, J.; Bailey, M.; Mercieca, V. & Grech, V. (2004). A comparison of clinical pediatric murmur assessment with echocardiography. Malta Medical J. 16, 04. [143] Stark, K. C.; Wingstrand, A.; Dahl, J.; M?gelmose, V. & Lo Fo Wong, D. M. A. (2002). Differences and similarities among experts? opinions on Salmonella entrica dynamics in swine pre-harvest. Preventive Veterinary Medicine, 53, 7-20. [144] Stekler, H. O. & Thomas, R. (2000). Evaluating BLS Labor Force, Employment, and Occupation Projection for 2000. [145] Stiber, N. A.; Pantazidou, M. & Small, M. J. (1999). Expert system methodology for evaluation reductive dechlorination at TCE sites. Environmental Science and Technology, 33, 17, 3012?3020. [146] Stiber, N. A.; Small, M. J. & Pantazidou, M. (2004). Site-specific updating and aggregation of Bayesian belief network models for multiple experts. Risk Analysis 24, 6, 1529?1538. [147] Stone, M. (1961). The opinion pool. Annals of Math. Statistics, 32, 1339-1342. [148] Tennessee Valley Authority. Appendix B ? Methodology and Results from Socioeconomic Modeling. Final Environmental Assessment. [149] Tversky, A. & Kahneman, D. (1974). Judgment Under Uncertainty: Heuristics and Biases. Science, 185, 1124-1131. [150] Vegelin A. L.; Brukx, L. J. C. E.; Waelkens, J. J. & Van den Broeck, J. (2003). Influence of knowledge, training and experience of observers on the reliability of anthropometric measurements in children. Annals of Human Biology, 30, 1, 65-79. [151] Walker, K. D.; Evans, J. S. & MacIntosh, D. (2001). Use of expert judgment in exposure assessment - Part 1. Characterization of personal exposure to benzene. Journal of Exposure Analysis and Environmental Epidemiology, 11, 308-322. [152] Walker, K.; Catalano, P.; Hammitt, J. & Evans, J. (2003). Use of expert judgment in exposure assessment: Part 2. Calibration of expert judgments about personal exposures to benzene. Journal of Exposure Analysis and Environmental Epidemiology 13, 1-16. [153] Williams, A. M., & Ericsson, K. A. (2005). Perceptual-cognitive expertise in sport: Some considerations when applying the expert performance approach, Human Movement Science, 24, 283-307. [154] Wilson, J. M. (1994). Network representations of knowledge about chemical equilibrium: Variations with achievement. Journal of Research in Science Teaching, 31, 1133-1147. [155] Winkler, R. L. & Makridakis, S. (1983). The Combination of Forecasts. Journal of the Royal Statistical Society, A, 146, 150-157 (1983). [156] Winkler, R. L. & Poses, R. M. (1993). Evaluating and combining physicians? probabilities of survival in an intensive care unit. Management Science, 39, 1526-1543. [157] Winkler, R. L. (1968). The Consensus of Subjective Probability Distributions. Management Science, 15, 61-75. [158] Winkler, R. L. (1981). Combining Probability Distributions from Dependent Information Sources. Management Science, 27 479-488. [159] Wisse, B., Bedford, T. & Quigley, J. (2005). Combining expert judgments in the Bayes linear methodology. In Proc. CEA-JRC Workshop on the Use of Expert Judgment in Decision-Making, Devictor, Moulin and Bolado-Lavin, (eds.). CEC, Aix-en-Provence. [160] Wissema, G. (1982). Trends in technology forecasting. R&D Management, 12, 1, 27-36. [161] Zajonc, R. B. (1962). A note on group judgments and group size. Human Relations, 15, 177?180. [162] Zarnowitz, V. 1984. Business Cycles Analysis and Expectational Survey Data. NBER Working Papers 1378, National Bureau of Economic Research, Inc. [163] Zio, E. & Apostolakis, G. E. (1996). Two methods for the structured assessment of model uncertainty by experts in performance assessments of radioactive waste repositories. Reliability Engineering and System Safety 54, 225?241.