ABSTRACT Title of dissertation: UNDERSTANDING INFORMATION USE IN MULTIATTRIBUTE DECISION MAKING Jeffrey S. Chrabaszcz, Doctor of Philosophy, 2016 Dissertation directed by: Professor Michael R. Dougherty Department of Psychology An inference task in one in which some known set of information is used to produce an estimate about an unknown quantity. Existing theories of how humans make inferences include specialized heuristics that allow people to make these infer- ences in familiar environments quickly and without unnecessarily complex computa- tion. Specialized heuristic processing may be unnecessary, however; other research suggests that the same patterns in judgment can be explained by existing patterns in encoding and retrieving memories. This dissertation compares and attempts to reconcile three alternate explanations of human inference. After justifying three hierarchical Bayesian version of existing inference models, the three models are com- pared on simulated, observed, and experimental data. The results suggest that the three models capture different patterns in human behavior but, based on posterior prediction using laboratory data, potentially ignore important determinants of the decision process. UNDERSTANDING INFORMATION USE IN MULTIATTRIBUTE DECISION MAKING by Jeffrey Stephen Chrabaszcz Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2016 Advisory Committee: Professor Michael Dougherty, Chair/Advisor Professor D.J. Bolger, Dean’s Representative Professor L. Robert Slevc Professor Tracy Riggins Professor Alexander Shackman c© Copyright by Jeffrey S. Chrabaszcz 2016 Dedication For and despite my son, Nicolas James Chrabaszcz. ii Table of Contents List of Tables v List of Figures vii List of Abbreviations viii 1 Introduction 1 1.1 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2 Delta Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3 Hypothesis Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2 Bayesian implementations of inference models 21 2.1 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 Delta Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 Hypothesis Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3 Cross-validation of models to simulated data 29 3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4 Prediction in a real-world data set 43 4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5 Modeling human inference: A novel behavioral experiment 54 5.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 iii 6 General Discussion 70 6.1 Psychological Plausibility . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.2 Modeling Search Order . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.3 Contamination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.4 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 A Experimental Differences in Accuracy 79 Bibliography 83 iv List of Tables 1.1 Two fictional students and predictors of their probability of graduation. 3 3.1 Covariance Matrices for both simulated ecologies. . . . . . . . . . . . 31 3.2 Fixed parameters for data generation. . . . . . . . . . . . . . . . . . . 35 3.3 Summaries of models fit to data generated from each model using first ecology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4 Median fixed effects for all models fit to simulated data by generating model. γ is the probability of choosing consistent with the terminat- ing model prediction (i.e., not failing to apply the model), µw is the average relative weight of CV and DR, and σw is the standard devi- ation of relative weight parameters. µ∆ and σ∆ give the mean and standard deviation of the delta parameters for each cue, indicating the distribution of differences in cue values needed to terminate search. µβ and σβ give the distributions for the weights in HyGene, indicating average search order. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.5 Summaries of models fit to data generated from each model using second ecology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1 Cues for the German Cities Task. . . . . . . . . . . . . . . . . . . . . 44 4.2 Model comparisons for HyGene, Search, and ∆I on the GCT. . . . . . 46 4.3 Median fixed effects for all models fit to simulated participants with the GCT data. γ is the probability of choosing consistent with the terminating model prediction (i.e., not failing to apply the model), µw is the average relative weight of CV and DR, and σw is the standard deviation of relative weight parameters. µ∆ and σ∆ give the mean and standard deviation of the delta parameters for each cue, indicating the distribution of differences in cue values needed to terminate search. µβ and σβ give the distributions for the weights in HyGene, indicating average search order. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.4 Search order information by model. . . . . . . . . . . . . . . . . . . . 47 4.5 Comparison of variance in outcome explained by cues in the ecologies from chapters 1 and 2 using multiple regression. . . . . . . . . . . . . 50 v 5.1 Frequencies of each stimulus for the test ecology. . . . . . . . . . . . . 56 5.2 Summary statistics for pony cue ecology. . . . . . . . . . . . . . . . . 57 5.3 Summary of multilevel logistic regression predicting accuracy using trial and varying both intercept and the effect of trial by participant. 61 5.4 Model comparisons for HyGene, Search, and ∆I on empirical data. . . 62 5.5 Median fixed effects for all models fit empirical data. γ is the prob- ability of choosing consistent with the terminating model prediction (i.e., not failing to apply the model), µw is the average relative weight of CV and DR, and σw is the standard deviation of relative weight parameters. µ∆ and σ∆ give the mean and standard deviation of the delta parameters for each cue, indicating the distribution of differ- ences in cue values needed to terminate search. µβ and σβ give the distributions for the weights in HyGene, indicating average search order. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.6 Model comparisons for HyGene, Search, and ∆I on the empirical data with fixed γ = .75. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.7 Median fixed effects for all models fit to empirical data with fixed γ = 0.75. µw is the average relative weight of CV and DR, and σw is the standard deviation of relative weight parameters. µ∆ and σ∆ give the mean and standard deviation of the delta parameters for each cue, indicating the distribution of differences in cue values needed to terminate search. µβ and σβ give the distributions for the weights in HyGene, indicating average search order. . . . . . . . . . . . . . . . . 64 A.1 Summary of multilevel logistic regression predicting accuracy using condition and varying intercept by participant. Intercept gives the average accuracy for the loss condition, the difference in accuracy for the gain condition is given by the Gain predictor. . . . . . . . . . . . 79 A.2 Fixed effect estimates for a multilevel multinomial model predicting cue choice by time with varying effects by participant. Mean gives the mean of the marginal posterior distribution for each parameter, while the 95% confidence interval gives the 2.5% and 97.5% percentile sam- ples for each marginal posterior distribution. pMCMC is an MCMC approximate of the p-value and gives the probability of observed an estimate of equal or greater magnitude given the estimated standard deviation centered at zero. . . . . . . . . . . . . . . . . . . . . . . . . 81 vi List of Figures 1.1 The lens model, from Brunswik (1952). . . . . . . . . . . . . . . . . . 3 2.1 Graphical model of Search. . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 Graphical model of Delta Inference. . . . . . . . . . . . . . . . . . . . 25 2.3 Graphical model of HyGene. . . . . . . . . . . . . . . . . . . . . . . . 26 4.1 Density of tau distance between generating search orders and model search orders by participant, colored by model. . . . . . . . . . . . . . 49 5.1 Comparison of pony drawings use in learning phase. . . . . . . . . . . 56 5.2 Stimulus states for test phase. . . . . . . . . . . . . . . . . . . . . . . 59 5.3 Jittered scatterplot and logistic regression prediction for accuracy by trial during training. The intermediate tick marks on the y-axis show the average predicted accuracy for the first and last trials based on the multilevel logistic regression model in Table 5.3. . . . . . . . . . . 61 5.4 Distributions of first cue searched during the test phase for three example participants. . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.5 Probability of choosing each of the four cues first by source, faceted by participant (with the right bar showing empirical cue choice dis- tributions). Two subjects omitted for space. . . . . . . . . . . . . . . 65 A.1 Boxplots for average participant accuracy by condition. . . . . . . . . 80 A.2 Probability of choosing each cue, averaged over subjects, over the course of the test trials. Error ribbon represents a single proportion standard error, √ p · (1− p)/N . . . . . . . . . . . . . . . . . . . . . . 82 vii List of Abbreviations AC Threshold for subsetting episodic memory traces β In HyGene, activation of a cue CC Conditional echo content vector ∆ In ∆I, size of credible cue value difference ∆I Delta Inference γ Probability of choosing counter to TTB L Learning rate S Similarity between a memory trace and probe SA Semantic activation w Relative weight parameter for combining CV and DR CV Cue Validity DIC Deviance Information Criterion DR Discrimination Rate GCT German cities task HyGene Hypothesis Generation log L Logarithm of the likelihood MCMC Markov chain Monte Carlo SOC Set of leading contenders SSL Strategy Selection Learning TTB Take-the-Best WADD Weighted Adding viii Chapter 1: Introduction An inference task in one in which some known set of information is used to produce an estimate about an unknown quantity. People make inferences all the time: Any judgment based on indirect information about an outcome requires an inference. Inferences guide important decisions. Which stock is more likely to in- crease? Which applicant is more likely to improve a business? Better understanding of inferences would allow us to influence and improve such decisions. Psychologists have been interested in this problem for many years. The most recent resurgence in interest, motivated by Gigerenzer and Goldstein (1996), had garnered over 2,400 citations at the time of writing. One important concept from Gigerenzer and Gold- stein (1996) is that people use only a subset of the available information when making an inference. Despite the volume of intervening research, the field is still in disagreement over the precise process used to select the information used in a given inference. This dissertation compares three hierarchical Bayesian implementations of existing models of inference. Comparing these three models gives insight into the problem of selecting information, both by demonstrating the effect this has on the ultimate inferences and by illuminating the differences between existing theories. One of the first models designed to account for how people make inferences of 1 the type described above was the lens model proposed by Brunswik (1952, 1955). In the lens model, knowledge about a judgment is encoded as a set of discrete cues. To make an inference, relevant cues are weighted by importance and combined, akin predictions with a linear model. Figure 1.1 illustrates the lens model. The distal variable on the left is the quantity that a person intends to predict. The distal vari- able is unknown by the decision maker, who instead has access to information about the distal variable via proximal-peripheral cues. These cues are in turn related to the distal variable by ecological validities, correlations between cues values and the distal variable. While rational decision makers would combine the cues according to their ecological validities to form a central response (i.e., make an inference), individuals are not always perfectly calibrated to a given environment. People com- bine the cues according to utilization weights, simultaneously bringing all available information to bear on a particular inference. Correspondence between inferences and the environment is indexed by the functional validity. Distal variables are not necessarily determined by cues, so even perfect cue utilization could lead to decision errors. As an example, imagine a person is asked to choose which of two doctoral students is more likely to graduate. Probability of graduation is an unknown quan- tity, but is likely related to some observable, intermediate information like number of publications, grade point average, and the advisor’s number of previously grad- uated students. Assuming the person in question uses the lens model, he or she would combine these pieces of information about each student, weight them accord- ing to their utilization weights, and claim the higher weighted sum (Student B) has 2 a higher probability of graduating (Table 1.1). Cue Student Publications GPA Advisor Weighted Sum A 5 3.2 5 4.5 B 4 4.0 10 5.2 Weight .5 .3 .2 Table 1.1: Two fictional students and predictors of their probability of graduation. One limitation of the lens model is that it does not accommodate information processing constraints imposed on the decision maker. For example, in many real world decision tasks, the decision maker is required to retrieve decision relevant information from memory. The output of this memory retrieval process, and the inherent limitations of working memory, therefore constrain what information a decision maker brings to bear on any particular decision. Figure 1.1: The lens model, from Brunswik (1952). The lens model is just one of many weighted adding (WADD) models of infer- ence (Anderson, 1990; Hammond, 1990). The class of weighted adding models all assume that cue values are multiplied by some set of values and summed to produce 3 an estimate of the judged outcome but differ primarily in the way cue weights are calculated. This view of cognition is prevalent even outside of psychology, (Yoon and Hwang, 1995; McCloskey, 1998). Meehl (1954) showed that perfect application of the weighted sums of cues, calculated by regressing a distal variable on relevant predictors, out-performs human clinical judgments. Many follow-up studies demon- strate that a variety of factors, including the type of environment, the availability of feedback, and the time spent learning, can affect the application of accurate cue weights (Karelaia and Hogarth, 2008). One major limitation with this class of models is the computational demand placed on the individual. Cognitive demands for WADD models include: the needs to retrieve information about the decision environment from memory with very high accuracy (Gigerenzer et al., 1991), the need to compute cue validities (Dougherty et al., 2008), and the need to aggregate information across multiple cues (Newell, 2005). People also operate under scarcity, both of the time they have to gather infor- mation and of the resources available to remember and process information relevant to a decision. Though regression weights are the informationally optimal way to aggregate information in WADD, it may require substantial cognitive processing to calculate regression-equivalent weights (though see Chater et al., 2003). Alternative models reduce the computational burden of calculating and combining cue weights. Dawes (1979) showed that even improperly specified linear models, which preserve the valence of cue weights but mis-specify the exact weight, perform better than human judges. While improper linear models reduce the computational burden of cue weight generation, they still entail exhaustive search of relevant information. 4 The exhaustive search process on its own may exceed human cognitive capacity. Herbert Simon argued that rationality for an actor with limited resources is bounded rather than absolute (Simon, 1955). A decision maker will always have competing goals. At some point the cost of increasing predictive accuracy relative to one goal will interfere with a competing goal. Imagine buying a car, which is a large investment and important to many people for a variety of reasons. Though finding an optimal car is important, finite time can be spent researching different makes and models of car, to say nothing of locating a specific, desired car that is available for purchase. Instead, Simon’s bounded rationality suggests that a buyer will find a car that meets his or her needs well enough while also limiting the time spent researching and locating a car. Since then, a plethora of research shows that people do fail to maximize the single goal of accuracy or utility. In the laboratory, participants satisfice rather than maximize expected returns when buying information in an incentive-compatible economic decision task, (Bowen and Qiu, 1992; Fellner et al., 2009). These effects are reflected in affective responses to decisions. Self-reported maximizers, compared with satisficers, have more regret and are less satisfied with economic game outcomes (Schwartz et al., 2002); maximizers also report worse life outcomes than satisficers, (Parker et al., 2007). People seem to make social judgment based on heuristics, (Rand et al., 2014), which are satisficed equivalents of moral rules, (Gigerenzer, 2010). Computer simulations show that satisficing can lead to good performance in economics games, (Stirling and Goodrich, 1999). The concept of bounded rationality has been applied to models of inference. 5 Gigerenzer and Goldstein (1996) proposed that people have a toolbox of fast and frugal heuristics that are adapted to different decision environments. Their theory reconciles the concepts of bounded rationality, (Simon, 1955), and ecological valid- ity, (Brunswik, 1955), by assuming that people match their computationally efficient heuristics to the current decision environment, (Gigerenzer et al., 1991; Payne et al., 1988, 1992). According to this theory, individuals select among available decision heuristics and apply one that fits the inferential environment while minimizing cog- nitive demands when confronted with the need to make an inference. Perhaps the most widely studied among these fast and frugal heuristics is take-the-best (TTB). Take-the-best begins with a structure similar to the lens model. Information relevant to an inference is organized into discrete cues. TTB differs from earlier rational models at this point, however; the cues are then ordered by cue validity (CV) rather than combined to produce a central response. Instead of bringing all information to bear on a given problem, TTB searches sequentially through cues and makes an inference on the first discriminating cue. For example, the earlier inference about which student has a higher chance of graduating can be made with TTB. Assuming the weights are CVs, TTB starts with the most valid cue (number of publications) and compares the values for each alternative. In this example, Student A has published more papers, so TTB would infer that Student A is more likely to graduate than Student B. TTB is frugal relative to WADD because it can potentially ignore most of the relevant cues. Only when alternatives are tied on the first cue will TTB utilize additional information. If both students had published the same number of papers, TTB would look to see whether one had a higher GPA. 6 While other cue quality metrics could be used to order cues, the most com- monly assumed method of cue ordering is to search through cues based on CV. CV is the probability that an alternative has a higher criterion value given that it also has a higher cue value (Gigerenzer and Goldstein, 1996), and contrasted with dis- crimination rate (DR), the probability that two objects will have unequal criterion values given that they have unequal cue values. CV = p(A > B|cueA > cueB) (1.1) DR = p(A 6= B|cueA 6= cueB) (1.2) CV is practically bound between 0.5 and 1, since cues with CV of less than 0.5 are reverse-coded. In our graduating student example, if advisor’s number of previous students had a CV of greater than 0.5, the reciprocal would have a CV of less than 0.5, indicating that among pairs of students, the one with an advisor who advised fewer students is more likely to graduate. DR gives the probability that a cue could be used to discriminate between two alternatives. DR does not specify that an alternative will have a higher criterion value contingent on cue value, only the probability that a cue/criterion pair are not tied. For the dichotomous cue values assumed in TTB, DR is bound between 0 and 0.5 and is negatively correlated with CV. Many factors influence the application of TTB and other one-reason decision making algorithms. Martignon and Hoffrage (1999) show that a non-compensatory environmental structure allows TTB to approximate the accuracy of WADD despite purportedly requiring fewer computational resources. Ordered search produces mod- 7 els that are non-compensatory; no combination of later cues can compensate for the decision implied by an earlier discriminating cue. Consequently, ordered search models work particularly well in environments where potential cues are highly cor- related or where one strong cue overwhelms the explanatory power of other cues (Lee and Zhang, 2012; Todd and Dieckmann, 2004). Thus, TTB performs best in environments where the first cue is most important, either because it explains more variance in the criterion than any combination of the following cues or because it partially encodes the same information as other cues. Empirical evidence shows ad- ditional constraints on decision heuristic use. Participants in a laboratory task are more likely to use TTB when additional information is costly, when cue validities are explicitly given rather than learned by trial-and-error, and when the environment is deterministic (Newell and Shanks, 2003). According to evidence from eye-tracking, participants use an ordered search heuristic with easily-accessible cues but use a compensatory strategy when cue information is more difficult to retrieve, (Platzer et al., 2014). Even experiments assuming a weighted-adding model for information aggre- gation can suggest satisficing. Newell et al. (2009) tested participants in a four-cue decision environment with the goal of predicting whether share price for an unknown company would increase based on the values of those four cues. After each deci- sion, participants got feedback on the trial based on their assigned conditions. In one condition, participants saw the probability of the share price increasing based on the observed cue pattern. The other condition simply saw a message stating whether the share price did or did not increase for that trial. The long-run fre- 8 quencies of the dichotomous messages from the latter condition matched the stated probabilities seen in the former condition so that information was held constant between conditions; only the format of the information differed. While this manipu- lation was sufficient to cause better performance for the probability-feedback group, the advantage disappeared in a second study. Rather than providing feedback af- ter each trial, participants saw either probabilistic or dichotomous information for each cue at each trial. That is, for each cue, participants saw either the probabil- ity of share price increasing for the current value or the cue, or saw a dichotomous increases/decreases paired with each cue value. Though the actual probabilities con- tained more information than the dichotomies that would enable higher performance is used rationally, both conditions performed at approximately the same level. This pattern of results suggests that telling participants the unit value of a cue prior to learning was sufficient to remove any effect of metric feedback on cue utilization. The environment for this series of studies could be predicted with relatively high accuracy (80%) using only unit weights for each cue, so further specification of cue weights may not have justified the additional effort to discover relative cue weights. The use of unit weights is rather simple, requiring only tallying of cues in favor of each alternative. Use of specific cue weights, on the other hand, required adding multiple rational numbers together, a process that may be difficult for individuals even without the added, implicit time pressure of the task. Participants may have been inclined to achieve satisfactory performance by using the simpler rule, since the more complex rule yields only slightly higher accuracy and is much more taxing to implement. This study reveals either a preference for a simple decision rule that 9 involves tallying unit weights for cues (Dawes and Corrigan, 1974; Dawes, 1979), or shows that participants satisfice by ignoring additional information having achieved sufficient performance by their individual standards. Some evidence suggests that TTB is a viable model of human inference. Hog- arth and Karelaia (2007) investigated a variety of decision models (including TTB and WADD) across a range of decision environments. They found that predic- tive success of different decision models depends on the structure of the underlying ecology. Assuming a toolbox of decision rules, accuracy is maximized by applying tools that match the ecology. For example, they find that TTB is more accurate than alternate models in non-compensatory environments. Another study showed that TTB was among the multiple decision tools necessary to capture variability in participant responses in a decision task. Cognitive toolbox models have been criticized for allowing unlimited flexibility – a researcher can always add another tool to account for unexplained variance or patterns in decisions (Glo¨ckner et al., 2010). Scheibehenne et al. (2013) fit hierarchical Bayesian models to data from a number of different experiments to validate the idea of a cognitive toolbox and to demonstrate a method of comparing toolbox models while accounting for flexibility in inferences. They find that, across many of these experiments, the combination of TTB and WADD provides the most parsimonious account of participant choices. Despite the normative success of fast and frugal heuristics, they may not de- scribe the actual decision process that people use. Hilbig et al. (2010) proposed the nonsensical alphabet heuristic and showed that its use in a decision task comparing city populations produced results comparable to the recognition and fluency heuris- 10 tics. He argued that similar performance between a heuristic and participants on a task is insufficient to claim that participants use that heuristic. Instead, other aspects of the decision process must be considered. In TTB, for example, partic- ipants must demonstrably search cues in the same order as TTB and also make similar inferences in order to claim that TTB is applied. Despite TTB’s simplicity, Dougherty et al. (2008) criticize the calculation of cue validity as implausible. They argue that the an automatic event-counter is unsupported by existing evidence in frequency encoding and that cue validity requires people to remember the absence of information, conflicting with logic and memory research. Some researchers argue that TTB is part of an ecologically rational toolbox of decision algorithms that are selected and used as a function of fit to the environment. While the contents of the toolbox are debated, most of these proposals include some mixture of simultaneous and sequential search tools that vary the use of cue weights. For example, people could combine all information (simultaneous search) either by weighting cues by their CV (WADD) or simply by aggregating cues with unit weights (Dawes, 1979). Though TTB requires search of cues ordered by their decreasing CV, search could also be in random cue order to limit the computational burden of calculating cue orders (Gigerenzer and Goldstein, 1996; Gigerenzer and Todd, 1999). A number of studies alter the proposed WADD and TTB models originally compared in Gigerenzer and Goldstein (1996). One proposed change is allowing a probability of guessing rather than applying the specified model (Bergert and Nosof- sky, 2007; Lee and Newell, 2011). This change accounts for the high variability in 11 subject responses to most behavioral tasks. No single decision algorithm captures the responses made by participants in observed decision environments, in part be- cause participants appear to be inconsistent in applying a given decision rule. Another change alters the function used to weight or order cues. Newell et al. (2004) show superior prediction in many environments using success, a cue value metric defined as CV times DR, added to the product of one minus the discrimina- tion rate times the probability of a correct choice when guessing. Success = CV ·DR + 1 2 (1−DR) (1.3) This method combines the probability that a cue will discriminate between alter- natives with the probability of guessing and the probability of choosing the correct alternative given that a cue discriminates, producing an aggregate measure of single- cue usefulness. Others suggest meta-heuristics to choose among heuristics in the toolbox for application in a given environment. Strategy Selection Learning theory (SSL) is one framework for comparing the use of simultaneous and sequential information search (Rieskamp and Otto, 2006). SSL claims that individuals learn which decision heuristic is best fit to an environment based on repeated feedback. One strength of SSL is that it encodes the heuristic toolbox, fully accounting for the flexibility of allowing many possible decision heuristics. In SSL, each decision heuristic is fully encoded as a model. SSL assumes that, over a number of learning trials, participants compare the predictive accuracy for each model and selectively reinforce models that provide higher accuracy within an environment. When a simulated learner 12 is allowed to learn to apply either WADD or TTB over repeated feedback trials, SSL yields higher accuracy (measured by percent correct) than WADD, TTB, or a memory-like categorization model. This evidence has been expanded to suggest that certain environmental characteristics occupy a cognitive niche that predispose decision makers to use one of a number of decision heuristics (Marewski and Schooler, 2011). Based on the evidence reviewed above, it is obvious that there are competing models of how cues are generated and ordered in the context of inference tasks. In what follows, I outline three contemporary models of this process, which will then serve as the basis for the remainder of this dissertation. 1.1 Search Cue metrics like CV and DR, or aggregation methods like TTB and WADD, may occupy the ends of two spectra used in decision making heuristics. While the success metric gives an optimal method for combining CV and DR, participants show variability in preferred search order that may be related to preference for valid or discriminating cues. Similarly, people may apply WADD or TTB in the same decision environment. Lee and Newell (2011) developed a pair of hierarchical Bayesian to describe individual differences in cue ordering and search termination called Search and Stop. Search and Stop are a complimentary pair of models that determine the order and number of searched cues based on compromises between CV and DR and between all-reason and one-reason decision making. The Search 13 model assumes that participants order cues and make inferences similar to TTB. At the participant level, Search includes two parameters that distinguish it from TTB: γ and w. The γ parameter indexes the probability of choosing counter to the model prediction. TTB is normally a deterministic model, participants are assumed to choose whichever alternative has an earlier discriminating cue. Allowing for errors in applying the TTB model can reduce the penalty of incorrect estimates for participants that choose inconsistently or are otherwise poorly fit by the TTB model. The w parameter allows participants to weight CV and DR: weight = CV · w + (1− w) ·DR. Participants then search cues by ranking weight from largest to smallest. The Search model also includes hyperparameters for the mean and standard deviation of w. Partially pooling estimates of w in this case simultaneously improves individual estimates of w and summarizes the individual differences in search order. The Search model in isolation assumes that participants apply an error-prone TTB to make decisions but allows for individual differences in search order. The Stop model captures differences between application of TTB and WADD by participant while assuming a fixed search order for cues. The model assumes that participants either apply TTB with probability θ or WADD with probability 1 − θ. Though more complicated than either WADD or TTB, a comparison of a modified Stop model with SSL shows that the complexity of SSL is almost never justified. Based on minimum description length, an information-theoretic measure of model complexity that balances fit and parsimony, Stop is preferred to stochastic 14 WADD and TTB across a range of plausible error rates (Newell and Lee, 2011). A similar comparison has not been performed for the Search model with alternative models of stochastic search order. One way Search and Stop prevail is by allowing individual differences in deci- sion strategy. By including hierarchical structure allowing cue ordering and model selection to vary by participant, the models account for variation that could oth- erwise be attributed to inconsistency in decision heuristic application. Compared with SSL, Stop is able to model individuals varying propensity to choose randomly or to prefer additional information. Stop also avoids the problem of fully encoding both TTB and WADD separately by adding a parameter to generalize between the alternative heuristics. Search and Stop are only tested on a single environment at a time, however, and can only account for individual differences in weighting of CV and DR or WADD and TTB. If decision making varies along any other dimensions, the Search and Stop models will be insufficient. 1.2 Delta Inference The original conception of TTB operates on dichotomous cues, so all cue com- parisons are either equal or differ by one. Luan et al. (2014) proposed Delta Inference (∆I) as an elaboration on TTB that allows for continuous cue values. This slight change alters the potential flexibility of TTB. TTB operates by choosing an alterna- tive based on a single discriminating cue, regardless of the discrepancy between cue values for the alternative choices. In DI, the stopping rule in TTB is amended to 15 stop cue search only when cue values for alternatives differ by more than a certain amount, ∆. This potentially allows search to continue despite a discriminating cue, when the difference between cue values is smaller that ∆, allowing some flexibility to accommodate compensatory ecological structures. Note that this is still one-reason decision making. While mere difference may not be sufficient to motivate a decision in ∆I, that cue has no bearing on the decision process during later cue considera- tion. This is different from a change like that in the Stop model, which weights the number of cues to aggregate in a decision (Lee and Newell, 2011) or another model which assumes that both TTB and WADD are accessible tools (Newell and Lee, 2011; Scheibehenne et al., 2013; Rieskamp and Otto, 2006). Though (Luan et al., 2014) show that a ∆ of 0 is best on average, they do not explore the fitting of ∆ for subjects, environments, or cues. ∆I is also not compared to human performance, so its predictive validity for real decisions with non-zero ∆ parameters, (where ∆ = 0 is equivalent to TTB), is unknown. 1.3 Hypothesis Generation Another decision modeling framework comes the Hypothesis Generation (Hy- Gene) model (Thomas et al., 2008). While not a decision making model per se, HyGene is a model of memory search based on MINERVA-2 (Hintzman, 1984). In addition to being consistent with memory research, HyGene has accurately modeled other psychological phenomena, including subadditivity (Dougherty et al., 1999) and visual search (Buttaccio et al., 2015). HyGene requires little substantial alteration 16 to produce decisions on a paired comparison task. Modifying HyGene to make in- ferences provides an opportunity to evaluate existing decision rules in the context of a plausible theory of memory, (which is missing in Search and violated by most fast and frugal heuristics). In the HyGene model, memory is divided into episodic memory, semantic memory, and working memory. Episodic memory contains a memory trace, a vector of features taking the values 0, −1, or 1, for each event or experience. The traces in episodic memory are subject to degradation through forgetting and interference, governed by a learning rate, L. Traces in episodic memory also encode frequency information in the environment: Events that occur and are encoded more frequently appear proportionally more often than less frequent events. In contrast, semantic memory encodes each potential outcome only once, regardless of the frequency of any individual event. For example, an emergency room doctor is likely to have diagnosed influenza much more often than smallpox. The doctor’s episodic memory would contain a large number of feature vectors corresponding to influenza and few if any for smallpox, but each would appear a single time in semantic memory. In HyGene, working memory is a constraint on the number of semantic memory traces that can be considered as hypotheses at one time. Hypothesis generation is motivated by a probe, or an event about which a hypothesis is necessary. In this recent example, the symptoms of a patient would act as a probe. The HyGene model assumes that people decide among hypotheses by first probing episodic memory to determine similarity between each event and the probe. Mathematically, this is accomplished by calculating the dot product between 17 the probe and each memory trace: Si = ∑N j=1 PjTij Ni , (1.4) where P and T are a probe and trace of length N and i indexes the number of traces in episodic memory. The cube of similarity (S3i ) is then compared to AC , the latter being a free parameter in the model. AC acts as a cutoff similarity value to limit search of memory to relevant traces. All traces with cubed similarities higher than AC are used to generate a hypothesis, while all traces with lower similarities are ignored for subsequent calculations. After identifying the relevant subset of memory, HyGene creates a conditional echo content vector (CC) using the following formula: CC = K∑ i=1 S3i Tij. (1.5) Each trace in episodic memory is multiplied by its cubed similarity and the resulting vectors are summed element-wise. The vector result of this process, normalized by dividing all values by the absolute value of the largest value in the vector, is an “unspecified probe” which combines the diagnostic information in the probe with base rate information from episodic memory. The dot product of this unspecified probe and each entry in semantic memory, yielding a semantic activation (SA) for each semantic trace. Traces with SA higher than zero then enter the set of leading contenders (SOC), a capacity-limited proxy for working memory. The entire search process: activate a subset of memory, create an unspecified probe, generate seman- tic activations, and populate the SOC, repeats until a pre-determined number of 18 iterations. The end result is a short-term store (the SOC) filled with hypotheses about the probe and their associated activations from semantic memory. The HyGene process requires little change to search for cue orders based on the contents of memory. A probe of each cue value could be used to search memory, returning the short-term buffer’s worth of cues that predict the highest values on the criterion value for a given decision environment along with their activations. Activation for each cue should condense DR and CV into a single measure and mimic success or Search model results. Thus, HyGene potentially gives a psychologically plausible method for calculating cue orders. This dissertation includes modeling studies that test this intuition and evaluate the necessity of complicated decision rules and metric derivation beyond memory processes. For example, calculation of CVs and cue ordering in general may be obviated by instead relying on emergent properties of episodic memory search. One might also reduce the stochasticity of WADD and TTB in Search by allowing cue orders to be determined by memory search, which is already a stochastic process. 1.3.1 Summary Search, Delta Inference, and HyGene are models that seek to explain decision making behavior at different levels of analysis. Search captures abstract, individual differences in search order and error of strategy application. Though initial work with ∆I focused on average values of ∆ across environments, ∆I could be adapted to investigate whether individual differences in decision making are limited to differ- 19 ences in ∆, the difference between cue values necessary to motivate a decision. The HyGene framework contains different restrictions, only containing free (and poten- tially varying) parameters that are involved in memory processes. These levels of explanation coincide with David Marr’s levels of analysis (Marr, 1982). Search exists at the computational level of analysis, focusing primarily the types of information that are required to accurately capture patterns in cue search and judgment. ∆I parameterizes specific components of the algorithm used to combine information, while HyGene focuses on details of the implementation of a decision algorithm in the context of a subordinate memory system. Unfortunately, these models are all evaluated in different ways. Search exists as a hierarchical Bayesian model with parameters that are partially pooled by individual, while ∆I and HyGene are both fit with maximum likelihood that average over individual differences. This disser- tation formulates all three models as hierarchical Bayesian models to allow direct comparison across these models and corresponding levels of analysis. 20 Chapter 2: Bayesian implementations of inference models Comparing Search, ∆I, and HyGene requires both common implementation methods and shared data. The present chapter includes descriptions of hierarchical Bayesian implementations for models of inference based on Search, ∆ Inference, and HyGene. Hierarchical Bayesian modeling allows natural extension to include indi- vidual differences, summarizes these differences with fixed parameters, and provides a very general method for relating cognitive models to observed data (Lee, 2010). There is a recurring structure present in all three models under consideration. The w for Search and ∆I, ∆ for ∆I, and β for HyGene all vary by participant but are drawn from a distribution with parameters that are fixed across participants. Draw- ing participant parameters from shared distributions allows the models to bring the maximum information available to bear on estimating each parameter (Lee, 2008), and represents a compromise between fully pooled estimates, which assume that participants are identical, and unpooled estimates, which assume that participants are entirely unique (Gelman and Hill, 2006). 21 2.1 Search The Search model, already a hierarchical Bayesian implementation, is drawn almost directly from Lee and Newell (2011). Given the focus on cue ordering, which is the major difference between Search and the comparison models, I ignore the Stop model entirely. The stopping rule in Stop is modeled independent of search order; Search determines search order by individual after which Stop subsequently determines the number of cues searched. These studies focus on how the models order cues differently and whether this influences judgments, though later work could examine the interaction with varied stopping rules. The Search model is completely described in Figure 2.1. The order of cue search is governed by a weighted combination of CV and DR. The individually- varying relative weight for CV is drawn from a bounded normal distribution with both mean and variance varying as beta distributions with α = β = 1 and bounds at .01 and .99. The DR weight is 1 minus the CV weight. Based on this balance of CV and DR, the cues are searched sequentially until any difference between the cues allows for the model to stop and choose one of the two alternatives. The model then selects the TTB-chosen alternative with probability of γ or the alternative with probability 1 − γ. This allows the model to account for the fact that human participants very rarely apply a single decision rule consistently. In the event that two alternatives are exactly tied on all cue values, the model chooses between the alternative with an equal probability for each outcome. In a small departure from earlier work, I have implemented the Search model 22 j problems i subjects µ ∼ Beta(1, 1) σ ∼ Beta(1, 1) wi ∼ N (µ, σ), wi ∈ (a, b), 0.01 < a < b < 0.99 si = Rank(wi · v · (1− wi · d)) γ ∼ Uniform(0.5, 1) tij =  γ if TTBsi(aj, bj) = a 1− γ if TTBsi(aj, bj) = b 0.5 otherwise yij ∼ Bernoulli(tij) bj wi d tij µ yij v si σ aj γ Figure 2.1: Graphical model of Search. with continuous, rather than dichotomous, cue values. While changing the cue support should not pose any problems for the Search framework, one goal of this dissertation is to assure that changing the support of these cue values does not interfere with the model. 23 2.2 Delta Inference For the purpose of this study, the ∆ Inference model is a modified version of the Search model including an elaboration that potentially allows the TTB search algorithm to continue past a discriminating cue (Luan et al., 2014). In some cases, a small difference in cue values may not be sufficiently informative to terminate search. The ∆ parameter is allowed to vary by both cue and participant, so the model converges on ∆ parameters most consistent with the data. While ∆I modifies the stopping rule for TTB, it does so in a way that preserves one-reason decision making. When ∆I makes a decision, it is based only on the value of a single cue; previous cues are treated as ties and ignored. Unlike the Stop model, the stopping rule from ∆I can interact with CV and DR weighting, influencing search order. Earlier research showed that, on average, the best value of ∆ is zero (Luan et al., 2014). This research kept a consistent value of ∆ across all cues and individ- uals in the studies, though, precluding the possibility that individuals or cues might differ in their values of ∆. Varying ∆ by person amounts to the suggestion that individuals may differ in the amount of information they require before making a decision; varying by cue allows that cues can be differentially informative. Instead of a consistent value of ∆ for all cues and participants, I allow ∆ to vary across both of these dimensions. Though the ∆s for each cue are independent, I define a hyperparameter for each cue’s ∆ and allow subject-varying deviations from this average value (Figure 2.2). 24 j problems i subjects k cues µ∆k ∼ Beta(1, 1) σ∆k ∼ Beta(1, 1) ∆ik ∼ N (µ∆, σ∆),∆ik ∈ (a, b), 0 < a < b <∞ µ ∼ Beta(1, 1) σ ∼ Beta(1, 1) wi ∼ N (µ, σ), wi ∈ (a, b), 0.01 < a < b < 0.99 si = Rank(wi · v · (1− wi · d)) γ ∼ Uniform(0.5, 1) tij =  γ if TTBsi(aj, bj + ∆i.) = a 1− γ if TTBsi(aj + ∆i., bj) = b 0.5 otherwise yij ∼ Bernoulli(tij) µ∆k bj wi d tij µ yij ∆ik v si σ aj γ σ∆k Figure 2.2: Graphical model of Delta Inference. 2.3 Hypothesis Generation The version of HyGene used in this dissertation is a Bayesian model inspired by Thomas et al. (2008). This HyGene implementation decides between pairs of choices by searching sequentially through cues (as in TTB) and using a minimal difference in cues to terminate search and select an alternative (Figure 2.3). Search order is determined by using weighted logistic regression on scaled cue values for a training set, deemed episodic memory or M in the graphical model, to generate normalized regression coefficients. Cues are searched in descending order of coeffi- 25 cient magnitude, serving as a proxy for CV and DR as used directly in both Search and ∆I. After determining search order based on the normalized regression weights, HyGene makes decisions like the Search model, complete with a γ parameter for error in application and TTB-like sequential cue use. j problems i subjects k cues µβk ∼ N (0, 1) σβk ∼ Exp(1) µβik ∼ N (µβk, σβk) σβik ∼ Exp(1) βijk ∼ N (µβik, σβik) wij = τ([aj, bj],M) ytrain ∼ Bernoulli(logit−1(βikw2j ·M)) sij = Rank(−|βij.|) γ ∼ Uniform(0.5, 1) tij =  γ if TTBsi(aj, bj) = a 1− γ if TTBsi(aj, bj) = b 0.5 otherwise yij ∼ Bernoulli(tij) bj µβik µβk tij yij σβik βijk ytrain M σβk wij aj γ sij Figure 2.3: Graphical model of HyGene. There are two major differences between the current instantiation of HyGene and the model specified by Thomas et al. (2008). The first is that the current instan- tiation uses continuous weighting for episodic memory traces. Rather than using a threshold (AC) and ignoring memory traces below a modeled or assumed thresh- old, the current HyGene implementation accomplishes a similar goal by weighting 26 observations in episodic memory more heavily as a function of the magnitude of ordinal correlation with the cue values for a given observation. The current method of weighted regression should have similar performance to selecting relevant memory traces based on AC . Assuming a true non-linearity in trace selection, the weighted regression technique in the current HyGene model provides a first-order, smoothed approximation of the discrete, underlying function (Shalizi, 2015). The AC param- eter is is a context-free value without a comparable parameter in Search or ∆I. Inferences from the hierarchical Bayesian are based on fixed parameters and search orders, so removing AC in favor of continuous weighting of episodic memory im- proves interpretability of comparisons with Search and ∆I. The second difference is that this HyGene instantiation is allowed to search all cues in the ecology. The original HyGene model includes a finite SOC, which limits the number of simultaneous activated semantic traces, in this case cues, that an individual can consider. The limited SOC could easily be included in any of the models currently under consideration and could have a variety of difficult-to- predict effects on search order and judgment accuracy. Similarly, nothing about HyGene requires sequential search, the model could aggregate over a subset of cues to produce a response. Given the focus on search order, HyGene is implemented with only the weighting mechanic and uses a TTB-like stopping and decision rule on HyGene-ordered cues to minimize differences from the comparison models. These decisions about HyGene modeling produce a version of the model that is maximally comparable to Search and ∆I, allowing for a direct inspection of the cue ordering mechanism without interference from other model dissimilarities. The 27 γ parameter, for example, increases model flexibility by allowing some proportion of responses that violate the direct predictions from sequential search of the cues. Exclusion of this modification, which is present only in Search and not in ∆I or HyGene as originally formulated, potentially conflates the cue ordering mechanism and other, empirically-motivated modeling decisions. 28 Chapter 3: Cross-validation of models to simulated data One aspect of comparing computationally-specified theories of decision making is understanding the relative flexibility of these theories. There are at least three ways to validate new modeling methods: 1. analytic proof of behavior in the limit; 2. validation on a standard dataset; or, 3. validation on simulated data. The first method is intractable in the case of most cognitive process models, the current models included. Closed-form solutions to these models would be difficult both because of the diverse prior specifications and because of the unconventional likelihood statement. The second method will be useful later when an external refer- ence helps explore the usefulness of these cognitive models in naturalistic conditions. These data lack a defined generating process, however; no model can be identified as correct in these circumstances. The only alternative is model comparison, but the best method to compare cognitive models is under debate. Therefore, simulated data will fuel an initial attempt to understand how Search, ∆I, and HyGene relate to one another. 29 This chapter will focus on two questions. The first is: How do Search, ∆I, and HyGene account for decision processes in simulated environments with known structure? This question is answered with two ecologies that vary in predictive difficulty. The simpler of these ecologies has orthogonal, non-compensatory cues. This means that the cues are uncorrelated with one another and that predictions based on the strongest cue cannot be reversed by any combination of subsequent cues. This first ecology is contrasted with a second ecology that has compensatory cue structure and positive cue intercorrelations. The second question for this chapter is: How related are the predictions from Search, ∆ Inference, and HyGene? Though this question will return regarding data generated by human decision makers, using simulated data allows for direct exam- ination of how structure in the environment is represented in the fixed parameters of each model. Fitting the three models to the same environments also allows for understanding of interactions between the shared components among the models (e.g., γ) and their unique components. 3.1 Methods The questions in this chapter require both generating structured ecologies and fitting of relevant cognitive models (Search, ∆I, HyGene). 30 Ecology 1 Ecology 2 Outcome 1 2 3 Outcome 1 2 3 Outcome 1.0 0.5 0.3 0.1 1.0 0.3 0.2 0.1 1 0.5 1.0 0.0 0.0 0.3 1.0 0.2 0.2 2 0.3 0.0 1.0 0.0 0.2 0.2 1.0 0.2 3 0.1 0.0 0.0 1.0 0.1 0.2 0.2 1.0 Table 3.1: Covariance Matrices for both simulated ecologies. Ecologies Both ecologies are generated from multivariate normal distributions with all means equal to zero, the defining ecologies differ only in their covariances. The first ecology consists of 20 samples from a covariance matrix with three orthogonal cues and decreasing correlations with the outcome (Table 3.1). Though dissimilar from the empirical data in later chapters, this ecology will provide a reference for all models. The cues in the first ecology are non-compensatory, so both sequential and simultaneous cue use both reach the same conclusions on these stimuli. The strict orthogonality of the cues also makes cue weighting easier, allowing inspection of the relative influence the priors in each model have on cue ordering. This ecology also permits assessment of the influence of individual differences in cue order drawn directly from the priors in each model. Direct fits to any fixed ecology should lead to a consistent search order. Differing search orders in this ecology reflect the influence of prior information in each model. The second ecology is intended to be closer to empirical data than the first. The cues are poorer indicators of the outcome on average and have non-zero covari- ances. The set of objects in the ecology are also more numerous, with 100 unique 31 objects instead of 20 as in the simple ecology. While the simple ecology is intended as a reference distribution to help assess prior influence, the complex ecology serves to foreshadow the success of these models when fitting messier, empirical data. The difference is that this complex ecology still has a known structure. While empir- ical samples are useful for different reasons, we have no way of knowing the true population structure from which they are drawn. Each ecology is used to generate three distinct sets of data, one corresponding to each Search, ∆I, and HyGene. The priors from each model are used to gen- erate parameters for 20 imaginary participants. These parameters are then used to produce responses to all paired comparisons of the objects in each of the two ecologies. For each set of generated data, shared parameter values are consistent across models. For example, γ is consistent across all three models and w for each “participant” is the same for both Search and ∆I within an ecology. Simulations I produced simulations from each model with each of the two ecologies. For Ecology 1, the training set consisted of simulated model predictions for all 190 unique pairs of stimuli in the 20-object ecology. The training set for Ecology 2 consisted of only a subset of the possible stimulus pairs. Though the second ecology includes 100 objects, the training set included only 100 random pairs of these objects rather than the exhaustive 4,950 pairs. I generated hypothetical participants’ responses for each ecology to allow for a representative range of the individual differences for 32 each model. After simulating these responses, I fit each of the three models using to the generated responses and the same training set. Analyses of the results both assess model performance and examine how differ- ent sources of variability are represented in each joint posterior distribution. Model performance can be examined in a variety of ways. I first present both the likeli- hood and the penalized likelihood using the Deviance Information Criterion (DIC; Spiegelhalter et al., 2002), which communicates average model effectiveness when accounting for complexity: DIC = −2 log L+ v̂ar(−2 log L). (3.1) For the DIC, complexity is a measure of the range in data that could be fit by the model and is measured as the observed variance in deviances observed in a convergent MCMC model1. Model comparison using the raw likelihood may make sense in this context. Human decision making is a complex, and perhaps stochastic, process, so any preference for simpler models, especially in a dataset of such limited size relative to the variation in the decision making system, may be unjustified. One caveat is that these densely-parameterized models allow for individual differences in different parameters for each model. Model comparison using penalized likelihood is especially unstable in these types of models because the likelihood function is very flat and the appropriate penalty is contentious (Weng and Gelman, 2014). In addition to fit statistics (log likelihood and DIC), posterior distributions of the fixed effects are reported for each model. The summarized fixed effects include γ for all 1The complexity term, v̂ar(−2 log L), is commonly referred to as the penalty. 33 models and the means and standard deviations of w for Search and ∆I, ∆ in ∆I, and β in HyGene. A major goal in this chapter is to assess overall model flexibility. Though both ∆ Inference and HyGene are instantiated as elaborations on the Search model in this series of studies, it is possible that the additional parameters for these models dramatically change model flexibility. The following analyses take data generated from each of the three models using the two ecologies specified in the methods and then fit each of the three models to this generated data. All inferences are based on models fit using JAGS (Plummer et al., 2003) and called from the rjags package in R (Plummer, 2015). JAGS is a C++ implementation of a Gibbs sampler. Parameters in each model met minimum Rˆ and effective n diagnostic criteria prior to further analysis. Rˆ is ratio of between- and within-chain variability, with values larger than 1 indicating poor mixing. Effective n is a measure of MCMC sample size that accounts for autocorrelation between successive samples. While no strict cutoff exists for effective n, a few hundred independent samples is considered sufficient to support inferences from the posterior distribution (Gelman et al., 2014). 34 Parameter Value Ecology 1 γ 0.9 CV 0.663, 0.668, 0.511 DR 1, 1, 1 Ecology 2 γ 0.9 CV 0.61, 0.69, 0.59 DR 1, 1, 1 Table 3.2: Fixed parameters for data generation. 3.2 Results First Ecology Samples from multivariate normal distributions with the aforementioned pa- rameters yielded CV and DR rates seen in Table 3.22. For both ecologies, the second cue gives the highest CV, followed by the first cue and then the third cue. There are no tied cue values, so all cues discriminate between all paired comparisons and DRk = 1. The results of fitting Search, ∆I, and HyGene to data generated using Ecol- ogy 1 show nearly equivalent performance for Search and HyGene (Table 3.3). ∆I is consistently, if only slightly, better able to account for variability in simulated participant responses as reflected by higher average likelihoods. While counter to earlier research, this suggests that non-zero ∆ parameters may be useful in some decision environments. Particularly in ecologies like the one generated, where many cue values are near the mean and explain relatively little variance in the outcome, ignoring small cue differences and continuing search may be more important than a 2Values throughout this dissertation are rounded to an appropriate number of significant digits. 35 simpler model (Search) or a more flexible method of ordering cues (HyGene). Pre- ferring ∆I to Search and HyGene for these data grants that the small differences between likelihoods and DICs are credible and favor ∆I, which may be unjustified given the small differences between model fits. Data HyGene Model HyGene Search ∆I log L −1386 −1385 −1378 Penalty 1.276 0.7953 7.074 DIC 2773 2771 2763 Data Search Model HyGene Search ∆I log L −1386 −1387 −1361 Penalty 1.019 0.2386 28.71 DIC 2774 2774 2750 Data Delta Inference Model HyGene Search ∆I log L −1387 −1384 −1377 Penalty 0.48 0.9434 8.579 DIC 2774 2769 2762 Table 3.3: Summaries of models fit to data generated from each model using first ecology. While the patterns in fit quality for Ecology 1 make sense given the continuous cue values, posterior distributions for the parameters that summarize each model suggest poor calibration, (Table 3.2). Though the γ parameter for all data gen- eration processes was very close to 1 (Table 3.2), all models returned a posterior distribution on γ with a median closer to 1 2 . ∆I gives higher values for γ, but noth- ing close to the true, fixed value of 9 10 . All three models are nearly guessing at the outcome, given that the probability of a model-inconsistent response is nearly as high as a model-consistent response. 36 Ecology 1 Generating Model Model HyGene Search ∆I HyGene γ 0.52 0.51 0.51 µβ −0.04, −0.06, −0.26 −0.03, −0.04, −0.3 −0.04, −0.04, −0.25 σβ 0.27, 0.42, 1.07 0.28, 0.38, 1.13 0.24, 0.39, 1.07 Search γ 0.52 0.5 0.53 µw 0.53 0.53 0.48 σw 0.7 0.71 0.71 ∆I γ 0.55 0.6 0.56 µ∆ 0.66, 0.2, 0.59 0.69, 0.14, 0.91 0.16, 0.77, 0.2 σ∆ 0.2, 0.4, 1.74 0.89, 0.24, 0.13 0.37, 1.24, 0.48 µw 0.51 0.5 0.51 σw 0.7 0.71 0.71 Ecology 2 Generating Model Model HyGene Search ∆I HyGene γ 0.55 0.53 0.55 β ≈ 0, 0.01, −0.19 ≈ 0, ≈ 0, −0.2 ≈ 0, 0.01, −0.22 σβ 0.22, 0.31, 0.76 0.21, 0.27, 0.8 0.23, 0.29, 0.73 Search γ 0.53 0.53 0.53 µw 0.51 0.51 0.5 σw 0.71 0.69 0.71 ∆I γ 0.64 0.64 0.65 µ∆ 0.11, 0.72, 0.87 0.14, 0.97, 0.81 0.12, 0.55, 0.8 σ∆ 0.24, 0.23, 0.11 0.2, 0.08, 0.06 0.23, 0.08, 0.06 µw 0.48 0.5 0.52 σw 0.71 0.71 0.71 Table 3.4: Median fixed effects for all models fit to simulated data by generating model. γ is the probability of choosing consistent with the terminating model pre- diction (i.e., not failing to apply the model), µw is the average relative weight of CV and DR, and σw is the standard deviation of relative weight parameters. µ∆ and σ∆ give the mean and standard deviation of the delta parameters for each cue, in- dicating the distribution of differences in cue values needed to terminate search. µβ and σβ give the distributions for the weights in HyGene, indicating average search order. Search order for HyGene follows the covariance between each cue and the outcome, on average, searching the first, then second, and finally third cue. The 37 standard deviation on the βs for HyGene is quite small for the first two µβ param- eters, but large for the third cue, suggesting that this cue is sometimes searched first. HyGene has separate search orders for each item. The high participant-wise variability seen in σβ3 is a reflection of the fact that, for some items, β3 is larger than β1. Both µw and σw for Search and ∆I follow the prior distributions for these parameters. No value of w will change the search order for these data because of the invariant DR values, so both models search in descending order of CV. The ∆ parameters for Ecology 1 are larger than zero with a relatively small σ∆. This means that Search claimed all participants searched only the second cue and then made a decision, while ∆I participants searched the second cue and occasionally continued on to the first and then third cues, guessing only if all three cues differed by less than the applicable value of ∆. Second Ecology The results of fitting each model to data generated from Ecology 2 are similar to Ecology 1, with larger differences between the models (Table 3.5). On average, ∆I is preferred based on likelihood and DIC. This is a product of the noisy environment in which small differences in cue values are even less likely to correctly favor the larger outcome value. No ties exist for any cues in this environment and DR is always 1, causing Search to examine exactly one cue (the second cue) and make a decision based on these values. HyGene is allowed to search in different orders but also picks based on whatever cue is searched first. Only ∆I stops as a function of 38 Data HyGene Model HyGene Search ∆I log L −1377 −1383 −1339 Penalty 3.373 1.033 6.811 DIC 2757 2767 2685 Data Search Model HyGene Search ∆I log L −1384 −1383 −1332 Penalty 1.807 0.9384 7.152 DIC 2769 2767 2671 Data Delta Inference Model HyGene Search ∆I log L −1376 −1384 −1340 Penalty 3.636 0.9114 3.746 DIC 2755 2769 2685 Table 3.5: Summaries of models fit to data generated from each model using second ecology. informativeness for the first cue. When differences between the first-cue values are sufficient large, ∆I ceases search; otherwise, it continues through the other cues and either stops or guesses. Table 3.2 gives the median values for all participant-varying (and cue-varying, as appropriate) parameters for the models fit to each set of data in each ecology. Compared with the posteriors for these parameters in Ecology 1, the models for Ecology 2 estimate approximately the same γ values despite the noisier ecology. This can only be attributed to bias in the models, not surprising given that decision models are designed to trade lower variance for higher bias (Gigerenzer and Brighton, 2009). 39 3.3 Discussion The simulations presented above illuminate how Search, ∆I, and HyGene op- erate on both structured and noisy data. To explore this, I cross-fit each model to data generated from each of the three models from two separate underlying ecologies. One major consideration, which in hindsight is obvious, is that the Search method of weighting CV and DR is only useful when there are unequal DRs for the cues. This is much more likely with discretely-valued cues, unlike the continuously- valued cues used in these investigations. As a result, Search effectively simplifies to TTB with probability (1− γ) of choosing counter to the TTB prediction. Thus, for these examples, Search is modeling no individual differences in search order. In addition to a single search order governed entirely by CV, the perfect discrimination rate of all cues means that Search was deciding based on a single cue and never searching beyond that. This equivalence of DR across cues also leads the ∆I model to a consistent search order across participants. The difference between Search and ∆I is that, because of the ∆ parameters, ∆I searches beyond the initial cue when the cue for the objects under consideration differ by less then ∆1. While Search acts like an error-prone TTB, ∆I sometimes searches additional cues to reach a decision. In these environments, HyGene is the only model that can model differing search orders. The relative weights of the cues for any given decision problem are influenced by the similarity between the cues for the choices and the cues for each of the choices in episodic memory, allowing cues to vary in importance between 40 choices. The model allows for weights to vary on average between participants as well, meaning search orders can vary both by individual and by item. This gives HyGene an advantage when search order should actually vary along these dimensions, assuming this variance is is sufficiently large relative to the error variance in the underlying ecology. HyGene’s flexibility is unwarranted in noisy data or data simulated from Search or ∆I processes, however, giving the model a higher penalized likelihood relative to the other models on data generated with a consistent search order. The consistent search order used in generating Search and ∆I models leave HyGene with unnecessary functionality on these simulated environments. This is especially true in the second, more complex environment. The positive covariances between the cues mean that, on average, the first cue is likely to partially encode the information present in later cues. Search, and to some extent ∆I, decide based on the first cue searched in this context, relying on the cue with highest CV. HyGene behaves in much the same way except that the single, deciding cue is less likely to be the highest-CV cue, since search order for HyGene can vary even with continuously- valued cues. These results are useful for understanding human decision making outside of this modeling context. While the Search model is intended to improve understanding of individual differences in decision making (Lee and Newell, 2011), it applies only when there are sufficient ties in cue values to produce differences in DR. This is either a substantial limitation on the generality of the Search model or requires the discretization/dichotomization of available information is a necessary component 41 of the underlying representations people use to make decisions. The assumption that cues are learned or stored with discrete values is not necessarily unjustified or even novel (Gigerenzer and Goldstein, 1996), but is a strong assertion that should be explicitly considered. This assumption is even testable, providing qualitative falsifiability to the method of subject-varying cue ordering found in the Search model. Search is based on a fixed value of w. If individuals show different search orders with continuously-valued cues, then the proposed mechanism of cue ordering by combining weighted CV and DR cannot account for this pattern and the Search method of cue ordering must be altered or abandoned. 3.3.1 Summary These simulated ecologies give important information about what to expect in future model fitting. The CV/DR weighting parameter in Search and the current version of ∆I will be more relevant with dichotomous cues but presents a potential avenue for empirical testing of Search cue ordering adequacy. HyGene is the only model under consideration that can account for search orders that differ by both in- dividuals and test items, though this flexibility is unjustified in the current modeled environments. 42 Chapter 4: Prediction in a real-world data set Though fitting models to simulated data can increase understanding of the models themselves, inferential models must also fit data without known parameters. Data generated from real environmental sources may differ in unpredictable ways from data generated with known properties. In this chapter, I use a well-known decision environment, the nine-cue ecology predicting population in German cities, to compare the resulting posterior distributions on the parameters of Search, ∆I, and HyGene. Synthetic decision environments that come from fixed parameters and well- behaved probability distributions may not accurately reflect these natural environ- ments. While it is trivial to design environments that are difficult or impossible to predict, people would simply guess in these circumstances. More interesting are en- vironments that can be predicted despite uncertainty. Even difficult-but-predictable environments must be difficult in the same way that ecologies encountered by human decision makers are difficult. For these reasons, synthetic ecologies are of limited use. Among existing decision ecologies, few are as thoroughly-explored as the Ger- man cities task (GCT). This task has been used in a large number of studies on 43 human memory and decision making (Gigerenzer et al., 1991; Gigerenzer, 1993; Gigerenzer and Goldstein, 1996), providing a reasonable baseline for performance of different models of decision making. The ecology for this task includes 9 dichoto- mous cues use to predict the population of the 83 German cities with populations larger than 100,000 (as of 1993, Figure 4.1). Cue CV DR Is the city is the national capital? 1 0.02 Was the city was an exposition site? 0.91 0.28 Does the city have a major-league soccer team? 0.87 0.3 Is the city on the Intercity line? 0.78 0.38 Is the city a state capital? 0.77 0.3 Is the license plate abbreviation more than one letter? 0.75 0.34 Is a university located in that city? 0.71 0.51 Is the city in the industrial belt? 0.56 0.3 Was the city in East Germany? 0.51 0.27 Table 4.1: Cues for the German Cities Task. If the only goal was to predict city size, one could just as well check the CIA fact book and get the best predictors of population size. The strength of the GCT, in addition to being widely-used in the decision making literature, is that all of the cues are relatively easy to remember and use, since each can only take on one of two values. In addition to the supposed psychological plausibility of dichotomous cue values, the GCT will allow a better comparison between cue ordering due to w in Search and ∆I, ∆ in ∆I, and β in HyGene. While the modeling in Chapter 3 compared these three models, it accomplished this comparison in a way that largely ignored the cue ordering aspects of Search and ∆I. Delta Inference was designed to allow search of decision environments to con- tinue past a marginal difference in cue values. Despite this, there is nothing in 44 principle to prevent ∆I from operating on dichotomous cues. Despite a consistent model structure, the interpretation of ∆ changes with dichotomous input. The pos- terior distribution on a given participants ∆ parameter for any cue only potentially changes decisions when it is ≥ 1. One can compare the density of the probability distribution for ∆ above and below 1 to see how likely the model is to consider this cue, conditioned on the search order dictated by w. These data reflect consistent subject-wise search orders. The original purposes of these data were to validate that the Search model can effectively model individual differences in search order. This puts HyGene at a disadvantage, but allows us to directly assess the influence of β priors when the data are generated with a single search order by subject. 4.1 Methods The following simulations use the models specified in Chapter 2. The only slight difference is that cue values are dichotomous, rather than continuous, which would be reflected in the aj and bj nodes for each model. The data for this chapter come from earlier work on the Search model (Lee and Newell, 2011). Twenty participants with 100 responses each are simulated using search orders of the GCT that differ by participants but are consistent within a participant across the 100 choices. These data are generated deterministically from the Search model, albeit with stronger relationship to the criterion than in Chapter 3 and using dichotomous cues. 45 4.2 Results Model Metric HyGene Search ∆I log L −496.4 −482.7 −482.2 Penalty 46.47 5.904 47.71 DIC 1039 971.4 1012 Table 4.2: Model comparisons for HyGene, Search, and ∆I on the GCT. Model fits using the GCT data are in Table 4.2. Search and ∆I are about equally effective at explaining variance in participant judgments, though the sub- stantial complexity for ∆I is unjustified according to the DIC. HyGene yields only a slightly lower likelihood, though it is substantially more complicated than even ∆I and has a correspondingly higher penalized likelihood. This pattern of results expected, given that the data are effectively generated from the Search model. Model summaries are in Table 4.2. In this environment, all three models con- verge on γ values near one. Only very rarely do the models assume that participants misapply the decision rule and choose counter to the predictions of the model. The median of the µβ parameters for HyGene suggest that this method of cue ordering produces sightly different search behavior on average than the relative weight of CV and DR. This is likely to be relatively inconsistent across participants, given the ac- companying σβ parameter which are large relative to the sizes of the corresponding µbetas. Though HyGene produces different cue ordering, Search and ∆I have the same average search order (Table 4.4). While search order differs by individual, the inclu- 46 Model Parameter Value HyGene γ 0.944 µβ 0.01, 0, 0.02, −0.05, −0.21, −0.1, 0, −0.12, −0.04 σβ 0.31, 0.12, 0.19, 0.25, 0.38, 0.49, 0.18, 0.83, 0.92 Search γ 0.947 µw 0.548 σw 0.26 ∆I γ 0.949 µ∆ 0.54, 0.59, 0.6, 0.64, 0.67, 0.64, 0.59, 0.72, 0.61 σ∆ 0.88, 2.31, 2.12, 1.86, 1.56, 1.68, 2.3, 1.38, 1.61 µw 0.556 σw 0.3 Table 4.3: Median fixed effects for all models fit to simulated participants with the GCT data. γ is the probability of choosing consistent with the terminating model prediction (i.e., not failing to apply the model), µw is the average relative weight of CV and DR, and σw is the standard deviation of relative weight parameters. µ∆ and σ∆ give the mean and standard deviation of the delta parameters for each cue, indicating the distribution of differences in cue values needed to terminate search. µβ and σβ give the distributions for the weights in HyGene, indicating average search order. sion of ∆ has little influence on search order: on average both Search and ∆I follow CV until the fourth cue. Though µw and σw are similar for Search and ∆I, the latter has non-zero probability density for each participant at ∆k > 0, resulting in probabilistic search of each cue. Assuming the HyGene model, the average person searches in a completely novel order, though the large standard deviations on the β parameters suggest substantial variability in HyGene search order. Model Median Search Order Unique Orders HyGene 2, 3, 1, 6, 9, 7, 4, 8, 5 1,379 Search 1, 2, 3, 6, 4, 5, 8, 7, 9 174 ∆I 1, 2, 3, 6, 4, 5, 8, 7, 9 269 Table 4.4: Search order information by model. Recall that these data are generated with a consistent search order for each 47 participant that is the results from a weighted combination of CV and DR for each cue; Search is the true model for these data. We can see that, despite 20 true search orders (one for each participant), Search still allocates some probability (via the w parameter) to 174 unique search orders. Adding variability in ∆ allows the ∆I model to identify 269 unique search orders. HyGene’s βs give even more flexibility, exploring nearly 1,400 separate search orders. Each of these models explores far fewer than the total possible search orders, which for nine cues is 362,880. The flexibility to identify additional search orders is not necessarily inappro- priate, given the probabilities associated with the parameter values that give rise to these unique search orders. Figure 4.1 gives the distribution of τ order between the true search order and those proposed by each model for each of the 20 participants. τ order is the number of paired switches necessary to match the order between two vectors, it is related the Kendall’s τ correlation coefficient. Search and ∆I overlap substantially for all participants, though the ∆I densities are more dispersed than those for Search. HyGene produces more varied results. The mean of the HyGene density varies in relation to the Search mean by participant, with HyGene showing higher numbers of disconcordances for some participants’ fitted search orders and lower numbers for others. The differences in search order between models must be interpreted with two caveats in mind. The average likelihoods are comparable for all three models. De- spite the differences between the models in fitting the true search order for each participant, HyGene is only slightly worse at predicting participant responses. Also, the wider dispersion of HyGene τ orders is due to the variety of search orders, which 48 is a feature (and not a limitation) of the model. These τ distributions for HyGene are collapsing over the 100 search orders by stimulus pair for each participant, each of which is potentially unique. If participants truly search cues differently based on the stimulus in question, then HyGene is almost certain to fit better in expectation than any model that assumes homogeneous search orders for a given participant. 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 Tau D en si ty model Delta HyGene Search Figure 4.1: Density of tau distance between generating search orders and model search orders by participant, colored by model. 49 4.3 Discussion This chapter focuses on fitting the three models to simulated participants from a widely-studied, naturally-occurring ecology. While this set of simulations does not directly inform on human decision making, one can learn more about how uncer- tainty is differently represented by each model. This establishes a baseline against which to compare these models when fit to data generated by human participants. One important observation is that, in the GCT ecology, the additional param- eters in ∆I and HyGene lead to likelihoods that are comparable to Search, even though Search is the true generating model. The current results contrast with the results in Chapter 3, which found higher likelihoods for ∆I across most generating models. This difference may be due to the relative predictability of the outcome based on the cues, which is much higher for the GCT relative to the ecologies in Chapter 1 (Table 4.5). Ecology R2 1 0.35 2 0.333 GCT 0.868 Table 4.5: Comparison of variance in outcome explained by cues in the ecologies from chapters 1 and 2 using multiple regression. At least one inference is consistent across all three models: The credible values of γ are very close (Table 4.2). Despite differing cue-ordering mechanisms, simulated participants make model-consistent responses at similar rates. Cue order matters very little for accurately predicting population in this ecology. The similarity across 50 models could also reflect that a certain subset of comparisons are difficult or im- possible predict based on the observed cue values. This consistency across models is reassuring, though with a multiplicity of reasonable models, the only permissible inferences are those for which the models agree (Breiman, 2001). In this case, one can infer that participants are consistently choosing consistent with the result of the TTB process, but should remain agnostic about the method of ordering cues, since the models disagree on this and produce comparable likelihoods and DICs. Interpretation of the ∆I model differs with dichotomous cues. This is because w and ∆ will have an odd relationship on dichotomous cue values. If ∆ is less than the discrete step size for a cue, the value of this parameter cannot influence search. On the other hand, if ∆ is larger than this step size, then the model will always search the next cue. For these dichotomous cues, the only important question about ∆ for a given cue is whether it is less than one or not. If ∆1 is less than one, the model will only search past the first cue when it does not discriminate between the alternatives. If ∆1 is greater than or equal to one, the model will never stop at the first cue. Since the decision mechanism is non-compensatory and only focuses on a single cue at a time, the latter case means that the first cue would not influence the decision process. This same reasoning applies to cues in any position and potentially limits the application of ∆I to dichotomous-cue environments. The hierarchical Bayesian models used in this chapter allow distributions on parameter values and ∆ is continuously-valued with probability density both above and below zero. The observed values of ∆ turn stopping into a stochastic process, the ∆I model will only stop at a discriminating with a probability that the applicable ∆ parameter is less 51 than one. Stochastic stopping effectively adds another source of variability into the model, though the addition appears to cause almost no change in search order. The GCT is one environment where working memory constraints might have played a role. Conditional selection can cause problems for later cues. For example, the ninth cue in the cities environment is whether or not a given city was in East Germany. Cities in East Germany tend to have larger populations than those that weer in West Germany. This is not necessarily the case when the prior either cues have already been searched. Given that the previous eight cues are all tied, cue nine might even have a negative cue validity. This would cause the model to make the incorrect choice once it has search to the ninth discriminating cue. Earlier work has explored the application of greedy algorithms that account for this conditional dependency with TTB, but find that decisions based on this strategy are worse in cross-validation than unconditioned decision rules, though this is not compared to any decision models that explicitly cease search after a set number of cues (Martignon and Hoffrage, 2002). In the case of truncated search as in the original implementation of HyGene, a limited working memory would cause the model to exit search and guess after searching a set number of cues. If conditional cue validity is negative for later cues, ignoring those later cues potentially leads to fewer incorrect choices, though it does not guarantee better concordance with human decision processes. 52 4.3.1 Summary This chapter uses Search, ∆I, and HyGene to explore an ecology with dichoto- mous cue values and with a strong relationship between the cues and the outcome. The findings for Search replicate Lee and Newell (2011) and provide additional evi- dence for the flexibility that ∆I and HyGene’s specific parameters provide in search order. 53 Chapter 5: Modeling human inference: A novel behavioral experi- ment This chapter focuses on the application of Search, ∆I, and HyGene to data from participants in a behavioral task. After unique training periods on a novel task environment, participants are allowed information on a single cue and asked to choose between a pair of stimuli. Data from this experiment allow for comparison of the three models in an environment that potentially requires the full flexibility of HyGene. Given the inconsistency in search orders observed in previous studies, I expect the inconsistent individual search orders allowed under HyGene give this more complicated model an advantage relative to Search and ∆I. I also expect ∆I to guess slightly more often than Search, assuming some probability of ∆1 exceeding the difference in cue values for the first cue. The current data also give a second, convergent method for validating the models. While participant judgments and cue values will be used to fit each model, participants also generate information on which cue is searched first for each trial. Existence of cue choices allow for comparison observed cue search and predicted cue search from each model. Potential inconsistency in the first cue searched for each participant which favors the assumptions built into HyGene, which allows for 54 varying search order based on the probe vector. Search and ∆I should both produce more consistent predictions about which cue is searched first relative to the more flexible HyGene model. 5.1 Methods Participants Thirty-eight participants (60% female) from the Psychology department sub- ject pool at the University of Maryland, College Park took part in this experiment. Participants received partial course credit for their participation. Stimuli Stimuli for this experiment consisted of line drawings of ponies that were identical except for four dichotomous cues: nose color, leg stripes, hind spots, and tail color. All combinations of four dichotomous cues yields a total of 16 unique figures, see Figure 5.1 for maximally different examples. The ecology in this experiment was designed so that CV, DR, and τ/success would each identify a separate, preferred cue1. The ecology consists of one set of cue weights and some probability for each unique pony. Stimulus pairs for both learning and test phases were created by uniformly sampling from figures according to the frequencies listed in Table 5.1. This sampling process produces an ecology 1Due to a coding error, a second ecology did not accurately allow discrimination between success and τ and is not reported. 55 (a) All cues absent. (b) All cues present. Figure 5.1: Comparison of pony drawings use in learning phase. with summary statistics found in Table 5.2. Stimulus Frequency 1 0000 50 2 0001 5 3 0010 10 4 0011 1 5 0100 24 6 0101 3 7 0110 7 8 0111 1 9 1000 46 10 1001 5 11 1010 9 12 1011 5 13 1100 24 14 1101 6 15 1110 3 16 1111 1 Table 5.1: Frequencies of each stimulus for the test ecology. Procedure Participants gave consent and heard the following description: 56 Cue 1 Cue 2 Cue 3 Cue 4 CV 0.704 0.871 0.867 0.997 DR 0.502 0.427 0.276 0.232 τa 0.205 0.317 0.202 0.231 τb 0.319 0.518 0.405 0.525 Success 0.603 0.659 0.601 0.615 p(Present) 0.495 0.345 0.185 0.135 Weight 0.100 0.200 0.200 0.400 Table 5.2: Summary statistics for pony cue ecology. Welcome to the world of pony consulting. This is a cut-throat industry in which desirable ponies are in high demand, pony buyers are extremely wealthy, and pony sellers are highly protective of their goods. You are training to become a pony consultant. As a pony consultant, your task is to pick ponies that your clients will like. As with any other competitive consulting industry, you will get [rewarded/penalized] for [satisfied/dissatisfied] clients for whom you select the [right/wrong] pony. Your payment today will be based on your overall performance - your goal is to [earn the most positive/receive the fewest negative] reviews. To prepare for pony consultant work, you will first practice selecting ponies. You will see pictures on the screen of two different ponies and you must choose the more desirable pony. Ponies vary on four traits: face color, leg stripes, spots, and tail color. Pay attention because once you start working you will need to know which pony traits are considered most desirable so you pick the right ponies for your clients. After completing training, work in the real industry begins at the pony 57 auction house. As before, you must select the more desirable pony. However, in the real world where the pony sellers are highly protective of the ponies, you must pay to reveal traits. As a junior consultant, you have only enough budget to reveal one trait for each pair of ponies. After that trait is revealed, you must make your choice. Now it’s time to practice selecting ponies. On the screen you will see pairs of ponies. Use the mouse to indicate which pony is more desirable. After making your choice you will receive feedback - green means you made the correct choice; red means you made the incorrect choice; and yellow means that the ponies were equally desirable. Participants then completed 40 learning trials. On each learning trial, par- ticipants saw two ponies side-by-side and clicked a button under the picture they judged to have higher value based on the set of cues. Following each choice, the but- tons disappeared from the screen and a colored border appeared around the selected picture: green for correct, yellow for tied, and red for incorrect. For correct and tied choices, the border remained for 500 ms, while the incorrect border remained for 2000 ms. After 40 trials, a research assistant read the following instructions to partici- pants: Congratulations! You’re now a junior consultant. As before, you will see pairs of ponies but this time, their traits are covered. You have enough budget to reveal one trait for each pair of ponies; it is not necessary 58 to pick the same trait every time. Once the trait is revealed, you must select the more desirable pony. Every time you select a pony that your client really [likes/dislikes], you will receive a [positive/negative] performance review. At the end of the pony auction, the number of [positive/negative] reviews will determine your pay - the [more positive/fewer negative] reviews you earned, the more sweets you get. Participants then completed 160 test trials. Each test trial began with two masked stimuli (Figure 5.2). Participants were allowed to select a single cue to uncover by clicking on the corresponding named button, which removed the cover from only that cue. They then made their choice between the stimuli based on that single cue. After each decision, a tally of the earned points at the bottom of the screen updated. (a) All cues masked. (b) One cue masked. Figure 5.2: Stimulus states for test phase. 59 Modeling Models are fit only to the 38 participants in the first ecology. HyGene, Search, and ∆I required slight modification to accommodate heterogeneous training samples. An implicit assumption with previous datasets is that participants have experience with the ecologies of interest. This assumption is instantiated by fitting the models using CV and DR calculated on the entire ecology (Search and ∆I) and including an episodic memory weights for all pairwise comparisons between objects in the ecology (HyGene). For this set of simulations, model CV and DR are calculated separately for each participant using only the 40 pairs of stimuli seen during training. The episodic memory for HyGene is also limited to this set of training stimuli. Partic- ipants saw limited feedback during test and cannot be expected to have previous experience with the artificial test environment created for this study. Failure to ac- count for the different experiences participants had with the environment potentially biases the subject-varying parameters fit to test responses. The current experiment allowed participants to search only a single cue before making a decision. The models are also modified to make a choice after inspecting only a single cue despite determining a search order for the entire cue population. 5.2 Results Participants’ accuracy improved over the course of training in spite of the short duration. Table 5.3 and Figure 5.3 show that on average, accuracy increases from 74.5% to 87.8% over the course of training. The high accuracy demonstrates the 60 0.00 0.75 0.88 1.00 0 10 20 30 40 Trial Ac cu ra cy Figure 5.3: Jittered scatterplot and logistic regression prediction for accuracy by trial during training. The intermediate tick marks on the y-axis show the aver- age predicted accuracy for the first and last trials based on the multilevel logistic regression model in Table 5.3. overall ease of the task; the outcome is perfectly predicted from the cues assuming full knowledge of the environment. All of the cues are also positively related to value and participants have access to all cues during training. Given a dearth of plausible alternatives, these results suggest that participants are increasing accuracy on average by learning about the cue ecology. Fixed Effects Coefficient Std. Error z Pr(> |z|) Intercept 1.07 0.185 5.793 6.93× 10−9 Trial 0.023 0.009 2.451 0.014 Varying Effects Group Coefficient Variance Std. Dev. Subject Intercept 0.554 0.744 Trial 0.001 0.038 Table 5.3: Summary of multilevel logistic regression predicting accuracy using trial and varying both intercept and the effect of trial by participant. One important feature of these data is the variation in search orders within 61 participants (Figure 5.4). Ideally, a model of decision making would both predict participant choices and mimic the decision process used by participants. Search and ∆I are specified with the assumption that participants use a consistent search order for each participant, a feature that is contradicted by the search patterns in these data. If HyGene is emulating the process of memory search participants are using to make decisions, then the model should both fit the choices participants made and show similar distributions of cue choices. 100 101 103 0 30 60 90 1 2 3 4 1 2 3 4 1 2 3 4 Cue Co un t Figure 5.4: Distributions of first cue searched during the test phase for three example participants. Model Metric HyGene Search ∆I log L −4215 −4215 −4215 Penalty 0.004 0.003 0.005 DIC 8431 8431 8431 Table 5.4: Model comparisons for HyGene, Search, and ∆I on empirical data. 62 Model Parameter Value HyGene γ 0.5 µβ 0.027, 0.06, 0.09, 0.081 σβ 0.114, 0.11, 0.105, 0.095 Search γ 0.5 µw 0.482 σw 0.724 ∆I γ 0.5 µ∆ 0.515, 0.529, 0.502, 0.497 σ∆ 0.747, 0.831, 0.864, 0.699 µw 0.516 σw 0.711 Table 5.5: Median fixed effects for all models fit empirical data. γ is the probability of choosing consistent with the terminating model prediction (i.e., not failing to apply the model), µw is the average relative weight of CV and DR, and σw is the standard deviation of relative weight parameters. µ∆ and σ∆ give the mean and standard deviation of the delta parameters for each cue, indicating the distribution of differences in cue values needed to terminate search. µβ and σβ give the distributions for the weights in HyGene, indicating average search order. Table 5.4 gives the log likelihood, penalty, and DIC for each of the three models on these data. These data are equally likely under each of the three models. The additional parameters in ∆I and HyGene have almost no effect in this circumstance, so the penalty used for DIC is nearly equivalent for all three models as well. For all three models, γ is exactly .5 (Table 5.2). While the β parameters for HyGene are defined relative to the training data and are unaffected by this, the fixed effects for Search and ∆I reflected the prior distributions because they have no effect on the likelihood calculation when γ = .5. Other evidence suggests that participants are not guessing (Figure A.1). To get an idea of how the models would fit on the assumption that people were not always guessing at the outcome, each model was fit to the data with the γ parameter fixed 63 Model Metric HyGene Search ∆I log L −5406 −5406 −5396 Penalty 0 0 24.46 DIC 1.081× 104 1.081× 104 1.082× 104 Table 5.6: Model comparisons for HyGene, Search, and ∆I on the empirical data with fixed γ = .75. Model Parameter Value HyGene µβ 0.027, 0.06, 0.09, 0.081 σβ 0.114, 0.11, 0.11, 0.095 Search µw 0.515 σw 0.729 ∆I µ∆ 0.505, 0.506, 0.505, 0.504 σ∆ 0.385, 0.432, 0.374, 0.448 µw 0.454 σw 0.684 Table 5.7: Median fixed effects for all models fit to empirical data with fixed γ = 0.75. µw is the average relative weight of CV and DR, and σw is the standard deviation of relative weight parameters. µ∆ and σ∆ give the mean and standard deviation of the delta parameters for each cue, indicating the distribution of differ- ences in cue values needed to terminate search. µβ and σβ give the distributions for the weights in HyGene, indicating average search order. at .75. The results are in Tables 5.6 and 5.2. These models all have lower average likelihoods after the forced increase for γ. While ∆I has a slight advantage relative to Search and HyGene in likelihood, it also has a much larger effective number of parameters, giving it larger (and less favorable) DIC. Figure 5.5 plots the probability of choosing each cue first for each participant for each of the models and the empirical data based on the search orders from the revised models that fix γ at .75. The task only allowed participants to search a single cue, so validation must be limited to predictions regarding the first cue searched. A successful model of cognition should mimic the patterns present in 64 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 Hy Ge ne De lta Se arc h Re al Hy Ge ne De lta Se arc h Re al Hy Ge ne De lta Se arc h Re al Hy Ge ne De lta Se arc h Re al Hy Ge ne De lta Se arc h Re al Hy Ge ne De lta Se arc h Re al Model p(C ho os e F irs t) Cue 1 2 3 4 Figure 5.5: Probability of choosing each of the four cues first by source, faceted by participant (with the right bar showing empirical cue choice distributions). Two subjects omitted for space. the data. Almost all participants searched all of the cues first at least once. Some participants had near-uniform rates of first cue search while others were much more likely to choose one or two cues more often than the others. The preferred cue differed by participant; some of the non-uniform choosers preferred cue four (highest CV) while others preferred cue one (highest DR). The three models are also fit to 65 the same training information as the participants, so a really successful model would exhibit all of these patterns and also accurately predict the frequencies with which each participant chose to search cues first. Prediction of cue search behavior is a particularly high bar, given that the model likelihood does not depend on these search orders; first cue search is an emergent property for all three models. Search and ∆I both make very similar predictions for first cue searched. Both of these models predict that cues one and four will be searched with some probabil- ity for most participants, with occasional looks at cue three. Both models also show some variability is uniformity of cue choice: some participants show a strong pref- erence for a single cue while others choose between three of the available cues more evenly. The variance in uniformity does not follow observed patterns in participant cue choice, however; Search and ∆I do not follow a given participants probabilities of choosing individual cues very closely, if at all. Neither of these models predicts even a single look at the second cue, a notable departure from the data. HyGene makes very different predictions that are inaccurate in different ways. This model usually only identifies a single cue that a participant will search, though it occasionally shows a second cue with a non-zero probability of being searched first. HyGene does sometimes predict the second cue as being the preferred cue, though, which is consistent with the observed choices and never predicted by Search or ∆I. The modal pick for HyGene is also reasonably consistent with the modal choice for each participant. 66 5.3 Discussion The current chapter focuses on fitting Search, ∆I, and HyGene to human responses rather than generated data. While simulated data was useful in under- standing the mechanics for each model, the goal is to understand something about human behavior. I focus on each model in turn before attempting to reconcile the inferences from the differing explanations. For these experimental data, HyGene predicts very consistent search orders within participant. HyGene also predicts that no one will search the DR cue. Despite obvious deviations from the observed data, the single cue searched by HyGene often agrees with the first- or second-most searched cue in reality. Though HyGene’s search orders are too consistent, the pattern of cue choices suggests that the ordering mechanism is picking up on the same environmental structure as the participants. HyGene’s median values of µβ are quite small compared with σβ. Despite this, HyGene searches cues in a very consistent order. This suggests unmodeled but positive covariances between β parameters, which encode the overlap between the cues in predicting magnitude of interest. Especially on the small training samples in this experiment, collinearity between βs may be reflected in cue search. Search heavily favors initial search of the first (DR) and fourth (CV) cues. Search also occasionally searches the third cue, which is preferred by no suggested cue ordering metric, but avoids the τ/success cue. This behavior is the result of a large uncertainty on the relative weighting of CV and DR described by σw. Search was formulated as a model about the individual differences in cue search based on 67 fixed, point values of w for each participant. Despite this, the variability w within participant is what allows variation in cue search order and increases similarity between modeled and actual cue search orders. Search is interesting because of the extremism it shows. The model probabilistically searches the CV cue, which has very poor discrimination, or the DR cue, which has the poorest CV, despite the existence of cue two, which goes entirely unsearched and is a compromise between these features. Delta Inference makes predictions that are largely consistent with the Search model. The ∆ parameter distributions once again have density above one, so some of the time a differentiating cue is being ignored. Large values are ∆ are infrequent, however, and do very little to change search order. The relative weighting param- eters are very similar to those in the Search model and yield very similar first-cue choices. Across all three models, participants are guessing quite often. The high rate of guessing is likely a function of the ecology, which contains a large number of tied values and no cues of exactly zero weight. Assuming the correct valence is learned for each cue, searching any cue will yield above-chance accuracy. Participants may have picked up on this strategy and been insufficiently motivated to maximize ac- curacy on the task by using a more difficult strategy. This would be consistent with findings from Newell et al. (2009), which also used an environment for which different strategies would yield only small differences in accuracy. The current experimental setup may not be an ideal test for these models. All four cues have positive predictive value according to the CV, DR, and success. If the 68 cues are too similar, participants may be unwilling or unable to distinguish between them. The difference in accuracy between conditions provides some protection from this criticism. The only difference in information for a given trial, and the only way accuracy could differ, is through cue choices. Despite the variation in first cues chosen, the difference in accuracy between conditions provides indirect evidence for some consistent pattern in cue use. Another potentially limiting factor is the single cue that participants are allowed during the test phase. Forcing a single cue choice could change cue ordering behavior. While there is no way of knowing whether limiting available information alters cue utilization, this is a potentially interesting question for future research. If limiting information changes decision making, then it may be an important feature to include in future decision models. While no a priori reason exists for limiting information to alter cue use, it could explain the failure of all three models to capture cue choices in this study. 5.3.1 Summary Beyond a high proportion of guessing, these three models do not agree on much regarding participant behavior. No single model is a particularly good predictor of actual participant search orders despite having nearly equivalent likelihoods for these data. This is partially by design, each of these models exists to explain decision making at a different level. In this case, however, it produces models that are mutually incompatible while capturing qualitatively different features in cue search. 69 Chapter 6: General Discussion The road to wisdom? — Well, it’s plain and simple to express: Err and err and err again but less and less and less. Piet Hein Search, ∆ Inference, and HyGene have disparate theoretical motivation but predict human behavior with similar success. Being simplifications, it is neither surprising nor discouraging that they also fail to account for potentially important response patterns (Box and Draper, 1987). These limitations do not preclude the culling of useful information from computational models, however; the ways in which these models fail tells us something about what aspects of decision making could be explained by omitted components. Chapter two summarizes the performance of Search, ∆I, and HyGene when fit to data generated from each of the three models using two well-defined, underlying ecologies. ∆I consistently outperforms Search and HyGene both in penalized and unpenalized likelihood despite a large effective number of parameters. Searching past a continuous cue that barely discriminates between alternatives is potentially 70 more important for model performance than flexible order of search. Both Search and ∆I had fixed search orders for these data. Given the focus on psychology rather than normative model performance, these results serve merely as a baseline for understanding how different aspects of these models are related. Each model is next fit to simulated participants with varying search order governed by a weighted combination of CV and DR. The GCT is a commonly-used dataset for decision research with unknown but naturalistic ecological structure and strong relationships between the cues and criterion values. For these data, Search fit better than ∆I and HyGene, though all three models identify very similar distri- butions of search orders, albeit with varying precision and accuracy. While this still tells us nothing about the psychology of decision making as such, it suggests limita- tions with using ∆I on dichotomous cues and shows that, even with mis-specification, HyGene’s ordering mechanism generates reasonable posterior distributions of cue search behavior. The penultimate chapter compares Search, ∆I, and HyGene when fit to be- havioral data using an ecology designed to assess cue preference. The models in this chapter suggest, above all else, that participants in this study were guessing a large percentage of the time despite statistical evidence of greater-than-chance accuracy. The three models produce highly similar average fit statistics, suggesting comparable success in explaining the data. HyGene also produces very different search behavior than Search and ∆I when focusing only on the first cue searched. Search and ∆I settle on values of w that vary the first cue searched between the first and fourth cue (those highest on DR and CV, respectively) with occasional looks 71 at the third cue (highest on none of the included metrics) depending on the partic- ipant. HyGene produces search behavior that is very consistent within participant, but would have participants search the second, third, or fourth cue depending on their training set. These different behaviors are uniquely inconsistent with observed search behavior; both should select among all four cues, though they omit different cues, and both should show more variability among cues within participant, though HyGene shows less variability than Search and ∆I. Search, ∆I, and HyGene exhibit some consistency in the posterior estimates of their parameters. The guessing parameter, γ, is estimated quite consistently across all models when fit to the same data. Consistency in γ suggests that the models agree on the error rate with which participants apply TTB to the modeled search orders. Though ∆I tends to converge on a slightly higher mean value, this is caused by an increased number of guesses. In terms of the model, when TTB(aj, bj) /∈ (a, b), the model chooses either outcome with a 50% chance. ∆I has a higher proportional of true guesses because of the possibility of ∆ > 1; true guesses do not affect the posterior of γ. Error of application and guessing are only equivalent when γ = .5, otherwise the model is more likely to produce a correctly-applied response. In the GCT with participants simulated from Lee and Newell (2011), all three models identify very similar search orders. Despite the true search order being dictated by a relative weighting of CV and DR, the HyGene cue ordering mechanic approximates the search orders quite well. HyGene has more diffuse distributions of τ order due to misspecification, but still manages to find some search orders that are closer to the true order compared with Search and ∆I due to the varying search 72 order within participant. The observed consistency in search order suggests that whatever variance in cue ordering is due to CV and DR can be at least partially recovered using the current version of HyGene’s search of episodic memory. Taken together, this agreement causes problems for the interpretation of the models in chapter four. For these data, HyGene’s predicted search orders are quite different from Search and ∆I when fixing γ above the guessing threshold. The initial, low value for γ, however, suggests that regardless of the searched cue, the models predict that participants are effectively guessing. This consistent inference across the models, despite performance well above chance for nearly all participants, indicates that some important aspect of decision making behavior is entirely ignored by these models. 6.1 Psychological Plausibility Like all research, these studies have limitations. The three tested models represent a very small subset of the possible models that encode cue ordering in a two-alternative, forced-choice context. Other models like SSL (Rieskamp and Otto, 2006), mixtures of models (Scheibehenne et al., 2013), or cognitive neuroscience- inspired models (Donoso et al., 2014), could more closely resemble the decision process that people use. Search, ∆I, and HyGene are interesting particularly because of their similarity. While these studies are unlikely to uncover the true generating model for participants’ responses, features of the process that are consistent with available explanations are more likely to be true of human decision making. Further work in decision making must focus on psychological plausibility. 73 Though hierarchical Bayesian modeling of individual differences is a step beyond deterministic models, aspects of Search, ∆I, and HyGene are still potentially opti- mistic about the limits of human cognition. Search and ∆I, for instance, make use of CV and DR calculations, which require that people either store or calculate those for relevant cues when making a decision. Calculation of CV and DR is less psycho- logically plausible than something like HyGene, which is based directly on memory search and therefore has convergent evidence for the cue ordering mechanism. One could go farther by including limitations on working memory, including temporal dynamics, or modeling the learning process within these models. These models could be further constrained. For example, psychological con- structs like working memory could be assessed and included as data in these models (Lee, 2010), rather than assumed and fit as free parameters. Adding features such as working memory to all of these models potentially increases the psychological and biological plausibility of these models. 6.2 Modeling Search Order The CV and DR weighting mechanism in the Search model is restrictive. The Search method of weighting only allows search orders that are some combination of CV and DR. The data from Chapter 4 serve as an existence proof that, at least some of the time, participants will select cues that are not ever predicted by this metric. While there are mathematical and historical reasons for focusing on CV and DR, a model allowing all possible search orders would sacrifice some ease of interpretation for a more accurate estimate of variance in search order attributable 74 to individual differences. The w parameter is convenient, it is interpreted as a participants’ relative preference for two cue metrics. A Dirichlet distribution or Gaussian process model for search order would allow for all possible search orders but would be nearly impossible to summarize with a single value. If the goal for the Search model is to give plausible estimates of sources of uncertainty in decision processes, then abandoning w for a more flexible mechanism makes sense. Another problem with the use of w to establish cue order is that it maps non-linearly onto search order. Depending on the distribution of CV and DR in a given ecology, changes in w cause completely unpredictable changes in search order. The non-linear mapping of w onto cue order gives little reason to believe that the posterior distribution on w will be continuous and unimodal, either. Despite the apparently simple interpretation and obvious relationship with the success metric, relative weighting of CV and DR is a troublesome method for understanding search order. 6.3 Contamination Lee (2010) also suggests a process for detecting contamination. Though the first two chapters focused on simulated data, chapter four’s empirical data almost certainly includes features that are not the direct and sole result of a person making his or her best judgments about the task. The models in this dissertation include a misapplication parameter, γ, which fills this role in a limited way. Lee and Newell (2011) interpret this term as an error of application, with γ being the probability of choosing explicitly counter to the TTB prediction (and separate from guessing). 75 Modeling contamination might make use of additional information, such as reaction time or changes in accuracy over time, to isolate and remove patterns in data that are unrelated to the underlying construct of interest. Removing contaminants pre- vents the substantive model parameters from attempting to account for variability in the observed data that are actually the result of a mixture process. In addition to guessing, participants may also search cues differently over time (Table A.2), a process that could be motivated by the limited feedback, boredom, or exhaustion. The motivation for removing contaminants is the same as for removing outlying data points. Extreme scores can bias statistical tests. Unfortunately, unprincipled outlier removal can also negatively influence test properties (Antonakis and Dietz, 2011). Removing participants with near-chance accuracy would partially alleviate the problem of modeling the mixture of true decision making behavior and contam- inant guessing, but it would also introduce a selection problem into the modeling. Modeling contaminant processes provides a principled method of removing spurious or unrelated patterns in the data. One inference from chapter four is that partici- pants are almost certainly guessing on some trials. Different collection methods and more constrained models would potentially allow us to isolate the guessed trials by participant and focus on trials that are the result of a legitimate decision process. 6.4 Aggregation Search order is not the only question and might not even be a relevant question. Thought the models under consideration assume sequential cue search, people may combine cues in various, unordered ways to produce decisions. Some modeling has 76 even been done to capture the trade-off between sequential and simultaneous cue use (Lee and Newell, 2011; Ravenzwaaij et al., 2014). In some cases, both methods produce similar results. Many studies provide evidence that participants search through cues rather than combining them in some way (Newell and Shanks, 2003). The experimental data in chapter four allows only a single cue’s information for each decision at test. Even if this scenario is only relevant to a subset of decisions that people make in the wider world, participants in the study had to choose among cues in some way. A more extensive comparison of the Search, ∆I, and HyGene models would explore aggregation as well. The Stop model, briefly discussed in chapter 1, provides one potential method for understanding the balance between effort and information in decision processes Lee and Newell (2011). The same methods used for cue ordering could be applied as weighting schemes or to alter the stopping rules used for the decision process. One motivation for exploring non-normative models of decision making is to account for the information-processing constraints that humans impose on the pro- cess. Decision processes could be made informationally frugal in a large number of ways. This is a question that ∆I attempts to address. With dichotomous cues, the ∆ parameters in our models were just another source of uncertainty in the models, occasionally allowing the search process to continue past a dichotomous cue. With continuously-valued cues, however, ∆ provides a mechanism for adaptive equiva- lence, setting a threshold on what differences are meaningful enough to stop search. Though this level of complication was not justified in the limited environments for 77 the present dissertation, fitting to multiple environments and examining sources of individual variation or combining the ∆ parameterization with other cue ordering methods could be useful. 6.5 Summary People make seemingly difficult decisions constantly and with relative ease. Experimental work shows that these decisions do not conform to a variety prescrip- tive or deterministic models, but it is yet unclear how far uncertainty can be reduced in human decision processes. This dissertation shows that three qualitatively differ- ent computational models of cue ordering and decision making can fit empirical data with similar success but fail to capture important patterns in search order. Assum- ing sequential search of cue information, people might differ on relative weighting of cue metrics or generate cue orders based on similarity of a decision to memories of the ecology. Either way, other inputs to the decision process must explain variation in individual cue ordering and people probably aggregate over multiple cues in some instances. 78 Appendix A: Experimental Differences in Accuracy Fixed Effects Coefficient Std. Error z Pr(> |z|) Intercept 0.314 0.072 4.334 1.46× 10−5 Gain 0.228 0.103 2.223 0.026 Error Terms Group Coefficient Std. Dev. Subject Intercept 0.259 Residual 1.000 Table A.1: Summary of multilevel logistic regression predicting accuracy using con- dition and varying intercept by participant. Intercept gives the average accuracy for the loss condition, the difference in accuracy for the gain condition is given by the Gain predictor. There is some evidence that the gain/loss manipulation alters accuracy. Par- ticipants were generally not very accurate on the task, with an overall accuracy of 59.8% ± 0.5%. A multilevel logistic regression with varying intercepts by participant suggests that accuracy is higher for participants in the gain condition compared to those in the loss condition (Table A.1). Very few participants had average accuracies of less than chance (Figure A.1). It is possible that, absent sufficient engagement or feedback, participants al- tered their cue search strategies over time. Table A.2 presents posterior estimates for a multilevel multinomial model predicting participant cue choices in the test phase by trial using a logistic link function with MCMCglmm (Hadfield, 2010). This multinomial model allows for differences in cue choice and the effect of trial 79 ll 0.00 0.25 0.50 0.75 1.00 Loss Gain Condition Ac cu ra cy Figure A.1: Boxplots for average participant accuracy by condition. on cue choice for each participant, which is the equivalent of varying intercepts and slopes in a multilevel regression. The first three parameters estimate the rates of choosing cues two, three, and four relative to cue one on the first trial. Participants appear to choose cues one and three at nearly equal rates for the first trial, while cues two and four are chosen less often than the first cue. The second three param- eters estimate the difference in rate of choosing between a given cue and cue one for each additional trial. Relative to cue one, cue two is chosen more often in later trials. Figure A.2 shows the probability of choosing each cue over time, averaged over participants. 80 Parameter Mean 95% CI pMCMC Cue 2 −0.41 (−0.7, −0.07) 0.013 Cue 3 −0.22 (−0.6, 0.2) 0.269 Cue 4 −0.48 (−0.9, −0.07) 0.025 Cue 2:Trial 0.002 (−4× 10−6, 0.004) 0.042 Cue 3:Trial 10−5 (−0.002, 0.002) 0.967 Cue 4:Trial 6.6× 10−4 (−0.002, 0.003) 0.697 Table A.2: Fixed effect estimates for a multilevel multinomial model predicting cue choice by time with varying effects by participant. Mean gives the mean of the marginal posterior distribution for each parameter, while the 95% confidence interval gives the 2.5% and 97.5% percentile samples for each marginal posterior distribution. pMCMC is an MCMC approximate of the p-value and gives the probability of observed an estimate of equal or greater magnitude given the estimated standard deviation centered at zero. 81 Cue 1 Cue 2 Cue 3 Cue 4 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0 40 80 120 Trial p(C ho os e F irs t) Figure A.2: Probability of choosing each cue, averaged over subjects, over the course of the test trials. Error ribbon represents a single proportion standard error,√ p · (1− p)/N 82 Bibliography Anderson, J. R. (1990). The Adaptive Character of Thought. Psychology Press. Antonakis, J. and Dietz, J. (2011). Looking for validity or testing it? the perils of stepwise regression, extreme-scores analysis, heteroscedasticity, and measurement error. Personality and Individual Differences, 50(3):409–415. Bergert, F. B. and Nosofsky, R. M. (2007). A response-time approach to comparing generalized rational and take-the-best models of decision making. Journal of Experimental Psychology: Learning, Memory, and Cognition, 33(1):107. Bowen, J. and Qiu, Z.-l. (1992). Satisficing when buying information. Organizational Behavior and Human Decision Processes, 51(3):471–481. Box, G. E. P. and Draper, N. R. (1987). Empirical model-building and response sur- faces. In Probability and Mathematical Statistics: Applied Probability and Statis- tics. John Wiley & Sons. Breiman, L. (2001). Statistical modeling: The two cultures. Statistical Science, 16(3):199–231. Brunswik, E. (1952). The conceptual framework of psychology, volume 1. University of Chicago Press. Brunswik, E. (1955). Representative design and probabilistic theory in a functional psychology. Psychological Review, 62(3):193. Buttaccio, D. R., Lange, N. D., Thomas, R. P., and Dougherty, M. R. (2015). Using a model of hypothesis generation to predict eye movements in a visual search task. Memory & Cognition, 43(2):247–265. Chater, N., Oaksford, M., Nakisa, R., and Redington, M. (2003). Fast, frugal, and rational: How rational norms explain behavior. Organizational behavior and human decision processes, 90(1):63–86. Dawes, R. M. (1979). The robust beauty of improper linear models in decision making. American Psychologist, 34(7):571. 83 Dawes, R. M. and Corrigan, B. (1974). Linear models in decision making. Psycho- logical Bulletin, 81(2):95. Donoso, M., Collins, A. G., and Koechlin, E. (2014). Foundations of human reason- ing in the prefrontal cortex. Science, 344(6191):1481–1486. Dougherty, M. R., Franco-Watkins, A. M., and Thomas, R. (2008). Psychological plausibility of the theory of probabilistic mental models and the fast and frugal heuristics. Psychological Review, 115(1):199. Dougherty, M. R., Gettys, C. F., and Ogden, E. E. (1999). Minerva-dm: A memory processes model for judgments of likelihood. Psychological Review, 106(1):180. Fellner, G., Gu¨th, W., and Maciejovsky, B. (2009). Satisficing in financial decision makinga theoretical and experimental approach to bounded rationality. Journal of Mathematical Psychology, 53(1):26–33. Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2014). Bayesian data analysis, volume 2. Taylor & Francis. Gelman, A. and Hill, J. (2006). Data analysis using regression and multi- level/hierarchical models. Cambridge University Press. Gigerenzer, G. (1993). The bounded rationality of probabilistic mental models. In Manktelow, K. I. and Over, D. E., editors, Rationality: Psychological and philosophical perspectives, pages 284–313. Routledge, London. Gigerenzer, G. (2010). Moral satisficing: Rethinking moral behavior as bounded rationality. Topics in Cognitive Science, 2(3):528–554. Gigerenzer, G. and Brighton, H. (2009). Homo Heuristicus: Why Biased Minds Make Better Inferences. Topics in Cognitive Science, 1(1):107–143. Gigerenzer, G. and Goldstein, D. G. (1996). Reasoning the fast and frugal way: models of bounded rationality. Psychological Review, 103(4):650. Gigerenzer, G., Hoffrage, U., and Kleinbo¨lting, H. (1991). Probabilistic mental models: a brunswikian theory of confidence. Psychological Review, 98(4):506. Gigerenzer, G. and Todd, P. M. (1999). Simple heuristics that make us smart. Oxford University Press. Glo¨ckner, A., Betsch, T., and Schindler, N. (2010). Coherence shifts in probabilistic inference tasks. Journal of Behavioral Decision Making, 23:439–462. Hadfield, J. D. (2010). MCMC methods for multi-response generalized linear mixed models: The MCMCglmm R package. Journal of Statistical Software, 33(2):1–22. Hammond, K. R. (1990). Functionalism and Illusionism: Can Integration be Use- fully Achieved? University of Chicago Press. 84 Hilbig, B. E., Erdfelder, E., and Pohl, R. F. (2010). One-reason decision making unveiled: A measurement model of the recognition heuristic. Journal of Experi- mental Psychology: Learning, Memory, and Cognition, 36(1):123. Hintzman, D. L. (1984). Minerva 2: A simulation model of human memory. Behavior Research Methods, Instruments, & Computers, 16(2):96–101. Hogarth, R. M. and Karelaia, N. (2007). Heuristic and linear models of judgment: Matching rules and environments. Psychological Review, 114(3):733. Karelaia, N. and Hogarth, R. M. (2008). Determinants of linear judgment: a meta- analysis of lens model studies. Psychological Bulletin, 134(3):404. Lee, M. D. (2008). Three case studies in the Bayesian analysis of cognitive models. Psychonomic Bulletin & Review, 15(1):1–15. Lee, M. D. (2010). How cognitive modeling can benefit from hierarchical Bayesian modeling. Journal of Mathematical Psychology. Lee, M. D. and Newell, B. J. (2011). Using hierarchical Bayesian methods to examine the tools of decision-making. Judgment & Decision Making, 6(8). Lee, M. D. and Zhang, S. (2012). Evaluating the coherence of take-the-best in structured environments. Judgment and Decision Making, 7(4):360. Luan, S., Schooler, L. J., and Gigerenzer, G. (2014). From perception to preference and on to inference: An approach-avoidance analysis of thresholds. Psychological Review, 121(3):501–525. Marewski, J. N. and Schooler, L. J. (2011). Cognitive niches: an ecological model of strategy selection. Psychological Review, 118(3):393. Marr, D. (1982). Vision: A Computational Investigation. Freeman New York. Martignon, L. and Hoffrage, U. (1999). Why does one-reason decision making work. In Gigerenzer, G. and Todd, P. M., editors, Simple heuristics that make us smart, pages 119–140. Oxford University Press. Martignon, L. and Hoffrage, U. (2002). Fast, frugal, and fit: Simple heuristics for paired comparison. Theory and Decision, 52(1):29–71. McCloskey, D. N. (1998). The Rhetoric of Economics. Univ of Wisconsin Press. Meehl, P. E. (1954). Clinical versus statistical prediction: A theoretical analysis and a review of the evidence. University of Minnesota Press. Newell, B. R. (2005). Re-visions of rationality? Trends in cognitive sciences, 9(1):11– 15. 85 Newell, B. R. and Lee, M. D. (2011). The right tool for the job? comparing an evidence accumulation and a naive strategy selection model of decision making. Journal of Behavioral Decision Making, 24(5):456–481. Newell, B. R., Rakow, T., Weston, N. J., and Shanks, D. R. (2004). Search strate- gies in decision making: The success of success. Journal of Behavioral Decision Making, 17(2):117–137. Newell, B. R. and Shanks, D. R. (2003). Take the best or look at the rest? Factors influencing ”one-reason” decision making. Journal of Experimental Psychology: Learning, Memory, and Cognition, 29(1):53. Newell, B. R., Weston, N. J., Tunney, R. J., and Shanks, D. R. (2009). The effec- tiveness of feedback in multiple-cue probability learning. The Quarterly Journal of Experimental Psychology, 62(5):890–908. Parker, A. M., Bruine de Bruin, W., and Fischhoff, B. (2007). Maximizers versus satisficers: Decision-making styles, competence, and outcomes. Judgment and Decision Making, 2(6):342–350. Payne, J. W., Bettman, J. R., and Johnson, E. J. (1988). Adaptive strategy selection in decision making. Journal of Experimental Psychology: Learning, Memory, and Cognition, 14(3):534. Payne, J. W., Bettman, J. R., and Johnson, E. J. (1992). Behavioral decision research: A constructive processing perspective. Annual Review of Psychology, 43(1):87–131. Platzer, C., Bro¨der, A., and Heck, D. W. (2014). Deciding with the eye: How the visually manipulated accessibility of information in memory influences decision behavior. Memory & Cognition, 42(4):595–608. Plummer, M. (2015). rjags: Bayesian Graphical Models using MCMC. R package version 3-15. Plummer, M. et al. (2003). Jags: A program for analysis of bayesian graphical models using gibbs sampling. In Proceedings of the 3rd international workshop on distributed statistical computing, volume 124, page 125. Technische Universit at Wien. Rand, D. G., Peysakhovich, A., Kraft-Todd, G. T., Newman, G. E., Wurzbacher, O., Nowak, M. A., and Greene, J. D. (2014). Social heuristics shape intuitive cooperation. Nature Communications, 5. Ravenzwaaij, D., Moore, C. P., Lee, M. D., and Newell, B. R. (2014). A hierar- chical bayesian modeling approach to searching and stopping in multi-attribute judgment. Cognitive science, 38(7):1384–1405. 86 Rieskamp, J. and Otto, P. E. (2006). SSL: a theory of how people learn to select strategies. Journal of Experimental Psychology: General, 135(2):207. Scheibehenne, B., Rieskamp, J., and Wagenmakers, E.-J. (2013). Testing adap- tive toolbox models: A bayesian hierarchical approach. Psychological Review, 120(1):39. Schwartz, B., Ward, A., Monterosso, J., Lyubomirsky, S., White, K., and Lehman, D. R. (2002). Maximizing versus satisficing: happiness is a matter of choice. Journal of Personality and Social Psychology, 83(5):1178. Shalizi, C. (2015). Advanced data analysis from an elementary point of view. Un- published. Simon, H. A. (1955). A behavioral model of rational choice. The Quarterly Journal of Economics, 69(1):99–118. Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and Van Der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(4):583–639. Stirling, W. C. and Goodrich, M. A. (1999). Satisficing games. Information Sciences, 114(1):255–280. Thomas, R. P., Dougherty, M. R., Sprenger, A. M., and Harbison, J. (2008). Diagnostic hypothesis generation and human judgment. Psychological Review, 115(1):155–185. Todd, P. M. and Dieckmann, A. (2004). Heuristics for ordering cue search in decision making. In Saul, L. K. and Bottou, L., editors, Advances in Neural Information Processing Systems, pages 1393–1400. MIT Press. Weng, W. and Gelman, A. (2014). Difficulty of selecting among multilevel models using predictive accuracy. Statistics and Its Interface, 7(1). Yoon, K. P. and Hwang, C.-L. (1995). Multiple Attribute Decision Making: An Introduction, volume 104. Sage Publications. 87