RESEARCH ARTICLE Scrutinizing LLAMA D as a measure of implicit learning aptitude Takehiro Iizuka* and Robert DeKeyser University of Maryland, College Park, Maryland, USA *Corresponding author. E-mail: tiizuka@terpmail.umd.edu (Received 26 May 2022; Revised 09 December 2022; Accepted 13 December 2022) Abstract Since Gisela Granena’s influential work, LLAMA D v2, a sound recognition subtest of LLAMA aptitude tests, has been used as a measure of implicit learning aptitude in second language acquisition research. The validity of this test, however, is little known and the results of studies with this instrument have been somewhat inconsistent. In this study, we tested the hypothesis that researchers’ variable test instructions are the source of the inconsistent results. One hundred fourteen English monolinguals were randomly assigned to take LLAMADv2 under one of three test instruction conditions. They also completed two implicit aptitude tests, three explicit aptitude tests, and a sound discrimination test. The results showed that, regardless of the type of test instructions, LLAMAD scores did not align with implicit aptitude test scores, indicating no clear evidence of the test being implicit. On the contrary, LLAMA D scores were negatively associated with scores on one implicit aptitude test, the Serial Reaction Time (SRT) task, but only in the condition where the instructions drew participants’ focal attention to the stimuli. This negative association was interpreted as focal attention working against learning in the SRT task. Implicit learning aptitude may be the degree to which one is able to process input without focal attention. Introduction Cognitive psychologist Arthur Reber, who coined the term “implicit learning,” viewed implicit learning mechanisms as a fundamental aspect of human cognition that varies minimally across individuals (A. Reber, 1967; A. Reber et al., 1991). This view has been embraced by some Second Language Acquisition (SLA) researchers as well (e.g., Krashen, 1981), leading to the implicit assumption that cognitive individual differ- ences, if examined in SLA, are concerned with explicit learning abilities (Wen et al., 2017). Although explicit learning abilities do indeed have predictive power for suc- cessful adult SLA (see, e.g., Abrahamsson & Hyltenstam, 2008; DeKeyser, 2000), it might also be true that the complex process of SLA involves implicit as well as explicit learning, and without empirical examination we might not want to dismiss the possibility of individual differences in implicit learning. Calls for this line of research ©TheAuthor(s), 2023. Published by Cambridge University Press. This is anOpenAccess article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited. Studies in Second Language Acquisition (2023), 1–23 doi:10.1017/S0272263122000559 https://doi.org/10.1017/S0272263122000559 Published online by Cambridge University Press https://orcid.org/0000-0003-4476-4697 mailto:tiizuka@terpmail.umd.edu http://creativecommons.org/licenses/by/4.0 https://doi.org/10.1017/S0272263122000559 http://crossmark.crossref.org/dialog?doi=https://doi.org/10.1017/S0272263122000559&domain=pdf https://doi.org/10.1017/S0272263122000559 (e.g., Kaufman et al., 2010; Woltz, 2003) were answered by preliminary findings that there may be individual differences in implicit learning abilities influencing adult SLA (e.g., Granena, 2013b; Suzuki & DeKeyser, 2015). This line of inquiry, however, is still in its infancy: We have little evidence that there is a reliable unitary construct of implicit learning aptitude (see P. Reber, 2013 for a review of relevant neuroimaging studies), and even if such a construct exists, we know little about how to measure it. The present study examined LLAMA D, a subtest of the LLAMA Language Aptitude Tests (Meara, 2005), which Granena (2013a, 2019) has proposed taps implicit language aptitude. Specifically, this study explored whether LLAMA D scores depend upon test administration procedures (i.e., test instructions) and, if they do, with which variant of test instructions the test best aligns with other measures of implicit learning abilities and is dissociated from measures of explicit learning abilities. A secondary purpose of the study was to see if currently available measures of implicit learning abilities (sequence learning and priming) are associated with each other, capturing the same underlying construct of implicit learning aptitude. Note that this study was conducted with LLAMADversion 2, and the findings reported here might not apply to version 3, a beta version of which is now available.1 Literature review Defining implicit learning aptitude As “implicit learning” is not always used consistently in the field, we will be explicit about what ismeant by the term. Following the research traditions of implicit/explicit learning in cognitive psychology andpsycholinguistics (e.g., DeKeyser, 2003; Jiménez, 2002; Kaufman et al., 2010; P. Reber, 2013; Rebuschat, 2013; Shanks, 2005; Williams, 2009), implicit learning was defined in this study as learning under conditions in which all the following criteria weremet: (a) no intention to learn the object of learning, (b) no awareness of what is being learned (i.e., process) or the product of learning, at least at the time of learning, and (c) no focal attention to the object of learning through one’s use of central executive attentional resources. Accordingly, we view “implicit learning aptitude” as the ability to learn something unintentionally, without awareness, irrespective of focal attention. Also, we see adult second language acquisition as a process largely driven by domain-general cognition (DeKeyser, 2003), while acknowledging some language- specific aspects (see, e.g., Skehan, 2016 for the discussion of domain generality and specificity). In this article, “implicit learning aptitude” (or, more simply, “implicit aptitude”) will be used for domain-general aptitude, and “implicit language (learning) aptitude” for language-specific aptitude. Measures of implicit learning aptitude Perhaps the most widely used measure of implicit learning aptitude to date is the implicit sequence learning paradigm, the Serial Reaction Time (SRT) task in particu- lar.2 In the SRT task, participants are instructed to respond as quickly and as accurately 1The latest version of the LLAMA tests can be accessed from Meara’s website: https://www.lognostics. co.uk/tools/ 2Some of the instruments used in the research of explicit/implicit learning are used in the research of declarative/procedural memory as well. The current study was framed by the paradigm of explicit/implicit learning (for a somewhat similar study with the framework of declarative/procedural memory, see, e.g., Buffington et al., 2021). 2 Takehiro Iizuka and Robert DeKeyser https://doi.org/10.1017/S0272263122000559 Published online by Cambridge University Press https://www.lognostics.co.uk/tools/ https://www.lognostics.co.uk/tools/ https://doi.org/10.1017/S0272263122000559 as possible to the location of a stimulus that appears on a computer screen. Unbe- knownst to the participants, a series of trials follows a certain regularity as to where the stimulus appears, and thus reaction time to the regular sequence decreases over the trials. As long as the task is probabilistic (as opposed to deterministic), intermixing the regular sequence with random sequences, participants are usually not aware of the regularity, and the learning is implicit (Janacsek & Nemeth, 2013; Jiménez, 2002). In fact, this type of sequence learning has been shown not to be significantly influenced by intention to learn (Jiménez et al., 1996), awareness of regularity (Cleeremans & Jiménez, 1998), or the amount of attention (Jiménez &Méndez, 1999), therebymeeting the criteria for implicit learning. Several studies have demonstrated that individual differences measured by this sequence learning paradigm predict success in adult SLA, particularly the likelihood of reaching high proficiency (Linck et al., 2013) and of developing grammatical sensitivity to subtle second language (L2) features (Granena, 2013b; Suzuki &DeKeyser, 2015), both of which presumably require a certain degree of implicit learning. Priming is another paradigm that has been proposed to tap implicit learning aptitude (Woltz, 2003). The basic idea of priming is that performance is facilitated by a past experience, which, in an experimental situation, is usually operationalized by a reaction time difference between a primed trial (i.e., preceded by a related trial) and a control trial. Priming can largely be divided into two kinds—perceptual and concep- tual priming, each of which is caused by a preceding stimulus similar either in form or meaning, respectively (see Tulving & Schacter, 1990; Woltz, 2003). This paradigm is also considered implicit on the grounds that participants have another task goal (e.g., deciding whether a stimulus is a word or nonword), priming being merely a by-product of task completion. Research indeed shows that increased memory load with a concurrent task does not affect priming effects, suggesting that priming is independent of explicit, attention-controlled recourses (e.g., Woltz & Was, 2006). Several studies have shown that individual differences measured by priming can predict success in cognitive skill acquisition in general and language acquisition in particular. Woltz (1988, 1999), for instance, reported that repetition (i.e., perceptual and conceptual) priming predicted a later stage of cognitive skill acquisition across verbal, numeric, and spatial content domains. Larkin, Woltz, Reynolds, and Clark (1996) also observed that conceptual priming was significantly associated with reading ability for sixth graders. Similarly, Was and Woltz (2007) found that the capacity for conceptual priming has a significant impact on first language (L1) listening ability in adults. Conceptual priming has been shown to be related to fluency of L2 speech production too (Granena, 2019). Although both paradigms more or less appear to succeed in measuring individual differences relevant to some kind of implicit learning, it is too early to say that they can be a yardstick of across-the-board implicit learning aptitude. In addition to the behavioral evidence that implicit learning measures often do not correlate with one another (e.g., Buffington et al., 2021; Gebauer & Mackintosh, 2007; Godfroid & Kim, 2021; Suzuki & DeKeyser, 2017), neuroimaging studies have shown that the locus of brain activation is not consistent across paradigms for implicit learning, which is in contrast to the case of explicit learning, where the learning process relies on the medial temporal lobe memory system (P. Reber, 2013). One of the implications is that we might not want to hastily assume that implicit learning outcomes of interest can be predicted by any one implicit aptitude measure, but instead might want to target the specific implicit process of interest. To measure implicit language learning aptitude then, a language-based paradigm may be preferable. The sequence learning paradigm LLAMA D as implicit learning aptitude 3 https://doi.org/10.1017/S0272263122000559 Published online by Cambridge University Press https://doi.org/10.1017/S0272263122000559 we just reviewed is not language based. The priming paradigm, particularly in the case of conceptual or semantic priming, involves language, but it only concerns the process of implicit activation of already acquired knowledge, excluding the encoding process of new knowledge. Hence, LLAMA D, which we will introduce in the next section, might hold promise for those who want to measure implicit language learning aptitude because it is a language-based test and involves both encoding and retrieval processes. Note again that, as mentioned in the introduction, the current study dealt with LLAMA D version 2 and that some of the features described in the following text may not apply to the newer version (more on this in the “Discussion” section). LLAMA D as a measure of implicit learning aptitude The LLAMA Language Aptitude Tests were developed by Paul Meara (2005). The tests are language-independent (see Granena, 2013a; Rogers et al., 2017 for negligible effects of test takers’ L1), and available to everyone for free, computer-delivered with auto- matic score calculation, making them easily accessible to a wide range of researchers. The test battery consists of four subtests, LLAMA B, D, E, and F, measuring vocabulary learning, sound recognition, sound-symbol association, and grammatical inferencing, respectively. Granena’s (2013a) exploratory validation study, using various cognitive ability tests along with LLAMA, demonstrated that there may be two different aptitude dimensions the LLAMA tests tap into, namely, explicit and implicit language learning aptitude. More specifically, her factor analysis showed that LLAMA B, E, and F loaded onto one factor, whereas LLAMA D loaded onto another factor. An additional factor analysis further revealed that LLAMA B, E, and F constituted a factor with intelligence (measured by an IQ test), while LLAMA D constituted a separate factor with implicit learning skills (measured by SRT), which was taken as evidence that LLAMA D is a measure of implicit language learning aptitude (see also Granena, 2019, for a similar finding, where LLAMA D aligned with conceptual priming). After this finding by Granena, researchers have started to use LLAMA D as a test of implicit learning aptitude, and the number of such studies has been increasing (e.g., Artieda & Muñoz, 2016; Forsberg Lundell & Sandgren, 2013; Granena, 2013b, 2016, 2019; Granena & Long, 2013; Lee, 2018; Li & Qian, 2021; Ma et al., 2018; Martens et al., 2016; Montero et al., 2018; Moorman, 2017; Mueller, 2017; Rodríguez Silva, 2017; Saito, 2017, 2019; Saito et al., 2019; Suzuki, 2021; Yalçın et al., 2016; Yalçın & Spada, 2016; Yi, 2018). The areas of these studies range fromphonology (Saito et al., 2019) to collocations (Yi, 2018) to grammar (Yalçın & Spada, 2016), covering beginners (Artieda & Muñoz, 2016) as well as advanced learners (Forsberg Lundell & Sandgren, 2013). The stakes, therefore, are getting higher and higher. However, the creator of the LLAMA tests noted that the tests have not been properly validated, and thus that “they should NOT be used in high-stakes situations” (Meara, 2005, p. 21). More than a decade later this still appears to be true (although some validation work is under way, e.g., Bokander & Bylund, 2020; Rogers et al., 2023). On top of that, the suggestion that LLAMA B, E, and F and LLAMA D tap explicit and implicit language learning aptitude, respectively, is essentially based solely on Grane- na’s (2013a, 2019) work, and such use of the tests is beyond the original intention of the test developer. The situation, therefore, may warrant a careful examination. When looking into empirical studies with LLAMA D, we see rather mixed results (see Appendix S1 in Supplementary Materials for the summary). The outcomes are in the expected direction: LLAMA D scores are associated with the attainment of, for 4 Takehiro Iizuka and Robert DeKeyser https://doi.org/10.1017/S0272263122000559 Published online by Cambridge University Press https://doi.org/10.1017/S0272263122000559 example, L2 collocations (Forsberg Lundell & Sandgren, 2013; Granena & Long, 2013), agreement structures (Granena, 2013b), and sound-symbolic intuitions (Mueller, 2017), all of which are assumed to be mainly the products of data-driven, implicit learning. LLAMA D scores have also been shown to predict long-term development of L2 speech production (Saito et al., 2019), which also aligns with the idea of LLAMADas a measure of implicit learning aptitude, if we think implicit learning becomes increas- ingly important in later stages of SLA. Yet the outcomes are not always easy to interpret: LLAMA D scores, in some cases, predict accuracy in L2 writing (Lee, 2018) and early stages of adult foreign language learning (Artieda & Muñoz, 2016), in which explicit learning presumably should play a more important role. Also noteworthy is that, when both LLAMA tests and the SRT task are used in the same study, the correlation coefficient between LLAMA D and SRT (implicit aptitude) is smaller than the one between LLAMADand LLAMAB, E, or F (explicit aptitude) (Granena, 2016; Yi, 2018). Furthermore, the relationships between LLAMA D and the other subtests are not consistent: There are sometimes moderate positive correlations (e.g., LLAMA D–B: r = .35, p < .01 in Yalçın & Spada, 2016; r = .34, p < .05 in Yi, 2018), while at other times no correlations were found (e.g., LLAMA D–B: r = .05, p > .05 in Saito, 2019; r = .03, p > .05 in Yalçın et al., 2016). To make sense of these apparent inconsistencies and use LLAMA D with more confidence, we might want to explore what is taking place in the test. Overall, the test goes like this: Test-takers listen to 10 unfamiliar sound strings—computer-generated words based on a Native American language from British Columbia. After that, they move on to the test stage, where they listen to 30 sound strings one by one and indicate whether they have heard the strings or not. Unlike the other LLAMA subtests, which include a study phase (2 minutes for LLAMA B and E; 5 minutes for LLAMA F), LLAMA D only exposes test-takers to the stimuli once, all in a row, before the test, arguably placing itself at the more implicit end of the spectrum for an aptitude test (Granena, 2013a). A thing to note, though, is that, partly as a result of its language- independent nature, the test does not have any standardized test instructions, and thus how to administer it is up to individual researchers. This potential inconsistency in test administration can be critical, particularly if we intend to use the test as a measure of implicit learning aptitude, because implicitness is, as mentioned in the earlier section, very sensitive to intention, awareness, and attention. In the next section, therefore, we will explore possible variations of LLAMA D test instructions and their relevance to implicit learning. Test instructions for LLAMA D While the LLAMA manual (Meara, 2005) encourages researchers to create their own test instructions, it provides some ideas of how to administer the test, which are considered default instructions and which many researchers presumably use. One publication by Meara and his colleagues demonstrates this default type of instructions: “Youmust listen to the sound recording and it will play with [sic] a number of made up words. Your task is to learn and memorise as many of these words as possible” (Rogers et al., 2016, p. 209). Now, reflecting on these instructions, we notice differences with other measures of implicit learning, namely, the LLAMA D instructions entail test- takers’ (a) intention to learn the stimuli, (b) awareness of what is being learned, and (c) focal attention to the stimuli, which all go against the criteria of implicit learning. Incidentally, the previously mentioned studies (Artieda & Muñoz, 2016; Lee, 2018), LLAMA D as implicit learning aptitude 5 https://doi.org/10.1017/S0272263122000559 Published online by Cambridge University Press https://doi.org/10.1017/S0272263122000559 which offered alleged counterevidence to LLAMA D as a measure of implicit learning aptitude, used this type of instructions. Granena (2013a, 2013b, 2016, 2019; Granena & Long, 2013) used a slightly different set of instructions: “You will hear a set of words in a language you are not familiar with. Your task is simply to listen carefully” (G. Granena, personal communication, April 13, 2018). As with the default instructions, focal attention is drawn to the stimuli by this type of instructions. The existence of intention and awareness, however, is less clear. Not being informed of the test phase in advance, test-takers would most likely have less intention and awareness of learning, compared with the default instructions, yet at least some participants would anticipate some kind of test, partly because, with the order of the LLAMA subtests randomized (Granena, 2013a), some of them have already taken other subtests before LLAMA D, becoming familiar with the sequence from the study/ exposure phase to the test phase. In any case, many individuals are expected to try to discover patterns, if not try to memorize them, under this condition. Yet another set of instructions was adopted by Saito (2017, 2019; Saito et al., 2019). To prevent test-takers from learning the stimuli intentionally, the researcher pretends that the exposure phase is just a sound check: They are only told to check if they can hear sound without any difficulty, which is followed by a surprise test. Their lack of intention to learn is confirmed by interview shortly after the test. Also, the test is always administered first among the entire test battery, making the “sound check” session reasonable and minimizing their anticipation of the test. Thus, the intention and awareness of learning are considered absent in this instruction condition. The existence of focal attention, however, is less clear, as some participants might pay attention to the sound strings, while others might just process them without any particular focus, only making sure the volume is okay. Table 1 summarizes the characteristics of the three types of test instructions, labeled as “memorization,” “just listen,” and “sound check” conditions, respectively. As seen in the table, the three types of instructions can be construed as varying from most explicit (“memorization”) to least explicit (“sound check”). In fact, previous studies suggest that the correlation coefficient between LLAMAD and the clearly explicit subtests is larger when the test instructions are more explicit (e.g., LLAMA D–B: r = .34, p < .05 in Yi [2018] with “memorization” instructions; r = .29, p < .05 in Granena [2016] with “just listen” instructions) than when less explicit (e.g., LLAMA D–B: r = .05, p > .05 in Saito [2019]; r = .05, p > .05 in Suzuki [2021] both with “sound check” instructions), giving us the impression that less explicit instructions are preferable if we want tomeasure implicit learning aptitude. Yet, the dissociation from explicit tests alone, of course, does not guarantee that the test is tapping the construct of implicit learning. Therefore, a systematic comparison of different types of instructions coupled with various cognitive aptitude tests appears to be justified. So far we have discussed the test instructions before the exposure phase. The instructions before the test phase may be equally as important. It is important to remember that LLAMA D is a sound recognition test, where participants decide Table 1. Characteristics of three types of test instructions for LLAMA D Instruction type Memorization Just listen Sound check Intention Yes Maybe No Awareness Yes Maybe No Focal attention Yes Yes Maybe Representative study Rogers et al. (2016) Granena (2013a) Saito et al. (2019) 6 Takehiro Iizuka and Robert DeKeyser https://doi.org/10.1017/S0272263122000559 Published online by Cambridge University Press https://doi.org/10.1017/S0272263122000559 whether each sound string was in the exposure phase or not. A recognition test in general is considered a test of explicit memory because it is accompanied by conscious memory retrieval. However, it may also be a test of implicit memory when (a) the test is forced choice, (b) the stimuli bear high similarity to one another, (c) there are no conceptual or contextual cues, and (d) the instructions discourage analytic retrieval strategies (Voss & Paller, 2009). Arguably, LLAMA D has the first three features. The fourth point is a procedural feature that researchers need to consider. Research has shown that when test instructions encourage participants to recall specific details of the learning objects, they tend to access their explicit memory, which is dependent on the amount of attention during encoding, whereas when test instructions encourage them to use a vague feeling of familiarity to approach the recognition, they tend to tap into their implicit memory, which is minimally affected by attentional resources during encoding (Mulligan, 1998; Whittlesea & Price, 2001). It has also been shown that familiarity-based recognition judgments (but not conscious recollection) are associated with conceptual implicit memory (Wang & Yonelinas, 2012). Taken together, if we were to measure implicit learning aptitude with LLAMA D, familiarity-based recog- nition judgments might need to be encouraged through test instructions before the test phase. Present study Overall research design The literature review has made it clear that scrutiny of LLAMA D test instructions is necessary for future use of the test as a measure of implicit learning aptitude. The present study attempted to uncover the impact of test instructions by empirically comparing three instruction conditions: (a) “memorization,” (b) “just listen,” and (c) “sound check.” Participants were randomly assigned to one of the three instruction conditions in which they took LLAMA D. The participants also completed two other relatively more establishedmeasures of implicit learning aptitude: (a) probabilistic SRT task and (b) Available Long-Term Memory (ALTM) task (i.e., conceptual priming). The participants further completed three measures of explicit learning aptitude: (a) paired associates task, (b) digit span task, and (c) Stroop task. These measures of rote learning ability and working memory were selected because of their hypothesized relevance to LLAMA D (see the next section). Phonological short-term memory and the central executive have long been held to be two components of working memory relevant to the storage and processing of verbal information (Baddeley & Hitch, 1974), which were measured by the digit span task and the Stroop task, respectively, in this study.3 The Stroop task, a test of inhibitory control ability, was chosen because inhibitory control ability appears to be implicated in various tasks of executive functions and is viewed as the core component of executive functions (Miyake & Friedman, 2012). Additionally, the participants completed a sound discrimination task as a measure of phonetic sensitivity because previous studies suggested the potential role of sound-related ability in LLAMA D; blind people performed significantly better on LLAMA D than sighted people (Smeds, 2015); LLAMA D scores were associated with musical aptitude (Martens et al., 2016) and phonetic acuity (Drozdova et al., n.d.). 3The updated version of Baddeley’s model of workingmemory also includes the episodic buffer, which is a temporary storage system that is capable of integrating information from a variety of sources (Baddeley, 2000). It is, however, yet unclear how to measure this component or how it relates to language learning. LLAMA D as implicit learning aptitude 7 https://doi.org/10.1017/S0272263122000559 Published online by Cambridge University Press https://doi.org/10.1017/S0272263122000559 We included this last instrument so that the role of implicit versus explicit learning aptitude in LLAMAD could be discussed without confound with sound-related ability. In sum, LLAMAD scores obtained with three types of test instructions were examined in light of two measures of implicit learning aptitude and three measures of explicit learning aptitude, holding specific sound-related aptitude constant. Research questions and hypotheses Before delving into the main question about LLAMA D test instructions, this study addressed the issue of whether currently available measures of implicit learning aptitude tap the same underlying construct. Thus, the first research question was formulated as follows: RQ1: Are two measures of implicit learning aptitude (SRT and ALTM) substantially associated with each other? Previous studies (e.g., Buffington et al., 2021; Gebauer &Mackintosh, 2007; Godfroid & Kim, 2021), though with different sets of instruments, have demonstrated that mea- sures of implicit learning often do not correlate substantially with one another. Thus, we proposed the following hypothesis: Hypothesis 1: Two measures of implicit learning aptitude (SRT and ALTM) will show no more than small correlation (r < .30). The main research question concerned the impact of test instructions on LLAMAD scores in light of implicit and explicit learning aptitude: RQ2: Which of the three types of test instructions (“memorization,” “just listen,” or “sound check”) best yields LLAMA D scores that align with scores on implicit learning aptitude measures (SRT and ALTM) without association with scores on explicit learning aptitudemeasures (paired associates, digit span, and Stroop) when controlling for phonetic sensitivity (sound discrimination)? We predicted that the “memorization” instructions, stimulating participants’ intention, awareness, and focal attention, would call upon their explicit learning aptitude, rather than implicit learning aptitude, and thus the scores would be affected by rote learning ability and working memory: Hypothesis 2a: Under the “memorization” instruction condition, LLAMAD scores will be significantly predicted by scores for rote learning ability (paired associates) and working memory (digit span and Stroop), when controlling for phonetic sensitivity (sound discrimination). The “just listen” instructions, encouraging participants’ focal attention to the stimuli, might make working memory play a role in the test: Hypothesis 2b: Under the “just listen” instruction condition, LLAMA D scores will be significantly predicted by scores for working memory (digit span and Stroop), when controlling for phonetic sensitivity (sound discrimination). The “sound check” instructions minimize participants’ intention and awareness of learning. Arguably, such less explicit learning condition might be conducive to 8 Takehiro Iizuka and Robert DeKeyser https://doi.org/10.1017/S0272263122000559 Published online by Cambridge University Press https://doi.org/10.1017/S0272263122000559 data-driven, implicit learning (see, e.g., DeKeyser, 1995; Granena & Yilmaz, 2019; Toomer & Elgort, 2019): Hypothesis 2c: Under the “sound check” instruction condition, LLAMA D scores will be significantly predicted by scores on implicit learning aptitude measures (SRT and ALTM), when controlling for phonetic sensitivity (sound discrimination). Methodology Participants One hundred fourteenmonolingual native English speakers (77 females; 18–38 years of age, M = 20.06, SD = 2.69) participated in this study for course credit or financial compensation at the University of Maryland, College Park. They were randomly assigned to one of three LLAMA D test instruction conditions. All the participants took an identical test battery, the only between-group difference being LLAMA D test instructions. Five participants were excluded because they did not follow instructions and/or were later found out to be ineligible for the study (see Appendix S2 for the eligibility criteria). The final sample size was N = 109 (“memorization” group, n = 37; “just listen” group, n = 36; “sound check” group, n = 36). Instruments LLAMA D LLAMAD is a sound recognition test. The participants listened to 10 unfamiliar sound strings once, all in a row (exposure phase). Then they listened to 30 sound strings one by one, some of whichwere in the exposure phase while others were not, andmade “old” or “new” decisions by clicking the corresponding icon on the computer screen (test phase). The stimuli were computer-synthesized sound strings based on words in a Native American language from British Columbia. The test phase was self-paced. The software calculated a score with a maximum of 75. Depending on assigned conditions, one of three test instructions was given as follows.4 Memorization condition. As with the original test instructions by Meara and col- leagues (e.g., Rogers et al., 2016), in this condition the participants were instructed to memorize the stimuli. They were also informed that there would be a test and what the test would be like. This set of instructions, therefore, activated the participants’ intention, awareness, and focal attention, along with conscious recollection. See Appendix S3 for the exact wording of this type of instructions as well as of the other two types. Just-listen condition. Following Granena’s (2013a, 2013b, 2016, 2019; Granena & Long, 2013) test instructions, in this condition the participants were instructed simply to listen to the stimuli carefully. The subsequent test was not mentioned. After the exposure phase, the participants were encouraged to adopt familiarity-based judgments for the recognition test. 4It should be noted that, although previous studies inspired our test instruction types, our test instructions were not identical to those in previous studies. Therefore, the results are not directly comparable. The present study neither validates nor invalidates the findings of the previous studies cited in this article. LLAMA D as implicit learning aptitude 9 https://doi.org/10.1017/S0272263122000559 Published online by Cambridge University Press https://doi.org/10.1017/S0272263122000559 Sound-check condition. Following Saito’s (2017, 2019; Saito et al., 2019) test instruc- tions, the participants in this condition were (falsely) informed that the exposure phase would be just a sound check. They were only told to check if they could hear sound without any difficulty. After the exposure phase, they were encouraged to adopt familiarity-based judgments for the recognition test. Shortly after the test, the researcher confirmed through a brief interview that the participants in fact did not try to learn the stimuli during the “sound check” phase. Probabilistic serial reaction time task The probabilistic SRT task (Kaufman et al., 2010) was used as one of implicit learning aptitude tests. In each trial, the participants saw a stimulus appear at one of four locations on the computer screen. Their task was to press the corresponding key as quickly and as accurately as possible. Unknown to them, the sequence of stimulus appearance followed a certain pattern (1–2–1–4–3–2–4–1–3–4–2–3) 85% of the time, which was intermixed with another pattern (3–2–3–4–1–2–4–3–1–4–2–1) 15% of the time. This version of SRT task was particularly hard for the participants to learn explicitly because the probability of stimulus appearance was governed not by first- order information but by second-order information; that is, the preceding trial alone did not provide any useful information for prediction, but the most recent two trials offered such information (e.g., after 1–2, one occurred 85% of the time while four occurred 15% of the time). After an initial practice block where the two patterns occurred with equal likelihood, the participants completed eightmain blocks (120 trials each) in which the sequence followed the previously mentioned differential probability. Following Granena’s (2013a, 2013b, 2016) studies, learning was quantified by calcu- lating the average reaction time difference between probable and improbable trials. Reaction time to the probable trials but not to the improbable trials was expected to decrease, and so the greater the reaction time difference (i.e., improbable � probable), the more learning.5 Available long-term memory task The ALTM category task (Was et al., 2012; Was & Woltz, 2007; Woltz & Was, 2006, 2007) was used as another implicit learning aptitude test. This test probed individual differences in conceptual priming. The test consisted of two tasks: a priming task and a comparison task. In the priming task, the participants saw five words, one at a time for 2 seconds each, presented on the computer screen. Three of the words were exemplars of one category and two were exemplars of another category (e.g., apple, dagger, banana, pear, and bomb). The participants were then asked to indicate which of two categories had more exemplars in the word list (e.g.,Were there more weapons or more fruits?). Following this priming task was the comparison task, where the participants saw pairs of words, a pair at a time, on the computer screen and indicated (for each pair) whether the two words were from the same or different categories as quickly and as accurately as possible by pressing one of two keys. To measure priming effects, there were two conditions for the comparison task, one in which one or both words were exemplars of one of the two categories from the preceding priming task (primed 5Following a reviewer’s request, the reaction time difference between probable and improbable trials was examined; there was a significant difference, t(108) = 10.35, p < .01, suggesting that learning occurred in this task. 10 Takehiro Iizuka and Robert DeKeyser https://doi.org/10.1017/S0272263122000559 Published online by Cambridge University Press https://doi.org/10.1017/S0272263122000559 condition), and another in which neither word was an exemplar of the categories from the preceding priming task (unprimed condition). The participants completed 18 rounds of the priming and comparison tasks after two rounds of practice. In half the rounds (nine rounds) the comparison task was with the primed condition and, in the rest of the rounds (nine rounds), with the unprimed condition. Each round included eight comparison task trials after four warm-up (unrelated) trials. The difference in the number of correct responses per minute between the primed and unprimed conditions (i.e., primed� unprimed) was used as an index of priming effects (Woltz &Was, 2006, 2007). Paired associates task The verbal paired associates task (Wechsler, 2009) was used as a test of rote learning ability. The participants saw a list of 14 word pairs (e.g., way and body), a pair at a time, on the computer screen. Their task was to memorize these word pairs. For each word pair, the participants saw the first word of the pair on the left side of the screen for one second, and then the second word of the pair on the right side of the screen for one second. All the 14 pairs were presented one after another with 2-second intervals between pairs. Immediately after this presentation, the participants were presented with the first word of each pair as a prompt and typed in the missing second word. Feedback (either correct or incorrect) was provided for each response. This procedure (presentation of word pairs followed by a recall test) was repeated four times. The number of correctly recalled items (max. 56) served as a score. Digit span task The forward digit span task (Woods et al., 2011) was used as a test of the phonological short-termmemory component of working memory. In each trial, the participants saw a sequence of digits, a digit at a time for one second each, presented on the computer screen. Their task was to recall them in the order presented. One second after the presentation of the last digit, the participants entered digits by clicking on-screen buttons. This was an adaptive test, where the number of digits for the next trial increased by one when the participants answered correctly on a given trial and the number of digits decreased by one after consecutive unsuccessful trials at the same level. The task began with three digits and continued for 14 trials. The Mean Span, the number of digits for which a given participant had a 50% chance of successful recall (Woods et al., 2011), served as a score. Stroop task The color-word Stroop task (Stroop, 1935) was used as a test of the central executive component of working memory. In each trial, the participants saw a color word (red, green, blue, or black) or rectangle shape, presented in red, green, blue, or black—the physical color may or may not match the meaning of the word—on the computer screen. Their task was to respond to the physical color of the word (or rectangle shape) as quickly and as accurately as possible by pressing the corresponding key, ignoring the meaning of the word. In an incongruent trial where a color word was presented in another color, the participants needed to inhibit their prepotent response (i.e., responding to the meaning of the word). The task consisted of 84 trials, in which each of the four colors appeared seven times in each of the three conditions (congruent word, incongruent word, and control rectangle shape). The average reaction time LLAMA D as implicit learning aptitude 11 https://doi.org/10.1017/S0272263122000559 Published online by Cambridge University Press https://doi.org/10.1017/S0272263122000559 difference between incongruent and control trials (i.e., control � incongruent) was used as an index of the participants’ ability of inhibitory control. Sound discrimination task A sound discrimination task was used as a test of phonetic sensitivity. In each trial, the participants listened to a pair of words from languages unfamiliar to them (Russian, Chinese, and Japanese). Their task was to indicate whether the pair was the same word in the given language. The stimuli wereminimal pairs, but the critical elements were not phonemic in the participants’ native language, namely Russian hard/soft consonants, Chinese tones, and Japanese short/long vowels. The task consisted of 72 target trials (24 trials with each of the languages) and 24 (easier) filler trials, with equal numbers of positive (same) and negative (different) response trials. The two words to be compared in each trial were spoken by different speakers of the same gender. The Russian materials were adopted from the study by Chrabaszcz and Gor (2014), and the Chinese and Japanese materials were recorded for this study. Feedback on accuracy was provided for six trials of practice, but not for the main trials. Percent accuracy was used as a score. Procedure The participants completed the test battery individually with the researcher in a research lab (for the order of the tests, see Table 2).6 All the participants took LLAMA D as their first test, regardless of their assigned test instruction conditions. This procedural decision was made, following Saito’s (2017, 2019; Saito et al., 2019) studies, most importantly to make the “sound check” condition reasonable, but also to preempt presumptions potentially brought in by other tests. After LLAMA D, the participants took the other six tests. To allow formore consistent individual differencemeasures, the Table 2. Order of tests Test Minutes Consent form and background questionnaire 5 LLAMA D 5 Digit span 5 Sound discrimination 10 LLAMA B 5 Break 5 Available long-term memory 30 LLAMA E 5 Break 5 Paired associates 10 LLAMA F 10 Probabilistic serial reaction time 10 Stroop 5 Note: The test battery included LLAMA B, LLAMA E, and LLAMA F, the results of which are not reported here because they are not within the scope of the present article. 6Aside from the seven tests, the participants also completed LLAMA B, LLAMA E, and LLAMA F, the results of which are not reported in this article because they are not within the scope of the present study. 12 Takehiro Iizuka and Robert DeKeyser https://doi.org/10.1017/S0272263122000559 Published online by Cambridge University Press https://doi.org/10.1017/S0272263122000559 participants completed these tests in the same order (see, e.g., Gebauer & Mackintosh, 2007; Miyake et al., 2000; Was &Woltz, 2007 for a similar practice). The entire session took about 2 hours. Analysis Descriptive statistics were calculated for all the tests. The test scores were coded such that greater values always indicated higher levels of the attributes. Test reliability was calculated using Cronbach’s alpha. To check if randomization was successful, a series of one-way ANOVAs were conducted. Pearson’s correlations among the variables were also computed. Following Plonsky and Oswald’s (2014) field-specific guideline, corre- lation coefficients were considered small when r was around .25, medium when r was around .40, and large when r was around .60. To answer the main research question, multiple linear regression was used, where LLAMA D scores of each group were regressed on the six predictor variables. The assumptions of linearity, homoscedasticity, and normality were examined through residual plots andQ-Q plots.When assumption violations were identified, data were transformed to overcome the problem. Multi- collinearity was inspected with tolerance values. These analyses were conducted using R version 4.0.3. Results Preliminary analysis Descriptive statistics for all the measures are summarized in Table 3. A series of one- way ANOVAs indicated that there was no significant difference in scores across the groups for the six tests that all participants took under the same condition,meaning that random assignment successfully resulted in roughly equivalent groups. Of those measures, three (ALTM, paired associates, and Stroop) had good reliability (α ≥ .74), whereas two (SRT and sound discrimination) had low reliability (α = .41–.46). A one- way ANOVA indicated that the effect of group was not significant for LLAMA D, F(2, 106) = 2.64, p = .08, suggesting that different test instructions did not have an impact on the level of LLAMA D scores. The reliability of LLAMA D also was low regardless of test instructions (α = .47–.58). Table 3. Descriptive statistics for the measures used in the study (N = 109) Measure Mean SD Min. Max. Reliability LLAMA D Memorization condition (n = 37) 31.76 15.60 0 65 .47 Just listen condition (n = 36) 28.61 14.37 0 55 .49 Sound check condition (n = 36) 23.61 15.75 0 50 .58 SRT 23.31 23.51 –48.71 96.26 .41 ALTM 9.21 4.76 –1.84 21.46 .74 Paired associates 27.38 12.27 0 50 .95 Digit span 6.82 0.99 4.83 9.75 N/Aa Stroop –221.58 149.84 –656.97 108.27 .95 Sound discrimination 63.43 7.06 43.10 80.60 .46 Note: SRT = probabilistic serial reaction time task; ALTM = available long-term memory task. aReliability could not be calculated for this task because it was an adaptive test. LLAMA D as implicit learning aptitude 13 https://doi.org/10.1017/S0272263122000559 Published online by Cambridge University Press https://doi.org/10.1017/S0272263122000559 Correlational analysis Correlations among the six tests that all participants took under the same condition are summarized in Appendix S4. Regarding the first research question, two measures of implicit learning aptitude, SRT and ALTM, were not significantly correlated, r(107) = –.09, p= .33 (Hypothesis 1 was confirmed). The ALTM task instead showedmarginally significant positive correlations with the paired associates task, r(107) = .19, p = .04, and with the Stroop task, r(107) = .21, p = .03. Correlations between LLAMA D and the other measures with each variant of LLAMA D test instructions are summarized in Table 4. With “memorization” instruc- tions, there was a small to medium marginally significant positive correlation between LLAMA D and the paired associates task, r(35) = .32, p = .05. With “just listen” instructions, there was a medium to large significant negative correlation between LLAMA D and the SRT task, r(34) = –.51, p < .01. With “sound check” instructions, LLAMA D was not correlated with any measures significantly. Regression analysis Based on residual plots, the assumptions of linearity and homoscedasticity did not appear to be violated. Q-Q plots, however, indicated violations of the normality assumption for themodels of the two groups (“memorization” and “just listen” groups). Therefore, the data was transformed to normalize it.7 Based on tolerance values, no multicollinearity issue was found. Regarding the main research question, each group’s LLAMA D scores were regressed on the six predictor variables. These regression models are summarized in Table 5. In the “memorization” model, the only significant predictor was paired associates (Hypothesis 2a was partially confirmed). Its squared semipartial correlation with LLAMA D scores was .11, meaning that paired associates scores uniquely explained about 11% of the variance in LLAMAD scores. It should be noted, however, that the overall model was not significant, R2 = .21, F(6, 30) = 1.33, p = .27, which suggests that this set of predictors did not account for a significant proportion of variance in LLAMAD. In the “just listen”model, the only significant predictor was SRT (Hypothesis 2b was not confirmed). Its squared semipartial correlation with LLAMAD scores was .27, meaning that SRT scores uniquely explained about 27% of the variance in LLAMA D scores. Note also that the association between these variables was in the negative direction. The overall model was significant, R2 = .35, F(6, 29) = 2.57, p = .04. Table 4. Correlations between LLAMA D and other measures with different LLAMA D test instructions LLAMA D instruction type SRT ALTM Paired associates Digit span Stroop Sound discrim. Memorization (n = 37) .16 –.09 .32† .09 .18 .17 Just listen (n = 36) –.51* –.03 .12 .04 –.11 –.11 Sound check (n = 36) .10 .03 .04 –.26 .02 .22 Note: SRT = probabilistic serial reaction time task; ALTM = available long-termmemory task; discrim. = discrimination. An asterisk (*) indicates statistical significance with the Bonferroni correction (i.e., p < .0028). A dagger (†) indicates marginal significance (i.e., not significant with the Bonferroni correction, but significant without the correction, p < .05). 7The data was transformed by ffiffiffiffiffiffiffiffiffiffiffiffi K�X p , where X was each score and K was the largest score of X þ 1 (Tabachnick & Fidell, 2013). 14 Takehiro Iizuka and Robert DeKeyser https://doi.org/10.1017/S0272263122000559 Published online by Cambridge University Press https://doi.org/10.1017/S0272263122000559 In the “sound check”model, none of the predictors was significant (Hypothesis 2c was not confirmed), and the overall model was also not significant, R2= .17, F(6, 29)= 0.99, p = .45. Discussion This study was conducted to examine the validity of LLAMA D v2 (Meara, 2005) as a measure of implicit learning aptitude. To see whether and how LLAMAD scores could be influenced by test instructions, participants were randomly assigned to take the test under one of three test instruction conditions. In the text that follows wewill discuss the results bearing in mind our guiding research questions and hypotheses. We will then discuss a rather unexpected finding and the future directions for the use of LLAMA D and research on implicit learning aptitude in SLA. For the first research question, we asked whether two measures of implicit learning aptitude, SRT and ALTM, would be substantially associated with each other. As was hypothesized, and in line with other recent work (e.g., Buffington et al., 2021; Godfroid & Kim, 2021), they were not correlated substantially, r = –.09, p = .33. This lack of convergence in this study and elsewhere suggests that we have not been successful in capturing a latent construct of implicit learning aptitude. Also, the ALTM task was found to be somewhat associated with rote learning ability (paired associates) and the processing component of working memory (Stroop). Given that other studies also reported some association between this task and explicit learning aptitude measures such as the Antisaccade task (Linck et al., 2013) and the letter span task (Granena, 2019), we need to reevaluate whether the ALTM task is truly a measure of implicit learning aptitude. For the second, and main research question, we explored the impact of test instruc- tions on LLAMA D scores. We asked which of the three types, “memorization,” “just listen,” or “sound check” instructions, would be the best for the test as a measure of implicit learning aptitude. Although there was no significant group difference in the level of LLAMAD scores, the relationship between LLAMAD scores and cognitive ability test scores was found to be different across the different test instruction groups, suggesting that different cognitive abilities came into play when performing LLAMA D depending on the instructions given. Table 5. Summary of LLAMA D regression models Memorization instructions (n = 37) Just listen instructions (n = 36) Sound check instructions (n = 36) Variable B SE B SE B SE SRT –0.01 0.01 0.04* 0.01 0.13 0.13 ALTM 0.05 0.06 0.03 0.05 0.01 0.65 Paired associates –0.05* 0.03 –0.03 0.02 0.06 0.22 Digit span –0.12 0.32 0.12 0.21 –4.76 3.04 Stroop 0.00 0.00 0.00 0.00 0.02 0.02 Sound discrimination 0.02 0.06 0.03 0.03 0.66 0.41 R2 .21 .35 .17 F 1.33 2.57* 0.99 Note: For normalization, the data for the “memorization” and “just listen” models were transformed by ffiffiffiffiffiffiffiffiffiffiffi K�X p , where X was each score and K was the largest score of X þ 1. SRT = probabilistic serial reaction time task; ALTM = available long- term memory task. *p < .05. LLAMA D as implicit learning aptitude 15 https://doi.org/10.1017/S0272263122000559 Published online by Cambridge University Press https://doi.org/10.1017/S0272263122000559 We hypothesized that under the “memorization” instruction condition, rote learn- ing ability and working memory would play a role in LLAMA D. This hypothesis was partially confirmed; in the regression analysis, paired associates was a significant predictor, indicating some involvement of rote learning, whereas digit span and Stroop (indices of working memory) were not significant predictors. However, the regression model was not significant and so the set of predictors did not sufficiently explain the overall variance of LLAMAD scores. Nevertheless, rote learning ability explainedmore than 10% of the variance; therefore, this test instruction type does not seem to be appropriate for LLAMA D as a measure of implicit learning aptitude. Our hypothesis for the “just listen” instruction condition was that working memory would play a role in LLAMA D. This hypothesis was not confirmed; neither of the indices of working memory, digit span or Stroop, was significant in the regression model. The only significant predictor was SRT, a measure of implicit learning aptitude. The direction of this association, however, was negative (we will revisit this point later in this section). Although working memory was not involved in the test performance, LLAMA D scores were negatively predicted by one of the implicit learning aptitude measures; therefore, this test instruction type does not seem to be appropriate either. We hypothesized that under the “sound check” instruction condition, LLAMA D scores would align with scores on implicit learning aptitude measures. This hypothesis was not confirmed; none of the predictors was significant in the regression model. It was not clear what kind of ability was involved in LLAMAD under this test instruction condition. In sum, regardless of test instruction types, LLAMA D scores did not align with scores on the two implicit learning aptitude measures. The two implicit measures were not associated with each other either. Therefore, no clear evidence of a unitary construct of implicit learning aptitude was found. Despite these somewhat disap- pointing results, one noticeable finding is that LLAMA D scores were negatively associated with SRT scores under the “just listen” instruction condition. The SRT task explained about a quarter of variance in LLAMA D in the regression model. This strong association is intriguing given their methodological differences—verbal, audi- tory, accuracy-based LLAMA D, on the one hand, and nonverbal, visual, reaction- time-based SRT, on the other. In other words, this association cannot be explained by method effects but rather was driven by a certain cognitive process involved in these tasks.What set the “just listen” instructions apart from the other instruction types was that the participants’ focal attention was drawn to the material without their strong intention or awareness of learning. Therefore, it would not be unreasonable to think that participants with good focusing ability (henceforth “good focusers”) did well on LLAMA D under the test instruction condition, and in turn these good focusers did not do well on the SRT task. It makes sense that focal attention worked against learning in the SRT task because the most adjacent trials did not provide useful information (see the “Methodology” section) and the participants needed to process the material at a macro level to succeed in the task. Potentially, this aspect—whether one is a good focuser or not—could be a key criterion for implicit learning aptitude in individual difference research. As seen in this study and similar work by others (e.g., Gebauer & Mackintosh, 2007; Godfroid & Kim, 2021), the attempt to construe implicit learning aptitude as something that accelerates learning across domains has not been successful, and so a better way to look at implicit learning aptitude may be to see it as an ability to let go of well-developed cognitive functions, perhaps focal attention in particular, and process input as it is. In other words, implicit learning aptitude may be better seen as lack of interference rather than something 16 Takehiro Iizuka and Robert DeKeyser https://doi.org/10.1017/S0272263122000559 Published online by Cambridge University Press https://doi.org/10.1017/S0272263122000559 that individuals have more or less of. Robinson (2005) reported that implicit learning of an artificial grammar was hindered by higher IQ, and other studies also showed that implicit learning was impeded by partial input enhancement (Toomer & Elgort, 2019) and explicit instructions (e.g., Granena & Yilmaz, 2019). All these could be interpreted as the results of focal attention being drawn to part of the learning material, thereby biasing the process of input unfavorably. This interpretation, however, is post hoc, and therefore follow-up studies are needed to (in)validate the finding.8 On the practical side, we now have serious reservations about the use of LLAMAD v2 as a measure of implicit learning aptitude (although this is not a criticism of the LLAMA D test and we are only commenting on this version of LLAMA D, and only on its use for assessing implicit language learning aptitude). This standpoint is based on the results of this study and a recent one by Suzuki. Suzuki (2021) modified the LLAMAD test and collected data on reaction time and confidence (i.e., how confident participants were in their judgments) as well as accuracy scores under the “sound check” instruction condition. The results showed that the participants responded faster and more accurately when they were confident, suggesting that they were using conscious knowledge on the test. In the current study we made the whole set of test instructions even more implicit by encouraging the participants to use familiarity- based judgments instead of conscious recollection (see the last paragraph of the literature review). Even with this effort, LLAMA D did not work well as an implicit test. Our recommendation, therefore, is to stop using LLAMA D as a measure of implicit aptitude and use the test as was originally intended, that is, as a test of sound recognition/listening ability (Meara, 2005; Rogers et al., 2023). This is the direction the LLAMA developer team is heading; the newer version of LLAMA D is accompa- nied by an explicit type of test instructions (Rogers et al., 2023). Additionally, the number of test items increased from 30 to 40 in the newer version, which should mitigate the issue of low reliability (e.g., α = .47–.58 in our study; .54 in Bokander & Bylund, 2020; .20 in Suzuki, 2021). For other changes from Version 2 to 3, see Rogers et al. (2023). From one point of view, the lack of convergence of the implicit aptitude measures in this study supports the idea that implicit learning aptitude is multidimensional (Li & DeKeyser, 2021). As a reviewer also suggested, this could mean that, when measuring implicit language aptitude, we should use a language-based test and perhaps further narrow it down to the specific domain of interest (e.g., grammar, pronunciation)—a case in point is a study by Saito, Sun, and Tierney (2019), where implicit pronunciation-specific language aptitude was measured through assessing participants’ neural encoding of speech. Examining a specific domain in this way is important. At the same time, though, if we want to discuss a cognitive construct of implicit learning aptitude, we eventually need to find out at least some commonal- ities among implicit aptitude measures. In the current study we might have found a 8Another way to look at this is to see this tendency to focus on patterns as a style rather than an aptitude. The ability to switch styles depending on context may be more advantageous than scoring very high on one aptitude or the other (or both!) without being able to switch (in this case between implicit and explicit learning). This is reminiscent of the literature on cognitive styles, in particular field dependence/indepen- dence, where field independence was seen as important for puzzle solving and various perceptual and motor skills, but field dependence more beneficial for smooth social interaction. See, e.g., Price (2004) and Witkin and Goodenough (1981). For examples of field (in)dependence in the SLA literature, see, e.g., DeKeyser (1984) and Johnson et al. (2000). LLAMA D as implicit learning aptitude 17 https://doi.org/10.1017/S0272263122000559 Published online by Cambridge University Press https://doi.org/10.1017/S0272263122000559 starting point for exploration of such commonalities; the negative association between LLAMA D and SRT led us to speculate about interference from focal attention in the SRT task (see the preceding discussion). This is all the more interesting because the SRT task is the most promising measure of implicit aptitude (Li & DeKeyser, 2021) with predictive power already documented in SLA (e.g., Godfroid & Kim, 2021; Granena, 2013b; Linck et al., 2013; Suzuki & DeKeyser, 2015). The next step then may be to examine the SRT task further, using, for example, an eye tracker to uncover the underlying mechanisms behind the task performance and subsequently develop more instruments that require similar cognitive operations. Lastly, the present study has raised important issues of reliability and validity. It appears that even the instruments that have been frequently used in published research are not necessarily reliable and they might not be measuring what they have been claimed to measure. A great deal more work needs to be done to validate research instruments. It is also important for researchers not to buy too hastily into what a single study has suggested (including our own). Limitations A couple of limitations are important to mention, one of which concerns the reliability of instruments. The reliability of the SRT task was particularly low (.41). This level of reliability is certainly not ideal, but a similar level of reliability was reported in previous studies with this instrument and it is considered standard for a measure of implicit learning (see Granena, 2016; Kaufman et al., 2010; Suzuki & DeKeyser, 2015). Because lower reliability results in attenuation of correlation, it is interesting that a strong negative correlation was found between SRT and LLAMA D in this study, despite the low reliability of these instruments. Nonetheless, replications with more reliable instruments are needed to confirm the findings of this study. We may, for example, be able to improve the reliability of the SRT task by increasing the number of blocks of trials. Also, fatigue could increase error variance, so keeping an experiment short may be another way to improve reliability (although the last two suggestions are somewhat conflicting and we need to find a good balance). Generalizability is another thing to note. As with many other studies in the field, participants were recruited at a university; that is, the sample was drawn from well- educated individuals. Follow-up studies with different kinds of people are needed to see if the findings of this study can be generalized to the population at large (seeAndringa& Godfroid, 2019 for a recent call for more diverse sampling). Conclusion Before summarizing the present study, a few caveats should be noted. First, the study was conducted with LLAMA version 2 (Meara, 2005) and the results should not be generalized to other versions. Second, the reliability of some instruments (SRT, sound discrimination, and LLAMAD)was low, and therefore the results should be interpreted with caution. Despite these limitations, the current study contributed to the field in some important ways, which are summarized as follows. In this study, we examined whether test instruction types had an impact on LLAMA D as ameasure of implicit learning aptitude. Although instruction types did change the relationship between LLAMADand other cognitive test scores, regardless of the type of 18 Takehiro Iizuka and Robert DeKeyser https://doi.org/10.1017/S0272263122000559 Published online by Cambridge University Press https://doi.org/10.1017/S0272263122000559 instructions, LLAMA D scores never aligned with scores on implicit learning aptitude measures, showing no evidence of the test being implicit. However, LLAMA D scores were negatively associated with scores on the SRT task, an implicit learning aptitude measure, under the test instruction condition where participants’ focal attention was drawn to the learning material. We interpreted this negative association as a result of focal attentionworking for (in the case of LLAMAD) versus against (in the case of SRT) learning and proposed the idea that implicit learning aptitude is the degree towhich one is able to let go of the tendency to look for patterns and process input without focal attention. Acknowledgments. This article is based on the first author’s qualifying paper under the supervision of the second author for the PhD program in Second Language Acquisition at the University of Maryland, College Park. We thank the committee, Kira Gor, Steven Ross, and Mike Long, for their constructive feedback. This article also benefited from comments by the Handling Editor Kazuya Saito and four anonymous reviewers. We are also grateful to Scott Barry Kaufman and Anna Chrabaszcz for sharing their research materials. Finally, many thanks to Joella (Mei) Huynh, Qi Zheng, KaitlynDorman, Yanlin Peng, Kenta Kurosawa, Jason Struck, and many others, who helped us at various stages of this project. Supplementary Materials. To view supplementary material for this article, please visit http://doi.org/ 10.1017/S0272263122000559. Competing interests. The authors declare none. References Abrahamsson,N., &Hyltenstam, K. (2008). The robustness of aptitude effects in near-native second language acquisition. Studies in Second Language Acquisition, 30, 481–509. Andringa, S., & Godfroid, A. (2019). Call for participation: SLA for all? Reproducing second language acquisition research in non-academic samples. Language Learning, 69, 5–10. Artieda, G., & Muñoz, C. (2016). The LLAMA tests and the underlying structure of language aptitude at two levels of foreign language proficiency. Learning and Individual Differences, 50, 42–48. Baddeley, A. (2000). The episodic buffer: A new component of working memory? Trends in Cognitive Sciences, 4, 417–423. Baddeley, A. D., &Hitch, G. (1974).Workingmemory. In G. A. Bower (Ed.), Recent advances in learning and motivation (Vol. 8, pp. 47–89). Academic Press. Bokander, L., & Bylund, E. (2020). Probing the internal validity of the LLAMA language aptitude tests. Language Learning, 70, 11–47. Buffington, J., Demos, A. P., & Morgan-Short, K. (2021). The reliability and validity of procedural memory assessments used in second language acquisition research. Studies in Second Language Acquisition, 43, 635–662. Chrabaszcz, A., &Gor, K. (2014). Context effects in the processing of phonolexical ambiguity in L2. Language Learning, 64, 415–455. Cleeremans, A., & Jiménez, L. (1998). Implicit sequence learning: The truth is in the details. InM. A. Stadler & P. A. Frensch (Eds.), Handbook of implicit learning (pp. 323–364). Sage. DeKeyser, R.M. (1984). The role of field independence in foreign language instruction. ITL Review of Applied Linguistics, 63, 1–21. DeKeyser, R.M. (1995). Learning second language grammar rules: An experiment with aminiature linguistic system. Studies in Second Language Acquisition, 17, 379–410. DeKeyser, R. M. (2000). The robustness of critical period effects in second language acquisition. Studies in Second Language Acquisition, 22, 499–533. DeKeyser, R.M. (2003). Implicit and explicit learning. In C. J. Doughty &M.H. Long (Eds.), The handbook of second language acquisition (pp. 313–348). Blackwell. Drozdova, P., van Hout, R., & Scharenborg, O. (n.d.). Do noise and linguistic skills influence lexically-guided perceptual learning? http://www.inspire-itn.eu/files/spire2016/02-Drozdova.pdf LLAMA D as implicit learning aptitude 19 https://doi.org/10.1017/S0272263122000559 Published online by Cambridge University Press http://doi.org/10.1017/S0272263122000559 http://doi.org/10.1017/S0272263122000559 http://www.inspire-itn.eu/files/spire2016/02-Drozdova.pdf https://doi.org/10.1017/S0272263122000559 Forsberg Lundell, F., & Sandgren, M. (2013). High-level proficiency in late L2 acquisition: Relationships between collocational production, language aptitude and personality. In G. Granena & M. H. Long (Eds.), Sensitive periods, language aptitude, and ultimate L2 attainment (pp. 231–256). John Benjamins. Gebauer, G. F., & Mackintosh, N. J. (2007). Psychometric intelligence dissociates implicit and explicit learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 33, 34–54. Godfroid, A., & Kim, K. M. (2021). The contributions of implicit-statistical learning aptitude to implicit second-language knowledge. Studies in Second Language Acquisition, 43, 606–634. Granena, G. (2013a). Cognitive aptitudes for L2 learning and the LLAMA Language Aptitude Test. In G. Granena & M. H. Long (Eds.), Sensitive periods, language aptitude, and ultimate L2 attainment (pp. 105–129). John Benjamins. Granena, G. (2013b). Individual differences in sequence learning ability and second language acquisition in early childhood and adulthood. Language Learning, 63, 665–703. Granena, G. (2016). Cognitive aptitudes for implicit and explicit learning and information-processing styles: An individual differences study. Applied Psycholinguistics, 37, 577–600. Granena, G. (2019). Cognitive aptitudes and L2 speaking proficiency: Links between LLAMA and Hi–LAB. Studies in Second Language Acquisition, 41, 313–336. Granena, G., & Long, M. H. (2013). Age of onset, length of residence, language aptitude, and ultimate L2 attainment in three linguistic domains. Second Language Research, 29, 311–343. Granena, G., & Yilmaz, Y. (2019). Corrective feedback and the role of implicit sequence‐learning ability in L2 online performance. Language Learning, 69, 127–156. Janacsek, K., & Nemeth, D. (2013). Implicit sequence learning and working memory: Correlated or complicated? Cortex, 49, 2001–2006. Jiménez, L. (2002). Intention, attention, and consciousness in probabilistic sequence learning. In L. Jiménez (Ed.), Attention and implicit learning (pp. 43–68). John Benjamins. Jiménez, L., & Méndez, C. (1999). Which attention is needed for implicit sequence learning? Journal of Experimental Psychology: Learning, Memory, and Cognition, 25, 236–259. Jiménez, L., Méndez, C., & Cleeremans, A. (1996). Comparing direct and indirect measures of sequence learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 22, 948–969. Johnson, J., Prior, S., & Artuso, M. (2000). Field dependence as a factor in second language communicative production. Language Learning, 50, 529–567. Kaufman, S. B., DeYoung, C. G., Gray, J. R., Jiménez, L., Brown, J., &Mackintosh, N. (2010). Implicit learning as an ability. Cognition, 116, 321–340. Krashen, S. (1981). Aptitude and attitude in relation to second language acquisition and learning. In K. C. Diller (Ed.), Individual differences and universals in language learning aptitude (pp. 155–175). Newbury House. Larkin, A. A., Woltz, D. J., Reynolds, R. E., & Clark, E. (1996). Conceptual priming differences and reading ability. Contemporary Educational Psychology, 21, 279–303. Lee, J. (2018). The interactive effects of task complexity, task condition, and cognitive individual differences on L2 writing (Doctoral dissertation). University of Maryland, College Park, MD. https://drum.lib.umd.edu/ handle/1903/21755 Li, S., & DeKeyser, R. (2021). Implicit language aptitude: Conceptualizing the construct, validating the measures, and examining the evidence. Studies in Second Language Acquisition, 43, 473–497. Li, S., & Qian, J. (2021). Exploring syntactic priming as a measure of implicit language aptitude. Studies in Second Language Acquisition, 43, 574–605. Linck, J. A., Hughes, M. M., Campbell, S. G., Silbert, N. H., Tare, M., Jackson, S. R.,…Doughty, C. J. (2013). Hi–LAB: A newmeasure of aptitude for high‐level language proficiency. Language Learning, 63, 530–566. Ma, D., Yao, T., & Zhang, H. (2018). The effect of third language learning on language aptitude among English-major students in China. Journal of Multilingual and Multicultural Development, 39, 590–601. Martens, P., Nakatsukasa, K., & Percival, H. (2016). Music training correlates with visual but not phono- logical foreign language learning skills. Proceedings of 14th International Conference on Music Perception and Cognition, 352–354. Meara, P. M. (2005). Llama Language Aptitude Tests: The manual. Lognostics. Miyake, A., & Friedman, N. P. (2012). The nature and organization of individual differences in executive functions: Four general conclusions. Current Directions in Psychological Science, 21, 8–14. 20 Takehiro Iizuka and Robert DeKeyser https://doi.org/10.1017/S0272263122000559 Published online by Cambridge University Press https://drum.lib.umd.edu/handle/1903/21755 https://drum.lib.umd.edu/handle/1903/21755 https://doi.org/10.1017/S0272263122000559 Miyake, A., Friedman, N. P., Emerson, M. J., Witzki, A. H., Howerter, A., & Wager, T. D. (2000). The unity and diversity of executive functions and their contributions to complex “frontal lobe” tasks: A latent variable analysis. Cognitive Psychology, 41, 49–100. Montero, F., Donate, A., Dixon, D., & Long, M. H. (2018, February). Language aptitudes and L2 proficiency in Spanish noun gender assignment. Paper presented at Evolving Perspectives on Advanc- edness: A Symposium on Second Language Spanish, University of Minnesota-Twin Cities, Minneapolis, MN. Moorman, C. M. (2017). Individual differences and linguistic factors in the development of mid vowels in L2 Spanish learners: A longitudinal study (Doctoral dissertation). Georgetown University, Washington, DC. https://repository.library.georgetown.edu/handle/10822/1047824 Mueller, J. (2017).An examination of the influence of age on L2 acquisition of English sound-symbolic patterns (Doctoral dissertation). University of Maryland, College Park, MD. https://drum.lib.umd.edu/handle/ 1903/20315 Mulligan, N. W. (1998). The role of attention during encoding in implicit and explicit memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 24, 27–47. Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes in L2 research. Language Learning, 64, 878–912. Price, L. (2004). Individual differences in learning: Cognitive control, cognitive style, and learning style. Journal of Educational Psychology, 24, 681–698. Reber, A. S. (1967). Implicit learning of artificial grammars. Journal of Verbal Learning and Verbal Behavior, 6, 855–863. Reber, A. S., Walkenfeld, F., & Hernstadt, R. (1991). Implicit and explicit learning: Individual differences and IQ. Journal of Experimental Psychology: Learning, Memory, and Cognition, 17, 888–896. Reber, P. J. (2013). The neural basis of implicit learning and memory: A review of neuropsychological and neuroimaging research. Neuropsychologia, 51, 2026–2042. Rebuschat, P. (2013). Measuring implicit and explicit knowledge in second language research. Language Learning, 63, 595–626. Robinson, P. (2005). Cognitive abilities, chunk-strength, and frequency effects in implicit artificial grammar and incidental L2 learning: Replications of Reber, Walkenfeld, and Hernstadt (1991) and Knowlton and Squire (1996) and their relevance for SLA. Studies in Second Language Acquisition, 27, 235–268. Rodríguez Silva, L. H. (2017). The role of cognitive individual differences and learning difficulty in instructed adults’ explicit and implicit knowledge of selected L2 grammar points: A study with Mexican learners of English (Doctoral dissertation). University of Essex, Essex, UK. http://repository.essex.ac.uk/id/eprint/ 20626 Rogers, V., Meara, P., Aspinall, R., Fallon, L., Goss, T., Keey, E., & Thomas, R. (2016). Testing aptitude: Investigating Meara’s (2005) LLAMA tests. In S. A. Liszka, P. Leclercq, M. Tellier, & G. D. Véronique (Eds.), EUROSLA Yearbook 16 (pp. 179–210). John Benjamins. Rogers, V., Meara, P., Barnett-Legh, T., Curry, C., & Davie, E. (2017). Examining the LLAMA aptitude tests. Journal of the European Second Language Association, 1, 49–60. Rogers, V., Meara, P., & Rogers, B. (2023). Testing language aptitude: LLAMA evolution and refinement. In Z. E. Wen, P. Skehan, & R. L. Sparks (Eds.), Language aptitude theory and practice. Cambridge University Press. Saito, K. (2017). Effects of sound, vocabulary, and grammar learning aptitude on adult second language speech attainment in foreign language classrooms. Language Learning, 67, 665–693. Saito, K. (2019). The role of aptitude in second language segmental learning: The case of Japanese learners’ English /ɹ/ pronunciation attainment in classroom settings. Applied Psycholinguistics, 40, 183–204. Saito, K., Sun, H., & Tierney, A. (2019). Explicit and implicit aptitude effects on second language speech learning: Scrutinizing segmental and suprasegmental sensitivity and performance via behavioural and neurophysiological measures. Bilingualism: Language and Cognition, 22, 1123–1140. Saito, K., Suzukida, Y., & Sun, H. (2019). Aptitude, experience, and second language pronunciation proficiency development in classroom settings: A longitudinal study. Studies in Second Language Acqui- sition, 41, 201–225. Shanks, D. R. (2005). Implicit learning. In K. Lamberts & R. Goldstone (Eds.), Handbook of cognition (pp. 202–220). Sage. LLAMA D as implicit learning aptitude 21 https://doi.org/10.1017/S0272263122000559 Published online by Cambridge University Press https://repository.library.georgetown.edu/handle/10822/1047824 https://drum.lib.umd.edu/handle/1903/20315 https://drum.lib.umd.edu/handle/1903/20315 http://repository.essex.ac.uk/id/eprint/20626 http://repository.essex.ac.uk/id/eprint/20626 https://doi.org/10.1017/S0272263122000559 Skehan, P. (2016). Foreign language aptitude, acquisitional sequences, and psycholinguistic processes. In G. Granena, D. O. Jackson, & Y. Yilmaz (Eds.), Cognitive individual differences in second language processing and acquisition (pp. 17–40). John Benjamins. Smeds, H. (2015). Blindness and second language acquisition: Studies of cognitive advantages in blind L1 and L2 speakers (Doctoral dissertation). Stockholm University, Stockholm, Sweden. https://www.diva-portal. org/smash/record.jsf?pid=diva2%3A790294&dswid=6124 Stroop, J. R. (1935). Studies of interference in serial verbal reactions. Journal of Experimental Psychology, 18, 643–662. Suzuki, Y. (2021). Probing the construct validity of LLAMA_D as a measure of implicit learning aptitude: Incidental instructions, confidence ratings, and reaction time. Studies in Second Language Acquisition, 43, 663–676. Suzuki, Y., &DeKeyser, R. (2015). Comparing elicited imitation andwordmonitoring asmeasures of implicit knowledge. Language Learning, 65, 860–895. Suzuki, Y., & DeKeyser, R. (2017). The interface of explicit and implicit knowledge in a second language: Insights from individual differences in cognitive aptitudes. Language Learning, 67, 747–790. Tabachnick, B. G., & Fidell, L. S. (2013). Using multivariate statistics (6th ed.). Pearson. Toomer, M., & Elgort, I. (2019). The development of implicit and explicit knowledge of collocations: A conceptual replication and extension of Sonbul and Schmitt (2013). Language Learning, 69, 405–439. Tulving, E., & Schacter, D. L. (1990). Priming and human memory systems. Science, 247, 301–306. Voss, J. L., & Paller, K. A. (2009). An electrophysiological signature of unconscious recognition memory. Nature Neuroscience, 12, 349–355. Wang, W. C., & Yonelinas, A. P. (2012). Familiarity is related to conceptual implicit memory: An examination of individual differences. Psychonomic Bulletin & Review, 19, 1154–1164. Was, C. A., Dunlosky, J., Bailey, H., & Rawson, K. A. (2012). The unique contributions of the facilitation of procedural memory and workingmemory to individual differences in intelligence.Acta Psychologica, 139, 425–433. Was, C. A., & Woltz, D. J. (2007). Reexamining the relationship between working memory and compre- hension: The role of available long-term memory. Journal of Memory and Language, 56, 86–102. Wechsler, D. (2009).Wechsler Memory Scale-Fourth Edition (WMS-IV): Technical and interpretive manual. Pearson. Wen, Z. E., Biedroń, A., & Skehan, P. (2017). Foreign language aptitude theory: Yesterday, today and tomorrow. Language Teaching, 50, 1–31. Whittlesea, B. W., & Price, J. R. (2001). Implicit/explicit memory versus analytic/nonanalytic processing: Rethinking the mere exposure effect. Memory & Cognition, 29, 234–246. Williams, J. N. (2009). Implicit learning in second language acquisition. InW.C. Ritchie &T. K. Bhatia (Eds.), The new handbook of second language acquisition (pp. 319–353). Emerald Group Publishing. Witkin, H. A., & Goodenough, D. R. (1981). Cognitive styles: Essence and origins. Field dependence and field independence. International Universities Press. Woltz, D. J. (1988). An investigation of the role of workingmemory in procedural skill acquisition. Journal of Experimental Psychology: General, 117, 319–331. Woltz, D. J. (1999). Individual differences in priming: The roles of implicit facilitation from prior processing. In P. L. Ackerman, P. C. Kyllonen, & R. D. Roberts (Eds.), Learning and individual differences: Process, trait, and content determinants (pp. 135–156). American Psychological Association. Woltz, D. J. (2003). Implicit cognitive processes as aptitudes for learning. Educational Psychologist, 38, 95–104. Woltz, D. J., &Was, C. A. (2006). Availability of related long-termmemory during and after attention focus in working memory. Memory & Cognition, 34, 668–684. Woltz, D. J., & Was, C. A. (2007). Available but unattended conceptual information in working memory: Temporarily active semantic content or persistent memory for prior operations? Journal of Experimental Psychology: Learning, Memory, and Cognition, 33, 155–168. Woods, D. L., Kishiyama, M. M., Yund, E. W., Herron, T. J., Edwards, B., Poliva, O., Hink, R. F., & Reed, B. (2011). Improving digit span assessment of short-term verbal memory. Journal of Clinical and Experimental Neuropsychology, 33, 101–111. 22 Takehiro Iizuka and Robert DeKeyser https://doi.org/10.1017/S0272263122000559 Published online by Cambridge University Press https://www.diva-portal.org/smash/record.jsf?pid=diva2%3A790294&dswid=6124 https://www.diva-portal.org/smash/record.jsf?pid=diva2%3A790294&dswid=6124 https://doi.org/10.1017/S0272263122000559 Yalçın, Ş., Çeçen, S., & Erçetin, G. (2016). The relationship between aptitude and working memory: an instructed SLA context. Language Awareness, 25, 144–158. Yalçın, Ş., & Spada, N. (2016). Language aptitude and grammatical difficulty: An EFL classroom-based study. Studies in Second Language Acquisition, 38, 239–263. Yi, W. (2018). Statistical sensitivity, cognitive aptitudes, and processing of collocations. Studies in Second Language Acquisition, 40, 831–856. Cite this article: Iizuka, T. and DeKeyser, R. (2023). Scrutinizing LLAMA D as a measure of implicit learning aptitude. Studies in Second Language Acquisition, 1–23. https://doi.org/10.1017/ S0272263122000559 LLAMA D as implicit learning aptitude 23 https://doi.org/10.1017/S0272263122000559 Published online by Cambridge University Press https://doi.org/10.1017/S0272263122000559 https://doi.org/10.1017/S0272263122000559 https://doi.org/10.1017/S0272263122000559 Scrutinizing LLAMA D as a measure of implicit learning aptitude Introduction Literature review Defining implicit learning aptitude Measures of implicit learning aptitude LLAMA D as a measure of implicit learning aptitude Test instructions for LLAMA D Present study Overall research design Research questions and hypotheses Methodology Participants Instruments LLAMA D Memorization condition Just-listen condition Sound-check condition Probabilistic serial reaction time task Available long-term memory task Paired associates task Digit span task Stroop task Sound discrimination task Procedure Analysis Results Preliminary analysis Correlational analysis Regression analysis Discussion Limitations Conclusion Acknowledgments Supplementary Materials Competing interests References